-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use flat set for spectators #4784
base: master
Are you sure you want to change the base?
Conversation
@gesior could you kindly run your benchmarks over this? I recall you recently checked a few different implementations, sorry if I remember wrong 😆 |
I have no way to check metrics on real servers with many players online, but brute force tests show that it is slower than a simple |
How are you checking it? |
I currently do not have access to servers with a large number of players to gather highly precise metrics. However, I am conducting a straightforward test that measures the time it takes for the CPU to execute millions of calls to this API. All the tests suggest that the process takes approximately twice as long. These tests were conducted with the cache disabled. The time difference between using a vector and a flat map is minimal. If a server were to run for an extended period, and we could assess the total CPU time consumed by the process, it is likely that the flat map would increase the overall time. I am not certain if this kind of evidence fully reflects real-world scenarios, but I tend to believe that the total process time correlates with fewer CPU cycles. the test: for (int i = 0; i < 100000; i++) {
Spectators spectators;
g_game.map.getSpectators(spectators, position, multifloor, onlyPlayers, minRangeX, maxRangeX, minRangeY,
maxRangeY);
} It's curious, although I don't quite understand why it is twice as slow for this test compared to a |
I've added your changes to TFS 1.4: gesior@72dab98 Problem with 'faster structures' is that they often work better - than std - with big collections ex. map with 50k keys. First I've benchmarked it on 16-core Linux cloud server with 100 players running around with multiple Demons. In this test I've compared total CPU usage by Dispatcher thread. Results showed that your new code is 2x SLOWER than old getSpectator. Then I've benchmarked your optimization on Windows PC with i9-13900K. To get stable results I've set core affinity of TFS to first 4 CPU cores (P cores). This time I've tested it with 1 player online with 16 Demons around. Demons were not moving. Each Demon casted fire area spell every second and said some text. CPU usage per 800 getSpectators executions: So again, your code is slower around 20%. Other thing that I've noticed in both benchmarks. Code tracking if (foundCache) {
AutoStat autoStat2("foundCache");
} All it does is create object, destroy object and destructor adds time to OTS Stats. IDK, if there is something wrong with my benchmarks, or your code somehow affects compiler optimization or CPU cache (L1/L2/L3). Raw results: Code like this: for (int i = 0; i < 100000; i++) {
Spectators spectators;
g_game.map.getSpectators(spectators, position, multifloor, onlyPlayers, minRangeX, maxRangeX, minRangeY,
maxRangeY);
} can't be used to get realistic benchmark results. With this code you get 100% cache hit ratio. {
AutoStat autoStat("test10k1");
std::string test = "test";
for (int i = 0; i < 10000; i++) {
test = transformToSHA1(test);
}
}
{
AutoStat autoStat("turn1kk");
for (int i = 0; i < 1000000; i++) {
Spectators spectators;
map.getSpectators(spectators, creature->getPosition(), true, true);
}
}
{
AutoStat autoStat("test10k2");
std::string test = "test";
for (int i = 0; i < 10000; i++) {
test = transformToSHA1(test);
}
} Now clean TFS and your version Maybe optimization would be visible with more creatures in cache or while iterating over that cache, but it would be hard to test. |
Why does 'The spectator cache in the OTS is cleared every time any monster/player moves' occur? I haven't read it, but I couldn't help but notice that clearPlayersSpectatorCache and clearSpectatorCache are always called; it's a good time to analyze and understand. 😀 |
|
Right, it seems that for the very small set size we usually have the overhead of hashing is way larger than the added complexity. I wonder if using a |
According to the tests I did for months and problems encountered, my conclusion is that any container will not help with anything, the best option is to use a cache, that is, a static vector Obviously a static vector is impossible, but we can have a namespace with a local vector inside it and use it as a cache, however since TFS can in some cases call getSpectators 1 time and other times 5 or So the best option is to create a stack of caches, then the stack starts empty, when we need to use it we simply create a new one or reuse the ones that are available on the stack, when the cache is being used it will be popped off the stack, and when When it is finished using we can put it back in the stack, The improvement is good, it is 50% in most cases. Another improvement would be to enable the internal cache for all ranks and areas, so that it is not only limited to the limits of the viewport, this will increase the RAM if there are many creatures running, but it is not much, how many thousands of creatures are needed to fill RAM with only vectors? |
This cache could be implemented using a floor quadtree, particularly for tracking which creatures are visible from a specific tile/floor. It's something I've been discussing with @ranisalt |
I am currently making it very basic, a simple unordered map that contains a key that is serialized with all the parameters that can be passed to the getSpectators method, I don't think it is the best way to do it, but it is something that I have been testing , but if, |
Are we still talking about spectators cache? Cache that is cleared whenever any monster/player move anywhere on server? read/write ratio of spectators cache is probably below 1. On some OTSes it may hit 1-3 reads per 1 write. It's avg size is probably around 1, as it's cleaned all time. It's even possible that we don't need any structure at all. Maybe even 2 variables |
Possibly for the engine it is not very relevant, (But yes, sometimes the cache is used especially when there are recursive calls to getSpectators), but when using it in Lua the magic happens since many times we can obtain it from the cache. But without a doubt there must be a better way to use the cache. |
Pull Request Prelude
Changes Proposed
We currently have a sorted vector for spectators since
std::unordered_set
is painfully slow for the usage pattern. It turns out thatboost::unordered_flat_set
, recently released in Boost 1.81, is ridiculously fast and is a great replacement with proper semantics.This is hard to measure without a proper running server, but with very limited local testing it averaged 20% less run time for
getSpectators
- I also expect improvement iterating over spectators to perform actions. Cache locality ftwSpectator caches have also been modified to use
boost::unordered_flat_map
that shares the same performance uplift overstd::map
.