Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use flat set for spectators #4784

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ranisalt
Copy link
Member

Pull Request Prelude

Changes Proposed

We currently have a sorted vector for spectators since std::unordered_set is painfully slow for the usage pattern. It turns out that boost::unordered_flat_set, recently released in Boost 1.81, is ridiculously fast and is a great replacement with proper semantics.

This is hard to measure without a proper running server, but with very limited local testing it averaged 20% less run time for getSpectators - I also expect improvement iterating over spectators to perform actions. Cache locality ftw

Spectator caches have also been modified to use boost::unordered_flat_map that shares the same performance uplift over std::map.

@ranisalt
Copy link
Member Author

@gesior could you kindly run your benchmarks over this? I recall you recently checked a few different implementations, sorry if I remember wrong 😆

@ranisalt ranisalt marked this pull request as ready for review September 16, 2024 09:59
@MillhioreBT
Copy link
Contributor

I have no way to check metrics on real servers with many players online, but brute force tests show that it is slower than a simple std::vector / SpectatorVec, on windows.

@ranisalt
Copy link
Member Author

I have no way to check metrics on real servers with many players online, but brute force tests show that it is slower than a simple std::vector / SpectatorVec, on windows.

How are you checking it?

@MillhioreBT
Copy link
Contributor

MillhioreBT commented Sep 18, 2024

I have no way to check metrics on real servers with many players online, but brute force tests show that it is slower than a simple std::vector / SpectatorVec, on windows.

How are you checking it?

I currently do not have access to servers with a large number of players to gather highly precise metrics.

However, I am conducting a straightforward test that measures the time it takes for the CPU to execute millions of calls to this API. All the tests suggest that the process takes approximately twice as long. These tests were conducted with the cache disabled.

The time difference between using a vector and a flat map is minimal. If a server were to run for an extended period, and we could assess the total CPU time consumed by the process, it is likely that the flat map would increase the overall time. I am not certain if this kind of evidence fully reflects real-world scenarios, but I tend to believe that the total process time correlates with fewer CPU cycles.

the test:

for (int i = 0; i < 100000; i++) {
	Spectators spectators;
	g_game.map.getSpectators(spectators, position, multifloor, onlyPlayers, minRangeX, maxRangeX, minRangeY,
	                         maxRangeY);
}

It's curious, although I don't quite understand why it is twice as slow for this test compared to a SpectatorVec class

@gesior
Copy link
Contributor

gesior commented Sep 22, 2024

@gesior could you kindly run your benchmarks over this? I recall you recently checked a few different implementations, sorry if I remember wrong 😆

I've added your changes to TFS 1.4: gesior@72dab98

Problem with 'faster structures' is that they often work better - than std - with big collections ex. map with 50k keys.
Spectator cache on OTS is cleaned every time any monster/player move, so it probably rarely go over 10 positions and 'cache hit ratio' is very low, probably around 50%.


First I've benchmarked it on 16-core Linux cloud server with 100 players running around with multiple Demons. In this test I've compared total CPU usage by Dispatcher thread. Results showed that your new code is 2x SLOWER than old getSpectator.
Then I've benchmarked some other optimizations on same machine and these benchmarks had problem with compilation changing from Release to Debug, but it probably did not affect your code tests. TFS compiled in Debug mode loads .otbm 4x slower and I haven't notice 20+ sec server start time during your optimization tests.


Then I've benchmarked your optimization on Windows PC with i9-13900K. To get stable results I've set core affinity of TFS to first 4 CPU cores (P cores). This time I've tested it with 1 player online with 16 Demons around. Demons were not moving. Each Demon casted fire area spell every second and said some text.
I've tracked Map::getSpectators execution time and number of Map::getSpectators with foundCache = true. 60% of getSpectators executions had foundCache = true.

CPU usage per 800 getSpectators executions:
TFS getSpectators CPU usage: 0.00396% - 0.00548%, avg ~0.005%
new getSpectators CPU usage: 0.00554% - 0.00684%, avg ~0.006%

So again, your code is slower around 20%.


Other thing that I've noticed in both benchmarks. Code tracking foundCache = true used around 25% more CPU too.
Code that tracked foundCache:

	if (foundCache) {
		AutoStat autoStat2("foundCache");
	}

All it does is create object, destroy object and destructor adds time to OTS Stats.

IDK, if there is something wrong with my benchmarks, or your code somehow affects compiler optimization or CPU cache (L1/L2/L3).

Raw results:
TFS: https://paste.ots.me/564127/text
new code: https://paste.ots.me/564128/text


Code like this:

for (int i = 0; i < 100000; i++) {
	Spectators spectators;
	g_game.map.getSpectators(spectators, position, multifloor, onlyPlayers, minRangeX, maxRangeX, minRangeY,
	                         maxRangeY);
}

can't be used to get realistic benchmark results. With this code you get 100% cache hit ratio.
Anyway, I've test it like this:

	{
		AutoStat autoStat("test10k1");
		std::string test = "test";
		for (int i = 0; i < 10000; i++) {
			test = transformToSHA1(test);
		}
	}
	{
		AutoStat autoStat("turn1kk");
		for (int i = 0; i < 1000000; i++) {
			Spectators spectators;
			map.getSpectators(spectators, creature->getPosition(), true, true);
		}
	}
	{
		AutoStat autoStat("test10k2");
		std::string test = "test";
		for (int i = 0; i < 10000; i++) {
			test = transformToSHA1(test);
		}
	}

Now clean TFS and your version test10k code gave same results (+/- 2%), so there is no problem with optimization/CPU cache.
TFS: 30.22 ms per 1kk getSpectator calls
new code: 70.75 ms per 1kk getSpectator calls
So hitting cache with your code is 2.3x times slower.

Maybe optimization would be visible with more creatures in cache or while iterating over that cache, but it would be hard to test.

@ramon-bernardo
Copy link
Contributor

ramon-bernardo commented Sep 25, 2024

Why does 'The spectator cache in the OTS is cleared every time any monster/player moves' occur?

I haven't read it, but I couldn't help but notice that clearPlayersSpectatorCache and clearSpectatorCache are always called; it's a good time to analyze and understand. 😀

@gesior
Copy link
Contributor

gesior commented Sep 25, 2024

Why does 'The spectator cache in the OTS is cleared every time any monster/player moves' occur?

spectators.clear() is probably faster than updating cache of 3703 (23 * 23 * ~7; on levels above ground) tiles around given Position every monster/player step.
Current very basic cache implementation was added to speed up scripts that send multiple things to players that see given position ex. spells. Ex. exura vita sends text on player position and effect animation+add health animation on player position, with cache getSpectators list creatures on 3703 tiles around only once.

@ranisalt
Copy link
Member Author

Right, it seems that for the very small set size we usually have the overhead of hashing is way larger than the added complexity. I wonder if using a boost::small_vector instead, keeping it in the stack, could help with locality. But I expect the improvements will be minimal

@MillhioreBT
Copy link
Contributor

According to the tests I did for months and problems encountered, my conclusion is that any container will not help with anything, the best option is to use a cache, that is, a static vector

Obviously a static vector is impossible, but we can have a namespace with a local vector inside it and use it as a cache, however since TFS can in some cases call getSpectators 1 time and other times 5 or N times,
It is neither safe nor predictable how many vectors there should be in the cache to satisfy and that there are no overwrites.

So the best option is to create a stack of caches, then the stack starts empty, when we need to use it we simply create a new one or reuse the ones that are available on the stack, when the cache is being used it will be popped off the stack, and when When it is finished using we can put it back in the stack,
and this way there will never be overwrites, according to my tests the stack never exceeds 3 vectors, but it depends on many factors, perhaps on a large server with many nested systems that make many calls to getSpectators recursively it could force more vectors to be generated.

The improvement is good, it is 50% in most cases.
An excellent thing is that we don't even need to change the syntaxys of all the places where getSpectators is used, it will only be necessary to edit the spectator.h file and that's it.

Another improvement would be to enable the internal cache for all ranks and areas, so that it is not only limited to the limits of the viewport, this will increase the RAM if there are many creatures running, but it is not much, how many thousands of creatures are needed to fill RAM with only vectors?

@ramon-bernardo
Copy link
Contributor

ramon-bernardo commented Oct 24, 2024

@MillhioreBT

This cache could be implemented using a floor quadtree, particularly for tracking which creatures are visible from a specific tile/floor. It's something I've been discussing with @ranisalt

@MillhioreBT
Copy link
Contributor

@MillhioreBT

This cache could be implemented using a floor quadtree, particularly for tracking which creatures are visible from a specific tile/floor. It's something I've been discussing with @ranisalt

I am currently making it very basic, a simple unordered map that contains a key that is serialized with all the parameters that can be passed to the getSpectators method, I don't think it is the best way to do it, but it is something that I have been testing , but if,
Having an optimal cache is definitely the key to having better performance in general.

@gesior
Copy link
Contributor

gesior commented Oct 25, 2024

Having an optimal cache is definitely the key to having better performance in general.

Are we still talking about spectators cache? Cache that is cleared whenever any monster/player move anywhere on server?

read/write ratio of spectators cache is probably below 1. On some OTSes it may hit 1-3 reads per 1 write. It's avg size is probably around 1, as it's cleaned all time.

It's even possible that we don't need any structure at all. Maybe even 2 variables Position lastPosition and std::vector<Creature*> listOfCreaturesAroundLastPosition would work.

@MillhioreBT
Copy link
Contributor

Having an optimal cache is definitely the key to having better performance in general.

Are we still talking about spectators cache? Cache that is cleared whenever any monster/player move anywhere on server?

read/write ratio of spectators cache is probably below 1. On some OTSes it may hit 1-3 reads per 1 write. It's avg size is probably around 1, as it's cleaned all time.

It's even possible that we don't need any structure at all. Maybe even 2 variables Position lastPosition and std::vector<Creature*> listOfCreaturesAroundLastPosition would work.

Possibly for the engine it is not very relevant, (But yes, sometimes the cache is used especially when there are recursive calls to getSpectators), but when using it in Lua the magic happens since many times we can obtain it from the cache.

But without a doubt there must be a better way to use the cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants