Use flat set for spectators #4784

ranisalt · 2024-09-15T00:07:30Z

Pull Request Prelude

I have followed proper The Forgotten Server code styling.
I have read and understood the contribution guidelines before making this PR.
I am aware that this PR may be closed if the above-mentioned criteria are not fulfilled.

Changes Proposed

We currently have a sorted vector for spectators since std::unordered_set is painfully slow for the usage pattern. It turns out that boost::unordered_flat_set, recently released in Boost 1.81, is ridiculously fast and is a great replacement with proper semantics.

This is hard to measure without a proper running server, but with very limited local testing it averaged 20% less run time for getSpectators - I also expect improvement iterating over spectators to perform actions. Cache locality ftw

Spectator caches have also been modified to use boost::unordered_flat_map that shares the same performance uplift over std::map.

ranisalt · 2024-09-15T00:10:02Z

@gesior could you kindly run your benchmarks over this? I recall you recently checked a few different implementations, sorry if I remember wrong 😆

MillhioreBT · 2024-09-17T13:30:17Z

I have no way to check metrics on real servers with many players online, but brute force tests show that it is slower than a simple std::vector / SpectatorVec, on windows.

ranisalt · 2024-09-18T11:45:31Z

I have no way to check metrics on real servers with many players online, but brute force tests show that it is slower than a simple std::vector / SpectatorVec, on windows.

How are you checking it?

MillhioreBT · 2024-09-18T20:32:21Z

I have no way to check metrics on real servers with many players online, but brute force tests show that it is slower than a simple std::vector / SpectatorVec, on windows.

How are you checking it?

I currently do not have access to servers with a large number of players to gather highly precise metrics.

However, I am conducting a straightforward test that measures the time it takes for the CPU to execute millions of calls to this API. All the tests suggest that the process takes approximately twice as long. These tests were conducted with the cache disabled.

The time difference between using a vector and a flat map is minimal. If a server were to run for an extended period, and we could assess the total CPU time consumed by the process, it is likely that the flat map would increase the overall time. I am not certain if this kind of evidence fully reflects real-world scenarios, but I tend to believe that the total process time correlates with fewer CPU cycles.

the test:

for (int i = 0; i < 100000; i++) {
	Spectators spectators;
	g_game.map.getSpectators(spectators, position, multifloor, onlyPlayers, minRangeX, maxRangeX, minRangeY,
	                         maxRangeY);
}

It's curious, although I don't quite understand why it is twice as slow for this test compared to a SpectatorVec class

gesior · 2024-09-22T01:02:24Z

@gesior could you kindly run your benchmarks over this? I recall you recently checked a few different implementations, sorry if I remember wrong 😆

I've added your changes to TFS 1.4: gesior@72dab98

Problem with 'faster structures' is that they often work better - than std - with big collections ex. map with 50k keys.
Spectator cache on OTS is cleaned every time any monster/player move, so it probably rarely go over 10 positions and 'cache hit ratio' is very low, probably around 50%.

First I've benchmarked it on 16-core Linux cloud server with 100 players running around with multiple Demons. In this test I've compared total CPU usage by Dispatcher thread. Results showed that your new code is 2x SLOWER than old getSpectator.
Then I've benchmarked some other optimizations on same machine and these benchmarks had problem with compilation changing from Release to Debug, but it probably did not affect your code tests. TFS compiled in Debug mode loads .otbm 4x slower and I haven't notice 20+ sec server start time during your optimization tests.

Then I've benchmarked your optimization on Windows PC with i9-13900K. To get stable results I've set core affinity of TFS to first 4 CPU cores (P cores). This time I've tested it with 1 player online with 16 Demons around. Demons were not moving. Each Demon casted fire area spell every second and said some text.
I've tracked Map::getSpectators execution time and number of Map::getSpectators with foundCache = true. 60% of getSpectators executions had foundCache = true.

CPU usage per 800 getSpectators executions:
TFS getSpectators CPU usage: 0.00396% - 0.00548%, avg ~0.005%
new getSpectators CPU usage: 0.00554% - 0.00684%, avg ~0.006%

So again, your code is slower around 20%.

Other thing that I've noticed in both benchmarks. Code tracking foundCache = true used around 25% more CPU too.
Code that tracked foundCache:

	if (foundCache) {
		AutoStat autoStat2("foundCache");
	}

All it does is create object, destroy object and destructor adds time to OTS Stats.

IDK, if there is something wrong with my benchmarks, or your code somehow affects compiler optimization or CPU cache (L1/L2/L3).

Raw results:
TFS: https://paste.ots.me/564127/text
new code: https://paste.ots.me/564128/text

Code like this:

for (int i = 0; i < 100000; i++) {
	Spectators spectators;
	g_game.map.getSpectators(spectators, position, multifloor, onlyPlayers, minRangeX, maxRangeX, minRangeY,
	                         maxRangeY);
}

can't be used to get realistic benchmark results. With this code you get 100% cache hit ratio.
Anyway, I've test it like this:

	{
		AutoStat autoStat("test10k1");
		std::string test = "test";
		for (int i = 0; i < 10000; i++) {
			test = transformToSHA1(test);
		}
	}
	{
		AutoStat autoStat("turn1kk");
		for (int i = 0; i < 1000000; i++) {
			Spectators spectators;
			map.getSpectators(spectators, creature->getPosition(), true, true);
		}
	}
	{
		AutoStat autoStat("test10k2");
		std::string test = "test";
		for (int i = 0; i < 10000; i++) {
			test = transformToSHA1(test);
		}
	}

Now clean TFS and your version test10k code gave same results (+/- 2%), so there is no problem with optimization/CPU cache.
TFS: 30.22 ms per 1kk getSpectator calls
new code: 70.75 ms per 1kk getSpectator calls
So hitting cache with your code is 2.3x times slower.

Maybe optimization would be visible with more creatures in cache or while iterating over that cache, but it would be hard to test.

ramon-bernardo · 2024-09-25T00:36:17Z

Why does 'The spectator cache in the OTS is cleared every time any monster/player moves' occur?

I haven't read it, but I couldn't help but notice that clearPlayersSpectatorCache and clearSpectatorCache are always called; it's a good time to analyze and understand. 😀

gesior · 2024-09-25T11:07:45Z

Why does 'The spectator cache in the OTS is cleared every time any monster/player moves' occur?

spectators.clear() is probably faster than updating cache of 3703 (23 * 23 * ~7; on levels above ground) tiles around given Position every monster/player step.
Current very basic cache implementation was added to speed up scripts that send multiple things to players that see given position ex. spells. Ex. exura vita sends text on player position and effect animation+add health animation on player position, with cache getSpectators list creatures on 3703 tiles around only once.

ranisalt · 2024-09-25T22:22:02Z

Right, it seems that for the very small set size we usually have the overhead of hashing is way larger than the added complexity. I wonder if using a boost::small_vector instead, keeping it in the stack, could help with locality. But I expect the improvements will be minimal

MillhioreBT · 2024-10-24T21:04:24Z

According to the tests I did for months and problems encountered, my conclusion is that any container will not help with anything, the best option is to use a cache, that is, a static vector

Obviously a static vector is impossible, but we can have a namespace with a local vector inside it and use it as a cache, however since TFS can in some cases call getSpectators 1 time and other times 5 or N times,
It is neither safe nor predictable how many vectors there should be in the cache to satisfy and that there are no overwrites.

So the best option is to create a stack of caches, then the stack starts empty, when we need to use it we simply create a new one or reuse the ones that are available on the stack, when the cache is being used it will be popped off the stack, and when When it is finished using we can put it back in the stack,
and this way there will never be overwrites, according to my tests the stack never exceeds 3 vectors, but it depends on many factors, perhaps on a large server with many nested systems that make many calls to getSpectators recursively it could force more vectors to be generated.

The improvement is good, it is 50% in most cases.
An excellent thing is that we don't even need to change the syntaxys of all the places where getSpectators is used, it will only be necessary to edit the spectator.h file and that's it.

Another improvement would be to enable the internal cache for all ranks and areas, so that it is not only limited to the limits of the viewport, this will increase the RAM if there are many creatures running, but it is not much, how many thousands of creatures are needed to fill RAM with only vectors?

ramon-bernardo · 2024-10-24T22:11:00Z

@MillhioreBT

This cache could be implemented using a floor quadtree, particularly for tracking which creatures are visible from a specific tile/floor. It's something I've been discussing with @ranisalt

MillhioreBT · 2024-10-24T23:52:08Z

@MillhioreBT

This cache could be implemented using a floor quadtree, particularly for tracking which creatures are visible from a specific tile/floor. It's something I've been discussing with @ranisalt

I am currently making it very basic, a simple unordered map that contains a key that is serialized with all the parameters that can be passed to the getSpectators method, I don't think it is the best way to do it, but it is something that I have been testing , but if,
Having an optimal cache is definitely the key to having better performance in general.

gesior · 2024-10-25T12:51:47Z

Having an optimal cache is definitely the key to having better performance in general.

Are we still talking about spectators cache? Cache that is cleared whenever any monster/player move anywhere on server?

read/write ratio of spectators cache is probably below 1. On some OTSes it may hit 1-3 reads per 1 write. It's avg size is probably around 1, as it's cleaned all time.

It's even possible that we don't need any structure at all. Maybe even 2 variables Position lastPosition and std::vector<Creature*> listOfCreaturesAroundLastPosition would work.

MillhioreBT · 2024-10-28T01:35:10Z

Having an optimal cache is definitely the key to having better performance in general.

Are we still talking about spectators cache? Cache that is cleared whenever any monster/player move anywhere on server?

read/write ratio of spectators cache is probably below 1. On some OTSes it may hit 1-3 reads per 1 write. It's avg size is probably around 1, as it's cleaned all time.

It's even possible that we don't need any structure at all. Maybe even 2 variables Position lastPosition and std::vector<Creature*> listOfCreaturesAroundLastPosition would work.

Possibly for the engine it is not very relevant, (But yes, sometimes the cache is used especially when there are recursive calls to getSpectators), but when using it in Lua the magic happens since many times we can obtain it from the cache.

But without a doubt there must be a better way to use the cache.

ranisalt added 5 commits September 4, 2024 22:39

Implement spectators using boost unordered flat set

9eea924

Remove flag for flat set

b0d0c36

Avoid dealing with pointers for Spectators

dc1ad5a

Use flat map for spectator cache

0504a99

Move spectator caches to anonymous namespace

1db58b9

ranisalt requested a review from EPuncker September 15, 2024 00:07

Formatting

18cc8b0

ranisalt marked this pull request as ready for review September 16, 2024 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use flat set for spectators #4784

Use flat set for spectators #4784

ranisalt commented Sep 15, 2024

ranisalt commented Sep 15, 2024

MillhioreBT commented Sep 17, 2024

ranisalt commented Sep 18, 2024

MillhioreBT commented Sep 18, 2024 •

edited

Loading

gesior commented Sep 22, 2024

ramon-bernardo commented Sep 25, 2024 •

edited

Loading

gesior commented Sep 25, 2024

ranisalt commented Sep 25, 2024

MillhioreBT commented Oct 24, 2024

ramon-bernardo commented Oct 24, 2024 •

edited

Loading

MillhioreBT commented Oct 24, 2024

gesior commented Oct 25, 2024

MillhioreBT commented Oct 28, 2024

Use flat set for spectators #4784

Are you sure you want to change the base?

Use flat set for spectators #4784

Conversation

ranisalt commented Sep 15, 2024

Pull Request Prelude

Changes Proposed

ranisalt commented Sep 15, 2024

MillhioreBT commented Sep 17, 2024

ranisalt commented Sep 18, 2024

MillhioreBT commented Sep 18, 2024 • edited Loading

gesior commented Sep 22, 2024

ramon-bernardo commented Sep 25, 2024 • edited Loading

gesior commented Sep 25, 2024

ranisalt commented Sep 25, 2024

MillhioreBT commented Oct 24, 2024

ramon-bernardo commented Oct 24, 2024 • edited Loading

MillhioreBT commented Oct 24, 2024

gesior commented Oct 25, 2024

MillhioreBT commented Oct 28, 2024

MillhioreBT commented Sep 18, 2024 •

edited

Loading

ramon-bernardo commented Sep 25, 2024 •

edited

Loading

ramon-bernardo commented Oct 24, 2024 •

edited

Loading