-
Notifications
You must be signed in to change notification settings - Fork 86
/
PERFORMANCE
77 lines (58 loc) · 3.62 KB
/
PERFORMANCE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
The performance of libmypaint/MyPaint is quite good compared to other drawing
applications but there is still room for improvement.
Note that the ideas proposed here and their implementation will have to
be judged by benchmarking. Unittests for correctness should also be in
place before starting to work in this area.
=== IMPLEMENTED: Deferred processing, multithreading and vectorization ===
Implemented as of November 2012:
https://mail.gna.org/public/mypaint-discuss/2012-11/msg00003.html
=== TODO: Improve vectorization ===
Currently only a small amount of the tile processing is (auto)vectorized.
Try to improve the coverage of vectorized code by:
* Remove run-length encoding of dab mask
* Using floats instead of uint16_t
Also make sure that GCC is generating efficient vectorized code.
* C99 restrict keyword
* __aligned__ attributes
Passing -ftree-vectorizer-verbose=6 to gcc allows to get details about the autovectorizer,
and -S/-save-temps -fverbose-asm is useful to look at the generated assembler code.
=== TODO: More efficient serial code ===
It may be possible to optimize the inner loops of dab mask calculation and dab compositing,
by rewriting the computation or by improving memory layout to have better cache line alignment.
See Ulrich Drepper, 2007: What Every Programmer Should Know About Memory
Try to benchmark these inner functions under an instruction/cache usage analyzer.
=== TODO: Try different tile sizes ===
It could be that libmypaint will perform better with smaller or bigger tile sizes.
Smaller size would make it more common that a set of operations span multiple tiles,
and thus processed in parallel. It may also improve cache locality.
On the other hand, a smaller tile size will increase the tile get/set overhead.
Implementation: Make the tile size selectable at creation time of a MyPaintTiledSurface
instead of a #define.
=== IDEA: Dab masks cache ===
Dab mask generation is one of the most time consuming parts of the rendering.
_If_ the same dab masks are used over and over again, it could be very beneficial
to cache and reuse these.
First: Check the calls to draw_dab when using typical brushes. Do they often
have exactly the same radius, opaque, hardness, aspect_ratio and angle?
Simulate the cache hit/miss rate for a most-recently used cache of say 32 elements.
Is it high enough that there could be a benefit?
Each hit would allow to convert a dab mask calculation to a plain copy, but there will
be costs associated with looking up in the cache and the extra memory needed.
Implementation:
* Make brush engine request masks handles from the surface, and pass these to draw_dab().
* In draw_dab(), render the mask for each tile and store it in the operation queue.
* Cached the rendered masks in the surface, so that they may be returned when an equivalent mask is requested.
Challenge: Keeping memory consumption down.
=== IDEA: Make use of GPU processing: OpenCL and OpenGL ===
Challenge: Migating the high latency of CPU<->GPU transfers
Challenge: Keeping the amount of branching (at least within warps) in GPU code down
http://www.nvidia.com/content/cudazone/CUDABrowser/downloads/papers/NVIDIA_OpenCL_BestPracticesGuide.pdf
Implementation idea:
1. Move only the rendering of operations and tiles to the GPU side,
and trigger this from end_atomic and similar.
Note: for interactive drawing on canvas, this might only result in performance
improvements if the result does not have to be transferred back to CPU before
being displayed.
This can be avoided by integrating the OpenCL operations with OpenGL such that
changes to the tiles in OpenCL will automatically update an OpenGL texture.
http://www.dyn-lab.com/articles/cl-gl.html