Improve performance when being forced to double buffer #488

Manually merged
dnkl merged 12 commits from double-buffering into master 7 months ago
dnkl commented 7 months ago
Owner

This is still very early, but, the idea is that when the compositor forces us to double buffer (i.e. we are swapping between two shm buffers), instead of copying the old buffer wholesome, and then applying the current frame's damage, we re-apply the old frame's damage first, then our own.

In short:

First, while applying this frame’s scroll damage, copy it to the buffer’s scroll damage list (so that we can access it via term->render.last_buf).

Also, when iterating and rendering the grid, build a pixman region of the damaged regions. This is currently done on a per-row basis. This is also stored in the buffer.

Now, when being forced to double buffer, first iterate the old buffer’s damage, and re-apply it to the current buffer. Then, composite the old buffer on top of the current buffer, using the old frame’s damage region as clip region. This effectively copies everything that was rendered to the last frame. Remember, this is on a per-row basis.

Then we go on and render the frame as usual.

Note that it would be really nice if we could subtract the current frame’s damage region from the clip region (no point in copying areas we’re going to overwrite anyway). Unfortunately, that’s harder than it looks; the current frame’s damage region is only valid after this frame’s scroll damage have been applied, while the last frame’s damage region is only valid before it’s been applied.

Translating one to the other isn’t easy, since scroll damage isn’t just about counting lines - there may be multiple scroll damage records, each with its own scrolling region. This creates very complex scenarios.

Edit: we now subtract the current frame's damage from the copy-region if the current frame has no scroll damage.

I do have some early benchmark numbers, but will wait before publishing them since they are from debug builds. It does look promising though. There is a very noticeable/measurable performance hit, but the improvement compared to the master branch is still huge.

TODO

  • code cleanup
  • more code cleanup
  • spend some more time trying to subtract current frame's damage from the old frame's damage, thereby reducing the amount of data we need to copy
  • check buffer age and do a full frame copy if the age is >= 2
  • skip old frame's damage if current frame is a full refresh
  • properly handle last frame being force-refreshed (urgent margins, flashing, scrollback search etc) TODO: more testing
  • Mutter
  • KWin
  • Changelog

Closes #478

~~This is still very early, but,~~ the idea is that when the compositor forces us to double buffer (i.e. we are swapping between two shm buffers), instead of copying the old buffer wholesome, and then applying the current frame's damage, we re-apply the old frame's damage first, then our own. In short: First, while applying this frame’s scroll damage, copy it to the buffer’s scroll damage list (so that we can access it via `term->render.last_buf`). Also, when iterating and rendering the grid, build a pixman region of the damaged regions. This is currently done on a per-row basis. This is also stored in the buffer. Now, when being forced to double buffer, first iterate the old buffer’s damage, and re-apply it to the current buffer. Then, composite the old buffer on top of the current buffer, using the old frame’s damage region as clip region. This effectively copies everything that was rendered to the last frame. Remember, this is on a per-row basis. Then we go on and render the frame as usual. Note that it would be _really_ nice if we could subtract the current frame’s damage region from the clip region (no point in copying areas we’re going to overwrite anyway). Unfortunately, that’s harder than it looks; the current frame’s damage region is only valid *after* this frame’s scroll damage have been applied, while the last frame’s damage region is only valid *before* it’s been applied. Translating one to the other isn’t easy, since scroll damage isn’t just about counting lines - there may be multiple scroll damage records, each with its own scrolling region. This creates very complex scenarios. **Edit**: we now subtract the current frame's damage from the copy-region **if** the current frame has no scroll damage. I _do_ have some early benchmark numbers, but will wait before publishing them since they are from debug builds. It does look promising though. There _is_ a very noticeable/measurable performance hit, but the improvement compared to the master branch is still huge. **TODO** * [x] code cleanup * [x] more code cleanup * [x] spend some more time trying to subtract current frame's damage from the old frame's damage, thereby reducing the amount of data we need to copy * [x] check buffer age and do a full frame copy if the age is >= 2 * [x] skip old frame's damage if current frame is a full refresh * [x] properly handle last frame being force-refreshed (urgent margins, flashing, scrollback search etc) **TODO**: more testing * [x] Mutter * [x] KWin * [x] Changelog Closes #478
dnkl added the
performance
label 7 months ago
Poster
Owner

Benchmark results (regular LTO release builds):

Sway 1.6, wlroots 0.13.0
Terminal size: 135x67 cells
Surface size: 953x1024
(CPU: i5-8250U CPU @ 1.60GHz, 4/8 cores/threads, 6MB L3)

Times are in microseconds (µs).

Numbers in parentheses is the time taken to “prepare” the buffer before applying the current frame’s damage (hence it’s always zero in the “Immediate release” column).

Not covered here: ignoring old buffer content and instead re-rendering the entire frame.

Benchmark Immediate release1 Damage tracking2 Copy last frame3
typing4 143.6 ±40.1 (0.0 ±0.0) 181.6 ±35.3 (35.2 ±3.2) 1404.2 ±135.2 (1212.2 ±120.7)
cursor movement5 165.9 ±39.9 (0.0 ±0.0) 231.5 ±33.8 (74.2 ±5.2) 1246.6 ±113.9 (1094.6 ±103.4)
scrolling6 540.6 ±106.7 (0.0 ±0.0) 854.5 ±91.1 (350.6 ±29.1) 1677.8 ±282.8 (1082.8 ±213.2)

Observations:

  • double buffering damage is a very clear improvement over doing a dumb copy of the last frame (which in turn is a huge improvement over re-rendering the frame from scratch).
  • scrolling is a fairly expensive operation. This is exacerbated when we’re forced to do it twice (350µs vs. 35-75µs).
  • a simple memcpy() is a fairly expensive operation on buffers as large as these. Also, in addition to take time, they can easily thrash the cache, slowing things down further.

  1. no double buffering, foot re-uses previous frame’s buffer ↩︎

  2. foot re-applies last frame’s damage before applying current frame’s damage ↩︎

  3. foot copies the old buffer (all of it) before applying current frame’s damage ↩︎

  4. running cat - in the shell, at the bottom of the screen, typing a single letter at a time ↩︎

  5. large C file in vim, moving cursor with arrow keys without scrolling the content ↩︎

  6. large C file in vim, scrolling content by holding down arrow key ↩︎

Benchmark results (regular LTO release builds): Sway 1.6, wlroots 0.13.0 Terminal size: 135x67 cells Surface size: 953x1024 (CPU: i5-8250U CPU @ 1.60GHz, 4/8 cores/threads, 6MB L3) Times are in microseconds (µs). Numbers in parentheses is the time taken to “prepare” the buffer before applying the current frame’s damage (hence it’s always zero in the _“Immediate release”_ column). Not covered here: ignoring old buffer content and instead re-rendering the entire frame. | Benchmark | Immediate release[^1] | Damage tracking[^2] | Copy last frame[^3] | | ------------------- | ----------------------: | ------------------------: | ----------------------------: | | typing[^4] | 143.6 ±40.1 (0.0 ±0.0) | 181.6 ±35.3 (35.2 ±3.2) | 1404.2 ±135.2 (1212.2 ±120.7) | | cursor movement[^5] | 165.9 ±39.9 (0.0 ±0.0) | 231.5 ±33.8 (74.2 ±5.2) | 1246.6 ±113.9 (1094.6 ±103.4) | | scrolling[^6] | 540.6 ±106.7 (0.0 ±0.0) | 854.5 ±91.1 (350.6 ±29.1) | 1677.8 ±282.8 (1082.8 ±213.2) | Observations: * double buffering damage is a very clear improvement over doing a dumb copy of the last frame (which in turn is a huge improvement over re-rendering the frame from scratch). * scrolling is a fairly expensive operation. This is exacerbated when we’re forced to do it twice (350µs vs. 35-75µs). * a simple `memcpy()` is a fairly expensive operation on buffers as large as these. Also, in addition to take time, they can easily thrash the cache, slowing things down further. [^1]: no double buffering, foot re-uses previous frame’s buffer [^2]: foot re-applies last frame’s damage before applying current frame’s damage [^3]: foot copies the old buffer (all of it) before applying current frame’s damage [^4]: running `cat -` in the shell, at the bottom of the screen, typing a single letter at a time [^5]: large C file in vim, moving cursor with arrow keys without scrolling the content [^6]: large C file in vim, scrolling content by holding down arrow key
dnkl force-pushed double-buffering from 0c727d2e00 to 9bc0572c4d 7 months ago
dnkl force-pushed double-buffering from 9bc0572c4d to 8047e7372c 7 months ago
dnkl changed title from WIP: improve performance when being forced to double buffer to Improve performance when being forced to double buffer 7 months ago
dnkl force-pushed double-buffering from 9cfe0548e8 to dc4f60fd4f 7 months ago
dnkl added 1 commit 7 months ago
dnkl merged commit 04215bac6c into master manually 7 months ago
The pull request has been manually merged as 04215bac6c.
Sign in to join this conversation.
Loading…
There is no content yet.