Improve performance when being forced to double buffer #488

Manually merged
dnkl merged 12 commits from double-buffering into master 1 month ago
dnkl commented 1 month ago
Owner

This is still very early, but, the idea is that when the compositor forces us to double buffer (i.e. we are swapping between two shm buffers), instead of copying the old buffer wholesome, and then applying the current frame's damage, we re-apply the old frame's damage first, then our own.

In short:

First, while applying this frame’s scroll damage, copy it to the buffer’s scroll damage list (so that we can access it via term->render.last_buf).

Also, when iterating and rendering the grid, build a pixman region of the damaged regions. This is currently done on a per-row basis. This is also stored in the buffer.

Now, when being forced to double buffer, first iterate the old buffer’s damage, and re-apply it to the current buffer. Then, composite the old buffer on top of the current buffer, using the old frame’s damage region as clip region. This effectively copies everything that was rendered to the last frame. Remember, this is on a per-row basis.

Then we go on and render the frame as usual.

Note that it would be really nice if we could subtract the current frame’s damage region from the clip region (no point in copying areas we’re going to overwrite anyway). Unfortunately, that’s harder than it looks; the current frame’s damage region is only valid after this frame’s scroll damage have been applied, while the last frame’s damage region is only valid before it’s been applied.

Translating one to the other isn’t easy, since scroll damage isn’t just about counting lines - there may be multiple scroll damage records, each with its own scrolling region. This creates very complex scenarios.

Edit: we now subtract the current frame's damage from the copy-region if the current frame has no scroll damage.

I do have some early benchmark numbers, but will wait before publishing them since they are from debug builds. It does look promising though. There is a very noticeable/measurable performance hit, but the improvement compared to the master branch is still huge.

TODO

  • code cleanup
  • more code cleanup
  • spend some more time trying to subtract current frame's damage from the old frame's damage, thereby reducing the amount of data we need to copy
  • check buffer age and do a full frame copy if the age is >= 2
  • skip old frame's damage if current frame is a full refresh
  • properly handle last frame being force-refreshed (urgent margins, flashing, scrollback search etc) TODO: more testing
  • Mutter
  • KWin
  • Changelog

Closes #478

~~This is still very early, but,~~ the idea is that when the compositor forces us to double buffer (i.e. we are swapping between two shm buffers), instead of copying the old buffer wholesome, and then applying the current frame's damage, we re-apply the old frame's damage first, then our own. In short: First, while applying this frame’s scroll damage, copy it to the buffer’s scroll damage list (so that we can access it via `term->render.last_buf`). Also, when iterating and rendering the grid, build a pixman region of the damaged regions. This is currently done on a per-row basis. This is also stored in the buffer. Now, when being forced to double buffer, first iterate the old buffer’s damage, and re-apply it to the current buffer. Then, composite the old buffer on top of the current buffer, using the old frame’s damage region as clip region. This effectively copies everything that was rendered to the last frame. Remember, this is on a per-row basis. Then we go on and render the frame as usual. Note that it would be _really_ nice if we could subtract the current frame’s damage region from the clip region (no point in copying areas we’re going to overwrite anyway). Unfortunately, that’s harder than it looks; the current frame’s damage region is only valid *after* this frame’s scroll damage have been applied, while the last frame’s damage region is only valid *before* it’s been applied. Translating one to the other isn’t easy, since scroll damage isn’t just about counting lines - there may be multiple scroll damage records, each with its own scrolling region. This creates very complex scenarios. **Edit**: we now subtract the current frame's damage from the copy-region **if** the current frame has no scroll damage. I _do_ have some early benchmark numbers, but will wait before publishing them since they are from debug builds. It does look promising though. There _is_ a very noticeable/measurable performance hit, but the improvement compared to the master branch is still huge. **TODO** * [x] code cleanup * [x] more code cleanup * [x] spend some more time trying to subtract current frame's damage from the old frame's damage, thereby reducing the amount of data we need to copy * [x] check buffer age and do a full frame copy if the age is >= 2 * [x] skip old frame's damage if current frame is a full refresh * [x] properly handle last frame being force-refreshed (urgent margins, flashing, scrollback search etc) **TODO**: more testing * [x] Mutter * [x] KWin * [x] Changelog Closes #478
dnkl added the
performance
label 1 month ago
dnkl added 3 commits 1 month ago
c860d1792f
shm: track busy buffers’ age, and add compile-time option to force double buffering
e05866a1a6
render: wip: re-apply last frame’s damage when forced to double buffer
dnkl added 2 commits 1 month ago
4cbfc19949
render: subtract current frame’s damage when there’s no scroll damage
0c727d2e00
render: code cleanup, log double buffering time
Poster
Owner

Benchmark results (regular LTO release builds):

Sway 1.6, wlroots 0.13.0
Terminal size: 135x67 cells
Surface size: 953x1024
(CPU: i5-8250U CPU @ 1.60GHz, 4/8 cores/threads, 6MB L3)

Times are in microseconds (µs).

Numbers in parentheses is the time taken to “prepare” the buffer before applying the current frame’s damage (hence it’s always zero in the “Immediate release” column).

Not covered here: ignoring old buffer content and instead re-rendering the entire frame.

Benchmark Immediate release1 Damage tracking2 Copy last frame3
typing4 143.6 ±40.1 (0.0 ±0.0) 181.6 ±35.3 (35.2 ±3.2) 1404.2 ±135.2 (1212.2 ±120.7)
cursor movement5 165.9 ±39.9 (0.0 ±0.0) 231.5 ±33.8 (74.2 ±5.2) 1246.6 ±113.9 (1094.6 ±103.4)
scrolling6 540.6 ±106.7 (0.0 ±0.0) 854.5 ±91.1 (350.6 ±29.1) 1677.8 ±282.8 (1082.8 ±213.2)

Observations:

  • double buffering damage is a very clear improvement over doing a dumb copy of the last frame (which in turn is a huge improvement over re-rendering the frame from scratch).
  • scrolling is a fairly expensive operation. This is exacerbated when we’re forced to do it twice (350µs vs. 35-75µs).
  • a simple memcpy() is a fairly expensive operation on buffers as large as these. Also, in addition to take time, they can easily thrash the cache, slowing things down further.

  1. no double buffering, foot re-uses previous frame’s buffer ↩︎

  2. foot re-applies last frame’s damage before applying current frame’s damage ↩︎

  3. foot copies the old buffer (all of it) before applying current frame’s damage ↩︎

  4. running cat - in the shell, at the bottom of the screen, typing a single letter at a time ↩︎

  5. large C file in vim, moving cursor with arrow keys without scrolling the content ↩︎

  6. large C file in vim, scrolling content by holding down arrow key ↩︎

Benchmark results (regular LTO release builds): Sway 1.6, wlroots 0.13.0 Terminal size: 135x67 cells Surface size: 953x1024 (CPU: i5-8250U CPU @ 1.60GHz, 4/8 cores/threads, 6MB L3) Times are in microseconds (µs). Numbers in parentheses is the time taken to “prepare” the buffer before applying the current frame’s damage (hence it’s always zero in the _“Immediate release”_ column). Not covered here: ignoring old buffer content and instead re-rendering the entire frame. | Benchmark | Immediate release[^1] | Damage tracking[^2] | Copy last frame[^3] | | ------------------- | ----------------------: | ------------------------: | ----------------------------: | | typing[^4] | 143.6 ±40.1 (0.0 ±0.0) | 181.6 ±35.3 (35.2 ±3.2) | 1404.2 ±135.2 (1212.2 ±120.7) | | cursor movement[^5] | 165.9 ±39.9 (0.0 ±0.0) | 231.5 ±33.8 (74.2 ±5.2) | 1246.6 ±113.9 (1094.6 ±103.4) | | scrolling[^6] | 540.6 ±106.7 (0.0 ±0.0) | 854.5 ±91.1 (350.6 ±29.1) | 1677.8 ±282.8 (1082.8 ±213.2) | Observations: * double buffering damage is a very clear improvement over doing a dumb copy of the last frame (which in turn is a huge improvement over re-rendering the frame from scratch). * scrolling is a fairly expensive operation. This is exacerbated when we’re forced to do it twice (350µs vs. 35-75µs). * a simple `memcpy()` is a fairly expensive operation on buffers as large as these. Also, in addition to take time, they can easily thrash the cache, slowing things down further. [^1]: no double buffering, foot re-uses previous frame’s buffer [^2]: foot re-applies last frame’s damage before applying current frame’s damage [^3]: foot copies the old buffer (all of it) before applying current frame’s damage [^4]: running `cat -` in the shell, at the bottom of the screen, typing a single letter at a time [^5]: large C file in vim, moving cursor with arrow keys without scrolling the content [^6]: large C file in vim, scrolling content by holding down arrow key
dnkl force-pushed double-buffering from 0c727d2e00 to 9bc0572c4d 1 month ago
dnkl force-pushed double-buffering from 9bc0572c4d to 8047e7372c 1 month ago
dnkl added 1 commit 1 month ago
dnkl changed title from WIP: improve performance when being forced to double buffer to Improve performance when being forced to double buffer 1 month ago
dnkl added 1 commit 1 month ago
dnkl added 4 commits 1 month ago
dnkl force-pushed double-buffering from 9cfe0548e8 to dc4f60fd4f 1 month ago
dnkl added 1 commit 1 month ago
dnkl merged commit 04215bac6c into master 1 month ago manually
The pull request has been manually merged as 04215bac6c.
Sign in to join this conversation.
No reviewers
No Milestone
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This pull request currently doesn't have any dependencies.

Loading…
There is no content yet.