CI: slow I/O #805

Open
opened 2 weeks ago by dachary · 12 comments

The sqlite testsuite of forgejo is rather I/O intensive. Today it took 1h39min although it runs in about 20 minutes on my laptop. I wonder if that is because HDD are used instead of SSD?

https://ci.codeberg.org/dachary/forgejo/pipeline/35/27

image

The sqlite testsuite of [forgejo](https://codeberg.org/Forgejo) is rather I/O intensive. Today it took 1h39min although it runs in about 20 minutes on my laptop. I wonder if that is because HDD are used instead of SSD? https://ci.codeberg.org/dachary/forgejo/pipeline/35/27 ![image](/attachments/6b9490ef-8ece-4ec0-b1c9-da46e01137c3)
369 KiB
Poster

Another run of test-sqlite today took 1h20 minute. It failed which indicates the presence of time sensitive tests / race conditions.

image

[Another run of test-sqlite today](https://ci.codeberg.org/dachary/forgejo/pipeline/40) took 1h20 minute. It failed which indicates the presence of time sensitive tests / race conditions. ![image](/attachments/44441b73-a9db-4ba6-ba39-45744503a782)
328 KiB
dachary added a new dependency 1 week ago
Owner

Why is a temporary database even stored on the disk? Can't this be stored in system memory? I don't think it makes sense to write this to the disk at all. Installing SSDs is of course possible (we had this at first), as of our benchmarks, CI builds were 10% slower on HDDs, might be much worse for IO intensive operations. But our estimations also concluded that the lifetime for SSDs would be 1 year each. I don't plan to create a pile of e-waste for repeated CI builds each year, if there are software optimizations possible.

Why is a temporary database even stored on the disk? Can't this be stored in system memory? I don't think it makes sense to write this to the disk at all. Installing SSDs is of course possible (we had this at first), as of our benchmarks, CI builds were 10% slower on HDDs, might be much worse for IO intensive operations. But our estimations also concluded that the lifetime for SSDs would be 1 year each. I don't plan to create a pile of e-waste for repeated CI builds each year, if there are software optimizations possible.
Poster

While this is indeed debatable, changing the Gitea tests to implement this would require a lot of work. An alternative to SSD would be to have a lot of RAM and use a RAM backed file systems for tests since it is thrown away when the test completes.

While this is indeed debatable, changing the Gitea tests to implement this would require a **lot** of work. An alternative to SSD would be to have a lot of RAM and use a RAM backed file systems for tests since it is thrown away when the test completes.
Collaborator

It might make some sense to run this test suite on bare metal once (on hdd) and see if we are losing a significant amount of Performance through virtualization somewhere.

It might make some sense to run this test suite on bare metal once (on hdd) and see if we are losing a significant amount of Performance through virtualization somewhere.
Poster

I would be surprised if it makes a difference. Storage virtualization may degrade performances ~25% if badly configured. What is observed here is a factor five slowdown.

I would be surprised if it makes a difference. Storage virtualization may degrade performances ~25% if badly configured. What is observed here is a factor five slowdown.
Owner

Honestly, I cannot imagine this being the HDD vs SSD performance alone. We discussed this today a little, and we think that memory caching should be sufficient already. What is the write load of Gitea? I suppose it creates a few records into the db (which are then in cache), and performs several reads / checks on them? Then repeats this for the next test? If this isn't writing tons of data, even the write operations shouldn't be slow while the backlog is written. I don't know if there are any filesystem specifics that could be optimized (I think we're using BTRFS inside a ZFS pool IIRC? Can't say much to that choice).

What about meeting virtually to figure this out. We could do several tests with scenarios, run write benchmarks inside the container, look at netdata graphs for the disks and more. I might be available today (starting in an hour).

Honestly, I cannot imagine this being the HDD vs SSD performance alone. We discussed this today a little, and we think that memory caching should be sufficient already. What is the write load of Gitea? I suppose it creates a few records into the db (which are then in cache), and performs several reads / checks on them? Then repeats this for the next test? If this isn't writing tons of data, even the write operations shouldn't be slow while the backlog is written. I don't know if there are any filesystem specifics that could be optimized (I think we're using BTRFS inside a ZFS pool IIRC? Can't say much to that choice). What about meeting virtually to figure this out. We could do several tests with scenarios, run write benchmarks inside the container, look at netdata graphs for the disks and more. I might be available today (starting in an hour).
Poster

To figure that out:

  • launch a test
  • measure I/O wait while it runs

If I/O wait is >10% the whole time, it's HDD killing the performances. If not, it's something else.

To figure that out: * launch a test * measure I/O wait while it runs If I/O wait is >10% the whole time, it's HDD killing the performances. If not, it's something else.
Owner

I haven't looked yet if forgejo modifies any tests, but the following diff applied to my 1.18 Gitea checkout seems to let make test-sqlite run fine (noticeably faster than on my local disk; PASSed after ~15 minutes on a crappy machine). Do I miss something?

diff --git a/tests/sqlite.ini.tmpl b/tests/sqlite.ini.tmpl
index f5e8895e0..a397f8a79 100644
--- a/tests/sqlite.ini.tmpl
+++ b/tests/sqlite.ini.tmpl
@@ -3,7 +3,7 @@ RUN_MODE = prod
 
 [database]
 DB_TYPE = sqlite3
-PATH    = tests/{{TEST_TYPE}}/gitea-{{TEST_TYPE}}-sqlite/gitea.db
+PATH    = ":memory:"
 
 [indexer]
 REPO_INDEXER_ENABLED = true

And as per https://www.sqlite.org/inmemorydb.html, you can likely even do more advanced stuff if really necessary.

I haven't looked yet if forgejo modifies any tests, but the following diff applied to my 1.18 Gitea checkout seems to let `make test-sqlite` run fine (noticeably faster than on my local disk; PASSed after ~15 minutes on a crappy machine). Do I miss something? ~~~diff diff --git a/tests/sqlite.ini.tmpl b/tests/sqlite.ini.tmpl index f5e8895e0..a397f8a79 100644 --- a/tests/sqlite.ini.tmpl +++ b/tests/sqlite.ini.tmpl @@ -3,7 +3,7 @@ RUN_MODE = prod [database] DB_TYPE = sqlite3 -PATH = tests/{{TEST_TYPE}}/gitea-{{TEST_TYPE}}-sqlite/gitea.db +PATH = ":memory:" [indexer] REPO_INDEXER_ENABLED = true ~~~ And as per https://www.sqlite.org/inmemorydb.html, you can likely even do more advanced stuff if really necessary.
Poster

The tests are not modified in any way, except for two tests that are commented out because they are flaky.

I applied the patch above and ran the test again. The DB is not the only disk operation the tests do. They also setup and use git repositories of various complexities: they are integration tests that require semi realistic conditions to operate effectively.

Note that the second run that completes within 16mn runs at a time (3h23) when the disk is not stalling processes. The other two run when the disk is stalling all processes between 10% to 20% of the time.

Pressure Stall Information identifies and quantifies the disruptions caused by resource contentions. The "some" line indicates the share of time in which at least some tasks are stalled on I/O. The "full" line indicates the share of time in which all non-idle tasks are stalled on I/O simultaneously. In this state actual CPU cycles are going to waste, and a workload that spends extended time in this state is considered to be thrashing. The ratios (in %) are tracked as recent trends over 10-, 60-, and 300-second windows.

image

It is another way to look at the global I/O wait, which has the same pattern.

Total CPU utilization (all cores). 100% here means there is no CPU idle time at all. You can get per core usage at the CPUs section and per application usage at the Applications Monitoring section. **Keep an eye on iowait. If it is constantly high, your disks are a bottleneck and they slow your system down.
**

image

There is a lot going on on this machine and it is unclear what exactly is applying pressure. It does not look like it is the Forgejo CI job because the pressure / iowait is high even when it is not running. And when it ran within 16mn, there was no spike in pressure.

Looking at the other users / cgroups I don't see a correlation that suggests one of them is responsible for the pressure. My conclusion at this time is that:

  • Forgejo integration tests are sensitive to iowait/pressure because they do a significant amount of IO
  • Other processes are almost constantly applying pressure (doing a significant amount of IO) which impacts Forgejo (but any other process doing IO), as shown below on a 12h period.
  • When HDD are in play a single process can apply high pressure without doing much IO, just random reads (see this blog post that explains why).
  • It is very difficult to figure out which process is applying pressure on IO at a given point in time because there is a wide variety of workloads.

image

The tests are not modified in any way, except for two tests that are commented out because they are flaky. I applied the patch above and ran the test again. The DB is not the only disk operation the tests do. They also setup and use git repositories of various complexities: they are integration tests that require semi realistic conditions to operate effectively. * https://ci.codeberg.org/dachary/forgejo/pipeline/43/8 1h45 (starts 0h58) * https://ci.codeberg.org/dachary/forgejo/pipeline/44/8 16mn (starts 3h23) * https://ci.codeberg.org/dachary/forgejo/pipeline/45/8 timeout after 2h (starts 3h57) * https://ci.codeberg.org/dachary/forgejo/pipeline/46/8 46min started as fast as the 16mn run and slowed down considerably towards the end (starts 11h19) * https://ci.codeberg.org/dachary/forgejo/pipeline/47/8 timeout after 2h (starts 12h33) Note that the second run that completes within 16mn runs at a time (3h23) when the disk is **not** stalling processes. The other two run when the disk is stalling all processes between 10% to 20% of the time. > Pressure Stall Information identifies and quantifies the disruptions caused by resource contentions. The "some" line indicates the share of time in which at least some tasks are stalled on I/O. The "full" line indicates the share of time in which all non-idle tasks are stalled on I/O simultaneously. In this state actual CPU cycles are going to waste, and a workload that spends extended time in this state is considered to be thrashing. The ratios (in %) are tracked as recent trends over 10-, 60-, and 300-second windows. ![image](/attachments/e65a55ec-4323-498a-90f9-7bb4595a070e) It is another way to look at the global I/O wait, which has the same pattern. > Total CPU utilization (all cores). 100% here means there is no CPU idle time at all. You can get per core usage at the CPUs section and per application usage at the Applications Monitoring section. **Keep an eye on iowait. If it is constantly high, your disks are a bottleneck and they slow your system down. ** ![image](/attachments/a1020fe5-66de-44d0-8f49-8e18d9efe194) There is a lot going on on this machine and it is unclear what exactly is applying pressure. It does not look like it is the Forgejo CI job because the pressure / iowait is high even when it is not running. And when it ran within 16mn, there was no spike in pressure. Looking at the other users / cgroups I don't see a correlation that suggests one of them is responsible for the pressure. My conclusion at this time is that: * Forgejo integration tests are sensitive to iowait/pressure because they do a significant amount of IO * **Other processes** are almost constantly applying pressure (doing a significant amount of IO) which impacts Forgejo (but any other process doing IO), as shown below on a 12h period. * **When HDD are in play a single process can apply high pressure without doing much IO, just random reads** (see [this blog post that explains why](https://blog.dachary.org/2011/04/29/random-read-disk-stress-test/)). * It is very difficult to figure out which process is applying pressure on IO at a given point in time because there is a wide variety of workloads. ![image](/attachments/b9d8a775-42f0-4c37-b821-98ff8e0bb312)
134 KiB
209 KiB
286 KiB
Poster

https://ci.codeberg.org/dachary/forgejo/pipeline/46/8 46min started as fast as the 16mn run (see above) and slowed down considerably towards the end (starts 11h19)

image

When the job is slowed down by IO pressure, the logs show lines such as:

=== TestAPICommentReactions (/go/src/codeberg/gitea/tests/integration/api_issue_reaction_test.go:82)
+++ TestAPICommentReactions is a slow test (took 32.989528651s)

In the logs of the run that took only 16mn, there are 27 slow tests out of 672. In the logs of the run that took 1h46 there are 202 slow tests out of 672.

https://ci.codeberg.org/dachary/forgejo/pipeline/46/8 46min started as fast as the 16mn run (see above) and slowed down considerably towards the end (starts 11h19) ![image](/attachments/689ca0d6-c2de-4169-8dcb-efa82e8b6299) When the job is slowed down by IO pressure, the logs show lines such as: ``` === TestAPICommentReactions (/go/src/codeberg/gitea/tests/integration/api_issue_reaction_test.go:82) +++ TestAPICommentReactions is a slow test (took 32.989528651s) ``` In the [logs of the run that took only 16mn](https://ci.codeberg.org/dachary/forgejo/pipeline/44/8), there are 27 slow tests out of 672. In the logs of the [run that took 1h46](https://ci.codeberg.org/dachary/forgejo/pipeline/43/8) there are 202 slow tests out of 672.
180 KiB
rwa added the
s/Woodpecker
label 5 days ago

Could donating some hardware to Codeberg help solve this issue?

If yes, what kind of SSD or server would fit the bill?

(maybe I can have servers equipped with "Dual Xeon E5-2690 v2", 128 GB of RAM and SSD)

Could donating some hardware to Codeberg help solve this issue? If yes, what kind of SSD or server would fit the bill? (maybe I can have servers equipped with "Dual Xeon E5-2690 v2", 128 GB of RAM and SSD)
Poster

That would be perfect. Out of the 128GB RAM, 64GB can be used as a RAM file system for running the tests and not wear the SSD too fast.

That would be perfect. Out of the 128GB RAM, 64GB can be used as a RAM file system for running the tests and not wear the SSD too fast.
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date

No due date set.

Blocks
#35 [CI] slow I/O
forgejo/forgejo
Reference: Codeberg/Community#805
Loading…
There is no content yet.