Confirm git GC is working #419

Open
opened 9 months ago by Captain4LK · 11 comments

I rebased two of my repositories, since the first commits didn't really have to do anything with the project and there were quite a few larger files which have been removed since.

However after doing so, my repository size increased, so I wonder if there is any way to prune these files on the server? On gitlab there is housekeeping which seems to do just that.

I rebased two of my repositories, since the first commits didn't really have to do anything with the project and there were quite a few larger files which have been removed since. However after doing so, my repository size increased, so I wonder if there is any way to prune these files on the server? On gitlab there is housekeeping which seems to do just that.
Collaborator

There is a garbage cleaning job which probably takes care of this, but we started to wonder why we don't see stuff getting removed. We're on it and testing it on codeberg-test.org and keep you updated here.

Can you keep the links to the original commits in case we need further testing?

Also relevant: #339

There is a garbage cleaning job which probably takes care of this, but we started to wonder why we don't see stuff getting removed. We're on it and testing it on codeberg-test.org and keep you updated here. Can you keep the links to the original commits in case we need further testing? Also relevant: #339
fnetX added the
question
label 9 months ago
Poster

Here is a link to the latest commit removed by the rebase:

375ca2e372 (diff), 375ca2e372 (full repo)

I can't provide links from the other repo, since I pruned it locally.

Here is a link to the latest commit removed by the rebase: https://codeberg.org/Captain4LK/SoftLK-lib/commit/375ca2e37242097db602d8aa9d180515a89a898c (diff), https://codeberg.org/Captain4LK/SoftLK-lib/commits/commit/375ca2e37242097db602d8aa9d180515a89a898c (full repo) I can't provide links from the other repo, since I pruned it locally.
fnetX changed title from Housekeeping equivalent? to Confirm git GC is working 7 months ago
Collaborator

I did not yet manage to confirm that the Gitea GC task is doing anything, I never found a commit that was gone afterwards.

@techknowlogick any ideas how we could check that? Codeberg-test repos are gone, but before we had a test repo with force-pushed commits and removed branches, closed pull requests etc and the commits haven't been cleaned up after more than a month (and after manually triggering the job).

I did not yet manage to confirm that the Gitea GC task is doing anything, I never found a commit that was gone afterwards. @techknowlogick any ideas how we could check that? Codeberg-test repos are gone, but before we had a test repo with force-pushed commits and removed branches, closed pull requests etc and the commits haven't been cleaned up after more than a month (and after manually triggering the job).
Collaborator

The Gitea-internal git GC is apparently working correctly (confirmed ~ the same behaviour as running git gc in the data folders, so all seems fine), but running out of memory for very big repos. That's probably why it didn't do anything on the old codeberg-test, I assume it failed with some very big repos and never came about to repack our test data - we just didn't look closely enough to the log messages.

I'll be looking into improving this on codeberg-test, if it doesn't run OOM for the biggest repos there, it should be fine to start the job on prod with the same settings (and monitor it closely!).

Note: It does remove commits that are referenced somewhere, but keeps the repos intact. This should be fine. It keeps the old commits for a while, we might want to increase the default time here? Because it's probably not uncommon that a PR is open for a while and force-pushes want to be compared.

The Gitea-internal git GC is apparently working correctly (confirmed ~ the same behaviour as running `git gc` in the data folders, so all seems fine), but running out of memory for very big repos. That's probably why it didn't do anything on the old codeberg-test, I assume it failed with some very big repos and never came about to repack our test data - we just didn't look closely enough to the log messages. I'll be looking into improving this on codeberg-test, if it doesn't run OOM for the biggest repos there, it should be fine to start the job on prod with the same settings (and monitor it closely!). Note: It **does** remove commits that are referenced somewhere, but keeps the repos intact. This should be fine. It keeps the old commits for a while, we might want to increase the default time here? Because it's probably not uncommon that a PR is open for a while and force-pushes want to be compared.
fnetX added
infrastructure
and removed
question
labels 7 months ago
Collaborator

Here is a link to the latest commit removed by the rebase:

375ca2e372 (diff), 375ca2e372 (full repo)

I can't provide links from the other repo, since I pruned it locally.

I think git gc dont remove it, since you have a reference sticked to it: pull/1/head -> the diff view of pull #1 should still work, if git prune it, it should be gone!

-> https://codeberg.org/Captain4LK/SoftLK-lib/pulls/1/files

> Here is a link to the latest commit removed by the rebase: > > https://codeberg.org/Captain4LK/SoftLK-lib/commit/375ca2e37242097db602d8aa9d180515a89a898c (diff), https://codeberg.org/Captain4LK/SoftLK-lib/commits/commit/375ca2e37242097db602d8aa9d180515a89a898c (full repo) > > I can't provide links from the other repo, since I pruned it locally. I think git gc dont remove it, since you have a reference sticked to it: `pull/1/head` -> the diff view of pull `#1` should still work, if git prune it, it should be gone! -> https://codeberg.org/Captain4LK/SoftLK-lib/pulls/1/files
Collaborator

git gc would remove it: 375ca2e372 (gone after manually running git gc on the repo)

We did never fire the job on prod for all repos.

Update: Ah well, I don't have this specific pull ... hmm ...

git gc would remove it: https://codeberg.org/fnetX/SoftLK-lib-git-gc-test/commit/375ca2e37242097db602d8aa9d180515a89a898c (gone after manually running git gc on the repo) We did never fire the job on prod for all repos. Update: Ah well, I don't have this specific pull ... hmm ...
fnetX self-assigned this 6 months ago
Collaborator

I updated the Gitea config file with some parameters regarding Garbage Collection.

It seems to work now (it simply calls Git GC, now with the additional settings I provided). Previous experiments on Codeberg-Test probably didn't yield any results, because it was blowing memory usage and probably never reached the test data I wanted it to repack and clean up, because the process would run out of memory before that. I didn't pay much attention to eventual log messages back then, I only wondered why pretty much nothing happened.

This is confirmed to work on codeberg-test now, Git GC is invoked on each repo, one after another, with the correct arguments, that are

  • --auto to skip repos that don't need much cleaning
  • --expire=8.weeks.ago to make sure people don't lose working objects, e.g. force-pushes on PRs that are still relevant. Two months should be reasonable. Active commits (on latest branches etc) won't be affected.

But of course, the few testing repos on codeberg-test.org are not enough to estimate the real impact on prod, especially the time it will take.

Runnings this might lead to high resource usage, but it seems to cap at a few Gigabytes of RAM for the biggest repos. And since they are optimized one-by-one, this should be fine.

Git repack operations should also improve working with Git repositories, it would be interesting to measure this impact somehow. A single experiment did not yield any significant impact, but is also not representative for all the Git repos on Codeberg.

I think we're good to go to run this on prod. 🚀

I updated the Gitea config file with some parameters regarding Garbage Collection. It seems to work now (it simply calls Git GC, now with the additional settings I provided). Previous experiments on Codeberg-Test probably didn't yield any results, because it was blowing memory usage and probably never reached the test data I wanted it to repack and clean up, because the process would run out of memory before that. I didn't pay much attention to eventual log messages back then, I only wondered why pretty much nothing happened. This is confirmed to work on codeberg-test now, Git GC is invoked on each repo, one after another, with the correct arguments, that are - `--auto` to skip repos that don't need much cleaning - `--expire=8.weeks.ago` to make sure people don't lose working objects, e.g. force-pushes on PRs that are still relevant. Two months should be reasonable. Active commits (on latest branches etc) won't be affected. But of course, the few testing repos on codeberg-test.org are not enough to estimate the real impact on prod, especially the time it will take. Runnings this might lead to high resource usage, but it seems to cap at a few Gigabytes of RAM for the biggest repos. And since they are optimized one-by-one, this should be fine. Git repack operations should also improve working with Git repositories, it would be interesting to measure this impact somehow. A single experiment did not yield any significant impact, but is also not representative for all the Git repos on Codeberg. I think we're good to go to run this on prod. 🚀
Collaborator

Just adding some thoughts:

Running this the first time on prod will also heavily touch files on the system and might grow the size of the backup snapshot a lot. We must make sure that everything works fine there and stay prepared for an increased size.

During normal maintenance this should not be an issue, since repos should only be occasionally repacked. But of course, transfer sizes might grow.

For the first time, it could also be an option to run the Git GC for some known and active repos manually first, so that we can catch eventual problems first (like: removing data that was still wanted, although I think this is super unlikely). This would also distribute the load on the backup systems over a larger timeframe.

Just adding some thoughts: Running this the first time on prod will also heavily touch files on the system and might grow the size of the backup snapshot a lot. We must make sure that everything works fine there and stay prepared for an increased size. During normal maintenance this should not be an issue, since repos should only be occasionally repacked. But of course, transfer sizes might grow. For the first time, it could also be an option to run the Git GC for some known and active repos manually first, so that we can catch eventual problems first (like: removing data that was still wanted, although I think this is super unlikely). This would also distribute the load on the backup systems over a larger timeframe.
Collaborator

I just ran some experiments on a random subset of 1000 repos. Wanted to share a nice table of how many repos had (no) action, but the result is very simple. No action taken at all.

Looks like we have to tune settings a little more agressive to see any effect.


Btw, might be useful

alias randomrepos='find /data/git/gitea-repositories/ -mindepth 2 -maxdepth 2 -type d | sort -R | tail --l $1'
randomrepos 100 | while read repo; do cd "$repo"; echo $repo; git gc --auto --prune=8.weeks.ago; done
I just ran some experiments on a random subset of 1000 repos. Wanted to share a nice table of how many repos had (no) action, but the result is very simple. No action taken at all. Looks like we have to tune settings a little more agressive to see any effect. --- Btw, might be useful ~~~ alias randomrepos='find /data/git/gitea-repositories/ -mindepth 2 -maxdepth 2 -type d | sort -R | tail --l $1' randomrepos 100 | while read repo; do cd "$repo"; echo $repo; git gc --auto --prune=8.weeks.ago; done ~~~
Collaborator

Repeated this with a changed setting (gc.auto = 1000) on a subset of 500+500 random repos tonight+later, 37+17 (= 7.4+3.4%) of them were compacted.

Will look into fine-tuning this now ... then put this settings into the config and run it for all repos.


Update: And finally for all repos:
638 compacted of 15758 (=~ 4%), saving about 5.2 GiB disk space.

Repeated this with a changed setting (`gc.auto = 1000`) on a subset of 500+500 random repos tonight+later, 37+17 (= 7.4+3.4%) of them were compacted. Will look into fine-tuning this now ... then put this settings into the config and run it for all repos. --- Update: And finally for all repos: 638 compacted of 15758 (=~ 4%), saving about 5.2 GiB disk space.
Collaborator

Just want to "close" this topic, as I will stop actively working on it.

Only remaining question on my side:

Some repos show something along

Auto packing the repository for optimum performance.
See "git help gc" for manual housekeeping.
Nothing new to pack.
Expanding reachable commits in commit graph: 1109949, done.
Writing out commit graph in 3 passes: 100% (3329847/3329847), done.
warning: There are too many unreachable loose objects; run 'git prune' to remove them.

and I don't really understand this. git gc is supposed to invoke git prune, and the docs say on the latter that it is normally not required to run this directly. I can't find any information on why it's sometimes necessary to run git prune in addition to git gc. Maybe someone finds this and can enlighten me?

Currently, this was always called via shell, I'll make this into the Gitea config and schedule it to run from time to time (when Gitea has a large uptime). It should, however, be considered to make the Gitea schedules permanent, since "once a week" will only be called when Gitea is not restarted in the meantime.

Just want to "close" this topic, as I will stop actively working on it. Only remaining question on my side: Some repos show something along ~~~ Auto packing the repository for optimum performance. See "git help gc" for manual housekeeping. Nothing new to pack. Expanding reachable commits in commit graph: 1109949, done. Writing out commit graph in 3 passes: 100% (3329847/3329847), done. warning: There are too many unreachable loose objects; run 'git prune' to remove them. ~~~ and I don't really understand this. `git gc` is supposed to invoke `git prune`, and the docs say on the latter that it is normally not required to run this directly. I can't find any information on why it's sometimes necessary to run `git prune` in addition to `git gc`. Maybe someone finds this and can enlighten me? Currently, this was always called via shell, I'll make this into the Gitea config and schedule it to run from time to time (when Gitea has a large uptime). It should, however, be considered to make the Gitea schedules permanent, since "once a week" will only be called when Gitea is not restarted in the meantime.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.