Archive URL generated archives changed (but their content remains identical) #1366

Closed
opened 2023-12-11 12:45:15 +00:00 by Arsen · 33 comments

Comment

hi,

it seems that recently the compressor for .tar.gz archive URLs changed.

  ~$ diff <(<dls/foot-1.16.2.tar.gz gzip -d) <(</var/cache/distfiles/foot-1.16.2.tar.gz gzip -d)
  ~$ diff <(<dls/foot-1.16.2.tar.gz cat) <(</var/cache/distfiles/foot-1.16.2.tar.gz cat)
  Binary files /dev/fd/63 and /dev/fd/62 differ

the above shows that compressed content differs while decompressed
content remains identical.

(dls/foot-1.16.2.tar.gz is downloaded from the gentoo master distfiles mirror, containing old versions of these files,
/var/cache/distfiles/foot-1.16.2.tar.gz is fetched from codeberg at
around two in the morning last night)

was this intended? what is the stability guarantee on these endpoints?

EDIT: both of the archives were fetched from https://codeberg.org/dnkl/foot/archive/1.16.2.tar.gz but at differentt times (a few weeks back vs. today) - apologies for the lack of context.

see also: https://lwn.net/Articles/921787/

### Comment hi, it seems that recently the compressor for .tar.gz archive URLs changed. ``` ~$ diff <(<dls/foot-1.16.2.tar.gz gzip -d) <(</var/cache/distfiles/foot-1.16.2.tar.gz gzip -d) ~$ diff <(<dls/foot-1.16.2.tar.gz cat) <(</var/cache/distfiles/foot-1.16.2.tar.gz cat) Binary files /dev/fd/63 and /dev/fd/62 differ ``` the above shows that compressed content differs while decompressed content remains identical. (dls/foot-1.16.2.tar.gz is downloaded from the gentoo master distfiles mirror, containing old versions of these files, /var/cache/distfiles/foot-1.16.2.tar.gz is fetched from codeberg at around two in the morning last night) was this intended? what is the stability guarantee on these endpoints? EDIT: both of the archives were fetched from https://codeberg.org/dnkl/foot/archive/1.16.2.tar.gz but at differentt times (a few weeks back vs. today) - apologies for the lack of context. see also: https://lwn.net/Articles/921787/
n0toose added the
bug
s/Forgejo
labels 2023-12-12 00:24:41 +00:00

If the checksum changed that's a huge problem, if you want to package Software from Codeberg. Many formats e.g. AUR or Flatpak rely on checksums. I fortunately never faced this problem myself. It looks like you can config how git craetes archivs, so this should maybe be done on Codeberg.

If the checksum changed that's a huge problem, if you want to package Software from Codeberg. Many formats e.g. AUR or Flatpak rely on checksums. I fortunately never faced this problem myself. It looks like [you can config how git craetes archivs](https://github.com/go-gitea/gitea/issues/26620#issuecomment-1685850286), so this should maybe be done on Codeberg.
Member

@fnetX Might be worth pinging you over this.

@fnetX Might be worth pinging you over this.
Gusted referenced this issue from a commit 2023-12-15 19:07:21 +00:00
Owner

Wouldn't a change now potentially break recently generated archives, too? Is this problem "small enough"?

Wouldn't a change now potentially break recently generated archives, too? Is this problem "small enough"?
Author

unsure - i don't have stats on which packages from codeberg were affected.

unsure - i don't have stats on which packages from codeberg were affected.
Owner

We have applied a change to Codeberg that uses the old way to generate archives for all commits that are older than November 12th 2023, which is the date when we deployed the new Git version.

It is only a heuristic, but probably good enough. Please give us feedback if this solves your problem.

Also note that this is best-effort. We still do not make explicit guarantees about archive checksum stability, and recommend that tooling relies e.g. on checksum of content rather than archive.

We have applied a change to Codeberg that uses the old way to generate archives for all commits that are older than November 12th 2023, which is the date when we deployed the new Git version. It is only a heuristic, but probably good enough. Please give us feedback if this solves your problem. Also note that this is best-effort. We still do not make explicit guarantees about archive checksum stability, and recommend that tooling relies e.g. on checksum of content rather than archive.

We have applied a change to Codeberg that uses the old way to generate archives

Thanks

We still do not make explicit guarantees about archive checksum stability, and recommend that tooling relies e.g. on checksum of content rather than archive.

That's a big problem. Most things use the archive checksum with no way to use the content checksum. If checksum stability is not guaranteed, that's a really big problem for OpenSource projects when packages randomly do not work anymore.

> We have applied a change to Codeberg that uses the old way to generate archives Thanks > We still do not make explicit guarantees about archive checksum stability, and recommend that tooling relies e.g. on checksum of content rather than archive. That's a big problem. Most things use the archive checksum with no way to use the content checksum. If checksum stability is not guaranteed, that's a really big problem for OpenSource projects when packages randomly do not work anymore.
Author

checking contents requires decompressing. if decompressor inputs aren't checked, decompressors are subject to possible exploitation (e.g. zip bombs or more)

checking contents requires decompressing. if decompressor inputs aren't checked, decompressors are subject to possible exploitation (e.g. zip bombs or more)
Owner

In my opinion, the checksum stability should be guaranteed within Git, since we only use the standard way of compressing archives.

Our patch also leads to inconsistencies across platforms. For example, a Git archive downloaded on GitHub is now different from a archive downloaded on Codeberg in most cases, old Git archives downloaded from arbitrary Forgejo instances are now also different from Git archives downloaded on Codeberg.

I would expect that when I mirror a source code on three forges, that they all have the same checksum, because there is a standard way for Git to create these archives.

It's hard to make guarantees for something that is in control of upstream projects. Unlike GitHub, we don't have the human power to operate a fork of Git itself.

In my opinion, the checksum stability should be guaranteed within Git, since we only use the standard way of compressing archives. Our patch also leads to inconsistencies across platforms. For example, a Git archive downloaded on GitHub is now different from a archive downloaded on Codeberg in most cases, old Git archives downloaded from arbitrary Forgejo instances are now also different from Git archives downloaded on Codeberg. I would expect that when I mirror a source code on three forges, that they all have the same checksum, because there is a standard way for Git to create these archives. It's hard to make guarantees for something that is in control of upstream projects. Unlike GitHub, we don't have the human power to operate a fork of Git itself.
Owner

My offer is that you follow the announcements regarding new Forgejo releases and test-drive them e.g. on https://next.forgejo.org (or work with us on our environments). If you report that checksums are changing between the releases, it will be considered a bug and blocking a release until it is resolved. This likely works out.

But I doubt that we have enough human resources for a proper fix on our scale. Maybe you have a simple proposal about how we could guarantee for checksum stability, though?

My offer is that you follow the announcements regarding new Forgejo releases and test-drive them e.g. on https://next.forgejo.org (or work with us on our environments). If you report that checksums are changing between the releases, it will be considered a bug and blocking a release until it is resolved. This likely works out. But I doubt that we have enough human resources for a proper fix on our scale. Maybe you have a simple proposal about how we could guarantee for checksum stability, though?

There needs to be a new regression test that verifies the stability of these checksums. https://code.forgejo.org/forgejo/end-to-end already has that kind of tests. But not this one specifically.

There needs to be a new regression test that verifies the stability of these checksums. https://code.forgejo.org/forgejo/end-to-end already has that kind of tests. But not this one specifically.

I have found a changed checksum in one of my repos. The Checksum of https://codeberg.org/JakobDev/jdMinecraftLauncher/archive/5.2.tar.gz changed from 0f63819ad0c7ed27b0e4e486f3550031d12986ac4160d34ba1932c97393208db to a434bd1db63276ca3d50cf668952629d6914b5e5fc053fd43832393a1197191f.

I have found a changed checksum in one of my repos. The Checksum of https://codeberg.org/JakobDev/jdMinecraftLauncher/archive/5.2.tar.gz changed from `0f63819ad0c7ed27b0e4e486f3550031d12986ac4160d34ba1932c97393208db` to `a434bd1db63276ca3d50cf668952629d6914b5e5fc053fd43832393a1197191f`.
Owner

Can you say to when the checkpoint changed? The commit this tag references is old and it should have chosen the old checksum implementation in this case 😕

When was the 0f… checksum generated?

Can you say to when the checkpoint changed? The commit this tag references is old and it should have chosen the old checksum implementation in this case 😕 When was the 0f… checksum generated?

The of checksum was from the day I created the release. I don't know when the checksum changed. I updated the Flatpak of the Program today, so I found out it has been changed.

The of checksum was from the day I created the release. I don't know when the checksum changed. I updated the Flatpak of the Program today, so I found out it has been changed.

I found another one: https://codeberg.org/JakobDev/minecraft-launcher-lib/archive/6.4.tar.gz changed from 78456049ad624337eb00595279363f0215743b427abb62bac620ed1e4fb6c854 to 0e9e6248f514caffd282f9597f48f4eac131c23c15b1034f724e4a66ea235a93.

That's a huge problem for me. I create AUR packages for my Software. If the Checksum just randomly change, the package breaks and can no longer be installed.

I found another one: https://codeberg.org/JakobDev/minecraft-launcher-lib/archive/6.4.tar.gz changed from `78456049ad624337eb00595279363f0215743b427abb62bac620ed1e4fb6c854` to `0e9e6248f514caffd282f9597f48f4eac131c23c15b1034f724e4a66ea235a93`. That's a huge problem for me. I create AUR packages for my Software. If the Checksum just randomly change, the package breaks and can no longer be installed.
Owner

@JakobDev Can you see something obviously wrong with 53ee27c8f8 ? I am wondering, because our testing confirmed that the patch was working as desired (we compared the hashes with and without the patch, and confirmed that those after the specified timestamp are using the new, and older ones the external gzip method).

The only thing I can currently think of is that we actually need to specify it for the tgz format, too (although I understand the Git docs that this should only be necessary when the file name is actually .tgz).

@JakobDev Can you see something obviously wrong with https://codeberg.org/Codeberg/forgejo/commit/53ee27c8f8df1c429da1186ca287674be612f240 ? I am wondering, because our testing confirmed that the patch was working as desired (we compared the hashes with and without the patch, and confirmed that those after the specified timestamp are using the new, and older ones the external gzip method). The only thing I can currently think of is that we actually need to specify it for the tgz format, too (although I understand the Git docs that this should only be necessary when the file name is actually .tgz).
Owner

Reading the patch again, if commit.Author.When.After(gitVersionAbove2_38_0) { sounds wrong. Mustn't it be "Before" this specific timestamp that we want to use the gzip command? Could we have screwed up our testing?

CC @Gusted

Reading the patch again, `if commit.Author.When.After(gitVersionAbove2_38_0) {` sounds wrong. Mustn't it be "Before" this specific timestamp that we want to use the gzip command? Could we have screwed up our testing? CC @Gusted
Owner

Mustn't it be "Before" this specific timestamp that we want to use the gzip command

Weirdly enough yeah that should be, I must've read it the other way around, working with time sucks.

> Mustn't it be "Before" this specific timestamp that we want to use the gzip command Weirdly enough yeah that should be, I must've read it the other way around, working with time sucks.
Owner
https://codeberg.org/Codeberg/forgejo/commit/7b60917310152691af40a9baf79b861d5fa7e8dd

The linked archives don't have the old checksum back. Maybe the generated archive is cached somewhere on the Server.

@Gusted The wrong check was there for 2 weeks. By fixing it you most likely break the checksum of all Archives from this time.

The linked archives don't have the old checksum back. Maybe the generated archive is cached somewhere on the Server. @Gusted The wrong check was there for 2 weeks. By fixing it you most likely break the checksum of all Archives from this time.
Owner

The linked archives don't have the old checksum back. Maybe the generated archive is cached somewhere on the Server.

As far as I'm aware it's not deployed, given we still need to decide what we should do here.

I'm of opinion that we should drop the special check all together as that would fix what my broken check unfortunately has caused and all the 'old' archives are not relying that heavily on this checksum given that it took a whole month to even notice the checksum changed.

> The linked archives don't have the old checksum back. Maybe the generated archive is cached somewhere on the Server. As far as I'm aware it's not deployed, given we still need to decide what we should do here. I'm of opinion that we should drop the special check all together as that would fix what my broken check unfortunately has caused and all the 'old' archives are not relying that heavily on this checksum given that it took a whole month to even notice the checksum changed.

it took a whole month to even notice the checksum changed.

I won't say that nobody noticed. Man Users don't write a bug report when something breaks. But if Codeberg can now guarantee the Checksum will not change any further, I'm happy with it.

> it took a whole month to even notice the checksum changed. I won't say that nobody noticed. Man Users don't write a bug report when something breaks. But if Codeberg can now guarantee the Checksum will not change any further, I'm happy with it.
Owner

I won't say that nobody noticed. Man Users don't write a bug report when something breaks.

Fair, but if it's breaking such as people relying it for packages I would've assumed some noise about this sooner.

. But if Codeberg can now guarantee the Checksum will not change any further, I'm happy with it.

The problem is that the current deployed code is not 'correct' and IMO still needs fixing either by removing the fix or reversing the check so it's stable for old archives again (which are now breaking w.r.t. their checksum).

> I won't say that nobody noticed. Man Users don't write a bug report when something breaks. Fair, but if it's breaking such as people relying it for packages I would've assumed some noise about this sooner. > . But if Codeberg can now guarantee the Checksum will not change any further, I'm happy with it. The problem is that the current deployed code is not 'correct' and IMO still needs fixing either by removing the fix or reversing the check so it's stable for old archives again (which are now _breaking_ w.r.t. their checksum).
Owner

All archives created for commits prior to 23.00 UTC today will use the old checksum algorithm now, which is consistent with the behaviour of the past (before the broken patch was applied) and consistent with how the checksums are created now.

Starting with commits created tonight, we'll use the new Git default.

Choosing a time in the future ensures that a commit only has one way the archive is generated.

Last but not least, this likely needs an integration test in Forgejo to prevent the same from happening again in future Git versions - in case the algorithm is modified again. I still fear this is out of scope for the Codeberg Issue Tracker, contributions are welcome, though.

All archives created for commits prior to 23.00 UTC today will use the old checksum algorithm now, which is consistent with the behaviour of the past (before the broken patch was applied) and consistent with how the checksums are created now. Starting with commits created tonight, we'll use the new Git default. Choosing a time in the future ensures that a commit only has one way the archive is generated. Last but not least, this likely needs an integration test in Forgejo to prevent the same from happening again in future Git versions - in case the algorithm is modified again. I still fear this is out of scope for the Codeberg Issue Tracker, contributions are welcome, though.
fnetX closed this issue 2024-01-12 18:46:50 +00:00

Last but not least, this likely needs an integration test in Forgejo to prevent the same from happening again in future Git versions - in case the algorithm is modified again. I still fear this is out of scope for the Codeberg Issue Tracker, contributions are welcome, though.

I opened an issue on the Forgejo tracker so this doesn't get lost: https://code.forgejo.org/forgejo/end-to-end/issues/83

> Last but not least, this likely needs an integration test in Forgejo to prevent the same from happening again in future Git versions - in case the algorithm is modified again. I still fear this is out of scope for the Codeberg Issue Tracker, contributions are welcome, though. I opened an issue on the Forgejo tracker so this doesn't get lost: https://code.forgejo.org/forgejo/end-to-end/issues/83
Member

External issue apparently caused by this problem, leaving here for posterity: https://github.com/microsoft/vcpkg/pull/35898

CC: @generic-pers0n

External issue apparently caused by this problem, leaving here for posterity: https://github.com/microsoft/vcpkg/pull/35898 CC: @generic-pers0n

Looks like this is happening again: https://github.com/gentoo/guru/pull/138

Looks like this is happening again: https://github.com/gentoo/guru/pull/138
n0toose reopened this issue 2024-02-18 10:20:11 +00:00
n0toose added the
Codeberg
label 2024-02-18 10:20:21 +00:00
Author

seems that it is indeed wise for maintainers to run make distcheck given how many problems of this sort happened

seems that it is indeed wise for maintainers to run `make distcheck` given how many problems of this sort happened
Owner

There has been no further change on our end. This issue is not actionable without a detailed description of which checksum changed between which values. And it's especially interesting to learn when the checksum was created.

There has been no further change on our end. This issue is not actionable without a detailed description of which checksum changed between which values. And it's especially interesting to learn when the checksum was created.
fnetX closed this issue 2024-02-18 12:49:57 +00:00
Member

Sorry about that. I will pass on what you mentioned to the upstream PR.

Sorry about that. I will pass on what you mentioned to the upstream PR.
Owner

So the situation likely looks like this:

  • archives downloaded before November 12th 2023 (the date we deployed the new Git version which introduced the problem): They have the same checksum as today
  • archives downloaded between November 12th 2023 and Jan 12 2024: The checksums might have changed between the initial download and today, because of the initial problem and our attempt to fix it; archives from older commits downloaded during this time might also have changed checksums. But they should be back to the initial checksums now.
  • archives downloaded after Jan 12 2024: Our patch has made checksums deterministic depending on the date of the commit.

So the problem only affects archive checksums which were generated or compared between Nov 12 and Jan 12. Because the behaviour during this time is a little complicated and depends on when a checksum was created, there is not much we can do about it.

For example:

  • a commit was made before Nov 12th, and a person downloading an archive sees 1234abcd as the hash.
  • someone downloading the same archive after Nov 12th might see afd3c523 as the hash, and put this somewhere for reference.
  • someone comparing the archive after Nov 12th on another day might have received 1234abcd as the hash again, resulting in a mismatch
  • after Jan 12, the hash was made deterministic. All commits from before Jan 12 use the old compression algorithm and the checksum is back to 1234abcd
  • someone downloading the commit after Jan 12 receives the original checksum; but if they compare it with the checksum of a "broken" archive from between Nov 12 and Jan 12 (the reference mentioned above), they have a mismatch.
So the situation likely looks like this: - archives downloaded **before** November 12th 2023 (the date we deployed the new Git version which introduced the problem): They have the same checksum as today - archives downloaded between November 12th 2023 and Jan 12 2024: The checksums might have changed between the initial download and today, because of the initial problem and our attempt to fix it; archives from older commits downloaded during this time might also have changed checksums. But they should be back to the initial checksums now. - archives downloaded **after** Jan 12 2024: Our patch has made checksums deterministic depending on the date of the commit. So the problem only affects archive checksums which were generated or compared between Nov 12 and Jan 12. Because the behaviour during this time is a little complicated and depends on when a checksum was created, there is not much we can do about it. For example: - a commit was made before Nov 12th, and a person downloading an archive sees 1234abcd as the hash. - someone downloading the same archive after Nov 12th might see afd3c523 as the hash, and put this somewhere for reference. - someone comparing the archive after Nov 12th on another day might have received 1234abcd as the hash again, resulting in a mismatch - after Jan 12, the hash was made deterministic. All commits from before Jan 12 use the old compression algorithm and the checksum is back to 1234abcd - someone downloading the commit after Jan 12 receives the original checksum; but if they compare it with the checksum of a "broken" archive from between Nov 12 and Jan 12 (the reference mentioned above), they have a mismatch.

@fnetX Your analysis seems to be correct. The newest checksum "fix" commit is basically just reverting back to the old one:

a26b4c839e
8b748050a2

@fnetX Your analysis seems to be correct. The newest checksum "fix" commit is basically just reverting back to the old one: https://github.com/gentoo/guru/commit/a26b4c839e8ce5880ff3a1bce3d0e5176e07bab4 https://github.com/gentoo/guru/commit/8b748050a20c9d216e33c3a8d5017429159a4b5d
Owner

OK. So it looks like you attempted to fix the problem on your end on Jan 5 using the "wrong" checksum for the archive (which was the version generated by Codeberg on that day). The problem came from both sides trying to fix the same problem which made it incompatible again.

I apologize for all the trouble this has caused. We didn't initially pay much attention to the change in Git's defaults for archive generation, but should have.

OK. So it looks like you attempted to fix the problem on your end on Jan 5 using the "wrong" checksum for the archive (which was the version generated by Codeberg on that day). The problem came from both sides trying to fix the same problem which made it incompatible again. I apologize for all the trouble this has caused. We didn't initially pay much attention to the change in Git's defaults for archive generation, but should have.

This issue was reported to MacPorts yesterday and upon investigation I see that 4 of our 8 ports that get their source code from Codeberg are affected because they were last updated between November 12 2023 and January 12 2024.

It took awhile for me to find this issue tracker to find this issue. I had originally thought it might be mentioned in your blog but I did not find anything about it there. When GitHub had this problem, they had writeups about it on their blog, both before they deployed the change that caused the problem and after they reverted it.

if it's breaking such as people relying it for packages I would've assumed some noise about this sooner.

Reasons I can think of why it was not noticed sooner:

  • There are very few projects hosted on Codeberg compared to how many are hosted elsewhere.
  • The problem only affects builds from source. MacPorts and other package management systems often provide pre-compiled binaries which most users use.
  • The problem only affects source distfile downloads from Codeberg. MacPorts and other package management systems mirror source code distfiles on their own infrastructure (for reasons including unintentional or malicious file changes on the original server). Most users who build from source with MacPorts get the files from those mirrors. Many users who build from source outside of the context of a package manager (any maybe even some that do) fetch using git, not a distfile; that wouldn't be affected either.
This issue was [reported to MacPorts yesterday](https://trac.macports.org/ticket/69395) and upon investigation I see that 4 of our 8 ports that get their source code from Codeberg are affected because they were last updated between November 12 2023 and January 12 2024. It took awhile for me to find this issue tracker to find this issue. I had originally thought it might be mentioned in your blog but I did not find anything about it there. When GitHub had this problem, they had writeups about it on their blog, both [before they deployed the change that caused the problem](https://github.blog/changelog/2023-01-30-git-archive-checksums-may-change/) and [after they reverted it](https://github.blog/2023-02-21-update-on-the-future-stability-of-source-code-archives-and-hashes/). > if it's breaking such as people relying it for packages I would've assumed some noise about this sooner. Reasons I can think of why it was not noticed sooner: * There are very few projects hosted on Codeberg compared to how many are hosted elsewhere. * The problem only affects builds from source. MacPorts and other package management systems often provide pre-compiled binaries which most users use. * The problem only affects source distfile downloads from Codeberg. MacPorts and other package management systems mirror source code distfiles on their own infrastructure (for reasons including unintentional or malicious file changes on the original server). Most users who build from source with MacPorts get the files from those mirrors. Many users who build from source outside of the context of a package manager (any maybe even some that do) fetch using git, not a distfile; that wouldn't be affected either.
Sign in to join this conversation.
No milestone
No project
No assignees
9 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference: Codeberg/Community#1366
No description provided.