Serving file content/diff requests from gitea/forgejo is quite expensive computationally. And these bots tend to tarpit themselves when they come across eg. a Linux repo mirror.
> Serving file content/diff requests from gitea/forgejo is quite expensive computationally
One time, sure. But unauthenticated requests would surely be cached, authenticated ones skip the cache (just like HN works :) ), as most internet-facing websites end up using this pattern.
There are _lots_ of objects in a large git repository. E.g., I happen to have a fork of VLC lying around. VLC has 70k+ commits (on that version). Each commit has about 10k files. The typical AI crawler wants, for every commit, to download every file (so 700M objects), every tarball (70k+ .tar.gz files), and the blame layer of every file (700M objects, where blame has to look back on average 35k commits). Plus some more.
Saying “just cache this” is not sustainable. And this is only one repository; the only reasonable way to deal with this is some sort of traffic mitigation, you cannot just deal with the traffic as the happy path.
You can't feasibly cache large reposotories' diffs/content-at-version without reimplementing a significant part of git - this stuff is extremely high cardinality and you'd just constantly thrash the cache the moment someone does a BFS/DFS through available links (as these bots tend to do).
https://social.hackerspace.pl/@q3k/114358881508370524