Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I assume they're trying to keep ai bots from strip mining the whole place.

Or maybe your IP/browser is questionable.



What's being strip mined is the openness of the Internet, and AI isn't the one closing up shop. Github was created to collaborate on and share source code. The company in the best position to maximize access to free and open software is now just a dragon guarding other people's coins.

The future is a .txt file of John Carmack pointing out how efficient software used to be, locked behind a repeating WAF captcha, forever.


AI isn't the one closing up shop, it’s the one looting all the stores and taking everything that isn’t bolted down. The AI companies are bad actors that are exploiting the openness of the internet in a fashion that was obviously going to lead to this result - the purpose of these scrapers is to grab everything they can and repackage it into a commercial product which doesn’t return anything to the original source. Of course this was going to break the internet, and people have been warning about that from the first moment these jackasses started - what the hell else was the outcome of all this going to be?


This rings the same tune as the MPAA and RIAA utilizing lawfare to destroy freedom online when pirates were the ones "break[ing] the internet."

Could you help me understand what the difference is between your point and the arguments MPAA and RIAA used to ruin the torrent users' lives they concluded were "thieves"?

As a rule of thumb, do you think people who are happy with the services they contribute content to being open access and wish them to remain so should be the ones who are forced to constantly migrate to new services to keep their content free?

When AI can perfectly replicate the browsing behavior of a human being, should Github restrict viewing a git repository to those who have verified blood biometrics or had their eyes scanned by an Orb? If they make that change, will you still place blame on "jackasses"?


The moral argument in favor of piracy was that it didn’t cost the companies anything and the uses were noncommercial. Neither of those applies to the AI scrapers - they’re aggressively overusing freely-provided services (listen to some of the other folks on this thread about how the scrapers behave) and they’re doing so to create a competing commercial products.

I’m not arguing you shouldn’t be annoyed by these changes, I’m arguing you should be mad at the right people. The scrapers violated the implicit contract of the open internet, and now that’s being made more explicit. GitHub’s not actually a charity, but they’ve been able to provide a free service in exchange for the good will and community that comes along with it driving enough business to cover their costs of providing that service. The scrapers have changed that math, as they did with every other site on the internet in a similar fashion. You can’t loot a store and expect them not to upgrade the locks - as the saying goes, the enemy gets a vote on your strategy, too.


There are plenty of commercial pirates, and those commercial uses were grouped in with noncommercial sharing in much the same way you are doing with scraping. Am I wrong in assuming most of this scraping comes from people utilizing AI agents for things like AI-assisted coding? If an AI agent scrapes a page at a users' request (say the 1 billionth git commit scraped today), do you consider that "loot[ing] a store"? What got looted? Is it the bandwidth? The CPU? Or does this require the assumption that the author of that commit wouldn't be excited that their work is being used?

I'd like to focus on your strongest point, which is the cost to the companies. I would love to know what that increase in cost looks like. You can install nginx on a tiny server and serve 10k rps of static content, or like 50 (not 50k) rps of a random web framework that generates the same content. So this increase in cost must be weighed against how efficient the software serving that content is.

If this Github post included a bunch of numbers and details demonstrating how they have reached the end of the line on optimizing their web frontend, they have ran out of things to cache, and the increase in costs is a real cause for concern to the company (not just a quick shave to the bottom line, not a bigger net/compute check written from Github to their owners), I'd throw my hands up with them and start rallying against the (unquestionably inefficient and on the line of hostile) AI agent scrapers causing the increase in traffic.

Because they did not provide that information, I have to assume that Github and Microsoft are doing this out of pure profit motivations and have abandoned any sense of commitment to open access of software. In fact, they have much to gain from building the walls of their garden up as high as they can get away with, and I'm skeptical their increase in costs is very material at all.

I would rather support services that don't camouflage as open and free software proponents one day and victims of a robbery on the next. I still think this question is important and valid: There is tons of software on Github written by users who wish for their work to remain open access. Is that the class of software and people you believe should be shuffled around into smaller and smaller services that haven't yet abandoned the commitments that allowed them to become popular?


> There are plenty of commercial pirates, and those commercial uses were grouped in with noncommercial sharing

I don't think many people were particularly sympathetic to people making money off piracy - by and large, people were upset because people committing piracy for personal use were getting hit with the kinds of fines and legal charges usually reserved for, well, people who make money off piracy.

> Am I wrong in assuming most of this scraping comes from people utilizing AI agents for things like AI-assisted coding?

Yes. The huge increases in traffic aren't from, say, Claude going and querying Github when you ask it to, it's from the scraping to drive the initial training process. Claude and the others know the first thing about code because Github and StackOverflow were part of their training corpus, because the companies which made them scraped the whole damn site and used it as part of their training data for making a ~competing product. That's what Github's reacting to, that's what Reddit reacted to, that's what everyone's been reacting to - it's the scraping of the data for training that's leading to these reactions.

To be clear, because I think this is maybe a core of our disagreement: The problem that's leading to this isn't LLM agents acting on behalf of a user - it's not that Cursor googled python code for you - it's that the various companies training the models are aggressively scraping everything they can get their hands on. It's not one request for one repo on behalf of one user, it's the wholesale scraping of everything on the site by a rival company to make a rival product, most likely in violation of terms of service and certainly in violation of anything that anyone could reasonably assume another corporate entity would stand for. Github's not mad at you, they're mad at OpenAI.

> There is tons of software on Github written by users who wish for their work to remain open access. Is that the class of software and people you believe should be shuffled around into smaller and smaller services that haven't yet abandoned the commitments that allowed them to become popular?

You store your money in a bank. The bank gets robbed repeatedly by an organized group of serial bank robbers, and increases security at the branch. You move your money to another bank, because the increased security annoys you. You understand the problem here may repeat itself elsewhere as well, right?


>You understand the problem here may repeat itself elsewhere as well, right?

I do, and how is this to cap off our discussion:

>You move your money to another bank, because the increased security annoys you.

On my way out, I would quote Benjamin Franklin: "Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety."

I go out of my way to help my community, and I expect those I support to do the same. That bank could put more resources into investigating/catching the robbers who are attacking not just the bank but the security and liberty of their community, or it could treat every customer with more suspicion than they did yesterday. I know where the latter option leads, and I won't stand for it.

You're right that it repeats itself elsewhere. Often, I find.


> Could you help me understand what the difference is

Well the main difference is that this is being used to justify blocking and not demanding thousands of dollars.

> When AI can perfectly replicate the browsing behavior of a human being

They're still being jackasses because I'm willing to pay to give free service to X humans but not 20X bots pretending to be humans.


Free and open source software is on GitHub, but AI- and other crawlers do not respect the licenses. As someone who writes a lot of code under specific FOSS licenses, I welcome any change that makes it harder for machines to take my code and just steal it


I encountered this on github last week. Very agressive rate limiting. My browser and IP is very ordinary.

Since Microsoft is struggling to make ends meet, maybe they could throw a captcha or proof of work like Anubis by xe iaso.

They already disabled code search for unauthenticated users. Its totally plausible they will disable code browsing as well.


That hit me, too. I thought it was an accidental bug and didn’t realize it was actually malice.


Just sign in if it's an issue for your usage.


My usage isn't high. I was rate limited to like 5 requests per minute. It was a repo with several small files.

And seriously if they keep this up, with limits on their web interface but leave unauthenticated cloning allowed, I'd rather clone the repo than log in.

GitHub code browsing went south since microsoft bought them anyway. Having a simple proxy that clones a repo and serves it would solve problems with rate limits and their awful UX.


> Or maybe your IP/browser is questionable.

I'm using Firefox and Brave on Linux from a residential internet provider in Europe and the 429 error triggers consistantly on both browsers. Not sure I would consider my setup questionable considering their target audience.


I’m browsing from an iPhone in Europe right now and can browse source code just fine without being logged in.


Then it means they're looking at the User-Agent string and determining that an iPhone in Europe most likely has a human using it, and might not require rate-limiting.


*other ai bots, ms will obviously mine anything on there.

Personally, I like sourcehut (sr.ht)


Same way Reddit sells all its content to Google, then stops everyone else from getting it. Same way Stack Overflow sells all its content to Google, then stops everyone else from getting it.

(Joke's on Reddit, though, because Reddit content became pretty worthless since they did this, and everything before they did this was already publicly archived)


Other bots or MS bots too?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: