Sites like the New Yorker have had pay walls since before LLMs entered the scene, and moreover they purposely make their content available freely to those web crawlers that are supposedly robbing them (hence why archive links work).
Yes, it’s surprised me how this meme was everywhere in the comments while the data does not support it. I’d bet it’s splashy headlines in news outlets. Important to correct it so that policy is focused on what’s most effective.
You can simply use specific training examples that teach the model what you please. Eg. a set of examples which lead ranking/retreival/filtering models. The models are already online training and weights likely updated every ~1 hour ( or even less).
It’d be easy to go from a set of “moderators” who find examples and use it to query related content and use it as negative training samples. Just a guess.