Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Github raw seems like the simplest system for solving cache invalidation: invalidate the cache of a changed file when it’s pushed.

They have access to both GitHub and the raw service. I know there are usually all sorts of layers between that make interconnectivity logistically complicated, but am I wrong that at the top-level it’s that simple?



That is too simple for the feature they are using. The client itself has its own cache and the only way to fully prevent traffic from a client is to tell it content it caches will remain valid for some amount of time into the future.

For URLs that return the latest entry there is no valid amount of time known in advance by GitHub unless they want to introduce mandatory publication delays. For URLs of specific change sets, they should never be corrected again and an infinite cache is pretty much valid unless a user overrides good git practices.

I think GitHub frequently misidentifies which scenario they are in and when they return 1 day for a current state URL users notice, while when they return 5 minutes for a permanent change set that gets a lot if traffic they lost network capacity.


Isn’t this where stuff like ETAGs are supposed to help? Not completely solve, but at least help reduce the problem a bit more?


Yes, but cache invalidation is a hard problem, so most services side-step etags (+ if-none-match), only to hit caches elsewhere.


Tag it with etag = hash, done. The client side isn't hard part.

The server side would require pushing any invalidation to (I imagine) whole tree of caches, which isn't exactly that hard if you plan for it from the start and have some way of upstream telling downstream file changes, but, well, they probably don't as I'd imagine they didn't expected people to pin their infrastructure to some binary blob on github that mutates


Etag doesn't let you "fully prevent traffic from a client" (GP's exact words). They'll still send a request to which you need to reply with a 304 after checking the resource.


They wouldn’t even need to check the resource. If the hash (which you get for free since it’s part of the git commit) was the etag then they can quickly reply to that request from an edge.


You get it "for free" if you load up the repo and check some files. That's not free at all.

In fact, loading a file by name from a Git repo is rather expensive, and is definitely not the way their CDN should be keeping things in cache (gotta load ref, uncompress and parse commit object, uncompress and read tree object(s), just to get the blob's hash. Every one of those objects is both deflated and delta-encoded.)


Nono: don't do it at lookup, set the etag when the repo is pushed.

I'm open to the idea that it's less computation to simply hash the files instead of deflate and decide, but my point was the hash is already calculated on the client when the changed file is added to the repo.


> The client itself has its own cache

Let’s leave this aside since it both has known client-side mitigations, and is not the cause of the issue that was posted.


Once you are using one style of caching its usually a mistake to introduce another even if the style is marginal. They very clearly have an issue with max age with clients/CDNs in the second half of the thread and probably have similar problems on internal transparent proxies, etc.


You could also just commit a file with the intended commit hash, make that the indicator for changes and use the commit in other requests. Has the added benefit that clients only need to fetch a tiny file if nothing changed.


GitHub responded that it was simply a bug in that Cache-Control was set to a day instead of 5 minutes. It's already been fixed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: