I'm confused as to why this would see any improvement over time. Looking at the code, it's by default hitting the gpt 3.5-turbo API. Maybe I'm misremembering, but I thought I've seen statements from people working at OpenAI where it's been claimed that the API is static, we'd be informed of any changes to the underlying model. Is the model actually receiving updates?
edit: Looking at previous days, too, it doesn't exactly seem to be improving. I think we just got a lucky sampling.
Yes, the models are updated officially around every three months, with a notice you can still use the previous version for a time until it is decommissioned.
Some people claim there are also unannounced changes, but I can't vouch for that.
The daily variation is likely due to temperature. To make the response less repetitive.
Wasn't there a study recently that tracked the performance of GPT over time and found significant drop in quality? Did those drops occur at official model changes, or at other times? (i.e. unannounced changes for safety or cost reduction)
I mean, if I was OpenAI, I probably wouldn't make an announcement like "we've just quantized the model and increased our profit margins significantly! The only change on your end will be a slightly dumber model. (Don't worry! Most users won't even notice!)"
This one [1]? That tracks two distinct versions (0613 vs. 0314).
Also, IMO, the tasks they evaluate aren't useful (I rarely want my LLM to tell me whether 17077 is a prime number), and there's room for cherrypicking/survivorship bias. My guess is that OpenAI did something between 0314 and 0613 that shifted focus away from maths to other subjects.
The site linked in the OP is interesting because it takes a picture from GPT every day, so we can see for ourselves if there is any difference with time. We can come back tomorrow and see what it has produced. If it has produces random squigly lines again, we might assume that today's success was just a fluke.
According to the author's blog post [1] the idea was that it "will use the latest gpt-4 model made available". Not sure if the code isn't up to date or this was changed in the meantime...
edit: Looking at previous days, too, it doesn't exactly seem to be improving. I think we just got a lucky sampling.