Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It has become common knowledge that GPT4 (and also 3.5) have problems with deterministic outputs (even at T=0). So what we're seeing here is just the effect of random sampling, not any actual change to the model itself. If you scroll down, you'll see other close attempts by the exact same model that could already be counted as a win depending on who you ask.

Edit: This comment section is a super fascinating case study on the inherent flaws in human cognition. Especially when it comes to seeing patterns in random noise. The fact that some people believe that the model really has to have changed in the past few days is amazing, because if you've kept up with the GPT architecture and the way OpenAI does things (especially on the API), it is incredibly obvious that nothing has happened. But people who want to believe that something has happened will definitely also start to see something.



>This comment section is a super fascinating case study on the inherent flaws in human cognition. Especially when it comes to seeing patterns in random noise. The fact that some people believe that the model really has to have changed in the past few days is amazing

You need only to look at the discourse around the Tesla FSD superusers to see this: they report a glitch at an intersection one day, then believe the next day it was "fixed" by the AI.


Go into /r/chatgpt and /r/bing and it's a bit scary how many people anthropomorphize the models.


Honest question, why do you find it scary?

I agree that some people take it too far, but most seem to be using metaphor to abstract away the underlying complexities and facilitate conversation. I did the same thing back in college, anthropomorphizing cells and even molecules when I was learning about microbiology and biophysics (e.g., kinesins[1] are a family of postal workers who work hard to deliver packages to their clients in a timely manner, and they spend their free time going for strolls and practicing on their favorite tightropes). I now do it in my day-to-day work at an AI/ML shop to communicate not just what the entire pipeline is doing, but what individuals layers or encoders are doing, or even what variables in an equation are doing. I find that my colleagues and I are better able to understand and remember concepts and ideas when they're communicated as part of an anthropomorphic story, but we scarcely forget that the things we're dealing with aren't human.

But maybe I'm missing the point, and you're really worried about that first, smaller group of people who have really swallowed the Kool aid. To that, I can say that I don't think their behavior exceeds the baseline craziness/weirdness that I've come to expect from humanity. I'm sure there are far more people who believe in astrology than who believe that we've achieved true humanlike artificial general intelligence, for example.

[1] https://youtu.be/y-uuk4Pr2i8


> Honest question, why do you find it scary?

What I find scary is how much trust people put into the answers.

Today, I saw someone saying "Look! ChatGPT can design my home solar electrical circuit". That kind of thing will lead to new Darwin awards being given out.

Thing is, if people will trust it to do that right, when will politicians write policy papers with it? Let's face it, they already are. Other people won't want to read those long papers, so they'll ask it to summarise it for them. At that point you have LLMs writing laws and reviewing laws. It's all fine as long as nobody enforces them.

Right?


Interesting that modern discourse is to use "scary" and "dangerous" and such words so much more. I wonder if it is related to the present rise in neuroticism and trigger warnings etc.

It's not particularly "scary" to me that people do that. I remember Boomer and Scooby Doo bots that people anthropomorphized and those were warbots from barely 10 y ago.

I suppose, in today's parlance, "it's scary how much people use fear-oriented language for normal things".


Great comment. I take it, the use of "scary" is more of a tool to provoke a stronger reaction or animosity to the comment reader, more than actually the writer feeling "scared" after seeing people anthropomorphising AI.

It seems to be the current trend in communication nowadays: A race to induce the strongest reaction possible.

Why should someone be scared of that? We have anthropomorphised chairs, brooms and whatnot since the Walt Disney times and I am sure before in classic literature (I am not that literate to know for certain).

I prefer to be 'amazed' or 'excited' about what is happening: It means that AI is getting to a point where people feel it more 'relatable'. We are getting to that point in our technology development. The number of things we will be able to do with a technology with which we can interact that seamlesly is great.


> Interesting that modern discourse is to use "scary" and "dangerous" and such words so much more. I wonder if it is related to the present rise in neuroticism and trigger warnings etc.

I doubt it’s trigger warnings because they usually don’t “warn scary things”.

Also is there any data to back up that scary and fear is used more today than in the past? That seems unlikely.


> and those were warbots from barely 10 y ago.

Some people even anthropomorphized Eliza. To your point: if you cherry pick enough and survey enough people; “some people” will do just about anything.


I noticed similar a behavior in Stable Diffusion forums where people believe that the model they downloaded and are running offline is getting better at understanding their prompts.


Stable diffusion most likely don't do it, but even static model that as an input takes embedding of all your historic prompts + current prompt would get progressively give you better inputs as you use it.


Yes, but currently that would be a conscious choice (and extra intentional effort)


Thats even worse for autonomous cars, there is so such data and noise there is no way to reproduce the issue, it's complete chaos. Whereas with a LLM if we control the seed we can 100% reproduce the same result


>> if we control the seed we can 100% reproduce the same result

No, that's the problem. You can't. You should be able to, but you can't. If you could, they wouldn't be scary. But we have Temperature Zero, different results. Because no one gave enough of a shit when coding them, and no one gives enough of a shit to try to fix the issue.

This is what in any other industry would be called gross negligence.


A lot of stuff behind the scenes is going on to batch and route queries to GPT-4 models that are in perturbed states already[1]. This isn't gross negligence, this is basic capitalism. If you want sole access to a GPT-4 MoE cluster starting fresh, it's gonna cost you.

1. https://152334h.github.io/blog/non-determinism-in-gpt-4/


Interesting article. I can see how it makes sense for OpenAI or someone with a LLM to take advantage of any entropy that presented itself, as a shortcut to non-repetitive answers. I'm not sure if you're saying that these LLMs take on new characteristics as they get more randomized? Or just that it would be hard to get your hands on a fresh one to test the determinism of?


That's a OpenAi problem, not a LLM problem


>Whereas with a LLM if we control the seed we can 100% reproduce the same result

No, you can't. For the latest GPT models and the way they are run, this doesn't work anymore, making the experiment completely illogical. Some of the reasons are explained here pretty well: https://152334h.github.io/blog/non-determinism-in-gpt-4/


This sounds more like a bug than a feature.


100%. This person is trying to find patterns in random noise and believes they are meaningful. The original post hurts my head with its bad logic.


I'm sorry it hurts your head. I'm happy to sponsor a packet of paracetamol or some water if that helps. Ultimately, this is fun, not science. I'm just happy that after all these attempts, it finally got to a unicorn.


It got to a unicorn quite easily when the output was set to Tikz as I’ve done here:

https://bobjansen.net/drawing-with-chatgpt/


That looks nothing like a unicorn, but the three triangles at the top right in the third picture look like the mask of some evil manga villain alien robot.


Well done for being a good sport, but I'm willing to bet that tomorrow's shape will not resemble a unicorn and you'll have to figure out how that works with your assumption that the model is improving.


I think we're going to need to wait a few more years, at least, to see any improvement. I expect to see 4 new models a year, before GPT-5 arrives. I'll just keep using the latest model and we can all reconvene in 1, 2, 5, 10 years.


The 'random noise' from a prompt "Draw a unicorn in svg" should still return Unicorns.

This is absolutly fine and it should start showing unicorn like drawings over a longer period and potentially finetuned ones over a longer period of time when the model changes.


>The original post hurts my head with its bad logic.

Huh? What "original post"? This is an experiment, today the model drew something resembling a unicorn. Tomorrow we will see how the experiment goes again. I see no associated analysis, so what makes your "head hurt".


https://adamkdean.co.uk/posts/gpt-unicorn-a-daily-exploratio...

> The idea behind GPT Unicorn is quite simple: every day, GPT-4 will be asked to draw a unicorn in SVG format. This daily interaction with the model will allow us to observe changes in the model over time, as reflected in the output.


Sure, each day it will draw a unicorn. Each time the model changes, we'll have a new group of drawings. No they're not drawn at T=0 but even at T=0 GPT-4 is not deterministic. This isn't science -- this is just a bit of fun.


I believe the logic is fine. You seem to think you need multiple data points from the same version of the model (i.e. multiple samples per day at least) would be necessary to judge the actual performance on each particular day.

That's worse logic. How would you visualize the very large sample you would get? Even with the current 118 samples (one per day) it's already difficult to find a pattern.

Would you "average" the samples?? That would not help IMO, you would need to average the score of each image, which requires either manually doing it or finding a reliable algorithm to do it automatically, but good luck with that.

So, a sample per day which allows clearly visualizing any change in the results over months and years is a valuable thing to do and I find it hard to improve on the methodology. You just need to keep in mind that one single picture from the sample is not enough, no one is going to disagree with that... but that doesn't make it "bad logic" and it's pretty thoughtless to say so.


I agree that trying to determine the distribution of these drawings is hard because it isn't a simple floating point number in its current form.

But maybe you could covert it to a linear monotonic measure?

You could pass it to an image recognition model and see record the degree to which it thinks it is an:

animal horse unicorn

Basically if it fails to be a unicorn, see if it is a horse and if it fails to be a horse check if it is an animal. This gives you some type of linear measure if you place these three measures along the same axis adjacently. Then you can transform each image to a floating point number and then characterize the distribution.


>You could pass it to an image recognition model and see record the degree to which it thinks it is...

We don't care what another algorithm "thinks". We want to see if what it draws is humanly interpretable as a unicorn.


Then you could use mechanical turk to have them rate each image to figure out how close it is to a Unicorn...


We could but this is also a fun project, which is why when I checked it today I was surprised that what I saw was not a turd with eyes (2023-05-18) nor a strange sea creature (2023-07-08) but something which, for the first time I think, actually resembled a unicorn.

I appreciate all the comments around determinism, sampling, scientific method, but as I said when I posted this just after building, it really is just for fun and to see, over time, if the general mish mash of outputs become more refined without any changes to the prompt (which doesn't aid it through CoT/ToT or improving on previous attempts etc.)


You don’t have to justify yourself to the HN peanut gallery :-)


Mechanical turk 'workers' use ChatGPT


Then simply train the model to predict whether a human can interpret it as a unicorn.


What if, instead of your ridiculous strawman, they believe in waiting to get consistency?


I could also believe in not having to have beliefs and just letting it run until I either die or run out of money, and the former may not result in an immediate shuttering of the project either


Expecting a convergent series is bad logic.


I don't see the bad logic.


He said he is "Asking GPT-4 to draw a unicorn every day to track changes in the model."

The variance he is seeing in the output is primarily the product of random chance, rather than changes in the model. Specifically this "unicorn" that he found today is likely just random chance and there was no changes in the model between yesterday and today that lead to it arising.

If he wanted to track changes in the model for real, he would have to ask multiple questions per day and try to infer some type of distribution characterization and then see if that changes over time. That is much more complex and not what he is doing.

This is just a curious experiment that doesn't mean much.


There's a linked blog post[0] that goes more into the methodology and reasoning.

"As mentioned in the hacker news discussion, the model doesn't change daily. [...] As OpenAI releases incremental updates, we'll see the model change automatically and be able to judge outputs. A single sample per day leads to quite different results, but that's fine I think. What I expect to see a year from now is an evolution of output. In variance: how varied are the outputs over a month?"

So yes, they could just produce 100 images with each new model release, but chose to spread those out over 1 per day instead. Is it the most scientific way to measure progress? No. Is it more fun and interesting to check back daily? Probably.

[0] https://adamkdean.co.uk/posts/gpt-unicorn-a-daily-exploratio...


Thanks, fun and light-hearted is the approach I went for


If you look at the examples from April, only 2 or 3 can be counted as unicorns. If in 3 months it's the reverse and only 2 or 3 can't be counted as unicorns, that would show a progressive improvement in the model. I agree we shouldn't take much from day N-1 to day N as there will be a lot of variance, but this can show us progressive improvement over model updates.

Perhaps in a few months half of the generated pictures will look like unicorns, perhaps in more months they will all be unicorns but 2 or 3 will look way more detailed instead of drawn by a 4 year old, etc. We just need to wait longer for the signal to break through the noise.


I do agree with your comment, especially this part:

> We just need to wait longer for the signal to break through the noise.

Currently, what we are observing is primarily noise with very little signal.


I don't think anyone claims this is an iterative linear measure, rather than a step function.

SVG can present arbitrarily complex graphics. The underlying display tech supports what ever fidelity GPT will eventually mature into.

Has GPT plateaud? Will it be stuck forever at this hilariously naive level of competence at SVG art? Will it mature into Midjourney level competence? I have no frigging clue. Since the token context is so small I imagine it will put limitations to the complexity of SVG art piece.

But I don't know. And it's fun to have a daily measure.

As a software engineer with a penchant for graphics asking GPT to draw complex graphic shapes was one of the first tests I did for it. It's extremely interesting for me to collect progress data, no matter how noisy.

I have no idea if GPT will ever mature beyond these squiggles but if it does, this track record will have at least considerable artistic value, if nothing else.


I mean, how many humans can draw art by writing out svg? If that's not in the training set, I don't even see how GPT-4 gets much better at this over time.


And so, if we see that it /does/ get better, over the next few years, will that not lead us to ask /how/?

Let's think about it:

1. It has to output SVG [1] 2. It is given a text based representation of what it must draw[2] 3. It must then somehow convert words -- the concept of a unicorn: equine with a horn, white, maybe rainbows? -- into SVG code, and attempt to convey both their location, shape, colour, appearance, with code.

And keep in mind, this is just a token predictor. I doubt there is much data in its training that is this specific.

So while it's quite far from science, for me, it's a bit of fun and I get emails every now and then remarking on things like the turd of May (2023-05-18) and it lightens the mood every now and then, which I think ultimately, is worth it.

[1] System: You are a helpful assistant that generates SVG drawings. You respond only with SVG. You do not respond with text.

[2] User: Draw a unicorn in SVG format. Dimensions: 500x500. Respond ONLY with a single SVG string. Do not respond with conversation or codeblocks.

See: https://github.com/adamkdean/gpt-unicorn/blob/master/src/lib...


GPT-4 is really great at transferring concepts between domains.

That's one the reasons why GPT when it works, feels magical. SVG art does not need to be in it's training set, as long as it knows how to present geometric concepts in SVG.

A good unicorn would require capabilities something like "the outline of unicorn is composed of lines {...}." -> "export lines as svg".


Ok that makes perfect sense, thanks.


It takes a picture from GPT every day, so we'll be able to see if this was a fluke by looking at future days' outputs. I think it will work to track changes in the model.


Yes, this makes sense. You are agreeing with me. In order to see if it is just a fluke or not, you need a lot of samples so you can characterize the distribution yourself and try to see if it changed.


OK but I'm still not seeing any logic error.


That a single sample per day can track changes meaningfully when the noise floor is above the signal strength. And also the fact that today's unicorn-ish sample is meaningful at all.

It is a fun experiment though.


If you look closely, you can see that the model version is attached to each image, and the original blog post clearly states that asking the same model over multiple days is how they intend to track the variance. No bad logic there imo.


you're looking at it from the wrong time scale.

The difference over months or years is what is interesting.


Agree with this except for one data point. OpenAI does enhance/tweak/do something with the models at different levels. This can be determined by:

1. A change in the current model number (eg. gpt-3.5-turbo-0613)

2. On ChatGPT UI, the date at the bottom (eg. August 2023)

So it isn’t correct to say “it is incredibly obvious that nothing has happened”. Not that obvious to me.

A bit like how you can never tell for sure if Coca Cola has tweaked their formula, or McDonalds has changed the recipe for its signature sauce. Only in this case, the model number going up or the date becoming more recent leads credence to something having changed.


The ChatGPT UI is indeed a wildcard. But it is irrelevant here because according to the github repo this page queries the API and OpenAI guarantees it doesn't change models with version number information (like gpt-4-0613, which is mentioned in the latest images). So this "experiment" would make a lot more sense if it was only run once every few months when the API actually offers new model versions and then generate a bunch of images for every model, instead of generating one single image every day (which is meaningless due to non-deterministic noise, even if the model had somehow changed since yesterday). That is also how it was done in the original study during the development of GPT4. I don't know how this experiment came up with its logic, unless they had a gross misunderstanding about how these models actually work (which admittedly seems common among tech interested folks here).


When I last spoken to Logan, he confirmed that there are no changes between the API models, so 0314 and 0613 are it. That's two models so far that I've collected SVGs for. With regard to batch vs daily -- it makes no different in terms of output, but by going daily, I don't need to track model changes and generate a new batch.

Also it's fun to see each daily unicorn.


That’s a good summary and it makes a lot of sense.


Agreed. But have you seen the original talk?

I believe he's trying to find an unicorn similar in style to the one generated by the original researcher.

It's so sad that openai has a far more capable model internally that it can't give open access to because of safety (or any other argument).


I suspect the model is fundamentally the same underneath, but that various tricks like quantization are being performed in the deployed model to improve inference speed/cost at the expense of output quality.


Is it possible that inference cost is so high it’s viable?


Bear in mind I chose SVG rather than TiKZ


Yes, nothing about GPT4 changed today. But that's not the goal of the project (although I can't speak for the intentions of the submitter here).

Currently there are two different GPT4 models represented in the samples, with quite significant quality difference between them. The quality (and variance in quality within a single model!) is interesting to see in such a comparison.


> But that's not the goal of the project (although I can't speak for the intentions of the submitter here).

(submitter here) You're correct, it's not the goal of the project. It would be fair to say there is no goal other than to ask GPT to draw a unicorn every day, and through it, create a talking point and potential fun for people who follow along.


This variance also exists among outputs from the same model. Just scroll down a bit and you'll see drastic quality differences with the exact same model.


Is there some reference or explanation on why the model it is non deterministic at temperature 0?


I'm not aware of anything concrete by OpenAI, but others have offered possible explanations.

One idea is that the cause is batched inference in sparse MoE (mixture of experts) models.

https://152334h.github.io/blog/non-determinism-in-gpt-4/

HN discussion: https://news.ycombinator.com/item?id=37006224


So in some sense the spectre attack for AI?


No


One important source of non-determinism is from using massive parallelism together with floating point arithmetic. In real math, a sum of numbers has an exact value that doesn't change if you change which order the numbers are added up in, but floating point arithmetic addition is not associative in the same way as real math, and parallelism can cause numbers to be added in a different order from execution to execution, which is one cause of non-determinism.


>and parallelism can cause numbers to be added in a different order from execution to execution

Parallelism doesn't magically add non-determinism of this kind unless you intentionally build it to be non deterministic. Nothing prevents you from processing an array in order in parallel.


However, the poster mentions parallelism in conjunction with floating point arithmetic, not parallelism by itself.


No. The problem is in a reduction op of some sort (sum or whatever). Since there no guarantee of the order you receive the terms for the reduction, the nondeterminism enters from order of terms reduced. Since float math isn't associative, there will be slight differences depending on the order and these can amplify quickly over a deep net.

You would have to explicitly order the terms prior to reduction but you don't always have that level of control.


> Nothing prevents you from processing an array in order in parallel.

100% correct if you remove processing time from the equation.

In reality, Nvidia Cuda calculations run much faster if you let it schedule the order of floating points operations itself. This makes the ordering different from run to run.

This in turn causes the results to be non-deterministic.



Thanks!


If he removed the word changes it would've made sense. See what the odds are of it producing a unicorn. So far it's roughly 1 in 118 based on 1 test a day.


I'm not sure what you mean by random sampling. If I sample a random SVG, I wouldn't expect to look like anything, let alone roughly like a unicorn.


I mean random sampling in the sense how autoregressive language models like GPT generate sequences using token probabilities. It's not a random svg, but the text sequence that is used to draw it suffers from inherent non-determinism in the underlying model.


Relying on token probabilities seems like the exact opposite of random


The neural network just generates a set of probabilities for all tokens. The actual next token is then sampled from this set, which is always random for T>0 (and in the case of GPT4 even for T=0, because of the way the model itself works).


"Random" is a loose term. We're discussing a technical subject. Talk about the distribution the random numbers are sampled from.


If the model understood the spacial relationships as well as the one that produced the original drawings of a unicorn then variance in the choice of the next token should produce many similar but somewhat different images of unicorns. None of the images until today bear any resemblance to the original images.


What I find interesting is that there are aspects of what a "unicorn" represents, horns, limbs etc, in some of the drawings. Today is the first time I've seen it create something that actually looks like a representation of a unicorn.


I have not kept up with GPT architecture, other than noticing that other people have noticed that T=0 is clearly not deterministic for these things (and that that results from a bug, not an intentional feature). This much was obvious when the supposedly genius idiots rolled out GPT-3. It's wonderful to see the whole world bend over and just take it up the ass from a bunch of people who can't figure out why their code can't produce the same result twice; but it's quite natural for there to be a big cheerleading section on HN for any new technology that's (1) brilliant in theory, (2) deeply anti-human in practice, and (3) just needs a couple more revisions before it "fixes" a bunch of stuff.


> people who can't figure out why their code can't produce the same result twice

It is well known that the Nvidia parallel processing optimizations cause non-deterministic results.

It's easy to get deterministic results as far as that goes. They've just elected not to do that, since it would run much slower.


We’re pretty sure the nondeterminism is batching + mixture of experts + contention for specific experts


If by batching you mean bad code that fails to sort or relies on hardware to best-guess how things sort, then sure, that's called a bug. Also, "we're pretty sure" is rather self-important while also admitting total, abject failure to produce a deterministic result. You shouldn't blame yourself. A lot of people had the same feeling after staking their life on the revolutionary properties of NFTs.


Your incorrect assumption here is that determinism comes with no tradeoffs. There are a few outsider analyses on the topic, for example, this one on sparse MoEs [1]. If OpenAI uses sparse MoE as described, then determinism would be possible but inefficient.

Even if it's not sparse MoE, chances are high that the non-determinism is introduced somewhere purely as a performance optimization. The article speculates that OpenAI knows this well and hides it to protect the model internals.

[1] https://152334h.github.io/blog/non-determinism-in-gpt-4/


Even if a single version of gpt4 would be deterministic any change done to the model would probably introduce enough noise to make it impossible to make any conclusions on a few samples?


> This comment section is a super fascinating case study on the inherent flaws in human cognition

Like most of HN.


yeah, see also image-2023-04-25 which is way earlier and comes really close, surrounded by garbage




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: