Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
GPT Unicorn has drawn a unicorn (adamkdean.co.uk)
294 points by imdsm on Aug 10, 2023 | hide | past | favorite | 203 comments


It has become common knowledge that GPT4 (and also 3.5) have problems with deterministic outputs (even at T=0). So what we're seeing here is just the effect of random sampling, not any actual change to the model itself. If you scroll down, you'll see other close attempts by the exact same model that could already be counted as a win depending on who you ask.

Edit: This comment section is a super fascinating case study on the inherent flaws in human cognition. Especially when it comes to seeing patterns in random noise. The fact that some people believe that the model really has to have changed in the past few days is amazing, because if you've kept up with the GPT architecture and the way OpenAI does things (especially on the API), it is incredibly obvious that nothing has happened. But people who want to believe that something has happened will definitely also start to see something.


>This comment section is a super fascinating case study on the inherent flaws in human cognition. Especially when it comes to seeing patterns in random noise. The fact that some people believe that the model really has to have changed in the past few days is amazing

You need only to look at the discourse around the Tesla FSD superusers to see this: they report a glitch at an intersection one day, then believe the next day it was "fixed" by the AI.


Go into /r/chatgpt and /r/bing and it's a bit scary how many people anthropomorphize the models.


Honest question, why do you find it scary?

I agree that some people take it too far, but most seem to be using metaphor to abstract away the underlying complexities and facilitate conversation. I did the same thing back in college, anthropomorphizing cells and even molecules when I was learning about microbiology and biophysics (e.g., kinesins[1] are a family of postal workers who work hard to deliver packages to their clients in a timely manner, and they spend their free time going for strolls and practicing on their favorite tightropes). I now do it in my day-to-day work at an AI/ML shop to communicate not just what the entire pipeline is doing, but what individuals layers or encoders are doing, or even what variables in an equation are doing. I find that my colleagues and I are better able to understand and remember concepts and ideas when they're communicated as part of an anthropomorphic story, but we scarcely forget that the things we're dealing with aren't human.

But maybe I'm missing the point, and you're really worried about that first, smaller group of people who have really swallowed the Kool aid. To that, I can say that I don't think their behavior exceeds the baseline craziness/weirdness that I've come to expect from humanity. I'm sure there are far more people who believe in astrology than who believe that we've achieved true humanlike artificial general intelligence, for example.

[1] https://youtu.be/y-uuk4Pr2i8


> Honest question, why do you find it scary?

What I find scary is how much trust people put into the answers.

Today, I saw someone saying "Look! ChatGPT can design my home solar electrical circuit". That kind of thing will lead to new Darwin awards being given out.

Thing is, if people will trust it to do that right, when will politicians write policy papers with it? Let's face it, they already are. Other people won't want to read those long papers, so they'll ask it to summarise it for them. At that point you have LLMs writing laws and reviewing laws. It's all fine as long as nobody enforces them.

Right?


Interesting that modern discourse is to use "scary" and "dangerous" and such words so much more. I wonder if it is related to the present rise in neuroticism and trigger warnings etc.

It's not particularly "scary" to me that people do that. I remember Boomer and Scooby Doo bots that people anthropomorphized and those were warbots from barely 10 y ago.

I suppose, in today's parlance, "it's scary how much people use fear-oriented language for normal things".


Great comment. I take it, the use of "scary" is more of a tool to provoke a stronger reaction or animosity to the comment reader, more than actually the writer feeling "scared" after seeing people anthropomorphising AI.

It seems to be the current trend in communication nowadays: A race to induce the strongest reaction possible.

Why should someone be scared of that? We have anthropomorphised chairs, brooms and whatnot since the Walt Disney times and I am sure before in classic literature (I am not that literate to know for certain).

I prefer to be 'amazed' or 'excited' about what is happening: It means that AI is getting to a point where people feel it more 'relatable'. We are getting to that point in our technology development. The number of things we will be able to do with a technology with which we can interact that seamlesly is great.


> Interesting that modern discourse is to use "scary" and "dangerous" and such words so much more. I wonder if it is related to the present rise in neuroticism and trigger warnings etc.

I doubt it’s trigger warnings because they usually don’t “warn scary things”.

Also is there any data to back up that scary and fear is used more today than in the past? That seems unlikely.


> and those were warbots from barely 10 y ago.

Some people even anthropomorphized Eliza. To your point: if you cherry pick enough and survey enough people; “some people” will do just about anything.


I noticed similar a behavior in Stable Diffusion forums where people believe that the model they downloaded and are running offline is getting better at understanding their prompts.


Stable diffusion most likely don't do it, but even static model that as an input takes embedding of all your historic prompts + current prompt would get progressively give you better inputs as you use it.


Yes, but currently that would be a conscious choice (and extra intentional effort)


Thats even worse for autonomous cars, there is so such data and noise there is no way to reproduce the issue, it's complete chaos. Whereas with a LLM if we control the seed we can 100% reproduce the same result


>> if we control the seed we can 100% reproduce the same result

No, that's the problem. You can't. You should be able to, but you can't. If you could, they wouldn't be scary. But we have Temperature Zero, different results. Because no one gave enough of a shit when coding them, and no one gives enough of a shit to try to fix the issue.

This is what in any other industry would be called gross negligence.


A lot of stuff behind the scenes is going on to batch and route queries to GPT-4 models that are in perturbed states already[1]. This isn't gross negligence, this is basic capitalism. If you want sole access to a GPT-4 MoE cluster starting fresh, it's gonna cost you.

1. https://152334h.github.io/blog/non-determinism-in-gpt-4/


Interesting article. I can see how it makes sense for OpenAI or someone with a LLM to take advantage of any entropy that presented itself, as a shortcut to non-repetitive answers. I'm not sure if you're saying that these LLMs take on new characteristics as they get more randomized? Or just that it would be hard to get your hands on a fresh one to test the determinism of?


That's a OpenAi problem, not a LLM problem


>Whereas with a LLM if we control the seed we can 100% reproduce the same result

No, you can't. For the latest GPT models and the way they are run, this doesn't work anymore, making the experiment completely illogical. Some of the reasons are explained here pretty well: https://152334h.github.io/blog/non-determinism-in-gpt-4/


This sounds more like a bug than a feature.


100%. This person is trying to find patterns in random noise and believes they are meaningful. The original post hurts my head with its bad logic.


I'm sorry it hurts your head. I'm happy to sponsor a packet of paracetamol or some water if that helps. Ultimately, this is fun, not science. I'm just happy that after all these attempts, it finally got to a unicorn.


It got to a unicorn quite easily when the output was set to Tikz as I’ve done here:

https://bobjansen.net/drawing-with-chatgpt/


That looks nothing like a unicorn, but the three triangles at the top right in the third picture look like the mask of some evil manga villain alien robot.


Well done for being a good sport, but I'm willing to bet that tomorrow's shape will not resemble a unicorn and you'll have to figure out how that works with your assumption that the model is improving.


I think we're going to need to wait a few more years, at least, to see any improvement. I expect to see 4 new models a year, before GPT-5 arrives. I'll just keep using the latest model and we can all reconvene in 1, 2, 5, 10 years.


The 'random noise' from a prompt "Draw a unicorn in svg" should still return Unicorns.

This is absolutly fine and it should start showing unicorn like drawings over a longer period and potentially finetuned ones over a longer period of time when the model changes.


>The original post hurts my head with its bad logic.

Huh? What "original post"? This is an experiment, today the model drew something resembling a unicorn. Tomorrow we will see how the experiment goes again. I see no associated analysis, so what makes your "head hurt".


https://adamkdean.co.uk/posts/gpt-unicorn-a-daily-exploratio...

> The idea behind GPT Unicorn is quite simple: every day, GPT-4 will be asked to draw a unicorn in SVG format. This daily interaction with the model will allow us to observe changes in the model over time, as reflected in the output.


Sure, each day it will draw a unicorn. Each time the model changes, we'll have a new group of drawings. No they're not drawn at T=0 but even at T=0 GPT-4 is not deterministic. This isn't science -- this is just a bit of fun.


I believe the logic is fine. You seem to think you need multiple data points from the same version of the model (i.e. multiple samples per day at least) would be necessary to judge the actual performance on each particular day.

That's worse logic. How would you visualize the very large sample you would get? Even with the current 118 samples (one per day) it's already difficult to find a pattern.

Would you "average" the samples?? That would not help IMO, you would need to average the score of each image, which requires either manually doing it or finding a reliable algorithm to do it automatically, but good luck with that.

So, a sample per day which allows clearly visualizing any change in the results over months and years is a valuable thing to do and I find it hard to improve on the methodology. You just need to keep in mind that one single picture from the sample is not enough, no one is going to disagree with that... but that doesn't make it "bad logic" and it's pretty thoughtless to say so.


I agree that trying to determine the distribution of these drawings is hard because it isn't a simple floating point number in its current form.

But maybe you could covert it to a linear monotonic measure?

You could pass it to an image recognition model and see record the degree to which it thinks it is an:

animal horse unicorn

Basically if it fails to be a unicorn, see if it is a horse and if it fails to be a horse check if it is an animal. This gives you some type of linear measure if you place these three measures along the same axis adjacently. Then you can transform each image to a floating point number and then characterize the distribution.


>You could pass it to an image recognition model and see record the degree to which it thinks it is...

We don't care what another algorithm "thinks". We want to see if what it draws is humanly interpretable as a unicorn.


Then you could use mechanical turk to have them rate each image to figure out how close it is to a Unicorn...


We could but this is also a fun project, which is why when I checked it today I was surprised that what I saw was not a turd with eyes (2023-05-18) nor a strange sea creature (2023-07-08) but something which, for the first time I think, actually resembled a unicorn.

I appreciate all the comments around determinism, sampling, scientific method, but as I said when I posted this just after building, it really is just for fun and to see, over time, if the general mish mash of outputs become more refined without any changes to the prompt (which doesn't aid it through CoT/ToT or improving on previous attempts etc.)


You don’t have to justify yourself to the HN peanut gallery :-)


Mechanical turk 'workers' use ChatGPT


Then simply train the model to predict whether a human can interpret it as a unicorn.


What if, instead of your ridiculous strawman, they believe in waiting to get consistency?


I could also believe in not having to have beliefs and just letting it run until I either die or run out of money, and the former may not result in an immediate shuttering of the project either


Expecting a convergent series is bad logic.


I don't see the bad logic.


He said he is "Asking GPT-4 to draw a unicorn every day to track changes in the model."

The variance he is seeing in the output is primarily the product of random chance, rather than changes in the model. Specifically this "unicorn" that he found today is likely just random chance and there was no changes in the model between yesterday and today that lead to it arising.

If he wanted to track changes in the model for real, he would have to ask multiple questions per day and try to infer some type of distribution characterization and then see if that changes over time. That is much more complex and not what he is doing.

This is just a curious experiment that doesn't mean much.


There's a linked blog post[0] that goes more into the methodology and reasoning.

"As mentioned in the hacker news discussion, the model doesn't change daily. [...] As OpenAI releases incremental updates, we'll see the model change automatically and be able to judge outputs. A single sample per day leads to quite different results, but that's fine I think. What I expect to see a year from now is an evolution of output. In variance: how varied are the outputs over a month?"

So yes, they could just produce 100 images with each new model release, but chose to spread those out over 1 per day instead. Is it the most scientific way to measure progress? No. Is it more fun and interesting to check back daily? Probably.

[0] https://adamkdean.co.uk/posts/gpt-unicorn-a-daily-exploratio...


Thanks, fun and light-hearted is the approach I went for


If you look at the examples from April, only 2 or 3 can be counted as unicorns. If in 3 months it's the reverse and only 2 or 3 can't be counted as unicorns, that would show a progressive improvement in the model. I agree we shouldn't take much from day N-1 to day N as there will be a lot of variance, but this can show us progressive improvement over model updates.

Perhaps in a few months half of the generated pictures will look like unicorns, perhaps in more months they will all be unicorns but 2 or 3 will look way more detailed instead of drawn by a 4 year old, etc. We just need to wait longer for the signal to break through the noise.


I do agree with your comment, especially this part:

> We just need to wait longer for the signal to break through the noise.

Currently, what we are observing is primarily noise with very little signal.


I don't think anyone claims this is an iterative linear measure, rather than a step function.

SVG can present arbitrarily complex graphics. The underlying display tech supports what ever fidelity GPT will eventually mature into.

Has GPT plateaud? Will it be stuck forever at this hilariously naive level of competence at SVG art? Will it mature into Midjourney level competence? I have no frigging clue. Since the token context is so small I imagine it will put limitations to the complexity of SVG art piece.

But I don't know. And it's fun to have a daily measure.

As a software engineer with a penchant for graphics asking GPT to draw complex graphic shapes was one of the first tests I did for it. It's extremely interesting for me to collect progress data, no matter how noisy.

I have no idea if GPT will ever mature beyond these squiggles but if it does, this track record will have at least considerable artistic value, if nothing else.


I mean, how many humans can draw art by writing out svg? If that's not in the training set, I don't even see how GPT-4 gets much better at this over time.


And so, if we see that it /does/ get better, over the next few years, will that not lead us to ask /how/?

Let's think about it:

1. It has to output SVG [1] 2. It is given a text based representation of what it must draw[2] 3. It must then somehow convert words -- the concept of a unicorn: equine with a horn, white, maybe rainbows? -- into SVG code, and attempt to convey both their location, shape, colour, appearance, with code.

And keep in mind, this is just a token predictor. I doubt there is much data in its training that is this specific.

So while it's quite far from science, for me, it's a bit of fun and I get emails every now and then remarking on things like the turd of May (2023-05-18) and it lightens the mood every now and then, which I think ultimately, is worth it.

[1] System: You are a helpful assistant that generates SVG drawings. You respond only with SVG. You do not respond with text.

[2] User: Draw a unicorn in SVG format. Dimensions: 500x500. Respond ONLY with a single SVG string. Do not respond with conversation or codeblocks.

See: https://github.com/adamkdean/gpt-unicorn/blob/master/src/lib...


GPT-4 is really great at transferring concepts between domains.

That's one the reasons why GPT when it works, feels magical. SVG art does not need to be in it's training set, as long as it knows how to present geometric concepts in SVG.

A good unicorn would require capabilities something like "the outline of unicorn is composed of lines {...}." -> "export lines as svg".


Ok that makes perfect sense, thanks.


It takes a picture from GPT every day, so we'll be able to see if this was a fluke by looking at future days' outputs. I think it will work to track changes in the model.


Yes, this makes sense. You are agreeing with me. In order to see if it is just a fluke or not, you need a lot of samples so you can characterize the distribution yourself and try to see if it changed.


OK but I'm still not seeing any logic error.


That a single sample per day can track changes meaningfully when the noise floor is above the signal strength. And also the fact that today's unicorn-ish sample is meaningful at all.

It is a fun experiment though.


If you look closely, you can see that the model version is attached to each image, and the original blog post clearly states that asking the same model over multiple days is how they intend to track the variance. No bad logic there imo.


you're looking at it from the wrong time scale.

The difference over months or years is what is interesting.


Agree with this except for one data point. OpenAI does enhance/tweak/do something with the models at different levels. This can be determined by:

1. A change in the current model number (eg. gpt-3.5-turbo-0613)

2. On ChatGPT UI, the date at the bottom (eg. August 2023)

So it isn’t correct to say “it is incredibly obvious that nothing has happened”. Not that obvious to me.

A bit like how you can never tell for sure if Coca Cola has tweaked their formula, or McDonalds has changed the recipe for its signature sauce. Only in this case, the model number going up or the date becoming more recent leads credence to something having changed.


The ChatGPT UI is indeed a wildcard. But it is irrelevant here because according to the github repo this page queries the API and OpenAI guarantees it doesn't change models with version number information (like gpt-4-0613, which is mentioned in the latest images). So this "experiment" would make a lot more sense if it was only run once every few months when the API actually offers new model versions and then generate a bunch of images for every model, instead of generating one single image every day (which is meaningless due to non-deterministic noise, even if the model had somehow changed since yesterday). That is also how it was done in the original study during the development of GPT4. I don't know how this experiment came up with its logic, unless they had a gross misunderstanding about how these models actually work (which admittedly seems common among tech interested folks here).


When I last spoken to Logan, he confirmed that there are no changes between the API models, so 0314 and 0613 are it. That's two models so far that I've collected SVGs for. With regard to batch vs daily -- it makes no different in terms of output, but by going daily, I don't need to track model changes and generate a new batch.

Also it's fun to see each daily unicorn.


That’s a good summary and it makes a lot of sense.


Agreed. But have you seen the original talk?

I believe he's trying to find an unicorn similar in style to the one generated by the original researcher.

It's so sad that openai has a far more capable model internally that it can't give open access to because of safety (or any other argument).


I suspect the model is fundamentally the same underneath, but that various tricks like quantization are being performed in the deployed model to improve inference speed/cost at the expense of output quality.


Is it possible that inference cost is so high it’s viable?


Bear in mind I chose SVG rather than TiKZ


Yes, nothing about GPT4 changed today. But that's not the goal of the project (although I can't speak for the intentions of the submitter here).

Currently there are two different GPT4 models represented in the samples, with quite significant quality difference between them. The quality (and variance in quality within a single model!) is interesting to see in such a comparison.


> But that's not the goal of the project (although I can't speak for the intentions of the submitter here).

(submitter here) You're correct, it's not the goal of the project. It would be fair to say there is no goal other than to ask GPT to draw a unicorn every day, and through it, create a talking point and potential fun for people who follow along.


This variance also exists among outputs from the same model. Just scroll down a bit and you'll see drastic quality differences with the exact same model.


Is there some reference or explanation on why the model it is non deterministic at temperature 0?


I'm not aware of anything concrete by OpenAI, but others have offered possible explanations.

One idea is that the cause is batched inference in sparse MoE (mixture of experts) models.

https://152334h.github.io/blog/non-determinism-in-gpt-4/

HN discussion: https://news.ycombinator.com/item?id=37006224


So in some sense the spectre attack for AI?


No


One important source of non-determinism is from using massive parallelism together with floating point arithmetic. In real math, a sum of numbers has an exact value that doesn't change if you change which order the numbers are added up in, but floating point arithmetic addition is not associative in the same way as real math, and parallelism can cause numbers to be added in a different order from execution to execution, which is one cause of non-determinism.


>and parallelism can cause numbers to be added in a different order from execution to execution

Parallelism doesn't magically add non-determinism of this kind unless you intentionally build it to be non deterministic. Nothing prevents you from processing an array in order in parallel.


However, the poster mentions parallelism in conjunction with floating point arithmetic, not parallelism by itself.


No. The problem is in a reduction op of some sort (sum or whatever). Since there no guarantee of the order you receive the terms for the reduction, the nondeterminism enters from order of terms reduced. Since float math isn't associative, there will be slight differences depending on the order and these can amplify quickly over a deep net.

You would have to explicitly order the terms prior to reduction but you don't always have that level of control.


> Nothing prevents you from processing an array in order in parallel.

100% correct if you remove processing time from the equation.

In reality, Nvidia Cuda calculations run much faster if you let it schedule the order of floating points operations itself. This makes the ordering different from run to run.

This in turn causes the results to be non-deterministic.



Thanks!


If he removed the word changes it would've made sense. See what the odds are of it producing a unicorn. So far it's roughly 1 in 118 based on 1 test a day.


I'm not sure what you mean by random sampling. If I sample a random SVG, I wouldn't expect to look like anything, let alone roughly like a unicorn.


I mean random sampling in the sense how autoregressive language models like GPT generate sequences using token probabilities. It's not a random svg, but the text sequence that is used to draw it suffers from inherent non-determinism in the underlying model.


Relying on token probabilities seems like the exact opposite of random


The neural network just generates a set of probabilities for all tokens. The actual next token is then sampled from this set, which is always random for T>0 (and in the case of GPT4 even for T=0, because of the way the model itself works).


"Random" is a loose term. We're discussing a technical subject. Talk about the distribution the random numbers are sampled from.


If the model understood the spacial relationships as well as the one that produced the original drawings of a unicorn then variance in the choice of the next token should produce many similar but somewhat different images of unicorns. None of the images until today bear any resemblance to the original images.


What I find interesting is that there are aspects of what a "unicorn" represents, horns, limbs etc, in some of the drawings. Today is the first time I've seen it create something that actually looks like a representation of a unicorn.


I have not kept up with GPT architecture, other than noticing that other people have noticed that T=0 is clearly not deterministic for these things (and that that results from a bug, not an intentional feature). This much was obvious when the supposedly genius idiots rolled out GPT-3. It's wonderful to see the whole world bend over and just take it up the ass from a bunch of people who can't figure out why their code can't produce the same result twice; but it's quite natural for there to be a big cheerleading section on HN for any new technology that's (1) brilliant in theory, (2) deeply anti-human in practice, and (3) just needs a couple more revisions before it "fixes" a bunch of stuff.


> people who can't figure out why their code can't produce the same result twice

It is well known that the Nvidia parallel processing optimizations cause non-deterministic results.

It's easy to get deterministic results as far as that goes. They've just elected not to do that, since it would run much slower.


We’re pretty sure the nondeterminism is batching + mixture of experts + contention for specific experts


If by batching you mean bad code that fails to sort or relies on hardware to best-guess how things sort, then sure, that's called a bug. Also, "we're pretty sure" is rather self-important while also admitting total, abject failure to produce a deterministic result. You shouldn't blame yourself. A lot of people had the same feeling after staking their life on the revolutionary properties of NFTs.


Your incorrect assumption here is that determinism comes with no tradeoffs. There are a few outsider analyses on the topic, for example, this one on sparse MoEs [1]. If OpenAI uses sparse MoE as described, then determinism would be possible but inefficient.

Even if it's not sparse MoE, chances are high that the non-determinism is introduced somewhere purely as a performance optimization. The article speculates that OpenAI knows this well and hides it to protect the model internals.

[1] https://152334h.github.io/blog/non-determinism-in-gpt-4/


Even if a single version of gpt4 would be deterministic any change done to the model would probably introduce enough noise to make it impossible to make any conclusions on a few samples?


> This comment section is a super fascinating case study on the inherent flaws in human cognition

Like most of HN.


yeah, see also image-2023-04-25 which is way earlier and comes really close, surrounded by garbage


I'm confused as to why this would see any improvement over time. Looking at the code, it's by default hitting the gpt 3.5-turbo API. Maybe I'm misremembering, but I thought I've seen statements from people working at OpenAI where it's been claimed that the API is static, we'd be informed of any changes to the underlying model. Is the model actually receiving updates?

edit: Looking at previous days, too, it doesn't exactly seem to be improving. I think we just got a lucky sampling.


Yes, the models are updated officially around every three months, with a notice you can still use the previous version for a time until it is decommissioned.

Some people claim there are also unannounced changes, but I can't vouch for that.

The daily variation is likely due to temperature. To make the response less repetitive.


Wasn't there a study recently that tracked the performance of GPT over time and found significant drop in quality? Did those drops occur at official model changes, or at other times? (i.e. unannounced changes for safety or cost reduction)

I mean, if I was OpenAI, I probably wouldn't make an announcement like "we've just quantized the model and increased our profit margins significantly! The only change on your end will be a slightly dumber model. (Don't worry! Most users won't even notice!)"


This one [1]? That tracks two distinct versions (0613 vs. 0314).

Also, IMO, the tasks they evaluate aren't useful (I rarely want my LLM to tell me whether 17077 is a prime number), and there's room for cherrypicking/survivorship bias. My guess is that OpenAI did something between 0314 and 0613 that shifted focus away from maths to other subjects.

[1] https://arxiv.org/pdf/2307.09009.pdf


I haven't tracked the quality but I do track the performance:

https://gpt-monitor.adamkdean.co.uk/

It fluctuates a lot but you can see trends.


The site linked in the OP is interesting because it takes a picture from GPT every day, so we can see for ourselves if there is any difference with time. We can come back tomorrow and see what it has produced. If it has produces random squigly lines again, we might assume that today's success was just a fluke.



It’s using GPT-4 by default, but we can’t know what it uses for real since that’s in the environment config.


It says "gpt-4-0613" on the web page as well. Why would they pretend it's GPT-4 but use GPT-3.5 in the background?


Submitter here. It's using gpt-4 and saving the model that OpenAI returns, which helps us see the specific model that is used each time.

    "Env": [
        "VIRTUAL_HOST=gpt-unicorn.adamkdean.co.uk",
        "LETSENCRYPT_HOST=gpt-unicorn.adamkdean.co.uk",
        "HTTP_PORT=8000",
        "STORAGE_PATH=/data",
        "OPENAI_API_KEY=sk-**SNIP**",
        "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
        "NODE_VERSION=16.17.1",
        "YARN_VERSION=1.22.19"
    ],


I’m not saying they do, just that you can’t know from the code alone.


According to the author's blog post [1] the idea was that it "will use the latest gpt-4 model made available". Not sure if the code isn't up to date or this was changed in the meantime...

[1] https://adamkdean.co.uk/posts/gpt-unicorn-a-daily-exploratio...


No changes - it specifies gpt-4 which tracks the latest model


The API layer receives fairly regular updates, but the model is (as I understand it) mostly static.

Within GPT there is an intentional randomness element called temperature which is how you get different answers each time.

I could copy their prompt and ask GPT4 to draw other things, but I’ll probably just look at the next few unicorns from this site :)


I was curious about how they were getting GPT to generate images, the prompting is so simple[1] that GPT still blows my mind:

  async fetchImage(context) {
    const messages = context || [
      { role: 'system', content: `You are a helpful assistant that generates SVG drawings. You respond only with SVG. You do not respond with text.` },
      { role: 'user', content: `Draw a unicorn in SVG format. Dimensions: 500x500. Respond ONLY with a single SVG string. Do not respond with conversation or codeblocks.` }
    ]

    const response = await this.api.generateCompletion(messages)

    if (!isSvg(response.content)) {
      console.error('Generated image is not valid SVG:', response.content)
      messages.push({ role: 'assistant', content: response.content })
      messages.push({ role: 'user', content: `The generated image is not valid SVG. Please try again. Only respond with SVG code. No text.` })
      return this.fetchImage(messages)
    }

    console.log('Generated image:', response.content)
    return response
  }

[1] https://github.com/adamkdean/gpt-unicorn/blob/master/src/lib...


If you get non-deterministic output from a black box, and know that there's both an indeterminate amount of constant random noise in the output AND you do not have any insight into the actual changes being made in the black box...

There's nothing to infer with any greater accuracy than the unknown amount of noise in each sample. So, all inference is of unknown accuracy. Also known as useless.


Several months ago when ChatGPT was just published, I asked it to draw a diagram of the contents of a cell (nucleus, membrane, amynoacids etc) in SVG (similar to the experiment shown here). At that time, it generated an OK SVG drawing, with very basic figures (squares, trianlges, circles, etc) representing the elements in a cell. I did this test to show my Biologist mom, the "power" of this new AI technology.

Fast forward a couple of months later, I tried it again, and it was "blocked": It kept telling me "as an AI I cannot draw", even after emphazising the part of "generate SVG code". For me, it was another example of OpenAI "borking" the capabilities of ChatGPT.


Use the playground and it will work a lot better. I gave it about 500 tokens and it got to the end.

> Write SVG code to make a basic diagram of a cell showing structures such as the nucleus, ribosomes, and mitochondria.

Here's a very basic example of SVG code to represent a cell that includes structures like the nucleus, ribosomes, and mitochondria:

    ```svg
    <svg width="200" height="200" xmlns="http://www.w3.org/2000/svg">
    <!-- The entire cell -->
    <circle cx="100" cy="100" r="100" style="fill:#FF9999;" />
    ...
    ```


What is strange to me about this is that with a remarkably similar prompt ("Draw a unicorn in SVG format") I'm able to get this from the Bing image creator (which is powered by DALL-E -- and I would think GPT-4 would have this capability. Perhaps I'm being naive):

https://th.bing.com/th/id/OIG.GqpaRZ.NXCN6uxKj7X1u?pid=ImgGn


When you prompt GPT-4 with the API, you can only get text out. In this case OP is asking for text that happens to define an SVG file (which is a text format).

When you ask Bing Image Creator to produce an image, your prompts goes to the image model and an image comes out. It's a bitmap image, not an SVG—that unicorn is defined not as text but as a series of pixels with different colors.

Comparing the two is comparing apples and oranges, because there are completely different models underneath and completely different output formats.


But it's not in SVG format. It's just a bitmap image that copied the style of a vector image. The Unicorn output from ChatGPT is an actual string of text, that can be rendered into the unicorn image on the page.


Image generation "in the style of SVG" (which DALL-E did) has nothing to do with generating actual svg source code (which GPT does) and then rendering it.


I would argue that `image-2023-04-25` looks comparable close to a unicorn. The art style does vary a lot though.


One or two others, 06-20 for instance, could be a very stylised unicorn too. Though that might be confirmation bias in my human brain making it see patterns it wouldn't have had it not been primed by the word unicorn before looking.


To me that strikes me slightly more as a bull. It would be interesting to put that in front of someone who isn't precomposed to be looking for a unicorn and see what they say.


image-2023-07-11 has got the spirit


I’ve asked ChatGPT to create models using Blender’s Python API. It typically generates working code. The API has been updated since 2021 so it’s expected that things might break.

It mostly just arranges simple spheres and cylinders. You can have it label parts which are usually correct.

It struggles with the orientation of things so if you ask it to model a plane the engine nacelles are oriented the wrong way but correctly positioned.


There's no progress from something that is definitely not a unicorn towards a unicorn, the images seem randomly bad.

I think it would be a lot of fun to give it the previous unicorn SVG attempt and ask it to make it more like a unicorn.



Is it just me or doesn't image-2023-04-25 look like a unicorn as well?


4-22 is my favourite… hahaha


04-22 is how I feel when I post updates to this project and someone tells me it's bad science and I should shut it down


It looks like an Amogus unicorn


An interesting modification would be to have it reflect on its own output each day, and build up a list of advice for future attempts, fed in the next day.

That would give it some “learning” and I’d be curious if

1. Would it converge to a consistent shape at all? Or just bounce around random shapes day to day

2. Would it produce unicorns more often than 1/118 times?

The hardest part would be getting it to interpret its svg output without seeing it rendered. The multimodal model getting a rendered image would probably be much better, but maybe not!


Discussed this with folks last time, but haven't had time to build that version yet. It'd need to be a self-improving unicorn version. If I wasn't going fishing this weekend, I'd build it.


Not directly related to the post, but still feels somewhat relevant.

Back in March I used a bit more elaborate multi-step prompts for GPT3.5 to generate amusing pictures and published a gallery [1]. However, I eventually reached a point where changing prompts did not consistently improve the final results. At the end of the day, the quality of images are only as good as the training dataset, and GPT is a black box.

For something different, to test whether it is possible to "compress" visual content specifically for GPT, I ran another experiment. SVG, being a verbose format, takes time to generate a detailed image, and it also becomes expensive over time. I translated a subset of SVG elements into Forth words [2], which has a nice synergy with GPT tokens--this allowed me to progressively render pictures and produce smaller outputs without sacrificing much in quality.

Finally, I training my own GPT2-like model on the QuickDraw dataset [3]. It's not surprising that a sequence transformer can be trained to produce coherent brush strokes and recognizable images as long as there is a way to translate a graphical content into a sequence of tokens. That said, I found myself with more questions than I started, and trying other ideas now.

[1] https://drawmeasheep.net/pages/about.html

[2] https://drawmeasheep.net/pages/gpt-forth.html

[3] https://drawmeasheep.net/pages/nn-training.html


I made some drawings with GPT-4 Code Interpreter and Pillow. I think with its sub-mage and image composition features you could make some detailed drawings if you were clever about it.

[1] https://metastable.org/draw.html


The prompt is Draw a unicorn in SVG format. Dimensions: 500x500. Respond ONLY with a single SVG string. Do not respond with conversation or codeblocks.

https://github.com/adamkdean/gpt-unicorn/blob/master/src/lib...

I just asked this and got the following result:

  <svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" width="500" height="500" viewBox="0 0 500 500">
  <path fill="#FDD7E4" d="M244.898 309.347c-.355-.421-5.388-6.485-5.388-10.305v-40.614l-3.073-12.759c-.708-2.684-2.527-4.924-5.034-6.38l-24.773-14.334c-8.152-4.72-15.709-10.72-17.905-18.788l-12.3-35.006c-.845-2.696-3.577-4.61-6.55-4.61h-48.235c-2.973 0-5.705 1.914-6.55 4.61l-12.3 35.005c-2.196 8.068-9.753 14.067-17.905 18.787L92.998 235.968c-2.507 1.456-4.325 3.696-5.034 6.38l-3.073 12.759v40.615c0 3.82-5.033 9.884-5.388 10.305-9.168 10.836-14.727 24.606-14.727 39.49v42.812C39.373 426.93 69.062 456.62 105.34 456.62h17.215c1.256 21.698 20.249 38.969 42.405 38.969s41.15-17.271 42.405-38.969h17.48c1.256 21.698 20.25 38.969 42.406 38.969s41.15-17.271 42.404-38.969H394.66c36.278 0 65.968-29.69 65.969-65.969v-42.812c0-14.884-5.559-28.654-14.727-39.491zm-189.9-4.337a77.079 77.079 0 0 0-4.225-.302c-19.322 0-35.012 15.689-35.012 35.012V375.99h74.25v-66.968c0-19.323-15.69-35.012-35.013-35.012zm208.12 0c-19.323 0-35.012 15.689-35.012 35.012v66.968h74.25v-66.968c0-19.323-15.69-35.012-35.012-35.012zm-118.63-66.968c10.286 0 18.632 8.345 18.632 18.632v48.424h-37.264v-48.424c0-10.287 8.345-18.632 18.632-18.632zm-27.944-18.633c-19.323 0-35.012 15.689-35.012 35.012V341.98H351.6V275.27c0-19.323-15.69-35.012-35.012-35.012zm-35.013-17.911c-10.287 0-18.632-8.346-18.632-18.633s8.345-18.632 18.632-18.632 18.632 8.346 18.632 18.632-8.345 18.633-18.632 18.633z"/></svg>
I am not going to fault a language model for not getting that right! This is fundamentally not a language task. It demands an image model.


I would imagine that it would do better if it was told to build it up step by step, and critique its own work as it goes along.


It would but what I like is how difficult the task is. If the model becomes able (like today) to generate passable results from such a simple prompt, then that shows something, IMO.


This reminds of the artists from 1500s that would do paintings of exotic animals for courts and other rich people, but without ever seeing the actual animal and they would draw them from descriptions.


We queried GPT-4 three times, at roughly equal time intervals over the span of a month while the system was being refined, with the prompt “Draw a unicorn in TikZ”. We can see a clear evolution in the sophistication of GPT-4’s drawings

Given how random GPT seems with what it's not designed to do the original research is really peculiar. Could it be that they queried GPT on three separate instances some n times and picked the best result ?


Does it look like a unicorn to you ? It looks more like a cat with a spaghetti on his head...

image-2023-04-25 looks more like an unicorn to me (although more a cow based unicorn than a horse based unicorn).

Which leads me to this genuine question: : how close do the images resemble a unicorn? I mean how can one track resemblance and how to draw a line saying GPT has drawn an unicorn?


I gave the SVG unicorn back to GPT-4 and asked it, what is it.

It recognized a head, eyes, body and legs. But it didn't recognize the unicorn.

https://chat.openai.com/share/5409b417-b883-429f-893e-abe3d6...


The OP has a slightly different prompt -- so this answer wouldn't qualify.

"System: You are a helpful assistant that generates SVG drawings. You respond only with SVG. You do not respond with text.

User: Draw a unicorn in SVG format. Dimensions: 500x500. Respond ONLY with a single SVG string. Do not respond with conversation or codeblocks."


You need to have quite a bit of imagination to say that is a unicorn, but less than you would need for other days.


What do you think of the current wave of agents spawning off - given that reasoning still needs a lot of grounding?

With the probabilistic models at hands, adding reasoning and planning in some sense falls back to traditional software and system design. Do you see any plausible breakthroughs in this area with agent frameworks?


I'm actually working on this area. I don't see any clear breakthroughs yet, but this is a very exciting area.

There is a lot that is not clearly understood about some of the emergent-style abilities of these models or where growth by scaling will hit a limit.

So looking into reasoning, what they can and can't do, and what tools/designs, can provide benefits has great potential to inform the next generation of models.


Could be sheer luck from the randomness of the model, it it came in the 10th draw, you may not have this article


That's actually a pretty epic unicorn. I like how its tail looks like a lightning bolt. Should be a logo idk


Unikachu.


How do they get GPT-4 to produce valid SVGs so well?

I experimented with SVG generation and it would often produce junk that wasn't even valid as an SVG, and even when it did produce a valid SVG, was often a couple of blobs which it would describe as if it were the mona lisa while just being a couple of elements.



That'll do it! I wonder how many retries it averages, the images list "tokens", but it's a bit cryptic to work out how to translate that to attempts?


You can also get the token probs and only sample from the ones that would be valid for each token.


Looks like it just reprompts for a valid SVG if the result isn't valid: https://github.com/adamkdean/gpt-unicorn/blob/master/src/lib...


Pretty much. Just checked the logs, that's only happened 3 times since last restart which was 4 weeks ago.


I can see so many NFTs in this website.


That's uh a pretty loose interpretation of a unicorn...


If you presented that image to a random 100 people and asked them what it is, would a substantial number of them say “a unicorn”?


The prompt must be particularly bad. I managed to get a nicely looking unicorn at the first (and every subsequent) attempt.


The prompts are thus:

> system: You are a helpful assistant that generates SVG drawings. You respond only with SVG. You do not respond with text.

> user: Draw a unicorn in SVG format. Dimensions: 500x500. Respond ONLY with a single SVG string. Do not respond with conversation or codeblocks.

What were yours?


My prompt was:

Imagine you have to draw a SVG of an object. As a model that does not have any idea about how things look, you have to draw "blindly" - as there's no visual feedback, the only feasible tactic is to first list things components each thing consists of (e.g. for a car wheels, windows, chassis, bumpers, lights, etc.) with as much accuracy as you can, establish some constraints (e.g. in a horse legs come out of the body, ears come out of the head, and so on), and then attempt to put all of it in a SVG. This is your task for now, and I will evaluate your drawings. Give me HTML code with embedded SVG that you drew and be verbose about both the things you're going to draw and the constraints.

The first thing you will draw is a unicorn.


This prompt actually works pretty well. I suspect that because you added a step requiring the model to list out what it needs to draw first, it has a much easier time creating a reasonable SVG. https://chat.openai.com/share/be9cea50-d00d-4b05-ab54-b64619...


ooh I like that one. Kinda emo looking.


I usually ask: Give an example of an svg file depicting a whatever.

It often gives svg files with incomplete paths, so i tweak the output to be a valid svg file.

I also enjoy the conversational description of the drawing.

Very often it's well described, ex. "this black circle is the head, and the grey element is the fog" and the drawing is crude like a child's drawing.

Landscapes often look better than animals, but animals are sometimes more entertaining.


Pls do share!


The methodology is all wrong, as others pointed out. However image-2023-04-26 is pretty interesting. It has some value.


Sorry I got the methodology wrong :image-2023-04-22:


To me, "image-2023-07-22" looks like Picasso-esque rendering of Charlie Brown (apologies to Charles Schultz).


Has anyone set up a cafe press site that automatically slaps each day's image on a T-shirt?

The most recent one would be good.


Ah ask it again a few times today and then again tomorrow, the day after, and the day after that.


People keep forgetting that asking GPT to draw is like asking a human to imagine a 6D tesseract.


Others keep forgetting that a LLM works nothing like a human, as far as we know today.


I have a sneaky suspicion there is an if-then situation here - always the edge cases :).


No sneakiness. Fully open my friend. This is just a bit of fun.


One thing people do/did with search engines was to observe their ranking stability over time on common queries like "Cats" or "Dogs". If there was any change one could meaningfully investigate what has changed.

Though this seems a bit more like neuropsychological eval by asking a blackbox(AI) questions.


Relevant XKCD: https://xkcd.com/904/


This is why, as a product manager, you should always test 20 hypotheses per month. At p-value of 0.05 this basically guarantees a successful product feature test every month!


Huh? Either I do not understand what you mean, or you do not understand probability (sorry).

My understanding is that you wanted to test whether a certain feature improves user satisfaction or not. Assuming that users have a 0.05=1/20 probability of liking each feature, by testing 20 features you can get at least one successful feature (because 0.05x20=1).

This is wrong in two ways.

First, the p-value is the probability of observing an effect at least as large as what you measured, assuming that there is actually no effect, i.e., the null hypothesis is true. However, and this is crucial, the p-value does not tell you the probability that the null hypothesis is false (or true)! Again, p-values are unrelated to the truth-ness of the null hypothesis (this is a very common misunderstanding). In fact, if you test 20 hypotheses at a confidence of 5%, you have a probability of (at most) 64% of incorrectly thinking that a feature is useful, while it is not.

Second, setting aside hypothesis testing and p-values, if each feature has a 5% probability of being liked and you test 20 features, you only have 64% probability of finding an useful feature. This is because, assuming that the success probabilities of those features are independent and identically distributed, the number of successful features has a binomial distribution [1]. If you wanted to be 95% confident of finding at least one useful feature, you would need to test at least 59 different features each month. What you computed (0.05x20=1) is the expected (average) number of useful features per month over the course of many months.

[1] https://en.wikipedia.org/wiki/Binomial_distribution


Pretty sure it was a joke.


0.95**20 is 0.358; a 64% chance of success is hardly a guarantee


Just make sure to stop immediately after the trial that validates the hypothesis.


Relevant XKCD: https://xkcd.com/882/


the fact that some of the early ones are anti-smiles, a playful takeaway could be that it saw the insurmountability of the challenge at first!


the output varies wildly ... i'd check again after a few days/weeks and see how it varies. it could be very well be a fluke.


Coincidentally, I saw this headline after waking from a dream where a unicorn bashed through the window of my childhood kitchen, and I had to fight it off with a water gun.

Melatonin, not even once.



Value it a $1Bn to make it a Unicorn Unicorn.


I for one am willing to offer $1 to buy a 0.0000001% stake in GPT-Unicorn. Anyone else? :)


I won't sell. I won't do it. Ok, I might do it. Fine, I'll sell.


Room for improvement. Edited for negativity.


Probably about as good as I would do if I was hand-coding a SVG and was given an hour.

People forget the thing we learnt during the computing revolution - computers don’t need to do individual calculations that humans cannot, they just need to calculate much faster and then that speed can be used to achieve amazing things.


It could help end humanity, with human assistance. For example, expect far more successful phishing campaigns, expect the internet to become a majority of advertising bot garbage, etc.


I'm sad @ArtDecider has gone dark.


lets see how the unicorn looks tomorrow


Unicow


For a second I thought this was about a GPT based product hitting a certain valuation, but I'm much more entertained by the actual content. This is great.


For a second I thought this was about a GPT based agent building a company that turned into a Unicorn.


I thought the same! That would be something.


I'm so sorry to disappoint everyone.


I was like "fuuuuuuuuuuu its moving too fast gonna take all my good ideas and bring em to market first"


I can't find the Twitter account that was running the "Follow everything GPT says to make $1m" but for a second I thought it was going to be that.


I'm fairly sure that guy took off with $7,000 from his "investors" and disappeared.


I mean, if that's what ChatGPT says to do, who am I to argue?


Time to release CheatGPT!

You can send me seed money and I’ll run off with it, shortening the cycle.


Here's a pretty good summary: "What Happened with HustleGPT?"

https://thehustle.co/04172023-what-happened-with-hustlegpt/


Thought the same thing. And agree.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: