Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How it works: a probability distribution over sequences of consecutive tokens.

Why it works: these absolute madmen downloaded the internet.



This is the thing. These AI models aren't that impressive in what they do if you understand it. What's impressive is the massive amount of data. One day the law will catch up too because what they are all producing is literally just a combination of a lot of little pieces of compressed versions of human-produced things. In effect it's some type of distributed plagiarism.


Like pretty much all human work...


Thankfully most human work is generally not controlled and monetized by three madmen


True! But that’s a critique of capitalism, not AI.


Actually did not mean for that statement to be understood in reverse. Is that the opposite of Poe's law? :thinking_emoji:


It has long been experimentally shown that neural network do in fact generalise and do not just memorise the training samples. What we do not see here is the convergence of the empirical distribution to the ideal distribution, the data is too sparse, the dimensionality too high. The amount of data is undoubtably enormous but it is not so simple. Only years and years of research have lead to models that are capable of learning such enormous amounts of data, while we can also see steady improvements on fixed datasets which means we in facto do make real progress on quite a lot of fronts. More data-efficiency would be great but at least we do have those datasets for language-related tasks, also it has been shown that fine-tuning is working quite well which might be a way to escape the dreaded data-inefficiency of our learning models.

In the end, we are not really in the business of copying the brain but creating models that learn from data. If we arrive at a model that can solve the problem we are interested in through different means than a human would, e.g. first pre-train on half of the internet and then fine tune on your taks, we would be quite happy and it would not be seen as a dealbreaker. Of course, we would really like to have models that learn faster or have more skills, but it's amazing what's possible right now. What I find inspiring is how simple the fundamental building blocks are that our models are composed of, from gradient descent to matrix multiplication to Relus (just a max(x,0)). It's not magic, just research.


> matrix multiplication to Relus (just a max(x,0))

Transformers famously employ the Softmax activation inside the attention matrix. Very rare to see Softmax anywhere other than the final layer.


This is an unfalsifiable claim.


> A probability distribution over sequences of consecutive tokens.

... reads like

A monad is just a monoid in the category of endofunctors.


>> is a monad really just a monoid in the category of endofunctors?

> In category theory, a monad is a structure that is similar to a monoid in the category of endofunctors. A monoid is a mathematical object that consists of a set of elements, along with a binary operation that combines the elements in a specific way. For example, the set of integers, with the operation of addition, forms a monoid.

> In the category of endofunctors, a monad is a similar structure, but it is defined in terms of functions instead of elements. An endofunctor is a function that takes a mathematical object, called the "domain," and maps it to another mathematical object, called the "codomain." For example, the function that maps a set to its power set (the set of all subsets of the set) is an endofunctor.

> A monad in the category of endofunctors is a structure that consists of an endofunctor, along with two natural transformations that satisfy certain conditions. A natural transformation is a function that maps one endofunctor to another, in a way that is consistent with the structure of the domain and codomain of the endofunctors. The two natural transformations that are part of a monad are called the "unit" and the "multiplication" natural transformations.

> The unit natural transformation is a function that maps an object in the domain of the endofunctor to the endofunctor itself. The multiplication natural transformation is a function that maps the endofunctor to itself, in a way that is consistent with the structure of the domain and codomain of the endofunctor. The unit and multiplication natural transformations must satisfy certain conditions in order for the structure to be a monad.

> Overall, a monad in the category of endofunctors is a structure that is similar to a monoid in the category of sets. It consists of an endofunctor, along with two natural transformations that satisfy certain conditions. This structure is useful in category theory and related fields, because it allows us to abstract away from the specific details of the endofunctors and focus on the general properties of the structure.

pretty good explanation, actually


But how does probability distribution over sequences of consecutive tokens can create new things?

Like, I saw the other day it creates a C code that creates a Lisp code that creates a Pascal code. Is this based on an entirely previous creation?


It doesn't create anything new. It creates things that look new.

The code examples are perfect case studies, they don't actually work. They aren't just slightly wrong, they're completely nonsensical.

Another example is "is <number> prime?", it can't answer things like that, and it will make up something that may or may not be accurate.

The model has no concept of what is true or false, it's essentially trying to predict what is the most likely token to come next.

It seems to know stuff because the knowledge comes from the dataset, hence techniques like zero-shot, few-shot and prompt-based learning.


> It doesn't create anything new. It creates things that look new.

This is not technically true. It can and does create things that are new. There are lots of new poems and jokes right here in this thread. I asked it, for example, to give me its top 10 reasons why Bigfoot knocks on camper trailers, and one of its answers was "because it likes to play with its food." I did a lot of searching to try to find this joke out there on the internet, and could not. I've also had it create Weird Al style songs for a variety of things, and it does great.

If these aren't new creations, I'm not sure what your threshold is for creating something new. In a sense I can see how you can say that it only "looks" new, but surely the essays generated by students worldwide mostly only "look" new, too...


ChatGPT has create a poem to cheer up my sick girlfriend. I have written a bit how she feels, what she has (just the flu) and what I did to cheer her up. ChatGPT created a decent poem with exactly fitted my description but was a bit dramatic, she's not dying just tired of being sick. I have asked ChatGPT to create a less dramatic version that rhymes more and ChatGPT just did it. Amazing. I have also googled parts of it but didn't find them! This certainly counts as novel or I would also be totally unable to create novel poems about my sick girlfriend (because I have read poems about girlfriends before?!).

A good idea when dismissing those machine learning models is to check whether a human would pass your standards. I miss the aspect when the dismissive "they only interpolate or memorise" arguments come. I am also quite bounded by my knowledge or what I have seen. Describe something I have never seen to me and ask me to draw it, I would fail in a quite hilarious way.

Hilariously, ChatGPT is also quite bad at arithmetic, like myself. I thought this is what machines are supposed to be good at!


People solve this by getting the GPT to describe a series of computations and then running those steps externally (e.g. asking GPT what Python code to run).

Thats not so different from how humans do this. When we need to add or multiply we switch from freeform thought to executing the Maths programs that were uploaded into our brains at school.


If I recall correctly, in his paper on whether machines could think, Turing gives an imaginary dialogue with a computer trying to pass as a human (what we later came to call the Turing test) where the judge poses an arithmetic problem, and the computer replies after a pause of 30 seconds — with the wrong answer.


That joke is a great example of why the creativity is surprising.

A human might have a thought process that starts with the idea that people are food for Bigfoot, and then connects that to phrase of "playing with your food".

But GPT generates responses word by word. And it operates at a word (token) level, rather than thinking about the concepts abstractly. So it starts with "Because it likes to play" which is a predictable continuation that could end in many different ways. But it then delivers the punchline of "with its food".

Was it just a lucky coincidence that it found an ending to the sentence that paid off so well? Or is the model so sophisticated that it can suggest word "plays" because it can predict the punchline related to "food".


I think what you are saying is just not true in the sense GPT style LLMs. The output is not just single word generation at a time. It is indeed taking into account the entire structure, preceding structures, and to a certain extent abstractions inherent to the structure throughout the model. Just because it tokenizes input doesn't mean it is seeing things word by word or outputting word by word. Transformers are not just fancy LSTMs. The whole point of transformers is it takes the input in parallel, where RNNs are sequential.


It seems I'd gotten the wrong impression of how it works. Do you have any recommendations for primers on GPT and similar systems? Most content seems to be either surface level or technical and opaque.


No. You got the right impression. It is indeed doing "next token prediction" in an autoregressive way, over and over again.

The best source would be the GPT-3 paper itself: https://paperswithcode.com/method/gpt-3


I wish someone what pass it the entirety of an IQ test. I bet it would score around 100, since no it does seem to get some logic questions wrong.


Well since it is only a text input AI it could only possibly attempt to do the VIQ part of a Weschler style IQ test, since the PIQ part requires understanding image abstractions (arrangements, block design, matrices of sequences etc).

I know there were some deep learning papers on how to train a model to pass the PIQ portion without human-coded heuristics (because, you could easily write a program to solve such questions if you knew ahead of time the format of the questions). I don't remember the outcomes however.


It got 52% in a SAT exam. Better than most people.


I have seen a score of 83 on twitter


Interesting, but I wonder how does it have the ability to combine those. i.e, creating a song in a KJV/spongebob style, or creating a code that writes a code that writes a code.


“create a song in spongebob style” will be cut into tokens which are roughly syllables (out of 50257 possible tokens), and each token is converted to a list of 12288 numbers. Each token always maps to the same list, called its embedding; the conversion table is called the token embedding matrix. Two embeddings with a short distance occur within similar concepts.

Then each token’s embedding is roughly multiplied with a set of matrices called “attention head” that yield three lists: query, key, value, each of 128 numbers behaving somewhat like a fragment of an embedding. We then take the query lists for the past 2048 tokens, and multiply each with the key lists of each of those 2048 tokens: the result indicates how much a token influences another. Each token’s value list get multiplied by that, so that the output (which is a fragment of an embedding associated with that token, as a list of 128 numbers) is somewhat proportional to the value list of the tokens that influence it.

We compute 96 attention heads in parallel, so that we get 128×96 = 12288 numbers, which is the size of the embedding we had at the start. We then multiply each with weights, sum the result, pass it through a nonlinear function; we do it 49152 times. Then we do the same again with other weights, but only 12288 times, so that we obtain 12288 numbers, which is what we started with. This is the feedforward layer. Thanks to it, each fragment of a token’s embedding is modified by the other fragments of that token’s embedding.

Then we pass that output (a window of 2048 token embeddings, each of 12288 numbers) through another multi-attention head, then another feedforward layer, again. And again. And again. 96 times in total.

Then we convert the output to a set of 50257 numbers (one for each possible next token) that give the probability of that token being the next syllable.

The token embedding matrix, multi-head attention weights, etc. have been learned by computing the gradient of the cross-entropy (ie. roughly the average likelihood of guessing the next syllable) of the model’s output, with respect to each weight in the model, and nudging the weights towards lower entropy.

So really, it works because there is a part of the embedding space that knows that a song is lyrical, and that a part of the attention head knows that sponge and bob together represent a particular show, and that a part of the feedforward layer knows that this show is near “underwater” in the embedding space, and so on.


Nobody really knows, because the model is too large and complex to really analyze.


It doesn't create anything new.

Who does? This is nothing but a "God of the Gaps" argument in reverse.


Sounds like you are thinking of language models in isolation, working in closed-book mode. That is just the default, it doesn't need to be how they are used in practice.

Do you know language models can use external toys, such as a calculator. They just need to write <calc>23+34=</calc> and they get the result "57" automatically added. The same, they can run <search>keyword</search> and get up to date snippets of information. They could write <work>def is_prime(x): ... print(is_prime(57))</work> and get the exact answer.

I think the correlation pattern in language is enough to do real work, especially when fortified with external resources. Intelligence is most likely a property of language, culture and tools, not of humans and neural networks.


I've been using it to write code for my business. It's often not perfect, but usually you can say fix bug XX in the code you gave me and it works.


The model also really loves stock phrases and platitudes.


“As a large language model trained by OpenAI, I do not have personal preferences or emotions. My primary function is to provide accurate and informative responses to questions based on the data I have been trained on. I am not capable of experiencing emotions or using stock phrases or platitudes.”


If it gives you broken code, you can tell it to fix the code and it often will


Sometimes it will, sometimes it won't. The point is that it's "random", it has no way to tell truth from falsity.

Language models are unsuitable for anything where the output needs to be "correct" for some definition of "correct" (code, math, legal advice, medical advice).

This is a well-known limitation that doesn't make those systems any less impressive from a technical point of view.


How can this interface be useful as a search engine replacement if the answers are often incorrect?

Can we fix it?

Because earlier today it told me that George VI was currently king of England. And I asked it a simple arithmetic question, which it got subtly wrong. And it told my friend there were a handful of primes less than 1000.

Everyone’s talking about it being a Google replacement. What’s the idea? That we train it over time by telling it when things are wrong? Or is the reality that these types of language models will only be useful for generating creative output?


there are plenty of google queries that return incorrect answers, and they've been operating for decades


It's not the same.

If you ask a chat interface a question and it says "this is true", that's very different from a search engine containing a list of results where one of them might be untrue.

For one thing, you can look at all the queries and take a majority vote etc. Second, you can look at the source to see if it's trustworthy.


Doctors are often not totally correct, but they're useful.


It absolutely replies to “is <number> prime” with the correct answer.


Also "Why is <number> interesting?" is an interesting question to ask. It finds something interesting about most integers, and falls back to giving you a good rational approximation for 'uninteresting' reals.


> It finds something interesting about most integers

Every integer is interesting. "[...] if there exists a non-empty set of uninteresting natural numbers, there would be a smallest uninteresting number – but the smallest uninteresting number is itself interesting because it is the smallest uninteresting number [...]" (https://en.wikipedia.org/wiki/Interesting_number_paradox)


A few days ago, I got to thinking about this. On a theoretical level every integer is interesting due to that, but on a practical level there's not much point calling a number "interesting" if the only property giving it that quality is being the first uninteresting number larger than n-1.

One way of checking what makes a number interesting is seeing what sequences it is in in the OEIS[1]. Interesting properties (squares, primes, palindromes, friendly numbers, etc) all have the first hundred or so numbers with those properties in a list. So I downloaded the dataset of all sequences from OEIS, and iterated through all integers to find the first one that is not in any sequence in the OEIS.

So, I present the smallest uninteresting number: 20,067. Every number before it is in at least one of the sequences - for example, 20,066 is 'a partial sum for a coordination sequence for the "pcu-i" 3D uniform tiling'. It is the first number that has no interesting properties other than trivial ones (like being a multiple of 3). It is the first uninteresting number, but one could say that makes it meta-interesting rather than being interesting in and of itself.

[1] https://oeis.org/

Edit: and after doing all that work I saw that someone else has already done so and put it in the wikipedia article. Oh well, at least I learned about set processing in Python.


is 4021 prime

No, 4021 is not a prime number. A prime number is a whole number greater than 1 that is divisible only by 1 and itself. 4021 is divisible by 17, which is a number other than 1 and itself, so it is not considered to be a prime number.


For the curious, 4021 is not divisible by 17.

I guess I only tried the first few single digit primes. Fair enough!


> But how does probability distribution over sequences of consecutive tokens can create new things?

If you start a sentence with a few words, think about the probability for what the next word might be. Imagine a vector (list) with a probability for every single other word in the language, proper nouns included. This is a huge list, and the probabilities of almost everything are near zero. If you take the very highest probability word, you'll get a fairly predictable thing. But if you start taking things a little lower down the probability list, you start to get what amounts to "creativity" but is actually just applied statistics plus randomness. (The typical threshold to use for how high the probability of a selected word should be is called the "temperature" and is a tunable parameter in these models usually.) But when you consider the fact that it has a lot of knowledge about how the world works and those things get factored into the relative probabilities, you have true creativity. Creativity is, after all, just trying a lot of random thoughts and throwing out the ones that are too impractical.

Some models, such as LaMDA, will actually generate multiple random responses, and run each of those responses through another model to determine how suitable the response is based on other criteria such as how on-topic things are, and whether it violates certain rules.

> Is this based on an entirely previous creation?

Yes, it's based entirely on its knowledge of basically everything in the world. Basically just like us, except we have personal volition and experience to draw from, and the capability to direct our own experiments and observe the results.


It turns out that human intelligence has left a detailed imprint in humanity’s written artifacts, and predicting the structure of this imprint requires something similar (perhaps identical, if we extrapolate out to “perfect prediction”) to human intelligence.

Not only that, but the imprint is also amenable to gradient descent, possessing a spectrum from easy- and difficult-to-predict structures.


You're forgetting to mention that you need to have a model that is capable of learning that probability distribution.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: