> so at some point the activations must be encoding some sort of overall plan fo...

edmara · on May 1, 2024

The modelling is advanced enough that you can't fundamentally distinguish it from (lossy, limited) planning in the way you're describing.

If the KQV doesn't encode information about likely future token sequences then a transformer empirically couldn't outperform Markov text generators.

mjburgess · on May 1, 2024

No one is spending $10-50mil building a markov text model of everything ever digitised; if they did so, their performance would approach a basic LLM.

Though, more simply, you can just take any LLM and rephrase it as a markov model. All algorithms which model conditional probability are equivalent; you can even unpack a NN as a kNN model or a decision tree.

They all model 'planning' in the same way: P(C|A, B) is a 'plan' for C following A, B. There is no model of P("A B C" | "A B"). Literally, at inference time, no computation whatsoever is performed to anticipate any future prediction -- this follows both trivially form the mathematical formalism (which no one seems to want to understand); or you can also see this empirically: inference time is constant regardless of prompt/continuation.

The reason 'the cat sat...' is completed by 'on the mat' is that it's maximal that P(on|the cat sat...), P(the|the cat sat on...), P(mat|the cat sat on the...)

Why its maximal is not in the model at all, nor in the data. It's in the data generating process, ie., us. It is we who arranged text by these frequencies and we did so because the phrase is a popular one for academic demonstrations (and so on).

As ever, people attribute "to the data" or worse, "to the LLM" no properties it has.. rather it replays the data to us and we suppose the LLM must have the property that generates this data originally. Nope.

Why did the tape recorder say, "the cat sat on the mat"? What, on the tape or in the recorder made "mat" the right word? Surely, the tape must have planned the word...

edmara · on May 2, 2024

>Why it's maximal is not in the model at all, nor the data

>It replays the data to us and we suppose the LLM must have the property that generates this data originally.

So to clarify, what you're saying is that under the hood, an LLM is essentially just performing a search for similar strings in its training data and regurgitating the most commonly found one?

Because that is demonstrably not what's happening. If this were 2019 and we were talking about GPT-2 it would be more understandable but SoTA LLMs can in-context learn and translate entire languages which aren't in their dataset.

Also RE inference time, when you give transformers more compute for an individual token, they perform better https://openreview.net/forum?id=ph04CRkPdC