Hacker Newsnew | past | comments | ask | show | jobs | submit | thomasahle's commentslogin

We scaled on "virtually all RL tasks and environments we could conceive." - apparently, they didn't conceive of pelican SVG RL.

I've long thought multi-modal LLMs should be strong enough to do RL for TikZ and SVG generation. Maybe Google is doing it.


To encourage female participation and representation. Most people think it would be good for chess long-term to have a larger female player base.

Most players actually peak in strength around age 35 [1].

But Carlsen has been number one for more time than any player for him, safe Kasparov [2]:

- Kasparov 255 months at number 1

- Carlsen 188

- Karpov 102

- Fischer 54

Bonus nuance: Carlsen has the longest unbroken run of 174 consecutive rating lists

[1]: https://en.chessbase.com/post/the-age-related-decline-in-che...

[2]: https://en.wikipedia.org/wiki/List_of_FIDE_chess_world_numbe...


On the topic of lecture notes, I can really recommend Scott Aaron's Quantum Information lecture notes: https://www.scottaaronson.com/qclec.pdf

Maybe they just checked with a compiler and got the same code?

> This matters because (1) the world cannot be modeled anywhere close to completely with language alone

LLMs being "Language Models" means they model language, it doesn't mean they "model the world with language".

On the contrary, modeling language requires you to also model the world, but that's in the hidden state, and not using language.


Let's be more precise: LLMs have to model the world from an intermediate tokenized representation of the text on the internet. Most of this text is natural language, but to allow for e.g. code and math, let's say "tokens" to keep it generic, even though in practice, tokens mostly tokenize natural language.

LLMs can only model tokens, and tokens are produced by humans trying to model the world. Tokenized models are NOT the only kinds of models humans can produce (we can have visual, kinaesthetic, tactile, gustatory, and all sorts of sensory, non-linguistic models of the world).

LLMs are trained on tokenizations of text, and most of that text is humans attempting to translate their various models of the world into tokenized form. I.e. humans make tokenized models of their actual models (which are still just messy models of the world), and this is what LLMs are trained on.

So, do "LLMS model the world with language"? Well, they are constrained in that they can only model the world that is already modeled by language (generally: tokenized). So the "with" here is vague. But patterns encoded in the hidden state are still patterns of tokens.

Humans can have models that are much more complicated than patterns of tokens. Non-LLM models (e.g. models connected to sensors, such as those in self-driving vehicles, and VLMs) can use more than simple linguistic tokens to model the world, but LLMs are deeply constrained relative to humans, in this very specific sense.


I don't get the importance of the distinction really. Don't LLMs and Large non-language Models fundamentally work kind of similarly underneath? And use similar kinds of hardware?

But I know very little about this.


you are correct the token representation gets abstracted away very quickly and is then identical for textual or image models. It's the so-called latent space and people who focus on next token prediction completely missed the point that all the interesting thinking takes place in abstract world model space.

> you are correct the token representation gets abstracted away very quickly and is then identical for textual or image models.

This is mostly incorrect, unless you mean "they both become tensor / vector representations (embeddings)". But these vector representations are not comparable.

E.g. if you have a VLM with a frozen dual-backbone architecture (say, a vision transformer encoder trained on images, and an LLM encoder backbone pre-trained in the usual LLM way), then even if, for example, you design this architecture so the embedding vectors produced by each encoder have the same shape, to be combined via another component, e.g. some unified transformer, it will not be the case that e.g. the cosine similarity between an image embedding and a text embedding is a meaningful quantity (it will just be random nonsense). The representations from each backbone are not identical, and the semantic structure of each space is almost certainly very different.


They do not model the world.

They present a statistical model of an existing corpus of text.

If this existing corpus includes useful information it can regurgitate that.

It cannot, however, synthesize new facts by combining information from this corpus.

The strongest thing you could feasibly claim is that the corpus itself models the world, and that the LLM is a surrogate for that model. But this is not true either. The corpus of human produced text is messy, containing mistakes, contradictions, and propaganda; it has to be interpreted by someone with an actual world model (a human) in order for it to be applied to any scrnario; your typical corpus is also biased towards internet discussions, the english language, and western prejudices.


If we focus on base models and ignore the tuning steps after that, then LLMs are "just" a token predictor. But we know that pure statistical models aren't very good at this. After all we tried for decades to get Markov chains to generate text, and it always became a mess after a couple of words. If you tried to come up with the best way to actually predict the next token, a world model seems like an incredibly strong component. If you know what the sentence so far means, and how it relates to the world, human perception of the world and human knowledge, that makes guessing the next word/token much more reliable than just looking at statistical distributions.

The bet OpenAI has made is that if this is the optimal final form, then given enough data and training, gradient descent will eventually build it. And I don't think that's entirely unreasonable, even if we haven't quite reached that point yet. The issues are more in how language is an imperfect description of the world. LLMs seems to be able to navigate the mistakes, contradictions and propaganda with some success, but fail at things like spatial awareness. That's why OpenAI is pushing image models and 3d world models, despite making very little money from them: they are working towards LLMs with more complete world models unchained by language

I'm not sure if they are on the right track, but from a theoretical point I don't see an inherent fault


There's plenty of faults in this idea.

First, the subjectivity of language.

1) People only speak or write down information that needs to be added to a base "world model" that a listener or receiver already has. This context is extremely important to any form of communication and is entirely missing when you train a pure language model. The subjective experience required to parse the text is missing.

2) When people produce text, there is always a motive to do so which influences the contents of the text. This subjective information component of producing the text is interpreted no different from any "world model" information.

A world model should be as objective as possible. Using language, the most subjective form of information is a bad fit.

The other issue in this argument is that you're inverting the implication. You say an accurate world model will produce the best word model, but then suddenly this is used to imply that any good word model is a useful world model. This does not compute.


> People only speak or write down information that needs to be added to a base "world model" that a listener or receiver already has

Which companies try to address with image, video and 3d world capabilities, to add that missing context. "Video generation as world simulators" is what OpenAI once called it

> When people produce text, there is always a motive to do so which influences the contents of the text. This subjective information component of producing the text is interpreted no different from any "world model" information.

Obviously you need not only a model of the world, but also of the messenger, so you can understand how subjective information relates to the speaker and the world. Similar to what humans do

> The other issue in this argument is that you're inverting the implication. You say an accurate world model will produce the best word model, but then suddenly this is used to imply that any good word model is a useful world model. This does not compute

The argument is that training neural networks with gradient descent is a universal optimizer. It will always try to find weights for the neural network that cause it to produce the "best" results on your training data, in the constraints of your architecture, training time, random chance, etc. If you give it training data that is best solved by learning basic math, with a neural architecture that is capable of learning basic math, gradient descent will teach your model basic math. Give it enough training data that is best solved with a solution that involves building a world model, and a neural network that is capable of encoding this, then gradient descent will eventually create a world model.

Of course in reality this is not simple. Gradient descent loves to "cheat" and find unexpected shortcuts that apply to your training data but don't generalize. Just because it should be principally possible doesn't mean it's easy, but it's at least a path that can be monetized along the way, and for the moment seems to have captivated investors


You did not address the second issue at all. You are inverting the implication in your argument. Whether gradient descent helps solve the language model problem or not does not help you show that this means it's a useful world model.

Let me illustrate the point using a different argument with the same structure: 1) The best professional chefs are excellent at cutting onions 2) Therefore, if we train a model to cuy onions using gradient descent, that model will be a very good profrssional chef

2) clearly does not follow from 1)


I think the commenter is saying that they will combine a world model with the word model. The resulting combination may be sufficient for very solid results.

Note humans generate their own non-complete world model. For example there are sounds and colors we don’t hear or see. Odors we don’t smell. Etc…. We have an incomplete model of the world, but we still have a model that proves useful for us.


> they will combine a world model with the word model.

This takes "world model" far too literally. Audio-visual generative AI models that create non-textual "spaces" are not world models in the sense the previous poster meant. I think what they meant by world model is that the vast majority of the knowledge we rely upon to make decisions is tacit, not something that has been digitized, and not something we even know how to meaningfully digitize and model. And even describing it as tacit knowledge falls short; a substantial part of our world model is rooted in our modes of actions, motivations, etc, and not coupled together in simple recursive input -> output chains. There are dimensions to our reality that, before generative AI, didn't see much systematic introspection. Afterall, we're still mired in endless nature v. nurture debates; we have a very poor understanding about ourselves. In particular, we have extremely poor understanding of how we and our constructed social worlds evolve dynamically, and it's that aspect of our behavior that drives the frontier of exploration and discovery.

OTOH, the "world model" contention feels tautological, so I'm not sure how convincing it can be for people on the other side of the debate.


Really all you're saying is the human world model is very complex, which is expected as humans are the most intelligent animal.

At no point have I seen anyone here as the question of "What is the minimum viable state of a world model".

We as humans with our ego seem to state that because we are complex, any introspective intelligence must be as complex as us to be as intelligent as us. Which doesn't seem too dissimilar to saying a plane must flap its wings to fly.


Has any generative AI been demonstrated to exhibit the generalized intelligence (e.g. achieving in a non-simulated environment complex tasks or simple tasks in novel environments) of a vertebrate, or even a higher-order non-vertebrate? Serious question--I don't know either way. I've had trouble finding a clear answer; what little I have found is highly qualified and caveated once you get past the abstract, much like attempts in prior AI eras.

> e.g. achieving in a non-simulated environment complex tasks or simple tasks in novel environments

I think one could probably argue "yes", to "simple tasks in novel environments". This stuff is super new though.

Note the "Planning" and "Robot Manipulation" parts of V-JEPA 2: https://arxiv.org/pdf/2506.09985:

> Planning: We demonstrate that V-JEPA 2-AC, obtained by post-training V-JEPA 2 with only 62 hours of unlabeled robot manipulation data from the popular Droid dataset, can be deployed in new environments to solve prehensile manipulation tasks using planning with given subgoals. Without training on any additional data from robots in our labs, and without any task-specific training or reward, the model successfully handles prehensile manipulation tasks, such as Grasp and Pick-and-Place with novel objects and in new environments.


There is no real bar any more for generalized intelligence. The bars that existed prior to LLMs have largely been met. Now we’re in a state where we are trying to find new bars, but there are none that are convincing.

ARC-AGI 2 private test set is one current bar that a large number of people find important and will be convincing to a large amount of people again if LLMs start doing really well on it. Performance degradation on the private set is still huge though and far inferior to human performance.

> It cannot, however, synthesize new facts by combining information from this corpus.

That would be like saying studying mathematics can't lead to someone discovering new things in mathematics.

Nothing would ever be "novel" if studying the existing knowledge could not lead to novel solutions.

GPT 5.2 Thinking is solving Erdős Problems that had no prior solution - with a proof.


The Erdos problem was solved by interacting with a formal proof tool, and the problem was trivial. I also don't recall if this was the problem someone had already solved prior but not reported, but that does not matter.

The point is that the LLM did not model maths to do this, made calls to a formal proof tool that did model maths, and was essentially working as the step function to a search algorithm, iterating until it found the zero in the function.

That's clever use of the LLM as a component in a search algorithm, but the secret sauce here is not the LLM but the middleware that operated both the LLM and the formal proof tool.

That middleware was the search tool that a human used to find the solution.

This is not the same as a synthesis of information from the corpus of text.


  It cannot, however, synthesize new facts by combining information from this corpus.
Are we sure? Why can't the LLM use tools, run experiments, and create new facts like humans?

Then the LLM is not actually modelling the world, but using other tools that do.

The LLM is not the main component in such a system.


So do we expect real world models to just regurgitate new facts from their training data?

Regurgitating facts kind of assumes it is a language model, as you're assuming a language interface. I would assume a real "world model" or digital twin to be able to reliably model relationships between phenomena in whatever context is being modeled. Validation would probably require experts in whatever thing is being modeled to confirm that the model captures phenomena to some standard of fidelity. Not sure if that's regurgitating facts to you -- it isn't to me.

But I don't know what you're asking exactly. Maybe you could specify what it is you mean by "real world model" and what you take fact-regurgitating to mean.


  But I don't know what you're asking exactly. Maybe you could specify what it is you mean by "real world model" and what you take fact-regurgitating to mean.
You said this:

  If this existing corpus includes useful information it can regurgitate that.It cannot, however, synthesize new facts by combining information from this corpus.
So I'm wondering if you think world models can synthesize new facts.

A world model can be used to learn something about the real system. I said synthesize because in the context that LLM's work in (using a corpus to generate sentences) that is what that would look like.

Why can’t an LLM run experiments to synthesize new facts?

That's not what synthesis is.

But that small semantic note aside, if an LLM is used to trigger other tools to find new facts, then those other tools are modeling the "world" or a particular domain. Alternatively you could say that the system as a whole, that the LLM is a part of, models the "world" or a particular domain.


Does it matter how the new fact is acquired?

If an LLM uses a calculator to come up with an answer, does it make it worse than a model that can inference the answer without using a tool?


they do model the world. Watch Noble price winner Hinton or let's admit that this is more of a religious question then the technical.

They model the part of the world that (linguistic models of the world posted on the internet) try to model. But what is posted on the internet is not IRL. So, to be glib: LLMs trained on the internet do not model IRL, they model talking about IRL.

His point is that human language and the written record is a model of the world, so if you train an LLM you're training a model of a model of the world.

That sounds highly technical if you ask me. People complain if you recompress music or images with lossy codecs, but when an LLM does that suddenly it's religious?


A model of a model of X is a model of X, albeit extra lossy.

An LLM has an internal linguistic model (i.e. it knows token patterns), and that linguistic model models humans' linguistic models (a stream of tokens) of their actual world models (which involve far, far more than linguistics and tokens, such as logical relations beyond mere semantic relations, sensory representations like imagery and sounds, and, yes, words and concepts).

So LLMs are linguistic (token pattern) models of linguistic models (streams of tokens) describing world models (more than tokens).

It thus does not in fact follow that LLMs model the world (as they are missing everything that is not encoded in non-linguistic semantics).


At this point, anyone claiming that LLMs are "just" language models aren't arguing in good faith. LLMs are a general purpose computing paradigm. LLMs are circuit builders, the converged parameters define pathways through the architecture that pick out specific programs. Or as Karpathy puts it, LLMs are a differentiable computer[1]. Training LLMs discovers programs that well reproduce the input sequence. Tokens can represent anything, not just words. Roughly the same architecture can generate passable images, music, or even video.

[1] https://x.com/karpathy/status/1582807367988654081


If it's an LLM it's a (large) language model. If you use ideas from LLM architecture in other non-language models, they are not language models.

But it is extremely silly to say that "large language models are language models" is a bad faith argument.


No, its extremely silly to use the incidental name of a thing as an argument for the limits of its relevance. LLMs were designed to model language, but that does not determine the range of their applicability, or even the class of problems they are most suited for. It turns out that LLMs are a general computing architecture. What they were originally designed for is incidental. Any argument that starts off "but they are language models" is specious out of the gate.

Sorry, but using "LLM" when you mean "AI" is a basic failure to understand simple definitions, and also is ignoring the meat of the blog post and much of the discussion here (which is that LLMs are limited by virtue of being only / mostly trained on language).

Everything you are saying is either incoherent because you actually mean "AI" or "transformer", or is just plain wrong, since e.g. not all problems can be solved using e.g. single-channel, recursively-applied transformers, as I mention elsewhere here: https://news.ycombinator.com/item?id=46948612. The design of LLMs absolutely determines the range of their applicability, and the class of problems they are most suited for. This isn't even a controversial take, lots of influencers and certainly most serious researchers recognize the fundamental limitations of the LLM approach to AI.

You literally have no idea what you are talking about and clearly do not read or understand any actual papers where these models are developed, and are just repeating simplistic metaphors from blog posts, and buying into marketing.


In this case this is not so. The primary model is not a model at all, and the surrogate has bias added to it. It's also missing any way to actually check the internal consistency of statements or otherwise combine information from its corpus, so it fails as a world model.

There's a graveyard of 100s of papers with "approximate near linear time attention."

They always hope the speed increase makes up for the lower quality, but it never does. The quadratic time seems inherent to the problem.

Indeed, there are lower bounds showing that sub n^2 algorithms can't work: https://arxiv.org/pdf/2302.13214


The paper says that:

> In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution, acceptable for many AI applications.

ie., the claim is that this method reproduces the results of conventional attention, up to float16 numerical precision.


> approximately the same magnitude

and they really do mean that, their results show +/- 1 on log10 plots.


I don't think this is an accurate characterization of the error magnitude? Their error plots (from appendix 3) are all showing `log_10(|Y - \dot{Y}|)` as having a median of ~-3 (difference of 0.001) and a max of ~1.5 (difference of 0.035), and this is with only 3 Taylor terms.

Oh you're right that is a misread on my part, the appendix charts don't say that. I think they're just useless then though? Since they're reporting absolute error (on a log10 scale) we can't assess the relative to compare to the 'within an order of magnitude' claim in the text.

It converges on conventional attention as P goes up

The method is more general. The github repository's first example is with eight Taylor terms (P = 8).

I'm clueless about this whole thing, but from my EE education I remember that in general:

Taylor approximations converge slowly in terms of error if the function they're representing is discontinuous (the error disappears quadratically if continuous, linearly if not), and they tend to create highly energetic swings near discontinuties (similarly to Fourier series with Gibbs oscillations).

Moreover, Taylor series are inherently nonlinear, and much of the mathematical toolset around AI assumes general linearity (cue linear algebra), with the exception of sigmoids , and going beyond cubic approximations tends to make errors worse (as expressed in SNR).


> self-attention is efficiently computable to arbitrary precision with constant cost per token

This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.


It's like claims of room temperature superconductors or millenium prize solutions. Earth shattering if true. It'd be such a black swan. Terrible for Nvidia.

Well, we solved one of the Millennium Prize problems (honestly kinda quickly) so maybe there's hope :)

It can't be successful at that any more than 1+1 can equal 3. Fundamentally, if every token wants to be able to look at every previous token without loss of information, it must be O(n^2); N tokens looking at N tokens is quadratic. Any sub-quadratic attention must hence necessarily lose some information and be unable to support perfect recall on longer sequences.

> N tokens looking at N tokens is quadratic

Convolving two arrays can be done perfectly accurately in O(n log n), despite every element being combined with every other element.

Or consider the even more basic sum of products a[i] * b[j] for all possible i, j:

    total = 0
    for i in range(len(a)):
        for j in range(len(b)):
            total += a[i] * b[j]
This can be computed in linear time as sum(a) * sum(b).

Your logic that 'the result contains terms of all pairs, therefore the algorithm must be quadratic' simply doesn't hold.


One of my favorite bits of my PhD dissertation was factoring an intractable 3-dimensional integral

\iiint f(x, y, z) dx dy dz = \int [\int g(x, y) dx]*[\int h(y, z) dz] dy

which greatly accelerated numerical integration (O(n^2) rather than O(n^3)).

My advisor was not particularly impressed and objectively I could have skipped it and let the simulations take a bit longer (quite a bit longer--this integration was done millions of times for different function parameters in an inner loop). But it was clever and all mine and I was proud of it.


That's like saying sorting can be done in O(n) because radix sort exists. If you assume some structure, you lose generality, i.e. there'll be some problems it's no longer able to solve. It can no longer approximate any arbitrary function that needs perfect memory over the sequence.

This brings me back to DSP class, man learning about FFT was eye-opening.

Convolution is a local operation.

Attention is a global operation.


Your argument just assumes there is no latent structure that can be exploited. That's a big assumption.

It's a necessary assumption for the universal approximation property; if you assume some structure then your LLM can no longer solve problems that don't fit into that structure as effectively.

But language does have structure, as does logic and reasoning. Universal approximation is great when you don't know the structure and want to brute force search to find an approximate solution. That's not optimal by any stretch of the imagination though.

Neural nets are structured as matrix multiplication, yet, they are universal approximators.

You're missing the non-linear activations.

I'm not saying if the paper is correct or not (since I can't tell), but I don't think your argument really holds. Consider applying it to multiplication:

Fundamentally, multiplication need to look at every pair of integer from the two input numbers. It must be O(n^2); N digits looking at N other digits is quadratic. Any sub-quadratic multiplication must hence necessarily lose some information.


Integer multiplication x * y can be trivially done in O(k): k = log₂(min(x, y)). This is because we can do addition in constant time, adding all bits in parallel.

By combining many more adding units, we can do (fixed-size) multiplication in constant time, too: https://en.wikipedia.org/wiki/Dadda_multiplier


Multiplication can be sub-quadratic using Karatsuba's algorithm.

That is the poster's point!

Doesn't that have to do with how many bits you allow in the actual calculation in physical reality?

Well, for multiplication complexity is defined in terms of on the number of digits/bits digits directly. For attention, complexity is defined on terms of the number of input vectors which are all at fixed precision. I don't understand what happens to the method proposed in the paper at higher precision (since I don't understand the paper), but in reality in doesn't matter since there is no value in anything over float16 for machine learning.

Multiplication has some properties like being cumulative. If we assume the sequence has any specific properties then we no longer have a general sequence model.

I think you meant commutative.

Attention also has some specific properties.

And sometimes results are just unexpected. Did you know that anything a Turing machine can do in t tome steps, a different Turing machine can do in O(sqrt(t log t)) memory cells? https://news.ycombinator.com/item?id=44055347


That argument could also be used to say that the FFT's time complexity of O(n log n) should be impossible.

As the error via linear approximation approaches similar magnitude as numerical error via quadratic computation, don’t the two start becoming comparable in practice?

I ask because in practice, for inference, attention is typically computed with low-precision (4-bit, 8-bit, 16-bit) floats.

Numerical error, in fact, may be a key factor as to why quadratic attention, in practice, exhibits context rot as context gets longer, analogous to an RNN:

https://www.anthropic.com/engineering/effective-context-engi...


That website says nothing about numerical error potentially causing context rot.

As far as I know, there is no widely accepted explanation for context rot.

Numerical error in long sequences of query-key dot-products may be a key factor.


That should be easy to test: test a 16 bit model on various benchmarks, once with fresh context and once with the context filled up with irrelevant tokens. Record the relative performance degradation, and then do the same for a quantized model. Compare whether the quantized model has a significant relatively larger performance drop from context rot. If so, numerical error should be the cause.

I think any kind of innovation here will have to take advantage of some structure inherent to the problem, like eliminating attention in favour of geometric structures like Grassman flows [1].

[1] Attention Is Not What You Need, https://arxiv.org/abs/2512.19428


Right - e.g., if you're modeling a physical system it makes sense to bake in some physics - like symmetry.

Indeed, and I think natural language and reasoning will have some kind of geometric properties as well. Attention is just a sledgehammer that lets us brute force our way around not understanding that structure well. I think the next step change in AI/LLM abilities will be exploiting this geometry somehow [1,2].

[1] GrokAlign: Geometric Characterisation and Acceleration of Grokking, https://arxiv.org/abs/2510.09782

[2] The Geometry of Reasoning: Flowing Logics in Representation Space, https://arxiv.org/abs/2506.12284


Unlike previous efforts, which typically stop at a low-order (e.g., quadratic) term of the Taylor expansion, this work derives a succinct, efficient, parallel general method for approximating attention with any number of Taylor terms, to arbitrary precision.

The github repository's first toy example is with 8 Taylor terms, applied to a context of 1B tokens, with attention computed over 1K heads per token. (Note that applying the quadratic formulation to 1B tokens, each with 1K heads, is not practical with current hardware, because it would require computing 1K attention matrices, each with 1B×1B dot-product scores.

Like every other proposed method, this one must be tested too. If it works, AI service providers who ignore it will find themselves at a disadvantage.

It's worth mentioning also that the mathematical techniques introduced by this work are likely of interest for other applications besides attention.


Dumb question: is the quadratic time complexity for training, inference, or both?

Both, with caveats. The attention computation is fundamentally quadratic: for every token in the sequence, you're doing a computation that has to compute over every other token in the sequence. So it's O(N) per token, O(N^2) for the whole sequence.

The big mitigation for this is that in causal transformers (i.e. all the chatbot type applications, where each token is only allowed to see tokens before it), you're running inference repeatedly on the same prefix in order to grow it by one token at a time. So if you cache the computations for tokens 0..N-1, on each inference pass you only have to compute O(N) for the newly added token at the end of the sequence.

That's why caching (and caching charges) appear so prominently everywhere in the pricing of inference.

In practice, caching is most beneficial at inference time, because you typically have relatively long conversations that start with the same cacheable prefix (the system prompt). At training time the same optimization can apply, but you're typically not pushing the same prefixes through the model repeatedly so you end up paying the quadratic cost more often.

The quadratic cost of attention is the fundamental compute bottleneck for transformer architectures, which is why there's research like this trying to find shortcuts in computing attention, as well as research into completely new primitives to replace attention (e.g. SSM, which is O(N) on a cold cache and O(1) on a warm cache).


Attention is calculated during the forward pass of the model, which happens in both inference (forward only) and training (forward & backward).

Dumb question: Can inference be done in a reverse pass? Outputs predicting inputs?

Strictly speaking: no. The "forward pass" terminology does not imply that there exists a "reverse pass" that does the same kind of computation. Rather, it's describing two different kinds of computation, and the direction they occur in.

The forward pass is propagating from inputs to outputs, computing the thing the model was trained for. The reverse/backwards pass is propagating from outputs back to inputs, but it's calculating the gradients of parameters for training (rougly: how much changing each parameter in isolation affects the output, and whether it makes the output closer to the desired training output). The result of the "reverse pass" isn't a set of inputs, but a set of annotations on the model's parameters that guide their adjustment.

The computations of the forward pass are not trivially reversible (e.g. they include additions, which destroys information about the operand values). As a sibling thread points out, you can still probabilistically explore what inputs _could_ produce a given output, and get some information back that way, but it's a lossy process.

And of course, you could train a "reverse" model, one that predicts the prefix of a sequence given a suffix (trivially: it's the same suffix prediction problem, but you train it on reversed sequences). But that would be a separate model trained from scratch on that task, and in that model the prefix prediction would be its forward pass.


I do want to see ChatGPT running upwards on my screen now, predicting earlier and earlier words in a futile attempt to explain a nonsense conclusion. We could call it ChatJeopardy.

Not as trivially as the forwards direction, unsurprisingly information is lost, but better than you might expect. See for example https://arxiv.org/pdf/2405.15012

Sounds like a great premise for a sci-fi short story.

Sci-fi ? You mean historical fiction!

I agree with the fundamental idea that attention must be O(N^2), with the exception of recent DeepSeek sparse attention approach (DSA), that does not escape N^2 but attempts to lower constant times so much that N^2 is more acceptable, by creating a much faster layer that predicts high scoring tokens.

Yeah, this(-ish): there are shipping models that don't eliminate N^2 (if a model can repeat your code back with edits, it needs to reference everything somehow), but still change the picture a lot when you're thinking about, say, how resource-intensive a long-context coding session is.

There are other experiments where model designers mix full-attention layers with limited-memory ones. (Which still doesn't avoid N^2, but if e.g. 3/4 of layers use 'light' attention, it still improves efficiency a lot.) The idea is the model can still pull information from far back in context, just not in every layer. Use so far is limited to smaller models (maybe it costs too much model capability to use at the high end?) but it seems like another interesting angle on this stuff.


You can't stuff O(N) bits in O(1) space, so any scheme that purports, in general to do constant-time inference on unbounded context is snake oil, like a perpetual motion machine. Every such scheme must decay somehow. All you can do is choose how it decays.

Right, not to "defend" the paper's claims, but it seems to be more like tuning how the leaky bucket leaks, using lossy compression to try to preserve some measure of coherency? Seems to turn on the fixed size summary.

The 2023 paper even if true doesn’t preclude the 2026 paper from being true, it just sets constraints on how a faster attention solution would have to work.

I think DeepSeek V3.2 is sub n^2, but it clearly performs quite well, refuting the alleged lower bounds in the paper.

It really isn't sub N^2. The main attention is only O(Nk), but only thanks to a lightning indexer that still has complexity O(N^2). So overall it still has the same complexity; just with a smaller constant factor [1]

> DSA reduces the core attention complexity of the main model from O(L^2) to O(Lk), where k (<< L) is the number of selected tokens. Although the lightning indexer still has a complexity of O(L^2), it requires much less computation compared with MLA in DeepSeek-V3.1-Terminus

[1] https://arxiv.org/pdf/2512.02556


Okay, then let's see whether we are going to see real linear architectures, like Gated DeltaNet or Mamba-3, in some larger models. I don't believe there is a "lower bound" which states that those can never get to (or exceed) the real-world performance of quadratic attention. (Perfect recall in unrealistic needle-in-haystack tests doesn't count.)

I'm also sure that some kind of linear architecture is possible. After all, humans don't have N^2 perfect recall either.

I agree. This from the paper mill for the paper mill.

I tried Prism, but it's actually a lot more work than just using claude code. The latter allows you to "vibe code" your paper with no manual interaction, while Prism actually requires you review every change.

I actually think Prism promotes a much more responsible approach to AI writing than "copying from chatgpt" or the likes.


> And also plagiarism, when you claim authorship of it.

I don't actually mind putting Claude as a co-author on my github commits.

But for papers there are usually so many tools involved. It would be crowded to include each of Claude, Gemini, Codex, Mathematica, Grammarly, Translate etc. as co-authors, even though I used all of them for some parts.

Maybe just having a "tools used" section could work?


I suspect the parent post was concerned about plagiarizing the author of training data; not software tools.


I'm always surprised that Python doesn't have as good TUI libraries as Javascript or Rust. With the amount of CLI tooling written in Python, you'd think it had better libraries than any other language.


Blessed was a decent one iirc:

https://github.com/jquast/blessed

One reason for the lack of python might be the timing of the TUI renaissance, which I think happened (is happening?) alongside the rise of languages like Go and Rust.


it has, but python being single threaded (until recently) didn't make it an attractive choice for CLI tools.

example: `ranger` is written in python and it's freaking slow. in comparison, `yazi` (Rust) has been a breeze.

Edit: Sorry, I meant GIL, not single thread.


> it has, but python being single threaded (until recently) didn't make it an attractive choice for CLI tools.

You probably mean GIL, as python has supported multi threading for like 20 years.

Idk if ranger is slow because it is written in python. Probably it is the specific implementation.


> You probably mean GIL

They also probably mean TUIs, as CLIs don't do the whole "Draw every X" thing (and usually aren't interactive), that's basically what sets them apart from CLIs.


Even my CC status line script enjoyed a 20x speed improvement when I rewrote it from python to rust.


It’s surprising how quickly the bottleneck starts to become python itself in any nontrivial application, unless you’re very careful to write a thin layer that mostly shells out to C modules.


Textual looks really nice, but I usually make web apps so I haven’t tried it for anything serious:

https://textual.textualize.io/


Textual is cook, but it's maintained by a single guy, and the roadmap hasn't been updated since 2023, https://textual.textualize.io/roadmap/


Textual is A++. Feels a bit less snappy than Ink, but it makes up in all things with its immense feature-set. Seriously fun building apps of all kinds with this lib.


I’m using Textual for my TUI needs, it’s very decent.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: