There's a graveyard of 100s of papers with "approximate near linear time attenti...

jcarreiro · 2026-02-04T16:47:25 1770223645

The paper says that:

> In practice, we find that four Taylor terms (P = 4) suffice for recovering conventional attention with elementwise errors of approximately the same magnitude as Float16 resolution, acceptable for many AI applications.

ie., the claim is that this method reproduces the results of conventional attention, up to float16 numerical precision.

kristjansson · 2026-02-04T19:06:23 1770231983

> approximately the same magnitude

and they really do mean that, their results show +/- 1 on log10 plots.

cptroot · 2026-02-05T00:50:14 1770252614

I don't think this is an accurate characterization of the error magnitude? Their error plots (from appendix 3) are all showing `log_10(|Y - \dot{Y}|)` as having a median of ~-3 (difference of 0.001) and a max of ~1.5 (difference of 0.035), and this is with only 3 Taylor terms.

kristjansson · 2026-02-06T18:01:19 1770400879

Oh you're right that is a misread on my part, the appendix charts don't say that. I think they're just useless then though? Since they're reporting absolute error (on a log10 scale) we can't assess the relative to compare to the 'within an order of magnitude' claim in the text.

energy123 · 2026-02-04T17:29:32 1770226172

It converges on conventional attention as P goes up

fheinsen · 2026-02-04T17:13:04 1770225184

The method is more general. The github repository's first example is with eight Taylor terms (P = 8).

torginus · 2026-02-05T00:45:34 1770252334

I'm clueless about this whole thing, but from my EE education I remember that in general:

Taylor approximations converge slowly in terms of error if the function they're representing is discontinuous (the error disappears quadratically if continuous, linearly if not), and they tend to create highly energetic swings near discontinuties (similarly to Fourier series with Gibbs oscillations).

Moreover, Taylor series are inherently nonlinear, and much of the mathematical toolset around AI assumes general linearity (cue linear algebra), with the exception of sigmoids , and going beyond cubic approximations tends to make errors worse (as expressed in SNR).

kristjansson · 2026-02-04T16:15:41 1770221741

> self-attention is efficiently computable to arbitrary precision with constant cost per token

This paper at least aspires to reproduce 'true' attention, which distinguishes it from many of the others. TBD if its successful in that.

energy123 · 2026-02-04T16:20:56 1770222056

It's like claims of room temperature superconductors or millenium prize solutions. Earth shattering if true. It'd be such a black swan. Terrible for Nvidia.

SeanAnderson · 2026-02-04T16:49:39 1770223779

Well, we solved one of the Millennium Prize problems (honestly kinda quickly) so maybe there's hope :)

logicchains · 2026-02-04T16:47:42 1770223662

It can't be successful at that any more than 1+1 can equal 3. Fundamentally, if every token wants to be able to look at every previous token without loss of information, it must be O(n^2); N tokens looking at N tokens is quadratic. Any sub-quadratic attention must hence necessarily lose some information and be unable to support perfect recall on longer sequences.

orlp · 2026-02-04T18:59:54 1770231594

> N tokens looking at N tokens is quadratic

Convolving two arrays can be done perfectly accurately in O(n log n), despite every element being combined with every other element.

Or consider the even more basic sum of products a[i] * b[j] for all possible i, j:

    total = 0
    for i in range(len(a)):
        for j in range(len(b)):
            total += a[i] * b[j]

This can be computed in linear time as sum(a) * sum(b).

Your logic that 'the result contains terms of all pairs, therefore the algorithm must be quadratic' simply doesn't hold.

CrazyStat · 2026-02-04T23:11:54 1770246714

One of my favorite bits of my PhD dissertation was factoring an intractable 3-dimensional integral

\iiint f(x, y, z) dx dy dz = \int [\int g(x, y) dx]*[\int h(y, z) dz] dy

which greatly accelerated numerical integration (O(n^2) rather than O(n^3)).

My advisor was not particularly impressed and objectively I could have skipped it and let the simulations take a bit longer (quite a bit longer--this integration was done millions of times for different function parameters in an inner loop). But it was clever and all mine and I was proud of it.

logicchains · 2026-02-04T21:52:43 1770241963

That's like saying sorting can be done in O(n) because radix sort exists. If you assume some structure, you lose generality, i.e. there'll be some problems it's no longer able to solve. It can no longer approximate any arbitrary function that needs perfect memory over the sequence.

anvuong · 2026-02-04T20:07:03 1770235623

This brings me back to DSP class, man learning about FFT was eye-opening.

noosphr · 2026-02-05T00:02:18 1770249738

Convolution is a local operation.

Attention is a global operation.

naasking · 2026-02-04T18:42:46 1770230566

Your argument just assumes there is no latent structure that can be exploited. That's a big assumption.

logicchains · 2026-02-04T21:53:45 1770242025

It's a necessary assumption for the universal approximation property; if you assume some structure then your LLM can no longer solve problems that don't fit into that structure as effectively.

naasking · 2026-02-04T22:52:28 1770245548

But language does have structure, as does logic and reasoning. Universal approximation is great when you don't know the structure and want to brute force search to find an approximate solution. That's not optimal by any stretch of the imagination though.

direwolf20 · 2026-02-04T23:52:14 1770249134

Neural nets are structured as matrix multiplication, yet, they are universal approximators.

noosphr · 2026-02-05T00:05:27 1770249927

You're missing the non-linear activations.

hellohello2 · 2026-02-04T17:22:34 1770225754

I'm not saying if the paper is correct or not (since I can't tell), but I don't think your argument really holds. Consider applying it to multiplication:

Fundamentally, multiplication need to look at every pair of integer from the two input numbers. It must be O(n^2); N digits looking at N other digits is quadratic. Any sub-quadratic multiplication must hence necessarily lose some information.

nine_k · 2026-02-05T02:02:30 1770256950

Integer multiplication x * y can be trivially done in O(k): k = log₂(min(x, y)). This is because we can do addition in constant time, adding all bits in parallel.

By combining many more adding units, we can do (fixed-size) multiplication in constant time, too: https://en.wikipedia.org/wiki/Dadda_multiplier

sifar · 2026-02-05T02:16:15 1770257775

Multiplication can be sub-quadratic using Karatsuba's algorithm.

nullc · 2026-02-06T06:05:32 1770357932

That is the poster's point!

actionfromafar · 2026-02-04T18:00:25 1770228025

Doesn't that have to do with how many bits you allow in the actual calculation in physical reality?

hellohello2 · 2026-02-04T19:03:27 1770231807

Well, for multiplication complexity is defined in terms of on the number of digits/bits digits directly. For attention, complexity is defined on terms of the number of input vectors which are all at fixed precision. I don't understand what happens to the method proposed in the paper at higher precision (since I don't understand the paper), but in reality in doesn't matter since there is no value in anything over float16 for machine learning.

logicchains · 2026-02-04T22:00:52 1770242452

Multiplication has some properties like being cumulative. If we assume the sequence has any specific properties then we no longer have a general sequence model.

direwolf20 · 2026-02-04T23:45:32 1770248732

I think you meant commutative.

Attention also has some specific properties.

And sometimes results are just unexpected. Did you know that anything a Turing machine can do in t tome steps, a different Turing machine can do in O(sqrt(t log t)) memory cells? https://news.ycombinator.com/item?id=44055347

oasisaimlessly · 2026-02-04T17:48:19 1770227299

That argument could also be used to say that the FFT's time complexity of O(n log n) should be impossible.

fheinsen · 2026-02-04T15:59:50 1770220790

As the error via linear approximation approaches similar magnitude as numerical error via quadratic computation, don’t the two start becoming comparable in practice?

I ask because in practice, for inference, attention is typically computed with low-precision (4-bit, 8-bit, 16-bit) floats.

Numerical error, in fact, may be a key factor as to why quadratic attention, in practice, exhibits context rot as context gets longer, analogous to an RNN:

https://www.anthropic.com/engineering/effective-context-engi...

cubefox · 2026-02-05T06:19:08 1770272348

That website says nothing about numerical error potentially causing context rot.

fheinsen · 2026-02-05T13:56:17 1770299777

As far as I know, there is no widely accepted explanation for context rot.

Numerical error in long sequences of query-key dot-products may be a key factor.

cubefox · 2026-02-05T16:59:29 1770310769

That should be easy to test: test a 16 bit model on various benchmarks, once with fresh context and once with the context filled up with irrelevant tokens. Record the relative performance degradation, and then do the same for a quantized model. Compare whether the quantized model has a significant relatively larger performance drop from context rot. If so, numerical error should be the cause.

naasking · 2026-02-04T16:19:31 1770221971

I think any kind of innovation here will have to take advantage of some structure inherent to the problem, like eliminating attention in favour of geometric structures like Grassman flows [1].

[1] Attention Is Not What You Need, https://arxiv.org/abs/2512.19428

findalex · 2026-02-04T16:57:06 1770224226

Right - e.g., if you're modeling a physical system it makes sense to bake in some physics - like symmetry.

naasking · 2026-02-04T17:34:38 1770226478

Indeed, and I think natural language and reasoning will have some kind of geometric properties as well. Attention is just a sledgehammer that lets us brute force our way around not understanding that structure well. I think the next step change in AI/LLM abilities will be exploiting this geometry somehow [1,2].

[1] GrokAlign: Geometric Characterisation and Acceleration of Grokking, https://arxiv.org/abs/2510.09782

[2] The Geometry of Reasoning: Flowing Logics in Representation Space, https://arxiv.org/abs/2506.12284

fheinsen · 2026-02-07T15:07:16 1770476836

Unlike previous efforts, which typically stop at a low-order (e.g., quadratic) term of the Taylor expansion, this work derives a succinct, efficient, parallel general method for approximating attention with any number of Taylor terms, to arbitrary precision.

The github repository's first toy example is with 8 Taylor terms, applied to a context of 1B tokens, with attention computed over 1K heads per token. (Note that applying the quadratic formulation to 1B tokens, each with 1K heads, is not practical with current hardware, because it would require computing 1K attention matrices, each with 1B×1B dot-product scores.

Like every other proposed method, this one must be tested too. If it works, AI service providers who ignore it will find themselves at a disadvantage.

It's worth mentioning also that the mathematical techniques introduced by this work are likely of interest for other applications besides attention.

cobolexpert · 2026-02-04T16:05:46 1770221146

Dumb question: is the quadratic time complexity for training, inference, or both?

dave_universetf · 2026-02-04T17:31:27 1770226287

Both, with caveats. The attention computation is fundamentally quadratic: for every token in the sequence, you're doing a computation that has to compute over every other token in the sequence. So it's O(N) per token, O(N^2) for the whole sequence.

The big mitigation for this is that in causal transformers (i.e. all the chatbot type applications, where each token is only allowed to see tokens before it), you're running inference repeatedly on the same prefix in order to grow it by one token at a time. So if you cache the computations for tokens 0..N-1, on each inference pass you only have to compute O(N) for the newly added token at the end of the sequence.

That's why caching (and caching charges) appear so prominently everywhere in the pricing of inference.

In practice, caching is most beneficial at inference time, because you typically have relatively long conversations that start with the same cacheable prefix (the system prompt). At training time the same optimization can apply, but you're typically not pushing the same prefixes through the model repeatedly so you end up paying the quadratic cost more often.

The quadratic cost of attention is the fundamental compute bottleneck for transformer architectures, which is why there's research like this trying to find shortcuts in computing attention, as well as research into completely new primitives to replace attention (e.g. SSM, which is O(N) on a cold cache and O(1) on a warm cache).

omneity · 2026-02-04T16:07:42 1770221262

Attention is calculated during the forward pass of the model, which happens in both inference (forward only) and training (forward & backward).

SubiculumCode · 2026-02-04T16:41:44 1770223304

Dumb question: Can inference be done in a reverse pass? Outputs predicting inputs?

dave_universetf · 2026-02-04T18:30:48 1770229848

Strictly speaking: no. The "forward pass" terminology does not imply that there exists a "reverse pass" that does the same kind of computation. Rather, it's describing two different kinds of computation, and the direction they occur in.

The forward pass is propagating from inputs to outputs, computing the thing the model was trained for. The reverse/backwards pass is propagating from outputs back to inputs, but it's calculating the gradients of parameters for training (rougly: how much changing each parameter in isolation affects the output, and whether it makes the output closer to the desired training output). The result of the "reverse pass" isn't a set of inputs, but a set of annotations on the model's parameters that guide their adjustment.

The computations of the forward pass are not trivially reversible (e.g. they include additions, which destroys information about the operand values). As a sibling thread points out, you can still probabilistically explore what inputs _could_ produce a given output, and get some information back that way, but it's a lossy process.

And of course, you could train a "reverse" model, one that predicts the prefix of a sequence given a suffix (trivially: it's the same suffix prediction problem, but you train it on reversed sequences). But that would be a separate model trained from scratch on that task, and in that model the prefix prediction would be its forward pass.

direwolf20 · 2026-02-04T23:54:55 1770249295

I do want to see ChatGPT running upwards on my screen now, predicting earlier and earlier words in a futile attempt to explain a nonsense conclusion. We could call it ChatJeopardy.

gpm · 2026-02-04T17:02:21 1770224541

Not as trivially as the forwards direction, unsurprisingly information is lost, but better than you might expect. See for example https://arxiv.org/pdf/2405.15012

root_axis · 2026-02-04T16:47:40 1770223660

Sounds like a great premise for a sci-fi short story.

anu7df · 2026-02-04T17:14:49 1770225289

Sci-fi ? You mean historical fiction!

antirez · 2026-02-04T20:48:37 1770238117

I agree with the fundamental idea that attention must be O(N^2), with the exception of recent DeepSeek sparse attention approach (DSA), that does not escape N^2 but attempts to lower constant times so much that N^2 is more acceptable, by creating a much faster layer that predicts high scoring tokens.

twotwotwo · 2026-02-05T06:58:07 1770274687

Yeah, this(-ish): there are shipping models that don't eliminate N^2 (if a model can repeat your code back with edits, it needs to reference everything somehow), but still change the picture a lot when you're thinking about, say, how resource-intensive a long-context coding session is.

There are other experiments where model designers mix full-attention layers with limited-memory ones. (Which still doesn't avoid N^2, but if e.g. 3/4 of layers use 'light' attention, it still improves efficiency a lot.) The idea is the model can still pull information from far back in context, just not in every layer. Use so far is limited to smaller models (maybe it costs too much model capability to use at the high end?) but it seems like another interesting angle on this stuff.

quotemstr · 2026-02-04T23:23:28 1770247408

You can't stuff O(N) bits in O(1) space, so any scheme that purports, in general to do constant-time inference on unbounded context is snake oil, like a perpetual motion machine. Every such scheme must decay somehow. All you can do is choose how it decays.

polynomial · 2026-02-05T02:58:56 1770260336

Right, not to "defend" the paper's claims, but it seems to be more like tuning how the leaky bucket leaks, using lossy compression to try to preserve some measure of coherency? Seems to turn on the fixed size summary.

WhitneyLand · 2026-02-04T17:53:23 1770227603

The 2023 paper even if true doesn’t preclude the 2026 paper from being true, it just sets constraints on how a faster attention solution would have to work.

cubefox · 2026-02-04T15:43:39 1770219819

I think DeepSeek V3.2 is sub n^2, but it clearly performs quite well, refuting the alleged lower bounds in the paper.

andy12_ · 2026-02-04T16:13:14 1770221594

It really isn't sub N^2. The main attention is only O(Nk), but only thanks to a lightning indexer that still has complexity O(N^2). So overall it still has the same complexity; just with a smaller constant factor [1]

> DSA reduces the core attention complexity of the main model from O(L^2) to O(Lk), where k (<< L) is the number of selected tokens. Although the lightning indexer still has a complexity of O(L^2), it requires much less computation compared with MLA in DeepSeek-V3.1-Terminus

[1] https://arxiv.org/pdf/2512.02556

cubefox · 2026-02-04T18:03:16 1770228196

Okay, then let's see whether we are going to see real linear architectures, like Gated DeltaNet or Mamba-3, in some larger models. I don't believe there is a "lower bound" which states that those can never get to (or exceed) the real-world performance of quadratic attention. (Perfect recall in unrealistic needle-in-haystack tests doesn't count.)

andy12_ · 2026-02-04T22:07:47 1770242867

I'm also sure that some kind of linear architecture is possible. After all, humans don't have N^2 perfect recall either.

wetwater · 2026-02-04T22:47:15 1770245235

I agree. This from the paper mill for the paper mill.