Token/s is entirely determined by memory bandwidth. TTFT is compute bound.

fulafel · 2026-03-04T05:08:09 1772600889

This is broadly correct for currently favoured software, but in computer science optimization problems you can usually trade off compute for memory and vice versa.

For example just now from the front page: https://news.ycombinator.com/item?id=47242637 "Speculative Speculative Decoding"

Or this: https://openreview.net/forum?id=960Ny6IjEr "Low-Rank Compression of Language Models Via Differentiable Rank Selection"

oofbey · 2026-03-04T21:25:20 1772659520

Good point on speculative decoding techniques. I'd forgotten about them, and they're good. Would love to see some of these get into llama.cpp and friends, but it does require somebody to come up with a distilled draft model.

But low rank compression isn't trading off compute for memory - it's just compressing the model. And critically, that's lossy compression. That's primarily a trade-off of quality for speed/size, with a little bit of added compute. Same goals as quantization. If there was some compute-intensive lossless compression of parameters, lots of people would be happy. But those floating point values look a lot like gaussian noise, making them extremely difficult to compress.

saagarjha · 2026-03-04T11:37:59 1772624279

None of these really change the fundamental shape of the problem.