Good point on speculative decoding techniques. I'd forgotten about them, and the...

Good point on speculative decoding techniques. I'd forgotten about them, and they're good. Would love to see some of these get into llama.cpp and friends, but it does require somebody to come up with a distilled draft model.

But low rank compression isn't trading off compute for memory - it's just compressing the model. And critically, that's lossy compression. That's primarily a trade-off of quality for speed/size, with a little bit of added compute. Same goals as quantization. If there was some compute-intensive lossless compression of parameters, lots of people would be happy. But those floating point values look a lot like gaussian noise, making them extremely difficult to compress.