The catch that you're missing is that Deepseek did this ages ago.
They're just using MLA, which is well known to reduce KV size by 90%. You know, the MLA that's used in... Deepseek V2, Deepseek V3, Deepseek R1, Deepseek V3.1, Deepseek V3.2.
Oh, and they also added some hybrid linear attention stuff to make it faster at long context. You know who else uses hybrid linear attention? Deepseek V3.2.
For a "copy Deepseek's homework" model, it's really good, preferable to DeepSeek for me (at least prior to V3.2, which I haven't been able to fully put through its paces yet). post-training really makes that much of a difference I guess
Linear attention is really bad, it's only good for benchmaxing but it leads to a loss of valuable granularity, which can be felt in the latest DeepSeek randomly forgetting/ignoring/correcting explicitly stated facts in the prompt.
They're just using MLA, which is well known to reduce KV size by 90%. You know, the MLA that's used in... Deepseek V2, Deepseek V3, Deepseek R1, Deepseek V3.1, Deepseek V3.2.
Oh, and they also added some hybrid linear attention stuff to make it faster at long context. You know who else uses hybrid linear attention? Deepseek V3.2.