> Quants at 4 bits are generally considered good, and 8 bits are generally considered overkill
Two year old info, only really applies to heavily undertrained models with short tokenizers. Perplexity scores are a really terrible metric for measuring quantization impact, and quantized models tend to also score higher than they should in benchmarks ran as topk=1 where the added randomness seems to help.
In my experience it really seems to affect reliability most, which isn't often tested consistently. An fp16 model might get a question right every time, Q8 every other time, Q6 every third time, etc. In a long form conversation this means wasting a lot of time regenerating responses when the model throws itself off and loses coherence. It also destroys knowledge that isn't very strongly ingrained, so low learning rate fine tune data gets obliterated at a much higher rate. Gemma-2 specifically also loses a lot of its multilingual ability with quantization.
I used to be in the Q6 camp for a long time, these days I run as much as I can in FP16 or at least Q8, because it's worth the tradeoff in most cases.
Now granted it's different for cases like R1 when training is native FP8 or with QAT, how different I'm not sure since we haven't had more than a few examples yet.
> there’s a lack of inference framework for gemma 2 is really of the mark
I mean mainly for the QAT format for Gemma 3, which surprisingly seems to be as a standard gguf this time. Last time around Google decided llama.cpp is not good enough for them and half-assedly implemented their own ripoff as gemma.cpp with basically zero usable features.
> llama.cpp support for gemma-3 is out
Yeah testing it right now, I'm surprised it runs coherently at all given the new global attention tbh. Every architectural change is usually followed with up to a month of buggy inference and back and forth patching, model reuploads and similar nonsense.
Strongly agree with the first part of your post :)
BTW in addition to the weights, it's also interesting to consider the precision of accumulation. f16 is just not enough for the large matrix sizes we are now seeing.
(Gemma.cpp TL here) FYI we are a research testbed, not full-featured nor user-centric. Some interesting things there are the fp8 weights and extremely fast matmul especially on workstation CPUs, plus some attention to numerics.
Two year old info, only really applies to heavily undertrained models with short tokenizers. Perplexity scores are a really terrible metric for measuring quantization impact, and quantized models tend to also score higher than they should in benchmarks ran as topk=1 where the added randomness seems to help.
In my experience it really seems to affect reliability most, which isn't often tested consistently. An fp16 model might get a question right every time, Q8 every other time, Q6 every third time, etc. In a long form conversation this means wasting a lot of time regenerating responses when the model throws itself off and loses coherence. It also destroys knowledge that isn't very strongly ingrained, so low learning rate fine tune data gets obliterated at a much higher rate. Gemma-2 specifically also loses a lot of its multilingual ability with quantization.
I used to be in the Q6 camp for a long time, these days I run as much as I can in FP16 or at least Q8, because it's worth the tradeoff in most cases.
Now granted it's different for cases like R1 when training is native FP8 or with QAT, how different I'm not sure since we haven't had more than a few examples yet.
> there’s a lack of inference framework for gemma 2 is really of the mark
I mean mainly for the QAT format for Gemma 3, which surprisingly seems to be as a standard gguf this time. Last time around Google decided llama.cpp is not good enough for them and half-assedly implemented their own ripoff as gemma.cpp with basically zero usable features.
> llama.cpp support for gemma-3 is out
Yeah testing it right now, I'm surprised it runs coherently at all given the new global attention tbh. Every architectural change is usually followed with up to a month of buggy inference and back and forth patching, model reuploads and similar nonsense.