> Quants at 4 bits are generally considered good, and 8 bits are generally consi...

> Quants at 4 bits are generally considered good, and 8 bits are generally considered overkill

Two year old info, only really applies to heavily undertrained models with short tokenizers. Perplexity scores are a really terrible metric for measuring quantization impact, and quantized models tend to also score higher than they should in benchmarks ran as topk=1 where the added randomness seems to help.

In my experience it really seems to affect reliability most, which isn't often tested consistently. An fp16 model might get a question right every time, Q8 every other time, Q6 every third time, etc. In a long form conversation this means wasting a lot of time regenerating responses when the model throws itself off and loses coherence. It also destroys knowledge that isn't very strongly ingrained, so low learning rate fine tune data gets obliterated at a much higher rate. Gemma-2 specifically also loses a lot of its multilingual ability with quantization.

I used to be in the Q6 camp for a long time, these days I run as much as I can in FP16 or at least Q8, because it's worth the tradeoff in most cases.

Now granted it's different for cases like R1 when training is native FP8 or with QAT, how different I'm not sure since we haven't had more than a few examples yet.

> there’s a lack of inference framework for gemma 2 is really of the mark

I mean mainly for the QAT format for Gemma 3, which surprisingly seems to be as a standard gguf this time. Last time around Google decided llama.cpp is not good enough for them and half-assedly implemented their own ripoff as gemma.cpp with basically zero usable features.

> llama.cpp support for gemma-3 is out

Yeah testing it right now, I'm surprised it runs coherently at all given the new global attention tbh. Every architectural change is usually followed with up to a month of buggy inference and back and forth patching, model reuploads and similar nonsense.