In my opinion qwq is the strongest model that fits on a single gpu (Rtx 3090 for...

moffkalast · 2025-03-12T09:43:27 1741772607

Gemma 2 27B at 4 bits would be a drooling idiot anyway, even going down to 8 bits seems to significantly lobotomize it. Qwens are surprisingly resistant to quantization compared to most so it'll pull ahead just in that already in terms of coherence for the same VRAM amount.

We'll see if the quantization aware versions are any better this time around, but I doubt any inference framework will even support them. Gemma.cpp never got a a standard compatible server API so people could actually use it, and as a result got absolutely zero adoption.

hnfong · 2025-03-12T10:52:54 1741776774

Quants at 4 bits are generally considered good, and 8 bits are generally considered overkill unless somehow need to squeeze the last bits of performance (in terms of generation quality) from it. There are papers to that effect though admittedly perhaps specific models might have divergent behavior ( https://arxiv.org/abs/2212.09720 )

All the above is subjective so maybe that’s true for you, but claiming there’s a lack of inference framework for gemma 2 is really off the mark.

Obviously ollama supports it. Also llama.cpp. Also mlx. I’ve listed 3 frameworks that support quantized versions of gemma 2

llama.cpp support for gemma-3 is out, the PR merged a couple hours after googles announcement. Obviously ollama supports it as well as you can see in TFA here.

I’m really curious how you’d get to the conclusions you’ve made. Are we living in different alternate universes?

moffkalast · 2025-03-12T11:09:50 1741777790

> Quants at 4 bits are generally considered good, and 8 bits are generally considered overkill

Two year old info, only really applies to heavily undertrained models with short tokenizers. Perplexity scores are a really terrible metric for measuring quantization impact, and quantized models tend to also score higher than they should in benchmarks ran as topk=1 where the added randomness seems to help.

In my experience it really seems to affect reliability most, which isn't often tested consistently. An fp16 model might get a question right every time, Q8 every other time, Q6 every third time, etc. In a long form conversation this means wasting a lot of time regenerating responses when the model throws itself off and loses coherence. It also destroys knowledge that isn't very strongly ingrained, so low learning rate fine tune data gets obliterated at a much higher rate. Gemma-2 specifically also loses a lot of its multilingual ability with quantization.

I used to be in the Q6 camp for a long time, these days I run as much as I can in FP16 or at least Q8, because it's worth the tradeoff in most cases.

Now granted it's different for cases like R1 when training is native FP8 or with QAT, how different I'm not sure since we haven't had more than a few examples yet.

> there’s a lack of inference framework for gemma 2 is really of the mark

I mean mainly for the QAT format for Gemma 3, which surprisingly seems to be as a standard gguf this time. Last time around Google decided llama.cpp is not good enough for them and half-assedly implemented their own ripoff as gemma.cpp with basically zero usable features.

> llama.cpp support for gemma-3 is out

Yeah testing it right now, I'm surprised it runs coherently at all given the new global attention tbh. Every architectural change is usually followed with up to a month of buggy inference and back and forth patching, model reuploads and similar nonsense.

janwas · 2025-03-13T11:14:16 1741864456

Strongly agree with the first part of your post :) BTW in addition to the weights, it's also interesting to consider the precision of accumulation. f16 is just not enough for the large matrix sizes we are now seeing.

(Gemma.cpp TL here) FYI we are a research testbed, not full-featured nor user-centric. Some interesting things there are the fp8 weights and extremely fast matmul especially on workstation CPUs, plus some attention to numerics.