Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For those who know speculative decoding: This is basically self-speculative decoding. It still auto-regressively feeds the predicted label sequence through the network again, and only keeps the prediction up to the point where it matches. So it will not get worse in performance but only faster (here up to 3 times, which is normal for speculative decoding).

Due to the multi-task training, it will however also get better. (This idea is already quite old, to predict multiple targets into the future as an auxiliary loss.)

Nice work.



The problem with speculative decoding is that there are hardly any models that support it and adding support takes extra GPU time. If speculative decoding also improves planning performance, then it will be more readily adopted.


What do you mean? Speculative decoding can be done with any auto-regressive model. Normally you use another much faster model to predict the next N subwords, and then you use the big model to verify whether it gets the same output, or maybe just reranked. Evaluating N subwords in one go is much faster compared to doing it subword by subword. That's why this is faster. Not all N words might match, so then you might need to redo the prediction for M < N subwords, but there are many simple cases where a faster and weaker model is still accurate enough. In the very extreme case, where N-1 subwords are always wrongly predicted, it would be slightly slower, but usually you get quite a big speedup, e.g. 3x faster or so.

The nice thing here is that you actually don't need another smaller model but the model itself already predicts the next N subwords.

Or maybe you mean it's not implemented in some of the common software? I'm not sure about that, but I thought it's a quite popular feature now.


For anyone interested in exploring this, llama.cpp has an example implementation here:

https://github.com/ggerganov/llama.cpp/tree/master/examples/...


> So it will not get worse in performance but only faster

A bit confused by this statement. Speculative decoding does not decrease the performance of the model in terms of "accuracy" or "quality" of output. Mathematically, the altered distribution being sampled from is identical to the original distribution if you had just used regular autoregressive decoding. The only reason you get variability between autoregressive vs speculative is simply due to randomness.

Unless you meant performance as in "speed", in which case it's possible that speculative decoding could degrade speed (but on most inputs, and with a good selection of the draft model, this shouldn't be the case).


I think parent is saying the same thing as you. Pointing out to folks unfamiliar, speculative decoding doesn't trade quality for speed.


Yes that's what I mean, speculative decoding does not decrease the performance in terms of quality. I guess my wording was confusing on this.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: