Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

My understanding is that, while all 8B are loaded into memory, for each token inference step only 2B are selected and used - so tokens are produced faster because there is less computation needed.

Hoping someone will correct me if that's not the right mental model!



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: