Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Matrix vector multiplication for feed forward layers is most of the bandwidth as I understand things, there's not really a way to do it "better", its just a bunch of memory-bound dot products.

(Posting this comment in hopes of being corrected and learning something).



The problem is different parts of the SoC (CPU, GPU, NPU) may not actually be able to consume all of the bandwidth available to the system as a whole. This is why you'd need to benchmark - different chips may be able to feed the cores better than others.


Ah, yeah. I guess as we venture further into SoCs that will be more common, I was just thinking "it's whatever the memory controller can do".


Training is performed in parallel with batching and is more flops heavy. I don't have an intuition on how memory bandwidth intensive updating the parameters is. It shouldn't be much worse than doing a single forward pass though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: