Matrix vector multiplication for feed forward layers is most of the bandwidth as...

Rohansi · 2025-08-28T07:57:38 1756367858

The problem is different parts of the SoC (CPU, GPU, NPU) may not actually be able to consume all of the bandwidth available to the system as a whole. This is why you'd need to benchmark - different chips may be able to feed the cores better than others.

woooooo · 2025-08-29T07:50:09 1756453809

Ah, yeah. I guess as we venture further into SoCs that will be more common, I was just thinking "it's whatever the memory controller can do".

imtringued · 2025-08-28T10:47:40 1756378060

Training is performed in parallel with batching and is more flops heavy. I don't have an intuition on how memory bandwidth intensive updating the parameters is. It shouldn't be much worse than doing a single forward pass though.