Yes. Otherwise next-token models wouldn't be nearly as good as they are. But the question is how to train these capabilities most efficiently! We had some interesting findings on how with increasing model/dataset scale/data quality, capabilities can move from "only learnable with multi-token prediction" to "indifferent" and "multi-token prediction actually hurts". This depends on the capability itself, induction e.g. matures way earlier in this sense than code generation capabilities.
Is it possible that anti-scaling effect occurs because you are removing some middle layers to free up space for the extra output heads? I only scanned the paper quickly but what happens if you treat the technique as strictly additive and don't keep parameter sizes fixed?