Maybe slightly related, canon layers provide direct horizontal information flow ...

Maybe slightly related, canon layers provide direct horizontal information flow along residual steams. See this paper, which precisely claims that LLMs struggle with horizontal information flow as "looking back a token" is fairly expensive since it can only be done via encoding in the residual stream and attention layers

https://openreview.net/pdf?id=kxv0M6I7Ud