Maybe slightly related, canon layers provide direct horizontal information flow along residual steams. See this paper, which precisely claims that LLMs struggle with horizontal information flow as "looking back a token" is fairly expensive since it can only be done via encoding in the residual stream and attention layers
https://openreview.net/pdf?id=kxv0M6I7Ud