An attention layer transforms word vectors by adding information from the other words in the sequence. The amount of information added from each neighboring word is regulated by a weight called the "attention weight". If the attention weight for one of the neighbors is enormously large, then all the information added will be from that word, in contrast, if the attention weight for a neighbor is zero, it will add no information to the word. This is called an 'attention mechanism' since it literally decides which information to pass through the network, i.e. which other words should the model 'pay attention to' when it is considering a particular word.