Sliding Window Attention
In the Multi-Head Attention, one of the largest issues is the quadratic nature of the attention computation.
To solve that, in sliding window, the current token can only attend to limited number of tokens defined by the window size. Lets say for window size 2, it can attend to 2 token left and 2 token right, in total 5 tokens. Sliding attention can be also thought as the local attention, whereas Multi-head attention is the global attention.
One can say they on W7 word, it can't attend to W3, but in recursive nature, W5 has already attended to W3 and W7 can attend to W5.
Also, the window size is variable. Usually, in the starting layers (closer to the inputs), the window size is small, which goes higher on the top level layers (closer to the outputs).
Example models: Gemma, Longformer
In Gemma2, they have used 1:1 ratio, where for each sliding attention, they have used one full attention layer.
In Gemma3, they have extended it to 5:1 ratio, where for each 5 sliding attention, they have use one full attention layer.
