
Figure 1. Illustration of the emergence of Slash-Dominant Heads (SDHs). Attention scores are determined by pre-PE queries, keys, and RoPE (left bottom). Because token embeddings lie approximately on a cone, queries/keys are almost rank-one, and nearly identical across tokens (left top), so RoPE primarily governs variation of attention scores across tokens. Then RoPE's high- and medium-frequency components interact constructively at specific lags, producing the attention score peaks at offset $\Delta$ (right top). As a result, SDHs emerge and are generalizable (right bottom).
Authors: **Yuan Cheng*** Fengzhuo Zhang***** Yunlong Hou*** Cunxiao Du Chao Du Tianyu Pang Aixin Sun Zhuoran Yang**
*Co-First Authors
Demystifying the Slash Pattern in Attention: The Role of RoPE
<aside> 💡
We investigate a key mechanism for information propagation in LLMs—slash-dominant heads (SDHs)—which exhibit a distinctive slash attention pattern, and demystify the emergence of SDHs as follows.
SDHs are Intrinsic: SDHs are intrinsic to the model and do not depend on the input prompt; consequently, they generalize Out-Of-Distribution (OOD) and appear across any prompts. 3.1. OOD Generalization of SDHs
One-rankness of Queries and Keys: The pre-PE queries and keys of SDHs are nearly rank-one, indicating little token-specific semantics, so attention variations are driven primarily by RoPE; 3.2 Approximate One-Rankness of pre-PE Queries and Keys.
Origin of Rank-One Structure: We further show that this rank-one structure arises because token embeddings lie on a cone and $W_Q,W_K$ project them onto its main axis. 3.3 How Approximately Rank-One Queries and Keys Arise?
Role of RoPE Frequencies: As a result of RoPE and rank-one queries and keys, for any $i,j\in\mathbb{N}$, the pre-softmax attention logit from position $i$ attending to position $j$ admits a Fourier-like decomposition:
$$ \mathtt{AttnLogit}(i,j) = \sum\nolimits_{l=1}^{d/2} A_{l} \cdot \cos \bigl (\theta_l \cdot (i-j) + \varphi_{l}\bigr) $$
where $d$ is the hidden dimension, $\{ \theta_{l}\}{l=1}^{d/2}$ *are the RoPE frequencies. Here $\{ A{l}\}{l=1}^{d/2}$ and $\{ \varphi{l} \}_{l=1}^{d/2}$ are amplitudes and phases.* We observe that high- and medium-frequency components play a dominant role in forming slash patterns, whereas low frequencies contribute little. 3.4 Collaboration of Frequencies in RoPE Determines Slash Pattern
A Sufficient Frequency Condition: We propose a condition that quantitatively characterizes the RoPE frequency interactions, and show it is sufficient to induce SDHs theoretically. 3.5 A Sufficient Frequency Condition for Slash Pattern Learning </aside>
@online{cheng2026demystifyingslashpatternattention,
title = {Demystifying the Slash Pattern in Attention: The Role of RoPE},
author = {Cheng, Yuan and Zhang, Fengzhuo and Hou, Yunlong and Du, Cunxiao and Du, Chao and Pang, Tianyu and Sun, Aixin and Yang, Zhuoran},
year = {2026},
url={<https://arxiv.org/abs/2601.08297>}
}
Given a prompt that contains a question, an LLM can interactively generate a coherent and contextually appropriate answer. A crucial ingredient behind this ability is the model's capability to pass information across different tokens in the sequence, most notably, from the prompt tokens to the answer tokens. In modern LLMs, the model's information-passing behavior is closely linked to a specific structural pattern: ****the slash pattern in its attention scores.
The slash pattern refers to the attention score concentrates along the $\Delta$-th sub-diagonal of the attention score matrix and thus forms a slash line (Figure 2). We refer to attention heads exhibiting slash patterns as Slash Dominant Heads (SDHs), which are formally defined as follows.
Intuitively, the average slash score measures the average attention paid to tokens that are $\Delta$ positions before the current token.
SDHs and their slash patterns play important algorithmic roles in LLMs. For example, they enable In-context Learning (ICL) via the induction head circuit, which is a special case of an SDH with $\Delta = 1$. In addition, another line of work such as XAttention and MTraining leverage slash patterns to help accelerate long-context inference or training.

Figure 2. Average of attention score matrices in Qwen2.5-7B-Instruct with prompts from LongBench. We denote the a-th head at b-th layer as LbHa in this blog. In panels (a)–(c), attention concentrates on the sub-diagonals with small offsets 0,1 and 2, respectively. In panels (d)–(f), it also concentrates on sub-diagonals with large offsets exceeding 500.
As shown in Figure 2, SDHs with diverse values of $\Delta$ are prevalent in modern open-source LLMs (Qwen). These SDHs enable a token at position $i$ to attend directly to the token at position $i-\Delta$, thereby passing information from earlier tokens to later ones. Their widespread presence and functional importance naturally motivate our central research question:
How do pretrained LLMs implement SDHs using their transformer architectures?
To answer this question, we first need to know what determines the attention scores in LLMs.
In this section, we introduce the backbone of modern transformer: Causal Self-attention Layer with RoPE, from which we can straightforwardly see what determines the attention scores.