From Compression to Expansion: A Layerwise Analysis of In-Context Learning

In-Context Learning Tasks

TDNV: Metric for Representation Compression

We analyze model representations for in-context learning (ICL). For each task, we use the hidden representation $h_{i,t}^{(l)}$ from each sample $i$ of task $t$ in layer $l$.

To measure representation quality, we propose the Task-Distance Normalized Variance (TDNV). It is the ratio of within-task variance to between-task distance. Lower TDNV means more compressed representations.

$$\text{TDNV}^{(l)} := \sum_{t=1}^{T} \sum_{\substack{t'=1 \\ t' \neq t}}^{T} \frac{\text{var}_t^{(l)} + \text{var}_{t'}^{(l)}}{2\|\bar{h}_t^{(l)} - \bar{h}_{t'}^{(l)}\|_2^2}$$

TDNV has two main components:

Within-task variance($\text{var}_t^{(l)}$). It measures how tightly examples from the same task cluster. A smaller variance means better compression.
$$\text{var}_t^{(l)} = \frac{1}{N} \sum_{i=1}^{N} \|h_{i,t}^{(l)} - \bar{h}_t^{(l)}\|_2^2, \quad \text{where} \quad \bar{h}_t^{(l)} = \frac{1}{N} \sum_{i=1}^{N} h_{i,t}^{(l)}.$$
Between-task distance($\|\bar{h}_t^{(l)} - \bar{h}_{t'}^{(l)}\|_2$). It measures the separation between different tasks. A larger distance means better separation.

Experimental Results

Prevalence of Phenomenon

Take Away: The compression-expansion phenomenon is universal across model architectures and emerges naturally during training.

Results Image — Layerwise TDNV of different model architectures, including transformer and state space model.

Additional Results — Layerwise TDNV during training process. The phenomenon emerges and intensifies with training progress.

Scaling Up Model Size Leads to More Compression

Take Away: As model size increases, the phenomenon becomes more pronounced, with larger models achieving better task representation compression.

Model Scaling Results 1 — Layerwise TDNV of varying model size.

Model Scaling Results 2 — ICL Performance v.s. minimum TDNV of varying model size.

Compression-to-Expansion under Noisy Demonstrations

Take Away: As the noise ratio increases, TDNV rises, and once the within-task variance exceeds the between-task distance ($\text{TDNV} > 1$), ICL performance drops sharply.

Noisy Demonstrations Results 1 — ICL Performance under different noise ratios.

Noisy Demonstrations Results 2 — Layerwise TDNV under different noise ratios.

Bias-variance Decomposition of Task Vectors

Take Away: As the number of demonstrations K increases, we observe an intriguing phenomenon:
• Different tasks induce task vectors in distinct directions, yet each task follows a consistent direction.
• The variance within each task decreases.

Thus, we decompose the task vector into bias and variance components:

$h_{t,i}(K) = \mu_t(\infty) + \underbrace{\mu_t(K) - \mu_t(\infty)}_{\text{bias}} + \underbrace{h_{t,i}(K) - \mu_t(K)}_{\text{variance}}$

Task Vector PCA — PCA visualization of task vectors from different tasks. As K increases, task vectors from different tasks become more separated while variance within each task decreases.

Distance vs ICL Length — Decrease of bias as $\mathcal{O}(1/K)$.

Variance vs ICL Length — Decrease of variance as $\mathcal{O}(1/K)$.

Decrease of bias:

$\|\mu_t(K) - \mu_t(\infty)\|_2 \propto \mathcal{O}(1/K)$

Decrease of variance:

$\|\mathbb{E}[(h_{t,i}(K) - \mu_t(K))^2]\|_2 \propto \mathcal{O}(1/K)$

BibTeX

@article{jiang2025compression,
        title={From Compression to Expansion: A Layerwise Analysis of In-Context Learning},
        author={Jiang, Jiachen and Dong, Yuxin and Zhou, Jinxin and Zhu, Zhihui},
        journal={arXiv preprint arXiv:2505.17322},
        year={2025}
      }