From Compression to Expansion:
A Layerwise Analysis of In-Context Learning

The Ohio State University

*Indicates First Author, Indicates Corresponding Author
Layerwise Compression-Expansion Phenomenon
LLMs exhibiting ICL capabilities organize their layers into two parts with distinct behaviors: a compression part and an expansion part. The early layers, comprising the compression part, progressively produce compact and discriminative representations that capture task information from the input demonstrations. The later layers, forming the expansion part, apply these compact representations to the query to generate the output.
Overview Image

Layer-wise compression to expansion in ICL representations. TDNV first decreases then increases from shallow to deep layers, splitting the model into compression and expansion stages. During the compression stage, task vector accuracy increases as task information is progressively extracted from demonstration pairs. During the expansion stage, early-exit accuracy increases as output information is progressively decoded based on the input query.

TL;DR

We uncover a universal Compression-to-Expansion 🚀 pattern in ICL representations, revealing how LLMs extract and utilize task information across layers.

🔥🔥 More content comming soon: codes, demos.

In-Context Learning Tasks

Tasks Visualization

TDNV: Metric for Representation Compression

We analyze model representations for in-context learning (ICL). For each task, we use the hidden representation $h_{i,t}^{(l)}$ from each sample $i$ of task $t$ in layer $l$.

To measure representation quality, we propose the Task-Distance Normalized Variance (TDNV). It is the ratio of within-task variance to between-task distance. Lower TDNV means more compressed representations.

$$\text{TDNV}^{(l)} := \sum_{t=1}^{T} \sum_{\substack{t'=1 \\ t' \neq t}}^{T} \frac{\text{var}_t^{(l)} + \text{var}_{t'}^{(l)}}{2\|\bar{h}_t^{(l)} - \bar{h}_{t'}^{(l)}\|_2^2}$$

TDNV has two main components:

  • Within-task variance($\text{var}_t^{(l)}$). It measures how tightly examples from the same task cluster. A smaller variance means better compression.
    $$\text{var}_t^{(l)} = \frac{1}{N} \sum_{i=1}^{N} \|h_{i,t}^{(l)} - \bar{h}_t^{(l)}\|_2^2, \quad \text{where} \quad \bar{h}_t^{(l)} = \frac{1}{N} \sum_{i=1}^{N} h_{i,t}^{(l)}.$$
  • Between-task distance($\|\bar{h}_t^{(l)} - \bar{h}_{t'}^{(l)}\|_2$). It measures the separation between different tasks. A larger distance means better separation.

Experimental Results

Prevalence of Phenomenon

Take Away: The compression-expansion phenomenon is universal across model architectures and emerges naturally during training.

Results Image
Layerwise TDNV of different model architectures, including transformer and state space model.
Additional Results
Layerwise TDNV during training process. The phenomenon emerges and intensifies with training progress.

Scaling Up Model Size Leads to More Compression

Take Away: As model size increases, the phenomenon becomes more pronounced, with larger models achieving better task representation compression.

Model Scaling Results 1
Layerwise TDNV of varying model size.
Model Scaling Results 2
ICL Performance v.s. minimum TDNV of varying model size.

Compression-to-Expansion under Noisy Demonstrations

Take Away: As the noise ratio increases, TDNV rises, and once the within-task variance exceeds the between-task distance ($\text{TDNV} > 1$), ICL performance drops sharply.

Noisy Demonstrations Results 1
ICL Performance under different noise ratios.
Noisy Demonstrations Results 2
Layerwise TDNV under different noise ratios.

Bias-variance Decomposition of Task Vectors

Take Away: As the number of demonstrations K increases, we observe an intriguing phenomenon:
    • Different tasks induce task vectors in distinct directions, yet each task follows a consistent direction.
    • The variance within each task decreases.

Thus, we decompose the task vector into bias and variance components:

$h_{t,i}(K) = \mu_t(\infty) + \underbrace{\mu_t(K) - \mu_t(\infty)}_{\text{bias}} + \underbrace{h_{t,i}(K) - \mu_t(K)}_{\text{variance}}$
Task Vector PCA
PCA visualization of task vectors from different tasks. As K increases, task vectors from different tasks become more separated while variance within each task decreases.
Distance vs ICL Length
Decrease of bias as $\mathcal{O}(1/K)$.
Variance vs ICL Length
Decrease of variance as $\mathcal{O}(1/K)$.

Decrease of bias:

$\|\mu_t(K) - \mu_t(\infty)\|_2 \propto \mathcal{O}(1/K)$

Decrease of variance:

$\|\mathbb{E}[(h_{t,i}(K) - \mu_t(K))^2]\|_2 \propto \mathcal{O}(1/K)$

BibTeX

@article{jiang2025compression,
        title={From Compression to Expansion: A Layerwise Analysis of In-Context Learning},
        author={Jiang, Jiachen and Dong, Yuxin and Zhou, Jinxin and Zhu, Zhihui},
        journal={arXiv preprint arXiv:2505.17322},
        year={2025}
      }