Papers
arxiv:2603.06922

NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks

Published on Mar 6
· Submitted by
Nandan Kumar Jha
on Mar 13
Authors:

Abstract

NerVE provides a unified framework for analyzing feed-forward network dynamics in large language models through spectral analysis metrics that reveal information flow organization and optimization impacts across architectures.

AI-generated summary

We introduce NerVE, a unified eigenspectral framework for understanding how feed-forward networks (FFNs) in large language models (LLMs) organize and regulate information flow in high-dimensional latent space. Despite FFNs dominating the parameter budget, their high-dimensional dynamics remain poorly understood. NerVE addresses this gap through lightweight, memory-efficient tracking of eigenspectrum dynamics via four complementary metrics: Spectral Entropy (dispersion), Participation Ratio (effective dimensionality), Eigenvalue Early Enrichment (top-heaviness), and Jensen-Shannon divergence (distributional shifts). Our key insight is that FFN nonlinearities reinject variance across eigenmodes, fundamentally governing latent dimension utilization, and that optimizer geometry strongly modulates the extent of this variance reinjection. We validate NerVE across model scales, and diverse architectural and optimizer configurations, each uniquely shaping FFN dynamics: normalization schemes controlling variance flow; FFN weight geometries constraining latent space; positional encoding and activation functions regulating information flow; and optimizer choices redistributing effective capacity across depth. Across these settings, NerVE consistently recovers stable spectral signatures that correlate with model's generalization ability and respond predictably to design choices, generalizing beyond transformer to MLP-Mixer architectures, providing actionable insights for architectural and optimizer choices beyond trial-and-error.

Community

Paper author Paper submitter
edited about 9 hours ago

Hi everyone!

I am excited to share our ICLR 2026 paper, NerVE: Nonlinear Eigenspectrum Dynamics in LLM Feed-Forward Networks. These are some interesting findings about the role of FFN (nonlinearity) in transformer architecture (we also verified them on non-transformer architecture: MLP-Mixer) :

  1. FFN nonlinearities are secretly fighting a war inside your transformer. Self-attention collapses rank doubly exponentially with depth (Dong et al., ICML 2021). We find that, FFN nonlinearities fight back by reinjecting variance into under-utilized dimensions--a process we call nonlinearity-induced rank inflation, which makes the transformer network alive.

  2. AdamW makes your nonlinearities work harder but achieve less, compared to Muon. Under AdamW, FFN nonlinearities spend their capacity repairing spectral damage (ill-conditioned pre-activation eigenspectrum ). Muon, on the other hand, preserves healthy spectra (well-conditioned per-activation eigenspectrum) , so nonlinearities only need to refine.

  3. You can predict generalization with a single forward pass, no eval set needed. Our spectral metrics (spectral entropy and participation ratio) correlate with validation loss at |r| > 0.97 throughout training. Short runs can even rank architectural configurations before training to convergence.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.06922 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.06922 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.06922 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.