arxiv:2604.07394

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

Published on Apr 8

· Submitted by

Quantong Qiu on Apr 10

Long-Context Model Laboratory

Upvote

Authors:

Quantong Qiu ,

Abstract

Flux Attention dynamically optimizes attention computation in LLMs by routing layers to full or sparse attention based on input context, achieving faster inference with minimal training overhead.

AI-generated summary

The quadratic computational complexity of standard attention mechanisms presents a severe scalability bottleneck for LLMs in long-context scenarios. While hybrid attention mechanisms combining Full Attention (FA) and Sparse Attention (SA) offer a potential solution, existing methods typically rely on static allocation ratios that fail to accommodate the variable retrieval demands of different tasks. Furthermore, head-level dynamic sparsity often introduces severe computational load imbalance and synchronization long-tails, which hinder hardware acceleration during autoregressive decoding. To bridge this gap, we introduce Flux Attention, a context-aware framework that dynamically optimizes attention computation at the layer level. By integrating a lightweight Layer Router into frozen pretrained LLMs, the proposed method adaptively routes each layer to FA or SA based on the input context. This layer-wise routing preserves high-fidelity information retrieval while ensuring contiguous memory access, translating theoretical computational reductions into practical wall-clock speedups. As a parameter-efficient approach, our framework requires only 12 hours of training on 8timesA800 GPUs. Extensive experiments across multiple long-context and mathematical reasoning benchmarks demonstrate that Flux Attention achieves a superior trade-off between performance and inference speed compared with baseline models, with speed improvements of up to 2.8times and 2.0times in the prefill and decode stages.

View arXiv page View PDF Project page GitHub 5 Add to collection

Community

QQTang1223

Paper author Paper submitter about 17 hours ago

•

edited about 17 hours ago

Paper Title: Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
Link: arXiv:2604.07394 (Preprint)

【TL;DR / One-Sentence Summary】
⭐⭐⭐⭐⭐ (Highly Recommended). A brilliant hardware-aware co-design that optimizes Long-Context LLM inference by shifting sparse attention scheduling from the "Head-level" to the "Layer-level," effectively eliminating synchronization bottlenecks and achieving significant wall-clock speedups.

【Core Highlights】

Hardware-Friendly Layer-level Routing: Unlike previous hybrid attention methods (e.g., Elastic Attention) that operate at a fine-grained head level—causing GPU thread divergence and idle time—Flux Attention introduces a lightweight Layer Router. It decides whether an entire layer uses Full Attention (FA) or Sparse Attention (SA). This coarse-grained approach ensures contiguous memory access, translating theoretical FLOPs reduction into actual speedups (up to 2.8x in Prefill and 2.0x in Decode).
Efficient & Non-Invasive Training: The framework keeps the backbone LLM (e.g., Qwen-3, Llama-3.1) frozen and only trains the routing modules. By using Gumbel-Softmax for differentiable routing and a Lagrangian-based sparsity penalty, the model converges in just 12 hours on a single 8x A800 node, making it highly practical for industrial deployment.
Context-Aware Adaptability: The router dynamically adjusts the sparsity ratio based on the input context (e.g., higher density for complex retrieval, higher sparsity for simple semantics), maintaining high fidelity while reducing costs.

【Limitations & Critiques】

Coarse-grained Trade-off: By forcing an entire layer to be either FA or SA, the model might sacrifice the functional heterogeneity of individual heads within a layer. There may be a slight performance ceiling compared to an "ideal" fine-grained routing strategy.
Requirement for Preliminary Tuning on OOD Tasks: For Out-of-Distribution (OOD) tasks not covered in the training set, the framework requires re-running preliminary experiments to identify the task type and calibrate appropriate budget targets, which may limit its immediate "plug-and-play" adaptability to entirely new domains.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.07394

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 9

Browse 9 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.07394 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.07394 in a Space README.md to link it from this page.