Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation Paper • 2603.04971 • Published 2 days ago • 3
NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time Paper • 2408.03675 • Published Aug 7, 2024
DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion Paper • 2406.06567 • Published Jun 3, 2024
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking Paper • 2502.13842 • Published Feb 19, 2025
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! Paper • 2502.07374 • Published Feb 11, 2025 • 40