You Only Index Once: Cross-Layer Sparse Attention with Shared Routing
You Only Index Once: Cross-Layer Sparse Attention with Shared Routing
要約
Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Structured block sparse m…