RADLADS

RADLADS (Rapid Attention Distillation to Linear Attention Decoders at Scale) presents a revolutionary protocol for converting traditional softmax attention transformers into linear attention decoder models. This method dramatically reduces the required computational resources, using only 350-700M tokens (less than 0.005% of original training costs) while maintaining near-original quality. The paper introduces two new RWKV-variant architectures and successfully converts popular Qwen2.5 models in 7B, 32B, and 72B parameter sizes. Linear attention transformer variants have begun matching or exceeding traditional softmax attention transformers in quality metrics. This is crucial for long-sequence inference as linear attention operates in O(1) time per token compared to O(N) for softmax transformers. However, training large models from scratch is prohibitively expensive, often requiring over ten trillion tokens. Previous conversion attempts include early work by Gerstenberger et al. (2020) requiring extensive training cycles, T2R by Kasai et al. (2021) using 3.5B tokens, and recent projects like SUPRA, DiJiang, XATL requiring 100B+ tokens. More efficient methods like LOLCats and MOHAWK use two-phase approaches but still have limitations. RADLADS builds upon these foundations, combining the best practices while achieving state-of-the-art performance with minimal training requirements. The conversion protocol consists of three main steps. In the setup phase, all attention-related weights (Wq, Wk, Wv, Wo) from the teacher model are transferred to the student model. Equivalent parameters are initialized with teacher weights while others use standard pre-training initialization. Step 1 involves Attention Hidden State Alignment using 100M tokens. Each student sequence mixing layer is trained to approximate the corresponding teacher model's hidden state outputs. This uses L2 distance loss between student and teacher outputs, a sequence length of 512, and cosine annealed learning rate from 1e-3 to 1e-5. Step 2 performs Knowledge Distillation with 250-700M tokens. The complete student model is trained to approximate teacher model logits using Kullback-Leibler divergence loss, a flat learning rate of 1e-5, and sequence length of 512. Step 3 focuses on Context Length Extension using 100M tokens. The model is fine-tuned on longer sequences (16,384 tokens) to enhance long-context capabilities using cross-entropy loss with no teacher model required. The paper introduces two architecture innovations. RAD-RWKV6 ("RADFinch") is a customized RWKV6-C2 variant featuring Gated Linear Attention kernel without bonus, sigmoid gate, state balancing technique allowing removal of state normalization, and data-dependent linear interpolation for token shift. RAD-RWKV7 ("RADGoose") is a modified RWKV-7 architecture with tokenshift removed for improved training/inference speed, RoPE applied, no bonus mechanism, and simplified recurrent formulation. Critical factors for successful conversion include dataset selection (DCLM dataset proved optimal for Qwen models), hyperparameter tuning (high initial learning rate for rapid alignment, then annealing to teacher's final learning rate), architecture choice (strong RNN architectures significantly impact performance), and component testing (some features beneficial for pre-training provide minimal benefit during conversion). Several approaches did not work effectively. Initial attention score alignment (Step 0) showed no benefits. Skipping Step 1 entirely resulted in much lower performance. De-novo initialization of attention weights performed worse than transferring from the teacher. Freezing model weights during Step 2 significantly reduced performance. Using LoRA for training caused rank reduction issues. Larger batch sizes did not improve convergence speed. RADLADS achieves state-of-the-art results with QRWKV6-72B-Instruct showing 0.899 MMLU relative score and QRWKV7-7B-Instruct achieving 0.924 MMLU, the highest among pure RNN conversions. The method consistently outperforms other conversion methods across benchmarks and maintains performance comparable to hybrid models containing softmax attention. Converting to a 72B linear attention model costs less than $2,000 USD at today's prices. Training times on 8x MI300X GPUs are approximately 7.25 hours for 7B models, 32.5 hours for 32B models, and 67.5 hours for 72B models. RADLADS provides a cost-effective method for converting quadratic softmax attention transformers into linear RNN models with constant memory usage. Benefits include significant energy savings, reduced research costs, enabling rapid testing of new RNN architectures, and producing state-of-the-art open-weight linear attention models. Limitations include that each architecture requires meticulous testing for RADLADS compatibility, RAD-RWKV7 shows reduced stability at larger scales (32B+), and impact on reasoning models requires further investigation. Future work includes testing conversion between different architectures, optimizing dataset selection for various model types, further architectural improvements to RAD-RWKV7, and investigating implicit normalization techniques. All models and code are released under Apache 2.0 license (with Qwen License Agreement for 72B models), enabling widespread adoption and further research in efficient linear attention architectures.