DeepSeek-V3, a powerful Mixture-of-Experts (MoE) language model with a total of 671B parameters and 37B parameters activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts the Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which have been thoroughly verified in DeepSeek-V2. Moreover, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for better performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully utilize its capabilities. Comprehensive evaluations show that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, the full training of DeepSeek-V3 only requires 2.788M H800 GPU hours. In addition, its training process is remarkably stable. Throughout the entire training process, we didn't encounter any unrecoverable loss spikes or perform any rollbacks.

Relevant Sites

Leave a Reply

Your email address will not be published. Required fields are marked *