
Milestone Papers
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
(2023-05) DPO by Stanford
(2023-05) DPO by Stanford
(2024-01) DeepSeek-v2 by DeepSeek