Paper Discussion: Training Language Models to Self-Correct via RL
Details
This latest paper from Deepmind explored the potential of LLM on self-correction. They propose a method, SCoRe, a multi-turn online reinforcement learning (RL) approach designed to enhance the self-correction ability of large language models (LLMs) using self-generated data. Traditional methods for training self-correction often require multiple models or external supervision and have shown limited effectiveness. SCoRe improves on this by addressing the limitations of supervised fine-tuning (SFT), which often leads to distribution mismatches or ineffective correction behaviors at test time.
SCoRe trains the model under its own distribution of correction traces, applying regularization to guide the model toward an effective self-correction strategy. The approach involves two key phases: an initial RL phase for generating a stable policy and the use of reward bonuses to boost self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, SCoRe significantly improves self-correction performance, achieving state-of-the-art results with a 15.6% improvement on the MATH benchmark and a 9.1% improvement on the HumanEval benchmark.
paper: https://arxiv.org/abs/2409.12917
Paper Discussion: Training Language Models to Self-Correct via RL