Skip to content

Paper Discussion: Training Language Models to Self-Correct via RL

Photo of Zhengjie Wang
Hosted By
Zhengjie W. and 3 others

Details

This latest paper from Deepmind explored the potential of LLM on self-correction. They propose a method, SCoRe, a multi-turn online reinforcement learning (RL) approach designed to enhance the self-correction ability of large language models (LLMs) using self-generated data. Traditional methods for training self-correction often require multiple models or external supervision and have shown limited effectiveness. SCoRe improves on this by addressing the limitations of supervised fine-tuning (SFT), which often leads to distribution mismatches or ineffective correction behaviors at test time.

SCoRe trains the model under its own distribution of correction traces, applying regularization to guide the model toward an effective self-correction strategy. The approach involves two key phases: an initial RL phase for generating a stable policy and the use of reward bonuses to boost self-correction during training. When applied to Gemini 1.0 Pro and 1.5 Flash models, SCoRe significantly improves self-correction performance, achieving state-of-the-art results with a 15.6% improvement on the MATH benchmark and a 9.1% improvement on the HumanEval benchmark.

paper: https://arxiv.org/abs/2409.12917

Photo of Canberra Deep Learning Meetup group
Canberra Deep Learning Meetup
See more events
level 3/44 Sydney Ave
44 Sydney Ave · Forrest
Google map of the user's next upcoming event's location
FREE