RL^V: Unifying Reasoning and Verification in Language Models through Value-Free Reinforcement Learning
LLMs have gained excellent reasoning capabilities by reinforcement studying (RL) on correctness rewards. Fashionable RL algorithms for LLMs, together with ...