RLHF: Research Digest
Literature Digest: Reinforcement Learning from Human Feedback (RLHF) Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning large…
This page is a free summary. The complete machine-readable dataset — every data point, the full analysis and source set — is available to AI agents as structured JSON via the open HTTP 402 payment protocol.
crawlerAutoPrice=true · verify at https://esa.aisa.one/api/v1/access/verify
Literature Digest: Reinforcement Learning from Human Feedback (RLHF)
Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning large language models (LLMs) with human preferences, using human‑labeled comparison data to train a reward model and then fine‑tuning a policy via reinforcement learning. Recent work spans theory, empirical practice, and safety‑oriented variants, revealing both the power and the limits of RLHF for AI alignment.
Foundations and surveys
“Open problems and fundamental limitations of reinforcement learning from human feedback” systematizes RLHF’s shortcomings, organizing them into three stages: collecting human feedback, learning the reward model, and policy optimization, and arguing that many issues are not merely engineering but fundamental limitations. “A survey of reinforcement learning from human feedback” similarly reviews the standard pipeline—learning a reward model from human feedback and then using it to train a policy—while noting that direct policy optimization from feedback is also possible. “RLHF Deciphered: A critical analysis of reinforcement learning from human feedback for LLMs” provides a principled RL‑centric analysis of how RLHF operates in practice, highlighting gaps between textbook RL assumptions and real‑world LLM deployment.
Safety, scaling, and AI feedback
“Training a helpful and harmless assistant with reinforcement learning from human feedback” demonstrates how RLHF can be deployed iteratively, updating models on a weekly cadence with fresh preference data to improve helpfulness and harmlessness. “Safe RLHF: Safe reinforcement learning from human feedback” proposes an explicit safety‑aware RLHF algorithm that constrains optimization to avoid unsafe behaviors. “RLAIF: Scaling reinforcement learning from human feedback with AI feedback” and “RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with AI feedback” explore replacing human annotators with AI‑generated feedback, showing that reward models trained on human feedback still outperform those trained on AI feedback when evaluated against held‑out human preferences.
Open problems
- Scalable, high‑quality human feedback collection and representativeness (Open problems and fundamental limitations of RLHF).
- Reward model misspecification and over‑optimization risks (Open problems and fundamental limitations of RLHF; RLHF Deciphered).
- Safety constraints and value alignment under RLHF optimization (Safe RLHF; Open problems and fundamental limitations of RLHF).
- Theoretical limits and sample efficiency when feedback is pairwise or k‑wise (Principled RL with human feedback).
- Performance and reliability gaps between human‑ and AI‑generated feedback (RLAIF vs. RLHF; RLAIF: Scaling RLHF with AI feedback).
Key papers
- Open problems and fundamental limitations of reinforcement learning from human feedback — S Casper,X Davies,C Shi,TK Gilbert…
- Training a helpful and harmless assistant with reinforcement learning from human feedback — Y Bai,A Jones,K Ndousse,A Askell,A Chen…
- A survey of reinforcement learning from human feedback — T Kaufmann,P Weng,V Bengs…
- Principled reinforcement learning with human feedback from pairwise or k-wise comparisons — B Zhu,M Jordan,J Jiao
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback — H Lee,S Phatale,H Mansoor,KR Lu,T Mesnard…
- Safe rlhf: Safe reinforcement learning from human feedback — J Dai,X Pan,R Sun,J Ji,X Xu,M Liu…
- Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback — H Lee,S Phatale,H Mansoor,T Mesnard…
- Rlhf deciphered: A critical analysis of reinforcement learning from human feedback for llms — S Chaudhari,P Aggarwal,V Murahari…
- A minimaximalist approach to reinforcement learning from human feedback — G Swamy,C Dann,R Kidambi,ZS Wu…
- Policy shaping: Integrating human feedback with reinforcement learning — S Griffith,K Subramanian,J Scholz…
Papers via the AISA Scholar API; synthesis by the AISA LLM layer. 2026-06-23.
Sources & citations
- Open problems and fundamental limitations of reinforcement learning from human feedback
- Training a helpful and harmless assistant with reinforcement learning from human feedback
- A survey of reinforcement learning from human feedback
- Principled reinforcement learning with human feedback from pairwise or k-wise comparisons
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback
- Safe rlhf: Safe reinforcement learning from human feedback
- Rlaif vs. rlhf: Scaling reinforcement learning from human feedback with ai feedback
- Rlhf deciphered: A critical analysis of reinforcement learning from human feedback for llms