
The integration of Reinforcement Learning (RL), particularly through the lens of the Bayesian paradigm, represents a pivotal shift from static, data-driven training to dynamic, decision-theoretic frameworks for enhancing cognitive capabilities in Large Language Models (LLMs) and Generative AI. While supervised fine-tuning has endowed these models with remarkable linguistic fluency, it falls short in cultivating the goal-directed, adaptive reasoning required for complex, multi-step tasks. This survey charts the evolution from supervised methods to RL, which reframes generation as a sequential decision-making problem, enabling optimisation against outcome-based rewards. However, standard RL introduces its own challenges related to exploration, stability, and reward model fidelity. This paper argues that the Bayesian paradigm offers a principled and mathematically coherent foundation for addressing these challenges. We provide a comprehensive review of the theoretical role of Bayesian inference in interpreting LLM behaviour, particularly resolving the paradox between their implicit Bayesian learning capabilities and their violation of the martingale property. We then examine the practical application of Bayesian methods in reward modelling to mitigate over-optimisation and in Bayes-Adaptive RL to foster reflective, uncertainty-aware exploration. The critical implications for two key areas—uncertainty quantification and safe exploration—are explored in detail, revealing a deep convergence between the two fields. This survey’s primary contribution is a unified synthesis of these disparate research threads, concluding that the Bayesian approach provides a robust framework for developing more adaptive, reliable, and cognitively sophisticated AI systems.
Favor de solicitar el acceso al correo:
escolareslcd@iimas.unam.mx
Ponente
Juan Carlos Martínez Ovando
Master Expert Data Scientist en BBVA