Offline Reinforcement Learning (RL) allows policies to be trained on pre-collected datasets without requiring additional interactions with the environment. This approach bypasses the need for real-time data acquisition in real-world applications, which can be impractical due to the safety issues inherent in the learning process. However, offline RL faces significant challenges, such as distributional shifts and extrapolation errors, and the resulting policies might underperform compared to the baseline policy. Safe policy improvement algorithms mitigate these issues, enabling the reliable deployment of RL approaches in real-world scenarios where historical data is available, guaranteeing that any policy changes will not result in worse performance compared to the baseline policy used to collect training data. In this paper, we propose MCTS-SPIBB, an algorithm that leverages Monte Carlo Tree Search (MCTS) for scaling safe policy improvement to large domains. We theoretically prove that the policy generated by MCTS-SPIBB converges to the optimal safely improved policy produced by Safe Policy Improvement with Baseline Bootstrapping (SPIBB) as the number of simulations increases. Additionally, we introduce SDP-SPIBB, a novel extension of SPIBB designed to address the scalability limitations of the standard algorithm via Scalable Dynamic Programming. Our empirical analysis across four benchmark domains demonstrates that MCTS-SPIBB and SDP-SPIBB significantly enhance the scalability of safe policy improvement, providing robust and efficient algorithms for large-scale applications. These contributions represent a significant step towards the deployment of safe RL algorithms in complex real-world environments.

Citation

  Bianchi, F., Zorzi, E., Castellini, A., Simão, T. D., Spaan, M. T. J., & Farinelli, A. (2025). Scaling Safe Policy Improvement: Monte Carlo Tree Search and Policy Iteration Strategies. JAIR. https://doi.org/10.1613/jair.1.19649

@article{Bianchi2025scaling,
  author = {Bianchi, Federico and Zorzi, Edoardo and Castellini, Alberto and Sim{\~a}o, Thiago D. and Spaan, Matthijs T. J. and Farinelli, Alessandro},
  title = {Scaling Safe Policy Improvement: {M}onte {C}arlo Tree Search and Policy Iteration Strategies},
  journal = {{JAIR}},
  year = {2025},
  url = {https://doi.org/10.1613/jair.1.19649}
}