In this work, we focus on safe policy improvement in multi-agent domains where current state-of-the-art methods cannot be effectively applied because of large state and action spaces. We consider recent results using Monte Carlo Tree Search for Safe Policy Improvement with Baseline Bootstrapping and propose a novel algorithm that scales this approach to multi-agent domains, exploiting the factorization of the transition model and value function. Given a centralized behavior policy and a dataset of trajectories, our algorithm generates an improved policy by selecting joint actions using a novel extension of Max-Plus (or Variable Elimination) that constrains local actions to guarantee safety criteria. An empirical evaluation on multi-agent SysAdmin and multi-UAV Delivery shows that the approach scales to very large domains where state-of-the-art methods cannot work.

Citation

  Bianchi, F., Zorzi, E., Castellini, A., Simão, T. D., Spaan, M. T. J., & Farinelli, A. (2024). Scalable Safe Policy Improvement for Factored Multi-Agent MDPs. ICML, 3952–3973.

@inproceedings{Bianchi2024scalable,
  author = {Bianchi, Federico and Zorzi, Edoardo and Castellini, Alberto and Sim{\~a}o, Thiago D. and Spaan, Matthijs T. J. and Farinelli, Alessandro},
  title = {Scalable Safe Policy Improvement for Factored Multi-Agent {MDP}s},
  booktitle = {ICML},
  pages = {3952--3973},
  year = {2024}
}