Thiago D. Simão | An Empirical Evaluation of Safe Policy Improvement in Factored Environments

HTML

Reinforcement Learning (RL) deals with problems that can be modeled as a Markov Decision Process (MDP) where the transition function is unknown. In situations where an arbitrary policy π is already in execution and the experiences with the environment were recorded in a batch D, an RL algorithm can use to compute a new policy π’. However, the policy computed by traditional RL algorithms might have worse performance compared to π. Our goal is to develop safe RL algorithms, where the agent has a high confidence that the performance of π’ is better than the performance of π given D. To develop sample-efficient and safe RL algorithms we combine ideas from exploration strategies in RL with a safe policy improvement method.

Citation

  Simão, T. D., & Spaan, M. T. J. (2018). An Empirical Evaluation of Safe Policy Improvement in Factored Environments. In ICML / IJCAI / AAMAS 2018 Workshop on Planning and Learning (PAL-18).

@misc{Simao2018,
  author = {Sim{\~a}o, Thiago D. and Spaan, Matthijs T. J.},
  title = {{An Empirical Evaluation of Safe Policy Improvement in Factored Environments}},
  booktitle = {{ICML / IJCAI / AAMAS 2018 Workshop on Planning and Learning (PAL-18)}},
  year = {2018}
}