Thiago D. Simão | publications

2026

IJCAI
Missingness-MDPs: Bridging the Theory of Missing Data and POMDPs

Wendland, Joshua, Zubia, Markel, Andriushchenko, Roman, Galesloot, Maris, Ceska, Milan, Kleist, Henrik, Simão, Thiago D., Weininger, Maximilian, and Jansen, Nils

In IJCAI 2026

Abs Bib Details

We introduce missingness-MDPs (miss-MDPs), a novel subclass of partially observable Markov decision processes (POMDPs) that incorporates the theory of missing data. A miss-MDP is a POMDP whose observation function is a missingness function, specifying the probability that individual state features are missing (i.e., unobserved) at a time step. The literature distinguishes three canonical missingness types: missing (1) completely at random (MCAR), (2) at random (MAR), and (3) not at random (MNAR). Our planning problem is to compute near-optimal policies for a miss-MDP with an unknown missingness function, given a dataset of action–observation trajectories. Achieving such optimality guarantees for policies requires learning the missingness function from data, which is infeasible for general POMDPs. To overcome this challenge, we exploit the structural properties of different missingness types to derive probably approximately correct (PAC) algorithms for learning the missingness function. These algorithms yield an approximate but fully specified miss-MDP that we solve using off-the-shelf planning methods. We prove that, with high probability, the resulting policies are \varepsilon-optimal in the true miss-MDP. Empirical results confirm the theory and demonstrate superior performance of our approach over two model-free POMDPs methods.
@inproceedings{Wendland2026missingnessmdps, author = {Wendland, Joshua and Zubia, Markel and Andriushchenko, Roman and Galesloot, Maris and Ceska, Milan and von Kleist, Henrik and Sim{\~a}o, Thiago D. and Weininger, Maximilian and Jansen, Nils}, title = {Missingness-{MDP}s: Bridging the Theory of Missing Data and {POMDP}s}, booktitle = {IJCAI}, year = {2026} }

Perception-Based Beliefs for POMDPs with Visual Observations

Ackermann, Miriam, Krale, Merlijn, Simão, Thiago D., Jansen, Nils, and Weininger, Maximilian

In AAMAS 2026

Bib Details

@inproceedings{Ackermann2026perception,
  author = {Ackermann, Miriam and Krale, Merlijn and Sim\~{a}o, Thiago D. and Jansen, Nils and Weininger, Maximilian},
  title = {Perception-Based Beliefs for {POMDP}s with Visual Observations},
  booktitle = {AAMAS},
  year = {2026}
}

AAMAS

Sample-Efficient Policy Space Response Oracles with Joint Experience Best Response

Bighashdel, Ariyan, Simão, Thiago D., and Oliehoek, Frans A.

In AAMAS 2026

Bib Details

@inproceedings{Bighashdel2026sample,
  author = {Bighashdel, Ariyan and Sim\~{a}o, Thiago D. and Oliehoek, Frans A.},
  title = {Sample-Efficient Policy Space Response Oracles with Joint Experience Best Response},
  booktitle = {AAMAS},
  year = {2026}
}

2025

NeurIPS
On Evaluating Policies for Robust POMDPs

Krale, Merlijn, Bovy, Eline M., Galesloot, Maris, Simão, Thiago D., and Jansen, Nils

In NeurIPS 2025

Abs Bib PDF Code Details

Robust partially observable Markov decision processes (RPOMDPs) model sequential decision-making problems under partial observability, where an agent must be robust against a range of dynamics. RPOMDPs can be viewed as a two-player game between an agent, who selects actions, and nature, who adversarially selects the dynamics. Evaluating an agent policy requires finding an adversarial nature policy, which is computationally challenging. In this paper, we advance the evaluation of agent policies for RPOMDPs in three ways. First, we discuss suitable benchmarks. We observe that for some RPOMDPs, an optimal agent policy can be found by considering only subsets of nature policies, making them easier to solve. We formalize this concept of solvability and construct three benchmarks that are only solvable for expressive sets of nature policies. Second, we describe a new method to evaluate agent policies for RPOMDPs by solving an equivalent MDP. Third, we lift two well-known upper bounds from POMDPs to RPOMDPs, which can be used to efficiently approximate the optimality gap of a policy and serve as baselines. Our experimental evaluation shows that (1) our proposed benchmarks cannot be solved by assuming naive nature policies, (2) our method of evaluating policies is accurate, and (3) the upper bounds provide solid baselines for evaluation.
@inproceedings{Krale2025onEvaluating, author = {Krale, Merlijn and Bovy, Eline M. and Galesloot, Maris and Sim\~{a}o, Thiago D. and Jansen, Nils}, title = {On Evaluating Policies for Robust {POMDP}s}, booktitle = {NeurIPS}, year = {2025} }
JAIR
Scaling Safe Policy Improvement: Monte Carlo Tree Search and Policy Iteration Strategies

Bianchi, Federico, Zorzi, Edoardo, Castellini, Alberto, Simão, Thiago D., Spaan, Matthijs T. J., and Farinelli, Alessandro

JAIR 2025

Abs Bib HTML PDF Details

Offline Reinforcement Learning (RL) allows policies to be trained on pre-collected datasets without requiring additional interactions with the environment. This approach bypasses the need for real-time data acquisition in real-world applications, which can be impractical due to the safety issues inherent in the learning process. However, offline RL faces significant challenges, such as distributional shifts and extrapolation errors, and the resulting policies might underperform compared to the baseline policy. Safe policy improvement algorithms mitigate these issues, enabling the reliable deployment of RL approaches in real-world scenarios where historical data is available, guaranteeing that any policy changes will not result in worse performance compared to the baseline policy used to collect training data. In this paper, we propose MCTS-SPIBB, an algorithm that leverages Monte Carlo Tree Search (MCTS) for scaling safe policy improvement to large domains. We theoretically prove that the policy generated by MCTS-SPIBB converges to the optimal safely improved policy produced by Safe Policy Improvement with Baseline Bootstrapping (SPIBB) as the number of simulations increases. Additionally, we introduce SDP-SPIBB, a novel extension of SPIBB designed to address the scalability limitations of the standard algorithm via Scalable Dynamic Programming. Our empirical analysis across four benchmark domains demonstrates that MCTS-SPIBB and SDP-SPIBB significantly enhance the scalability of safe policy improvement, providing robust and efficient algorithms for large-scale applications. These contributions represent a significant step towards the deployment of safe RL algorithms in complex real-world environments.
@article{Bianchi2025scaling, author = {Bianchi, Federico and Zorzi, Edoardo and Castellini, Alberto and Sim{\~a}o, Thiago D. and Spaan, Matthijs T. J. and Farinelli, Alessandro}, title = {Scaling Safe Policy Improvement: {M}onte {C}arlo Tree Search and Policy Iteration Strategies}, journal = {{JAIR}}, year = {2025} }
ECAI
Pessimistic Iterative Planning with RNNs for Robust POMDPs

Galesloot, Maris, Suilen, Marnix, Simão, Thiago D., Carr, Steven, Spaan, Matthijs T. J., Topcu, Ufuk, and Jansen, Nils

In ECAI 2025

Abs arXiv Bib PDF Code Details

Robust POMDPs extend classical POMDPs to incorporate model uncertainty using so-called uncertainty sets on the transition and observation functions, effectively defining ranges of probabilities. Policies for robust POMDPs must be (1) memory-based to account for partial observability and (2) robust against model uncertainty to account for the worst-case probability instances from the uncertainty sets. To compute such robust memory-based policies, we propose the pessimistic iterative planning (PIP) framework, which alternates between (1) selecting pessimistic POMDPs via worst-case probability instances from the uncertainty sets, and (2) computing finite-state controllers (FSCs) for these pessimistic POMDPs. Within PIP, we propose the RFSCNET algorithm, which optimizes a recurrent neural network to compute the FSCs. The empirical evaluation shows that RFSCNET can compute better-performing robust policies than several baselines and a state-of-the-art robust POMDP solver.
@inproceedings{Galesloot2025pessimistic, author = {Galesloot, Maris and Suilen, Marnix and Sim{\~a}o, Thiago D. and Carr, Steven and Spaan, Matthijs T. J. and Topcu, Ufuk and Jansen, Nils}, title = {Pessimistic Iterative Planning with {RNN}s for Robust {POMDP}s}, booktitle = {ECAI}, year = {2025}, pages = {4823--4831} }
ICLR
Safety-Prioritizing Curricula for Constrained Reinforcement Learning

Koprulu, Cevahir, Simão, Thiago D., Jansen, Nils, and Topcu, Ufuk

In ICLR 2025

Abs Bib HTML PDF Code Details

Curriculum learning aims to accelerate reinforcement learning (RL) by generating curricula, i.e., sequences of tasks of increasing difficulty. Although existing curriculum generation approaches provide benefits in sample efficiency, they overlook safety-critical settings where an RL agent must adhere to safety constraints. Thus, these approaches may generate tasks that cause RL agents to violate safety constraints during training and behave suboptimally after. We develop a safe curriculum generation approach (SCG) that aligns the objectives of constrained RL and curriculum learning: improving safety during training and boosting sample efficiency. SCG generates sequences of tasks where the RL agent can be safe and performant by initially generating tasks with minimum safety violations over high-reward ones. We empirically show that compared to the state-of-the-art curriculum learning approaches and their naively modified safe versions, SCG achieves optimal performance and the lowest amount of constraint violations during training.
@inproceedings{Koprulu2025safetyPrioritizing, title = {Safety-Prioritizing Curricula for Constrained Reinforcement Learning}, author = {Koprulu, Cevahir and Sim{\~a}o, Thiago D. and Jansen, Nils and Topcu, Ufuk}, booktitle = {ICLR}, year = {2025} }
ICLR
Robust Transfer of Safety-Constrained Reinforcement Learning Agents

Zubia, Markel, Simão, Thiago D., and Jansen, Nils

In ICLR 2025

Abs Bib HTML PDF Code Details

Reinforcement learning (RL) often relies on trial and error, which may cause undesirable outcomes. As a result, standard RL is inappropriate for safety-critical applications. To address this issue, one may train a safe agent in a controlled environment (where safety violations are allowed) and then transfer it to the real world (where safety violations may have disastrous consequences). Prior work has made this transfer safe as long as the new environment preserves the safety-related dynamics. However, in most practical applications, differences or shifts in dynamics between the two environments are inevitable, potentially leading to safety violations after the transfer. This work aims to guarantee safety even when the new environment has different (safety-related) dynamics. In other words, we aim to make the process of safe transfer robust. Our methodology (1) robustifies an agent in the controlled environment and (2) provably provides—under mild assumption—a safe transfer to new environments. The empirical evaluation shows that this method yields policies that are robust against changes in dynamics, demonstrating safety after transfer to a new environment.
@inproceedings{Zubia2025robust, title = {Robust Transfer of Safety-Constrained Reinforcement Learning Agents}, author = {Zubia, Markel and Sim{\~a}o, Thiago D. and Jansen, Nils}, booktitle = {ICLR}, year = {2025} }
AAMAS
Tighter Value-Function Approximations for POMDPs

Krale, Merlijn, Koops, Wietze, Junges, Sebastian, Simão, Thiago D., and Jansen, Nils

In AAMAS 2025

Abs arXiv Bib HTML PDF Code Details

Solving partially observable Markov decision processes (POMDPs) typically requires reasoning about the values of exponentially many state beliefs. Towards practical performance, state-of-the-art solvers use value bounds to guide this reasoning. However, sound upper value bounds are often computationally expensive to compute, and there is a tradeoff between the tightness of such bounds and their computational cost. This paper introduces new and provably tighter upper value bounds than the commonly used fast informed bound. Our empirical evaluation shows that, despite their additional computational overhead, the new upper bounds accelerate state-of-the-art POMDP solvers on a wide range of benchmarks.
@inproceedings{Krale2025tighter, author = {Krale, Merlijn and Koops, Wietze and Junges, Sebastian and Sim\~{a}o, Thiago D. and Jansen, Nils}, title = {Tighter Value-Function Approximations for {POMDP}s}, booktitle = {AAMAS}, year = {2025}, pages = {1200--1208} }

2024

ICML
Scalable Safe Policy Improvement for Factored Multi-Agent MDPs

Bianchi, Federico, Zorzi, Edoardo, Castellini, Alberto, Simão, Thiago D., Spaan, Matthijs T. J., and Farinelli, Alessandro

In ICML 2024

Abs Bib HTML PDF Code Details

In this work, we focus on safe policy improvement in multi-agent domains where current state-of-the-art methods cannot be effectively applied because of large state and action spaces. We consider recent results using Monte Carlo Tree Search for Safe Policy Improvement with Baseline Bootstrapping and propose a novel algorithm that scales this approach to multi-agent domains, exploiting the factorization of the transition model and value function. Given a centralized behavior policy and a dataset of trajectories, our algorithm generates an improved policy by selecting joint actions using a novel extension of Max-Plus (or Variable Elimination) that constrains local actions to guarantee safety criteria. An empirical evaluation on multi-agent SysAdmin and multi-UAV Delivery shows that the approach scales to very large domains where state-of-the-art methods cannot work.
@inproceedings{Bianchi2024scalable, author = {Bianchi, Federico and Zorzi, Edoardo and Castellini, Alberto and Sim{\~a}o, Thiago D. and Spaan, Matthijs T. J. and Farinelli, Alessandro}, title = {Scalable Safe Policy Improvement for Factored Multi-Agent {MDP}s}, booktitle = {ICML}, pages = {3952--3973}, year = {2024} }
AAAI
Factored Online Planning in Many-Agent POMDPs

Galesloot, Maris, Simão, Thiago D., Junges, Sebastian, and Jansen, Nils

In AAAI 2024

Abs arXiv Bib HTML PDF Code Details

In centralized multi-agent systems, often modeled as multi-agent partially observable Markov decision processes (MPOMDPs), the action and observation spaces grow exponentially with the number of agents, making the value and belief estimation of single-agent online planning ineffective. Prior work partially tackles value estimation by exploiting the inherent structure of multi-agent settings via so-called coordination graphs. Additionally, belief estimation has been improved by incorporating the likelihood of observations into the approximation. However, the challenges of value estimation and belief estimation have only been tackled individually, which prevents existing methods from scaling to many agents. Therefore, we address these challenges simultaneously. First, we introduce weighted particle filtering to a sample-based online planner for MPOMDPs. Second, we present a scalable approximation of the belief. Third, we bring an approach that exploits the typical locality of agent interactions to novel online planning algorithms for MPOMDPs operating on a so-called sparse particle filter tree. Our experimental evaluation against several state-of-the-art baselines shows that our methods (1) are competitive in settings with only a few agents and (2) improve over the baselines in the presence of many agents.
@inproceedings{Galesloot2024factored, author = {Galesloot, Maris and Sim\~{a}o, Thiago D. and Junges, Sebastian and Jansen, Nils}, title = {Factored Online Planning in Many-Agent {POMDP}s}, booktitle = {AAAI}, year = {2024}, pages = {17407--17415} }
AAAI
Robust Active Measuring under Model Uncertainty

Krale, Merlijn, Simão, Thiago D., Tumova, Jana, and Jansen, Nils

In AAAI 2024

Abs arXiv Bib HTML PDF Code Details

Partial observability and uncertainty are common problems in sequential decision-making that particularly impede the use of formal models such as Markov decision processes (MDPs). However, in practice, agents may be able to employ costly sensors to measure their environment and resolve partial observability by gathering information. Moreover, imprecise transition functions can capture model uncertainty. We combine these concepts and extend MDPs to robust active-measuring MDPs (RAM-MDPs). We present an active-measure heuristic to solve RAM-MDPs efficiently and show that model uncertainty can, counterintuitively, let agents take fewer measurements. We propose a method to counteract this behavior while only incurring a bounded additional cost. We empirically compare our methods to several baselines and show their superior scalability and performance.
@inproceedings{Krale2024robust, author = {Krale, Merlijn and Sim\~{a}o, Thiago D. and Tumova, Jana and Jansen, Nils}, title = {Robust Active Measuring under Model Uncertainty}, booktitle = {AAAI}, year = {2024}, pages = {21276--21284} }

ICAART

A Supervised Learning Approach to Robust Reinforcement Learning for Job Shop Scheduling

Schmidl, Christoph, Simão, Thiago D., and Jansen, Nils

In ICAART 2024

Bib HTML PDF Code Details

@inproceedings{Schmidl2024supervised,
  author = {Schmidl, Christoph and Sim{\~{a}}o, Thiago D. and Jansen, Nils},
  title = {A Supervised Learning Approach to Robust Reinforcement Learning for Job Shop Scheduling},
  booktitle = {ICAART},
  year = {2024},
  pages = {1324-1335}
}

PHME
Maintenance Strategies for Sewer Pipes with Multi-State Degradation and Deep Reinforcement Learning

Jimenez-Roa, Lisandro A., Simão, Thiago D., Bukhsh, Zaharah, Tinga, Tiedo, Molegraaf, Hajo, Jansen, Nils, and Stoelinga, Marielle

In PHME 2024

Abs arXiv Bib HTML PDF Details

Large-scale infrastructure systems are crucial for societal welfare, and their effective management requires strategic forecasting and intervention methods that account for various complexities. Our study addresses two challenges within the Prognostics and Health Management (PHM) framework applied to sewer assets: modeling pipe degradation across severity levels and developing effective maintenance policies. We employ Multi-State Degradation Models (MSDM) to represent the stochastic degradation process in sewer pipes and use Deep Reinforcement Learning (DRL) to devise maintenance strategies. A case study of a Dutch sewer network exemplifies our methodology. Our findings demonstrate the model’s effectiveness in generating intelligent, cost-saving maintenance strategies that surpass heuristics. It adapts its management strategy based on the pipe’s age, opting for a passive approach for newer pipes and transitioning to active strategies for older ones to prevent failures and reduce costs. This research highlights DRL’s potential in optimizing maintenance policies. Future research will aim improve the model by incorporating partial observability, exploring various reinforcement learning algorithms, and extending this methodology to comprehensive infrastructure management.
@inproceedings{JimenezRoa2024maintenance, author = {Jimenez-Roa, Lisandro A. and Simão, Thiago D. and Bukhsh, Zaharah and Tinga, Tiedo and Molegraaf, Hajo and Jansen, Nils and Stoelinga, Marielle}, title = {Maintenance Strategies for Sewer Pipes with Multi-State Degradation and Deep Reinforcement Learning}, booktitle = {PHME}, year = {2024} }

2023

ECAI
Reinforcement Learning by Guided Safe Exploration

Yang, Qisong, Simão, Thiago D., Jansen, Nils, Tindemans, Simon H., and Spaan, Matthijs T. J.

In ECAI 2023

Abs arXiv Bib HTML PDF Code Details

Safety is critical to broadening the application of reinforcement learning (RL). Often, we train RL agents in a controlled environment, such as a laboratory, before deploying them in the real world. However, the real-world target task might be unknown prior to deployment. Reward-free RL trains an agent without the reward to adapt quickly once the reward is revealed. We consider the constrained reward-free setting, where an agent (the guide) learns to explore safely without the reward signal. This agent is trained in a controlled environment, which allows unsafe interactions and still provides the safety signal. After the target task is revealed, safety violations are not allowed anymore. Thus, the guide is leveraged to compose a safe behavior policy. Drawing from transfer learning, we also regularize a target policy (the student) towards the guide while the student is unreliable and gradually eliminate the influence from the guide as training progresses. The empirical analysis shows that this method can achieve safe transfer learning and helps the student solve the target task faster.
@inproceedings{Yang2023reinforcement, title = {Reinforcement Learning by Guided Safe Exploration}, author = {Yang, Qisong and Sim{\~a}o, Thiago D. and Jansen, Nils and Tindemans, Simon H. and Spaan, Matthijs T. J.}, booktitle = {ECAI}, year = {2023}, pages = {2858--2865} }
UAI
Risk-aware Curriculum Generation for Heavy-tailed Task Distributions

Koprulu, Cevahir, Simão, Thiago D., Jansen, Nils, and Topcu, Ufuk

In UAI 2023

Abs Bib HTML PDF Supp Code Details

Automated curriculum generation for reinforcement learning (RL) aims to speed up learning by designing a sequence of tasks of increasing difficulty. Such tasks are usually drawn from probability distributions with exponentially bounded tails, such as uniform or Gaussian distributions. However, existing approaches overlook heavy-tailed distributions. Under such distributions, current methods may fail to learn optimal policies in rare and risky tasks, which fall under the tails and yield the lowest returns, respectively. We address this challenge by proposing a risk-aware curriculum generation algorithm that simultaneously creates two curricula: 1) a primary curriculum that aims to maximize the expected discounted return with respect to a distribution over target tasks, and an auxiliary curriculum that identifies and over-samples rare and risky tasks observed in the primary curriculum. Our empirical results evidence that the proposed algorithm achieves significantly higher returns in frequent as well as rare tasks compared to the state-of-the-art methods.
@inproceedings{Koprulu2023risk-aware, title = {{Risk-aware Curriculum Generation for Heavy-tailed Task Distributions}}, author = {Koprulu, Cevahir and Sim{\~a}o, Thiago D. and Jansen, Nils and Topcu, Ufuk}, pages = {1132--1142}, booktitle = {UAI}, year = {2023} }
ICML
Scalable Safe Policy Improvement via Monte Carlo Tree Search

Castellini, Alberto, Bianchi, Federico, Zorzi, Edoardo, Simão, Thiago D., Farinelli, Alessandro, and Spaan, Matthijs T. J.

In ICML 2023

Abs Bib HTML PDF Code Details

Algorithms for safely improving policies are important to deploy reinforcement learning approaches in real-world scenarios. In this work, we propose an algorithm, called MCTS-SPIBB, that computes safe policy improvement online using a Monte Carlo Tree Search based strategy. We theoretically prove that the policy generated by MCTS-SPIBB converges, as the number of simulations grows, to the optimal safely improved policy generated by Safe Policy Improvement with Baseline Bootstrapping (SPIBB), a popular algorithm based on policy iteration. Moreover, our empirical analysis performed on three standard benchmark domains shows that MCTS-SPIBB scales to significantly larger problems than SPIBB because it computes the policy online and locally, i.e., only in the states actually visited by the agent.
@inproceedings{Castellini2023scalable, title = {{Scalable Safe Policy Improvement via Monte Carlo Tree Search}}, author = {Castellini, Alberto and Bianchi, Federico and Zorzi, Edoardo and Sim{\~a}o, Thiago D. and Farinelli, Alessandro and Spaan, Matthijs T. J.}, pages = {3732--3756}, booktitle = {ICML}, year = {2023} }
IJCAI
More for Less: Safe Policy Improvement with Stronger Performance Guarantees

Wienhöft, Patrick, Suilen, Marnix, Simão, Thiago D., Dubslaff, Clemens, Baier, Christel, and Jansen, Nils

In IJCAI 2023

Abs arXiv Bib HTML PDF Code Details

In an offline reinforcement learning setting, the safe policy improvement (SPI) problem aims to improve the performance of a behavior policy according to which sample data has been generated. State-of-the-art approaches to SPI require a high number of samples to provide practical probabilistic guarantees on the improved policy’s performance. We present a novel approach to the SPI problem that provides the means to require less data for such guarantees. Specifically, to prove the correctness of these guarantees, we devise implicit transformations on the data set and the underlying environment model that serve as theoretical foundations to derive tighter improvement bounds for SPI. Our empirical evaluation, using the well-established SPI with baseline bootstrapping (SPIBB) algorithm, on standard benchmarks shows that our method indeed significantly reduces the sample complexity of the SPIBB algorithm.
@inproceedings{Wienhoft2023more, author = {Wienhöft, Patrick and Suilen, Marnix and Sim{\~a}o, Thiago D. and Dubslaff, Clemens and Baier, Christel and Jansen, Nils}, title = {More for Less: Safe Policy Improvement with Stronger Performance Guarantees}, booktitle = {IJCAI}, year = {2023}, pages = {4406--4415} }
IJCAI
Recursive Small-Step Multi-Agent A* for Dec-POMDPs

Koops, Wietze, Jansen, Nils, Junges, Sebastian, and Simão, Thiago D.

In IJCAI 2023

Abs Bib HTML PDF Code Details

We present recursive small-step multi-agent A* (RS-MAA*), an exact algorithm that optimizes the expected reward in decentralized partially observable Markov decision processes (Dec-POMDPs). RS-MAA* builds on multi-agent A* (MAA*), an algorithm that finds policies by exploring a search tree, but tackles two major scalability concerns. First, we employ a modified, small-step variant of the search tree that avoids the double exponential outdegree of the classical formulation. Second, we use a tight and recursive heuristic that we compute on-the-fly, thereby avoiding an expensive precomputation. The resulting algorithm is conceptually simple, yet it shows superior performance on a rich set of standard benchmarks.
@inproceedings{Koops2023recursive, author = {Koops, Wietze and Jansen, Nils and Junges, Sebastian and Sim{\~a}o, Thiago D.}, title = {Recursive Small-Step Multi-Agent {A*} for {Dec-POMDP}s}, booktitle = {IJCAI}, year = {2023}, pages = {5402--5410} }
ICAPS
Act-Then-Measure: Reinforcement Learning for Partially Observable Environments with Active Measuring

Krale, Merlijn, Simão, Thiago D., and Jansen, Nils

In ICAPS 2023

Abs arXiv Bib HTML PDF Code Details

We study Markov decision processes (MDPs) where agents have direct control over when and how they gather information, as formalized by action-contingent noiselessly observable MDPs (ACNO-MPDs). In these models, actions consist of two components: a control action which affects the environment, and a measurement action which affects what the agent can observe. For solving ACNO-MDPs, we introduce the act-then-measure (ATM) heuristic, which assumes that we can ignore future state uncertainty when choosing control actions. We show how following this heuristic may lead to shorter policy computation times and prove a bound on the performance loss incurred by the heuristic. To decide whether or not to take a measurement action, we introduce the concept of measuring value. We develop a reinforcement learning algorithm based on the ATM heuristic, using a variant of Dyna-Q modified for partially observable domains, and showcase its superior performance in comparison to prior methods on a number of partially-observable environments.
@inproceedings{Krale2023act, title = {{Act-Then-Measure: Reinforcement Learning for Partially Observable Environments with Active Measuring}}, author = {Krale, Merlijn and Sim{\~a}o, Thiago D. and Jansen, Nils}, year = {2023}, pages = {212--220}, booktitle = {ICAPS} }
ICLR
Safe Reinforcement Learning From Pixels Using a Stochastic Latent Representation

Hogewind, Yannick, Simão, Thiago D., Kachman, Tal, and Jansen, Nils

In ICLR 2023

Abs arXiv Bib HTML PDF Code Details

We address the problem of safe reinforcement learning from pixel observations. Inherent challenges in such settings are (1) a trade-off between reward optimization and adhering to safety constraints, (2) partial observability, and (3) high-dimensional observations. We formalize the problem in a constrained, partially observable Markov decision process framework, where an agent obtains distinct reward and safety signals. To address the curse of dimensionality, we employ a novel safety critic using the stochastic latent actor-critic (SLAC) approach. The latent variable model predicts rewards and safety violations, and we use the safety critic to train safe policies. Using well-known benchmark environments, we demonstrate competitive performance over existing approaches regarding computational requirements, final reward return, and satisfying the safety constraints.
@inproceedings{Hogewind2023safe, title = {{Safe Reinforcement Learning From Pixels Using a Stochastic Latent Representation}}, author = {Hogewind, Yannick and Sim{\~a}o, Thiago D. and Kachman, Tal and Jansen, Nils}, year = {2023}, booktitle = {ICLR} }
AAAI
Safe Policy Improvement for POMDPs via Finite-State Controllers

Simão, Thiago D., Suilen, Marnix, and Jansen, Nils

In AAAI 2023

Abs arXiv Bib HTML PDF Code Details

We study safe policy improvement (SPI) for partially observable Markov decision processes (POMDPs). SPI is an offline reinforcement learning (RL) problem that assumes access to (1) historical data about an environment, and (2) the so-called behavior policy that previously generated this data by interacting with the environment. SPI methods neither require access to a model nor the environment itself, and aim to reliably improve the behavior policy in an offline manner. Existing methods make the strong assumption that the environment is fully observable. In our novel approach to the SPI problem for POMDPs, we assume that a finite-state controller (FSC) represents the behavior policy and that finite memory is sufficient to derive optimal policies. This assumption allows us to map the POMDP to a finite-state fully observable MDP, the history MDP. We estimate this MDP by combining the historical data and the memory of the FSC, and compute an improved policy using an off-the-shelf SPI algorithm. The underlying SPI method constrains the policy-space according to the available data, such that the newly computed policy only differs from the behavior policy when sufficient data was available. We show that this new policy, converted into a new FSC for the (unknown) POMDP, outperforms the behavior policy with high probability. Experimental results on several well-established benchmarks show the applicability of the approach, even in cases where finite memory is not sufficient.
@inproceedings{Simao2023safe, title = {Safe Policy Improvement for {POMDP}s via Finite-State Controllers}, author = {Sim{\~a}o, Thiago D. and Suilen, Marnix and Jansen, Nils}, booktitle = {AAAI}, year = {2023}, publisher = {{AAAI} Press}, pages = {15109--15117} }
ICAART
Targeted Adversarial Attacks on Deep Reinforcement Learning Policies via Model Checking

Gross, Dennis, Simão, Thiago D., Jansen, Nils, and Pérez, Guillermo A.

In ICAART 2023

Abs arXiv Bib HTML PDF Code Details

Deep Reinforcement Learning (RL) agents are susceptible to adversarial noise in their observations that can mislead their policies and decrease their performance. However, an adversary may be interested not only in decreasing the reward, but also in modifying specific temporal logic properties of the policy. This paper presents a metric that measures the exact impact of adversarial attacks against such properties. We use this metric to craft optimal adversarial attacks. Furthermore, we introduce a model checking method that allows us to verify the robustness of RL policies against adversarial attacks. Our empirical analysis confirms (1) the quality of our metric to craft adversarial attacks against temporal logic properties, and (2) that we are able to concisely assess a system’s robustness against attacks.
@inproceedings{Gross2023targeted, title = {{Targeted Adversarial Attacks on Deep Reinforcement Learning Policies via Model Checking}}, author = {Gross, Dennis and Sim{\~a}o, Thiago D. and Jansen, Nils and P{\'e}rez, Guillermo A.}, booktitle = {ICAART}, pages = {501--508}, year = {2023}, doi = {10.5220/0011693200003393} }
STTT
Decision-making under uncertainty: beyond probabilities. Challenges and Perspectives

Badings, Thom, Simão, Thiago D., Suilen, Marnix, and Jansen, Nils

STTT 2023

Abs arXiv Bib HTML PDF Details

This position paper reflects on the state-of-the-art in decision-making under uncertainty. A classical assumption is that probabilities can sufficiently capture all uncertainty in a system. In this paper, the focus is on the uncertainty that goes beyond this classical interpretation, particularly by employing a clear distinction between aleatoric and epistemic uncertainty. The paper features an overview of Markov decision processes (MDPs) and extensions to account for partial observability and adversarial behavior. These models sufficiently capture aleatoric uncertainty but fail to account for epistemic uncertainty robustly. Consequently, we present a thorough overview of so-called uncertainty models that exhibit uncertainty in a more robust interpretation. We show several solution techniques for both discrete and continuous models, ranging from formal verification, over control-based abstractions, to reinforcement learning. As an integral part of this paper, we list and discuss several key challenges that arise when dealing with rich types of uncertainty in a model-based fashion.
@article{Badings2023decision, title = {{Decision-making under uncertainty: beyond probabilities. Challenges and Perspectives}}, author = {Badings, Thom and Sim{\~a}o, Thiago D. and Suilen, Marnix and Jansen, Nils}, year = {2023}, journal = {{STTT}} }

Ph.D.

Safe Online and Offline Reinforcement Learning

Simão, Thiago D.

Delft University of Technology 2023

Bib PDF Details

@phdthesis{Simao2023thesis,
  title = {{Safe Online and Offline Reinforcement Learning}},
  author = {Sim{\~a}o, Thiago D.},
  year = {2023},
  school = {{Delft University of Technology}}
}

2022

NeurIPS
Robust Anytime Learning of Markov Decision Processes

Suilen, Marnix, Simão, Thiago D., Parker, David, and Jansen, Nils

In NeurIPS 2022

Abs arXiv Bib HTML PDF Supp Code Details

Markov decision processes (MDPs) are formal models commonly used in sequential decision-making. MDPs capture the stochasticity that may arise, for instance, from imprecise actuators via probabilities in the transition function. However, in data-driven applications, deriving precise probabilities from (limited) data introduces statistical errors that may lead to unexpected or undesirable outcomes. Uncertain MDPs (uMDPs) do not require precise probabilities but instead use so-called uncertainty sets in the transitions, accounting for such limited data. Tools from the formal verification community efficiently compute robust policies that provably adhere to formal specifications, like safety constraints, under the worst-case instance in the uncertainty set. We continuously learn the transition probabilities of an MDP in a robust anytime-learning approach that combines a dedicated Bayesian inference scheme with the computation of robust policies. In particular, our method (1) approximates probabilities as intervals, (2) adapts to new data that may be inconsistent with an intermediate model, and (3) may be stopped at any time to compute a robust policy on the uMDP that faithfully captures the data so far. We show the effectiveness of our approach and compare it to robust policies computed on uMDPs learned by the UCRL2 reinforcement learning algorithm in an experimental evaluation on several benchmarks.
@inproceedings{Suilen2022robust, title = {Robust Anytime Learning of {M}arkov Decision Processes}, author = {Suilen, Marnix and Sim{\~a}o, Thiago D. and Parker, David and Jansen, Nils}, booktitle = {NeurIPS}, year = {2022}, publisher = {Curran Associates, Inc.}, volume = {35}, pages = {28790--28802} }
ML
Safety-constrained reinforcement learning with a distributional safety critic

Yang, Qisong, Simão, Thiago D., Tindemans, Simon H., and Spaan, Matthijs T. J.

Machine Learning 2022

Abs Bib HTML PDF Code Details

Safety is critical to broadening the real-world use of reinforcement learning. Modeling the safety aspects using a safety-cost signal separate from the reward and bounding the expected safety-cost is becoming standard practice, since it avoids the problem of finding a good balance between safety and performance. However, it can be risky to set constraints only on the expectation neglecting the tail of the distribution, which might have prohibitively large values. In this paper, we propose a method called Worst-Case Soft Actor Critic for safe RL that approximates the distribution of accumulated safety-costs to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety constraint, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can compute policies whose worst-case performance satisfies the constraints. We investigate two ways to estimate the safety-cost distribution, namely a Gaussian approximation and a quantile regression algorithm. On the one hand, the Gaussian approximation is simple and easy to implement, but may underestimate the safety cost, on the other hand, the quantile regression leads to a more conservative behavior. The empirical analysis shows that the quantile regression method achieves excellent results in complex safety-constrained environments, showing good risk control.
@article{Yang2022safety-constrained, title = {Safety-constrained reinforcement learning with a distributional safety critic}, author = {Yang, Qisong and Sim{\~a}o, Thiago D. and Tindemans, Simon H. and Spaan, Matthijs T. J.}, journal = {Machine Learning}, volume = {112}, number = {3}, pages = {859--887}, year = {2022}, publisher = {Springer} }
ITSC
A Modern Perspective on Safe Automated Driving for Different Traffic Dynamics Using Constrained Reinforcement Learning

Kamran, Danial, Simão, Thiago D., Yang, Qisong, Ponnambalam, Canmanie T., Fischer, Johannes, Spaan, Matthijs T. J., and Lauer, Martin

In ITSC 2022

Abs Bib HTML PDF Details

The use of reinforcement learning (RL) in real-world domains often requires extensive effort to ensure safe behavior. While this compromises the autonomy of the system, it might still be too risky to allow a learning agent to freely explore its environment. These strict impositions come at the cost of flexibility and applying them often relies on complex parameters and hard-coded knowledge modelled by the reward function. Autonomous driving is one such domain that could greatly benefit from more efficient and verifiable methods for safe automation. We propose to approach the automated driving problem using constrained RL, a method that automates the trade off between risk and utility, thereby significantly reducing the burden on the designer. We first show that an engineered reward function for ensuring safety and utility in one specific environment might not result in the optimal behavior when traffic dynamics changes in the exact environment. Next we show how algorithms based on constrained RL which are more robust to the environmental disturbances can address this challenge. These algorithms use a simple and easy to interpret reward and cost function, and are able to maintain both, efficiency and safety without requiring reward parameter tuning. We demonstrate our approach in the automated merging scenario with different traffic configurations such as low or high chance of cooperative drivers and different cooperative driving strategies.
@inproceedings{Kamran2022modern, author = {Kamran, Danial and Simão, Thiago D. and Yang, Qisong and Ponnambalam, Canmanie T. and Fischer, Johannes and Spaan, Matthijs T. J. and Lauer, Martin}, title = {A Modern Perspective on Safe Automated Driving for Different Traffic Dynamics Using Constrained Reinforcement Learning}, year = {2022}, booktitle = {ITSC}, publisher = {IEEE}, pages = {4017--4023}, doi = {10.1109/ITSC55140.2022.9921907} }

2021

AAMAS
AlwaysSafe: Reinforcement Learning Without Safety Constraint Violations During Training

Simão, Thiago D., Jansen, Nils, and Spaan, Matthijs T. J.

In AAMAS 2021

Abs Bib HTML PDF Supp Code Details

Deploying reinforcement learning (RL) involves major concerns around safety. Engineering a reward signal that allows the agent to maximize its performance while remaining safe is not trivial. Safe RL studies how to mitigate such problems. For instance, we can decouple safety from reward using constrained Markov decision processes (CMDPs), where an independent signal models the safety aspects. In this setting, an RL agent can autonomously find tradeoffs between performance and safety. Unfortunately, most RL agents designed for CMDPs only guarantee safety after the learning phase, which might prevent their direct deployment. In this work, we investigate settings where a concise abstract model of the safety aspects is given, a reasonable assumption since a thorough understanding of safety-related matters is a prerequisite for deploying RL in typical applications. Factored CMDPs provide such compact models when a small subset of features describe the dynamics relevant for the safety constraints. We propose an RL algorithm that uses this abstract model to learn policies for CMDPs safely, that is without violating the constraints. During the training process, this algorithm can seamlessly switch from a conservative policy to a greedy policy without violating the safety constraints. We prove that this algorithm is safe under the given assumptions. Empirically, we show that even if safety and reward signals are contradictory, this algorithm always operates safely and, when they are aligned, this approach also improves the agent’s performance.
@inproceedings{Simao2021alwayssafe, author = {Sim{\~a}o, Thiago D. and Jansen, Nils and Spaan, Matthijs T. J.}, title = {{AlwaysSafe}: Reinforcement Learning Without Safety Constraint Violations During Training}, year = {2021}, booktitle = {AAMAS}, publisher = {IFAAMAS}, location = {Online}, pages = {1226--1235} }
AAAI
WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning

Yang, Qisong, Simão, Thiago D., Tindemans, Simon H., and Spaan, Matthijs T. J.

In AAAI 2021

Abs Bib HTML PDF Supp Code Details

Safe exploration is regarded as a key priority area for reinforcement learning research. With separate reward and safety signals, it is natural to cast it as constrained reinforcement learning, where expected long-term costs of policies are constrained. However, it can be hazardous to set constraints on the expected safety signal without considering the tail of the distribution. For instance, in safety-critical domains, worst-case analysis is required to avoid disastrous results. We present a novel reinforcement learning algorithm called Worst-Case Soft Actor Critic, which extends the Soft Actor Critic algorithm with a safety critic to achieve risk control. More specifically, a certain level of conditional Value-at-Risk from the distribution is regarded as a safety measure to judge the constraint satisfaction, which guides the change of adaptive safety weights to achieve a trade-off between reward and safety. As a result, we can optimize policies under the premise that their worst-case performance satisfies the constraints. The empirical analysis shows that our algorithm attains better risk control compared to expectation-based methods.
@inproceedings{Yang2021wcsac, author = {Yang, Qisong and Sim{\~a}o, Thiago D. and Tindemans, Simon H. and Spaan, Matthijs T. J.}, title = {WCSAC: Worst-Case Soft Actor Critic for Safety-Constrained Reinforcement Learning}, booktitle = {AAAI}, publisher = {{AAAI} Press}, year = {2021}, pages = {10639--10646} }

2020

AAMAS
Safe Policy Improvement with an Estimated Baseline Policy

Simão, Thiago D., Laroche, Romain, and Tachet des Combes, Rémi

In AAMAS 2020

Abs Bib HTML PDF Code Details

Previous work has shown the unreliability of existing algorithms in the batch Reinforcement Learning setting, and proposed the theoretically-grounded Safe Policy Improvement with Baseline Bootstrapping (SPIBB) fix: reproduce the baseline policy in the uncertain state-action pairs, in order to control the variance on the trained policy performance. However, in many real-world applications such as dialogue systems, pharmaceutical tests or crop management, data is collected under human supervision and the baseline remains unknown. In this paper, we apply SPIBB algorithms with a baseline estimate built from the data. We formally show safe policy improvement guarantees over the true baseline even without direct access to it. Our empirical experiments on finite and continuous states tasks support the theoretical findings. It shows little loss of performance in comparison with SPIBB when the baseline policy is given, and more importantly, drastically and significantly outperforms competing algorithms both in safe policy improvement, and in average performance.
@inproceedings{Simao2020safe, author = {Sim{\~a}o, Thiago D. and Laroche, Romain and {Tachet des Combes}, Rémi}, title = {{Safe Policy Improvement with an Estimated Baseline Policy}}, year = {2020}, booktitle = {AAMAS}, publisher = {IFAAMAS}, pages = {1269--1277} }

2019

IJCAI DC
Safe and Sample-Efficient Reinforcement Learning Algorithms for Factored Environments

Simão, Thiago D.

In IJCAI 2019

Abs Bib HTML PDF Details

Reinforcement Learning (RL) deals with problems that can be modeled as a Markov Decision Process (MDP) where the transition function is unknown. In situations where an arbitrary policy pi is already in execution and the experiences with the environment were recorded in a batch D, an RL algorithm can use D to compute a new policy pi’. However, the policy computed by traditional RL algorithms might have worse performance compared to pi. Our goal is to develop safe RL algorithms, where the agent has a high confidence that the performance of pi’ is better than the performance of pi given D. To develop sample-efficient and safe RL algorithms we combine ideas from exploration strategies in RL with a safe policy improvement method.
@inproceedings{Simao2019dc, author = {Sim{\~a}o, Thiago D.}, title = {{Safe and Sample-Efficient Reinforcement Learning Algorithms for Factored Environments}}, booktitle = {IJCAI}, publisher = {International Joint Conferences on Artificial Intelligence Organization}, pages = {6460--6461}, year = {2019} }
IJCAI
Structure Learning for Safe Policy Improvement

Simão, Thiago D., and Spaan, Matthijs T. J.

In IJCAI 2019

Abs Bib HTML PDF Details

We investigate how Safe Policy Improvement (SPI) algorithms can exploit the structure of factored Markov decision processes when such structure is unknown a priori. To facilitate the application of reinforcement learning in the real world, SPI provides probabilistic guarantees that policy changes in a running process will improve the performance of this process. However, current SPI algorithms have requirements that might be impractical, such as: (i) availability of a large amount of historical data, or (ii) prior knowledge of the underlying structure. To overcome these limitations we enhance a Factored SPI (FSPI) algorithm with different structure learning methods. The resulting algorithms need fewer samples to improve the policy and require weaker prior knowledge assumptions. In well-factorized domains, the proposed algorithms improve performance significantly compared to a flat SPI algorithm, demonstrating a sample complexity closer to an FSPI algorithm that knows the structure. This indicates that the combination of FSPI and structure learning algorithms is a promising solution to real-world problems involving many variables.
@inproceedings{Simao2019structure, author = {Sim{\~a}o, Thiago D. and Spaan, Matthijs T. J.}, title = {{Structure Learning for Safe Policy Improvement}}, booktitle = {IJCAI}, pages = {3453--3459}, year = {2019} }
AAAI
Safe Policy Improvement with Baseline Bootstrapping in Factored Environments

Simão, Thiago D., and Spaan, Matthijs T. J.

In AAAI 2019

Abs Bib HTML PDF Details

We present a novel safe reinforcement learning algorithm that exploits the factored dynamics of the environment to become less conservative. We focus on problem settings in which a policy is already running and the interaction with the environment is limited. In order to safely deploy an updated policy, it is necessary to provide a confidence level regarding its expected performance. However, algorithms for safe policy improvement might require a large number of past experiences to become confident enough to change the agent’s behavior. It can achieve a better sample complexity by exploiting independence between features of the environment, but it lacks a confidence level. We study how to improve the sample efficiency of the safe policy improvement with baseline bootstrapping algorithm by exploiting the factored structure of the environment. Our main result is a theoretical bound that is linear in the number of parameters of the factored representation instead of the number of states. The empirical analysis shows that our method can improve the policy using a number of samples potentially one order of magnitude smaller than the flat algorithm.
@inproceedings{Simao2019safe, author = {Sim{\~a}o, Thiago D. and Spaan, Matthijs T. J.}, title = {{Safe Policy Improvement with Baseline Bootstrapping in Factored Environments}}, booktitle = {AAAI}, pages = {4967--4974}, publisher = {{AAAI} Press}, year = {2019} }

2018

ICML workshop
An Empirical Evaluation of Safe Policy Improvement in Factored Environments

Simão, Thiago D., and Spaan, Matthijs T. J.

2018

Abs Bib HTML Details

Reinforcement Learning (RL) deals with problems that can be modeled as a Markov Decision Process (MDP) where the transition function is unknown. In situations where an arbitrary policy π is already in execution and the experiences with the environment were recorded in a batch D, an RL algorithm can use to compute a new policy π’. However, the policy computed by traditional RL algorithms might have worse performance compared to π. Our goal is to develop safe RL algorithms, where the agent has a high confidence that the performance of π’ is better than the performance of π given D. To develop sample-efficient and safe RL algorithms we combine ideas from exploration strategies in RL with a safe policy improvement method.
@misc{Simao2018, author = {Sim{\~a}o, Thiago D. and Spaan, Matthijs T. J.}, title = {{An Empirical Evaluation of Safe Policy Improvement in Factored Environments}}, booktitle = {{ICML / IJCAI / AAMAS 2018 Workshop on Planning and Learning (PAL-18)}}, year = {2018} }
IBERAMIA
When a Robot Reaches Out for Human Help

Andrés, Ignasi, Nunes de Barros, Leliane, Mauá, Denis D., and Simão, Thiago D.

In IBERAMIA 2018

Abs Bib HTML PDF Details

In many realistic planning situations, any policy has a non-zero probability of reaching a dead-end. In such cases, a popular approach is to plan to maximize the probability of reaching the goal. While this strategy increases the robustness and expected autonomy of the robot, it considers that the robot gives up on the task whenever a dead-end is encountered. In this work, we consider planning for agents that pro-actively and autonomously resort to human help when an unavoidable dead-end is encountered (the so-called symbiotic agents). To this end, we develop a new class of Goal-Oriented Markov Decision Process that includes a set of human actions that ensures the existence of a proper policy, one that possibly resorts to human help. We discuss two different optimization criteria: minimizing the probability to use human help and minimizing the expected cumulative cost with a finite penalty for using human help for the first time. We show that for a large enough penalty both criteria are equivalent. We report on experiments with standard probabilistic planning domains for reasonably large problems.
@inproceedings{Andres2018when, author = {Andr{\'e}s, Ignasi and {Nunes de Barros}, Leliane and Mau{\'a}, Denis D. and Sim{\~a}o, Thiago D.}, title = {{When a Robot Reaches Out for Human Help}}, booktitle = {IBERAMIA}, pages = {277--289}, publisher = {Springer}, year = {2018} }

2016

ENIAC

Heuristics for Dead-Ends Detection in Probabilisitic Planning

Simão, Thiago D., Andrés, Ignasi, Santos, Viviane B., and Nunes de Barros, Leliane

In ENIAC 2016

Bib Details

@inproceedings{Simao2016,
  author = {Sim{\~a}o, Thiago D. and Andr{\'e}s, Ignasi and Santos, Viviane B. and {Nunes de Barros}, Leliane},
  title = {{Heuristics for Dead-Ends Detection in Probabilisitic Planning}},
  booktitle = {ENIAC},
  year = {2016},
  note = {portuguese}
}

2015

ENIAC

Probabilistic Planning with Dead-Ends

Simão, Thiago D., Nunes de Barros, Leliane, and Silva, Felipe L.

In ENIAC 2015

Bib Details

@inproceedings{Simao2015,
  author = {Sim{\~a}o, Thiago D. and {Nunes de Barros}, Leliane and Silva, Felipe L.},
  title = {{Probabilistic Planning with Dead-Ends}},
  booktitle = {ENIAC},
  year = {2015},
  note = {portuguese}
}

2011

ESUD

Development of 3D Games for Distance Education

Leitão, Ulisses A., Simão, Thiago D., and Neves, Jefferson A.

In ESUD 2011

Bib Details

@inproceedings{leitao2011,
  author = {Leitão, Ulisses A. and Sim{\~a}o, Thiago D. and Neves, Jefferson A.},
  title = {{Development of 3D Games for Distance Education}},
  booktitle = {ESUD},
  year = {2011},
  note = {portuguese}
}