Université Libre de Bruxelles
We show how to combine techniques from formal methods and learning for online computation of a strategy that aims at optimizing the expected long-term reward in large systems modelled as Markov decision processes. This strategy is computed with a receding horizon and using Monte Carlo tree search (MCTS). We augment the MCTS algorithm with the notion of advice which guides the search in the relevant part of the tree using exact methods. Such advice can be symbolically written as a logical formula and computed on-the-fly using model-checking tools.
We show that the classical theoretical guarantees of the Monte Carlo tree search are still maintained after this augmentation. To lower the latency of MCTS algorithms with advice, we propose to replace advice coming from exact algorithms with an artificial neural network. For this purpose, we implemented an expert imitation framework to train the neural network in order to replace expert advice with neural advice. To demonstrate the practical interest of our techniques, we implemented the frameworks on different systems modelled as MDPs.