• Contract type: Fixed-term contract
  • Level of qualifications required: Graduate degree or equivalent
  • Fonction: PhD Position
  • Theme/Domain: Vision, perception and multimedia interpretation, Information system
  • Partner, country, city: Inria, France, Grenoble (Montbonnot)
  • Starting date: 2020-10-01
  • Duration of contract: 3 years
  • Deadline to apply: 2020-07-08
  • Remuneration: 1st and 2nd year: 1 982 euro gross/month; 3rd year: 2 085 euros gross/month

Further info & application: Inria


Reinforcement learning, and in particular deep reinforcement learning (DRL), became very popular in the recent, successfully addressing a wide variety of tasks such as board game playing. It has also been popular in addressing computer vision or pattern recognition tasks for which a differentiable loss function is difficult to find or does not exist [1]. The most popular methodology in DRL is to approximate the so-called action-value function, leading to deep Q networks (DQN) [2] and its derivatives. In these methods, both the action policy and the system’s transition function are implicitly learned within the neural network that approximates the action-value. This is limiting since it is unclear how to incorporate prior knowledge of the policy or of the system. Alternatives to the mainstream methodology exists and they can be based, for instance, in parametrising a deterministic policy, and optimizing the parametrisation by gradient descent [3]. Probabilistic alternatives based on the EM algorithm were first proposed in the late 90’s [4], and revisited in the 2010’s, see for instance [5].

In parallel, progress on deep learning lead to the conception of variational auto-encoders [6], which are non-linear probabilistic generative models. The use of VAE for reinforcement learning is at its very early stage [7], in which all actions, rewards and observations are jointly considered to infer a latent state, which is designed to encode the generation process of both the policy and the value function. In this PhD we would to investigate the use of VAE-based RL for audio-visual human-robot interaction. We aim to combine VAE for RL with probabilistic models developed for other tasks, such as speaker tracking or speech enhancement. VAE is a prominent research line but we are open to other ideas. Our team has expertise in probabilistic models for a variety of tasks [8,9] as well as on reinforcement learning for HRI [10].

Our team is searching for a motivated PhD candidate to investigate new approaches for deep reinforcement learning (DRL) in the field of audio-visual human-robot interaction. Although its high potential, DRL is still in its infancies when it comes to real-world applications such as robotics. Our group investigates DRL methods for the control of robots based on audio-visual inputs [10]. Our current project, in cooperation with several European partners [11], develops new approaches to enable health-care robots to interact and communicate with groups of people by providing information or guiding them. In difference to existing methods we are looking into approaches that combine visual and auditory information which improve for example the identification of an active speaker or their location.


Research Master’s degree, or equivalent, in a discipline connected to signal and information processing, computer vision and machine learning. The candidate should be willing to study independently new approaches and to develop their own ideas in this field, getting inspiration from the previous description and the progress on the literature. The candidate should have preferably a background in artificial intelligence, machine learning, computer science or applied mathematics. Moreover, a candidate should have knowledge in programming, preferable in Python.


  1. Ren, Liangliang, Jiwen Lu, Zifeng Wang, Qi Tian, and Jie Zhou. “Collaborative deep reinforcement learning for multi-object tracking.” In Proceedings of the European Conference on Computer Vision (ECCV), pp. 586-602. 2018.
  2. Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. “Human-level control through deep reinforcement learning.” Nature 518, no. 7540 (2015): 529-533.
  3. Casas, Noe. “Deep deterministic policy gradient for urban traffic light control.” arXiv preprint arXiv:1703.09035 (2017).
  4. Dayan, Peter, and Geoffrey E. Hinton. “Using expectation-maximization for reinforcement learning.” Neural Computation 9, no. 2 (1997): 271-278.
  5. Vlassis, Nikos, Marc Toussaint, Georgios Kontes, and Savas Piperidis. “Learning model-free robot control by a Monte Carlo EM algorithm.” Autonomous Robots 27, no. 2 (2009): 123-130.
  6. Kingma, Diederik P., and Max Welling. “Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).
  7. Igl, Maximilian, Luisa Zintgraf, Tuan Anh Le, Frank Wood, and Shimon Whiteson. “Deep variational reinforcement learning for pomdps.” arXiv preprint arXiv:1806.02426 (2018).
  8. Ban, Yutong, Xavier Alameda-Pineda, Laurent Girin, and Radu Horaud. “Variational bayesian inference for audio-visual tracking of multiple speakers.” IEEE transactions on pattern analysis and machine intelligence (2019).
  9. Sadeghi, Mostafa, and Xavier Alameda-Pineda. “Robust unsupervised audio-visual speech enhancement using a mixture of variational autoencoders.” In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7534-7538. IEEE, 2020.
  10. Lathuilière, Stéphane, Benoît Massé, Pablo Mesejo, and Radu Horaud. “Neural network based reinforcement learning for audio–visual gaze control in human–robot interaction.” Pattern Recognition Letters 118 (2019): 61-71.
  11. https://spring-h2020.eu