Deep reinforcement learning (DRL) is a recent yet very active area of research that joins forces between deep learning (the use of neural networks) and reinforcement learning (solving sequential decision tasks). In DRL, the goal is to learn an optimal policy (behavior) of an agent acting in an environment, with deep neural networks as the function approximator (for example in the value function). DRL has achieved outstanding results so far in areas that include beating human-level performance in Atari games [11] and in DeepMind’s now famous Go tournament versus Lee Sedol [12]. This has led to a dramatic increase in the number of applications that use this technique, most notably in video games and robotics. 

Much of the successful DRL research to date has only considered single-agent environments. For instance, in Atari games, there is only one player to control. However, recent works have started to up the ante beyond single-agent scenarios and have begun exploring multi-agent scenarios, where the environment is populated with several learning agents at once. The presence of multiple agents causes additional dynamicity in the environment and in the agents themselves, which makes learning more complicated. 

Recent works have reported successes in multiagent domains, such as DOTA 2 [14] or Capture the flag [21], in which many agents learn to compete and/or cooperate in the same environment. Despite these promising results, however, there are still many open challenges to be addressed. This article aims to (i) provide a clear overview of current multiagent deep reinforcement learning (MDRL) trends and (ii) share both examples and lessons learned on how certain methods and algorithms from DRL and multiagent learning can be used in complementary ways to solve problems in this emerging area.

From multiagent learning to MDRL

Almost 20 years ago, Stone and Veloso's seminal survey used very intuitive and practical examples [1] to lay the groundwork for defining the area of multiagent systems (MAS) and its open problems in the context of machine learning:

“AI researchers have earned the right to start examining the implications of multiple autonomous agents interacting in the real world. In fact, they have rendered this examination indispensable. If there is one self-steering car, there will surely be more. And although each may be able to drive individually, if several autonomous vehicles meet on the highway, we must know how their behaviors interact”.

Roughly ten years later, Shoham, Powers, and Grenager [2] noted that the literature on learning in multiagent systems, or multiagent learning (MAL), was on the rise and it was no longer possible to enumerate all relevant articles. In the decade since, the number of published MAL works continues to rise, resulting in a series of different surveys (reviews) that showcase everything from analyzing the basics of MAL and their challenges [3] to addressing specific subareas (e.g., cooperative settings and evolutionary dynamics of MAL) [4-10].

Research interest in MAL has been accompanied by a number of successes; first, in single-agent Atari games [11], and more recently in two-player games [12-13] like Go, poker, and games involving two competing teams.

Deep reinforcement learning [15] plays a key role in these works and has successfully integrated with other AI techniques like (Monte Carlo tree) search, planning, and more recently, multiagent systems. The result is the emerging area of multiagent deep reinforcement learning (MDRL). 

Learning in multiagent settings is fundamentally more difficult than the single-agent case thanks to problems [3-10] like:

•    Non-stationarity: If all agents are learning at the same time, the dynamics become more complicated and break many standard RL assumptions.

•    Curse of dimensionality: An exponential growth of state-action space when a learning agent keeps track of all agent actions.

•    Multiagent credit assignment: Defining how agents should deduce their contributions when learning in a team; for example, if agents receive a team reward but it’s one agent doing most of the work, others can become “lazy.” It’s just like real life!

Despite these complexities, top AI conferences such as AAAI, AAMAS, ICLR, IJCAI, and NIPS have all published works reporting MDRL successes. The validation of these top-tier conferences convinces us it is valuable to compile an overview of the recent MDRL works and understand how these works relate to the existing literature.

In this context, we have identified four prominent categories in which to group recent works, as shown in the following figure. 

Categories of different MDRL works

(a) Analysis of emergent behaviors: evaluate DRL algorithms in multiagent scenarios.

This illustration shows a smiley face and a sad face with a magnifying glass between them.

(b) Learning communication: agents learn with actions and through messages.

This illustration shows two smiley faces looking at each other. One has a lightbulb shape above its head, one has a speech bubble with "A B C…" above its head

(c) Learning cooperation: agents learn to cooperate using only actions and local observations.

This illustration shows two smiley faces next to each other with a puzzle piece shape connecting their heads together.

(d) Agents modeling agents: agents reason about others to fulfill a task (e.g., cooperative or competitive).

This illustration shows two smiley faces with a wall dividing them, one is wearing a hat, the other one has a thought bubble above his head with the same hat inside.


Bridging multiagent learning and MDRL

The next objective of this article is to provide guidelines by showcasing how methods and algorithms from DRL and multiagent learning can complement each other to solve problems in MDRL. This occurs, for example, when: 

•    Dealing with non-stationarity

•    Dealing with multiagent credit assignment

We also present general lessons learned from these works, such as the use of:

•    Experience replay buffers in MDRL – a key component in many DRL works. These containers serve as explicit memory, storing interactions through which agents learn to improve their behaviors.

•    Recurrent neural networks (e.g., LSTMs). These networks serve as implicit memory that improves performance, particularly for partially-observable environments.

•    Centralized learning with decentralized execution: Agents can be trained with a central controller that has both access to agents’ actions and observations, but during deployment they will operate based solely on observations.

•    Parameter sharing: In many tasks, it is useful to share a network’s internal layers, even when there are many outputs.

Towards the end of the article, we reflect on some open questions and challenges:

•    On the challenge of sparse and delayed rewards

Recent MAL competitions and environments (e.g., Pommerman [24], Capture the flag [21], MarLÖ, Starcraft II, and Dota 2) have complex scenarios wherein many actions must be taken before a reward signal becomes available. This is already a challenge for RL [16]; in MDRL it becomes even more problematic since the agents not only need to learn basic behaviors (like in DRL), but also need to learn the strategic element (e.g., competitive/collaborative) embedded within the multiagent setting. 

•    On the role of self-play.

Self-play (when all the agents use the same learning algorithm) is a MAL cornerstone that achieves impressive results [17-19]. While notable results have also occurred in MDRL, recent works have shown that plain self-play does not yield the best results [20, 21]. 

•    On the challenge of the combinatorial nature of MDRL.

Monte Carlo tree search (MCTS) has been the backbone of major breakthroughs for AlphaGo and AlphaGo Zero, both of which used MCTS along with the DRL. However, for multiagent scenarios, there is an additional challenge of the exponential growth of all the agents' action spaces for centralized methods. Given more scalable planners [22, 23], there is room for research in combining MCTS-like planners with DRL techniques in multi-agent scenarios. 


While there are a number of notable works in DRL and MDRL that represent important milestones for AI, we acknowledge there are also open questions in both single-agent learning and multiagent learning that demonstrate how much more work still needs to be done. We expect this article will help unify and motivate future research to take advantage of the abundant literature that exists in both areas (DRL and MAL) in a joint effort to promote fruitful research in the multiagent community.

Full paper: https://arxiv.org/abs/1810.05587 


[1] P. Stone, M. M. Veloso, Multiagent Systems - A Survey from a Machine Learning Perspective., Autonomous Robots 8 (2000) 345–383.
[2] Y. Shoham, R. Powers, T. Grenager, If multi-agent learning is the answer, what is the question?, Artificial Intelligence 171 (2007) 365–377. 
[3] K. Tuyls, G. Weiss, Multiagent learning: Basics, challenges, and prospects, AI Magazine 33 (2012) 41–52. 
[4] L. Busoniu, R. Babuska, B. De Schutter, A Comprehensive Survey of Multiagent Reinforcement Learning, IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews) 38 (2008) 156–172. 
[5] A. Nowé, P. Vrancx, Y.-M. De Hauwere, Game theory and multi-agent reinforcement learning, in: Reinforcement Learning, Springer, 2012, pp. 441–470. 
[6] L. Panait, S. Luke, Cooperative Multi-Agent Learning: The State of the Art, Autonomous Agents and Multi-Agent Systems 11 (2005).
[7] L. Matignon, G. J. Laurent, N. Le Fort-Piat, Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems, Knowledge Engineering Review 27 (2012) 1–31. 
[8] D. Bloembergen, K. Tuyls, D. Hennes, M. Kaisers, Evolutionary Dynamics of Multi-Agent Learning: A Survey., Journal of Artificial Intelligence Research 53 (2015) 659–697. 
[9] P. Hernandez-Leal, M. Kaisers, T. Baarslag, E. Munoz de Cote, A Survey of Learning in Multiagent Environments - Dealing with Non-Stationarity (2017). arXiv:1707.09183.
[10] S. V. Albrecht, P. Stone, Autonomous agents modelling other agents: A comprehensive survey and open problems, Artificial Intelligence 258 (2018) 66–95. 
[11] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through deep reinforcement learning, Nature 518 (2015) 529–533. 
[12] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, Mastering the game of Go with deep neural networks and tree search, Nature 529 (2016) 484–489. 
[13] M. Moravčík, M. Schmid, N. Burch, V. Lisý, D. Morrill, N. Bard, T. Davis, K. Waugh, M. Johanson, M. Bowling, DeepStack: Expert-level artificial intelligence in heads-up no-limit poker, Science 356 (2017) 508–513. 
[14] Open AI Five, https://blog.openai.com/openai-five, 2018. [Online; accessed 7-September-2018]. 
[15] K. Arulkumaran, M. P. Deisenroth, M. Brundage, A. A. Bharath, A Brief Survey of Deep Reinforcement Learning (2017). arXiv:1708.05866v2.
[16] R. S. Sutton, A. G. Barto, Introduction to reinforcement learning, volume 135, MIT press Cambridge, 1998.
[17] J. Hu, M. P. Wellman, Nash Q-learning for general-sum stochastic games, Journal of Machine Learning Research 4 (2003) 1039–1069. 
[18] M. Bowling, Convergence and no-regret in multiagent learning, in: Advances in Neural Information Processing Systems, Vancouver, Canada, 2004, pp. 209–216. 
[19] J. Heinrich, D. Silver, Deep Reinforcement Learning from Self-Play in Imperfect-Information Games (2016). arXiv:1603.01121
[20] T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, I. Mordatch, Emergent Complexity via Multi-Agent Competition., in: International Conference on Machine Learning, 2018.
[21] M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Casta˜neda, C. Beattie, N. C. Rabinowitz, A. S. Morcos, A. Ruderman, N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hassabis, K. Kavukcuoglu, T. Graepel, Human-level performance in first-person multiplayer games with population based deep reinforcement learning (2018).
[22] C. Amato, F. A. Oliehoek, et al., Scalable planning and learning for multiagent POMDPs., in: AAAI, 2015, pp. 1995–2002. 
[23] G. Best, O. M. Cliff, T. Patten, R. R. Mettu, R. Fitch, Dec-MCTS: Decentralized planning for multi-robot active perception, The International Journal of Robotics Research (2018)
[24] Resnick, C., Eldridge, W., Ha, D., Britz, D., Foerster, J., Togelius, J., Cho, K. and Bruna, J., 2018. Pommerman: A Multi-Agent Playground. arXiv preprint arXiv:1809.07124.