# 2020 End of Year Reading List

As we near the end of 2020, we thought we'd share our favorite research papers from the past year and the impact they've had on the field. Enjoy this holiday reading list!

## Navigating the landscape of multiplayer games

*Shayegan Omidshafiei, Karl Tuyls, Wojciech M. Czarnecki, Francisco C. Santos,Mark Rowland, Jerome Connor, Daniel Hennes, Paul Muller, Julien Pérolat, Bart De Vylder,Audrunas Gruslys and Rémi Munos*

by Pablo Hernandez

#### Why do we reccomend?

What do Rock-Scissor-Paper, Chess, Go, and Starcraft II have in common? They are multiplayer games that have taken a central role in artificial intelligence algorithms, in particular in reinforcement learning and multiagent learning. This paper asks the question of how to define a measure on multiplayer games that can be used to define a taxonomy over them. Trivial proposals might not scale or be appropriate to many games, turning this question into a highly complicated task. In this work, the authors use tools from game theory and graph theory to understand the structure of general-sum multiplayergames. In particular, they study graphs where the nodes correspond to strategies (or trained agents) and the edges correspond to interactions between nodes, quantified by the game’s payoffs. The paper concludes by shedding some light on another paramount and related question, how to automatically generate interesting (i.e., with desirable characteristics) environments (games) for learning agents?

## Underspecification Presents Challenges for Credibility in Modern Machine Learning

*Alexander D’Amour, Katherine Heller, Dan Moldovan, Ben Adlam, BabakAlipanahi, Alex Beutel, Christina Chen, Jonathan Deaton, Jacob Eisenstein, Matthew D.Hoffman, Farhad Hormozdiari, Neil Houlsby, Shaobo Hou, Ghassen Jerfel, Alan Karthike-salingam, Mario Lucic, Yian Ma, Cory McLean, Diana Mincu, Akinori Mitani, AndreaMontanari, Zachary Nado, Vivek Natarajan, Christopher Nielson, Thomas F. Osborne, RajivRaman, Kim Ramasamy, Rory Sayres, Jessica Schrouff, Martin Seneviratne, Shannon Se-queira, Harini Suresh, Victor Veitch, Max Vladymyrov, Xuezhi Wang, Kellie Webster, SteveYadlowsky, Taedong Yun, Xiaohua Zhai, D. Sculley*

by Mehran Kazemi

#### Why do we recommend?

When we solve an underdetermined system of linear equations (i.e. more unknowns than linearly independent equations), we obtain a class of solutions rather than a unique solution. This paper shows that a similar phenomenon often occurs in machine/deep learning: when we fit a model to a given independent and identically distributed (IID) dataset, several weight configurations achieve near-optimal held-out performance. This phenomenon is called underspecification and it may arise due to the use of models with many parameters (sometimes more parameters than data points). While our learning process often selects a random weight configuration from the class of configurations achieving near-optimal held-out performance, as shown in this paper for many models and in many domains, each of these configurations encodes different inductive biases which result in very different (and sometimes undesired) behaviors in production when the data distribution slightly changes. This paper is a must-read before deploying any machine learning model to production.

## End-to-End Object Detection with Transformers

*Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, AlexanderKirillov, Sergey Zagoruyko*

by Thibaut Durand

**Why do we recommend?**

This paper proposes a new framework to tackle the object detection problem. The authors introduce DETR (DEtection TRansformer) which replaces the full complex hand-crafted object detection pipeline with a Transformer. Unlike most of the modern object detectors, DETR approaches object detection as a direct set prediction problem. It consists of a set-based global loss, which forces unique predictions via bipartite matching, and a Transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The authors also show that DETR can be easily extended to the panoptic segmentation task by using the same recipe for “stuff” and “things” classes.

## A Simple Framework for Contrastive Learning of Visual Representations

*Ting Chen, Simon Kornblith, Mohammad Norouzi and Geoffrey Hinton*

by Maryna Karpusha

**Why do we recommend? **

Recent natural language processing models, such as BERT and GPT, show how accuracy can be improved with unsupervised learning. In this case, the model is first pre-trained on a large unlabelled dataset and then fine-tuned on a small amountof labeled data. Google Brain Researchers, in their paper "SimCLR: A Simple Frameworkfor Contrastive Learning of Visual Representations," show that a similar approach has a great potential to improve the performance of computer vision models. SimCLR is a simple framework for contrastive visual representation learning. It differs from the standard supervised learning on ImageNet by a few components: the choice of data augmentation, the use of nonlinear head at the end of the network, and the loss function selection. By carefully studying different design choices, authors improve considerably over previous self-supervised, semi-supervised, and transfer learning methods. SimCLR provided a strong motivation for further research in this direction and improved self-supervised learning for computer vision.

## Stochastic Normalizing Flows

*Hao Wu, Jonas Köhler, Frank Noé*

by Andreas Lehrmann

**Why do we recommend? **

Normalizing flows are a family of invertible generative models that transform a simple base distribution to a complex target distribution. In addition to traditional density estimation, they have recently been used in conjunction with re-weighting schemes (e.g., importance sampling) to draw unbiased samples from unnormalized distributions (e.g., energy models). This paper proposes a variant of this idea in which the flow consists of an interwoven sequence of deterministic invertible functions and stochastic sampling blocks. The added stochasticity is shown to overcome the limited expressivity of deterministic flows, while the learnable bijections improve over the efficiency of traditional MCMC. Interestingly, the authors show how to compute exact importance weights without integration over all stochastic paths, enabling efficient asymptotically unbiased sampling.

## SurVAE Flows: Surjections to Bridge the Gap between VAEs and Flows

*Didrik Nielsen, Priyank Jaini, Emiel Hoogeboom, Ole Winther, Max Welling*

by Marcus Brubaker

**Why do we recommend? **

Variational Autoencoders (VAEs) and Normalizing Flows are two popular classes of generative models that allow for fast sample generation and either exact or tractable approximations of probability density. This paper provides a middle ground between these two models by showing how surjective and stochastic transformations can be incorporated and mixed with bijective transformations to construct a wider and more flexible range of distributions. While this construction loses the ability to compute exact densities, it allows for more accurate "local" variational posteriors which can be derived based on the specific transformation. The resulting SurVAE Flow generative model is similar in nature to a VAE in that likelihoods can only be approximated or bounded. However, instead of trying to learn a single (potentially very complex) variational posterior, SurVAEFlows instead learn a variational posterior for each step which will generally be more accurate.

## Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data

*Emily M. Bender, Alexander Koller*

by Yanshuai Cao

**Why do we recommend?**

The words “meaning”, “semantics”, “understand” or “com-prehend” are often used loosely in the broader AI literature, and sometimes even in NLP papers. This position paper precisely defines “form” and “meaning”, and argues that an NLP system trained only on form has a priori no way to learn meaning, regardless of the amount of training data or compute power. It is a modern take on the symbol grounding problem. The next paper in our list, “Experience Grounds Language” by Bisket al., can be viewed as a follow-up and an attempt to answer some of the questions posed in this paper. Besides the clarity that Bender and Koller bring to the topic, this paper is also recommended because the authors leverage thought experiments to make convincing arguments without mathematical proofs or numerical experiments. This style is sometimes used in Physics literature but rarely in modern AI.

## Experience Grounds Language

*Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio, Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nisnevich, Nicolas Pinto, Joseph Turian*

by Daniel Recoskie

**Why we recommend?**

This paper posits that exposure to a large textual corpusis not enough for a system to understand language. The authors argue that experience in the physical world (including social situations) is necessary for successful linguistic communication. Five "world scopes" are used as a framework in which to view NLP: corpus, internet, perception, embodiment, and social. The authors use these world scopes to create a roadmap towards true language understanding. The paper also reviews a large amount of linguistics and NLP literature in a way that is approachable even without any NLP background.

## If Beam Search is the Answer, What was the Question?

*Clara Meister, Tim Vieira, Ryan Cottere*

by Layla El Asri

**Why do we recommend?**

In this paper, the authors analyse the success of beam search as a decoding strategy in NLP. They show that beam search in fact optimizes a regularized maximum a posteriori objective and that the induced behaviour relates to a concept known in cognitive sciences as the uniform information density hypothesis. According to this hypothesis, "Where speakers have a choice between several variants to encode their message, they prefer the variant with more uniform information density". Based on this observation, the authors propose other regularizers and show how the performance of beam search for larger beams can be improved. Several decoding schemes have been proposed to alleviate some of the difficulties of language generation. This paper proposes an elegant frame work that helps analyze the very popular method of beam search and understand why it produces high-quality text.

## Learning De-biased Representations with Biased Representations

*Hyojin Bahng, Sanghyuk Chun, Sangdoo Yun, Jaegul Choo, Seong Joon Oh*

by Ga Wu

**Why do we recommend?**

While machine learning models could achieve remarkable accuracy on many classification tasks, it is hard to identify if the models rely on data bias as a shortcut for successful prediction. Consequently, the models could extract biased observation representations and fail to generalize when the bias shifts. This paper described an application where Hilbert-Schmidt Independence Creterion (HSIC) is adopted for unbiased representation learning. Specifically, The authors proposed to train a debiased representation by encouraging it to be statistically independent of intentionally biased representations. While it appears similar to the Noise Contrastive Estimation (NCE) that also distinguish desired representation from others, the proposed approach is based on information theory instead of density estimation. This paper is one of the recent works that pair HSIC and deep learning, which has received a surge of attention.

## On Adaptive Attacks to Adversarial Example Defenses

*Florian Tramèr, Nicholas Carlini, Wieland Brendel, and Aleksander Maadry Tian*

by Amir Abdi

**Why do we recommend?**

The paper demonstrates that typical adaptive evaluations for adversarial defense are deemed incomplete. Accordingly, they detail the methodology of an appropriate adaptive attack and go through thirteen well-known adversarial defenses only to show that all of them can be circumvented by careful and appropriate tuning of the attacks. The authors propose an informal no-free-lunch theorem that “for any proposed attack, it is possible to build a non-robust defense that prevents that attack”. Thus, they advise the community not to overfit on the proposed adaptive attack and use them only as sanity checks. Moreover, the paper supports “simpler” but “hand-designed” attacks that are as close as possible to straightforward gradient descent with an appropriate loss function.