Dec. 12, 2018

We spent last week navigating a phalanx of NeurIPS 2018 talks, poster sessions, networking events, parties and workshops. After taking a few days to process our thoughts, we're excited to present a few of our favourite papers from this year's conference.

**Authors: **Yu-An Chung · Wei-Hung Weng · Schrasing Tong · James Glass

Last year, I was surprised by a paper that introduced a technique to perform word translation between two languages without parallel corpora. To clarify, you’re said to have unparalleled corpora when two sets of a text exist in two languages (e.g. a set of English words and a set of French words), but there is no information regarding which English word corresponds to the appropriate French translation.

Previously, the state-of-the-art method for learning cross-lingual word embeddings mainly relied on bilingual dictionaries, along with some help from character-level information for languages that shared a common alphabet. None of this was competitive with supervised machine translation techniques. The authors of the paper managed to pose a different question: Was it possible to do unsupervised word translation? They answered their own question by introducing a new technique that worked quite well.

Their model worked by obtaining the word embeddings space for both languages, independently, and introducing a technique for unsupervised alignment between the two embedding spaces that can achieve translations without parallel corpora. The intuition behind this technique is to rotate one embedding space to the point that the two embedding spaces are virtually indistinguishable to a classifier (i.e., adversarial training).

This year, a new set of authors presented their work regarding the task of automatic speech recognition without parallel data. This, again, means two independent sets of speech data and text data exist, but the correspondence information between them is unclear. This work stood out since it is the first successful attempt to apply the unsupervised alignment technique introduced last year on multiple modalities of data. The task involved taking a dataset of words from one language and a dataset of spoken words from either the same or a different language, and automatically identifying spoken words without parallel information.

The authors first trained an embedding space for written words and another embedding space for spoken words. They then applied the unsupervised alignment technique to the embedding spaces to align them so that spoken words could automatically be classified and translated. At test time, a speech segment is first mapped into its respective embedding space, aligned to the text embedding space, then the nearest neighbors of the text embedding are picked as the translation. The same procedure can be used for the text-to-speech conversion task.

The authors present some experiments on the spoken Wikipedia and LibriSpeech datasets that show unsupervised alignments are still not as good as supervised alignment – but they’re close. Some challenges still remain to be solved before unsupervised cross-modal alignments can be competitive with supervised ones; however, this work shows the promise of improving automatic speech recognition (ASR), text-to-speech (TTS) and even translation systems, especially in languages with a low availability of paralleled data. (/**HS**)

**Authors**: Rad Niazadeh · Tim Roughgarden · Joshua Wang

This paper was accepted as an oral presentation. The authors gave an approximation algorithm for maximizing continuous non-monotone submodular functions.

To give a brief recap, submodular functions arise in several important areas of machine learning and, in particular, around the intersection of economics and learning. They can be used to model the problem of maximizing multi-platform ad revenue, where a buyer wants to maximize their profit = revenue - cost by advertising on different platforms and there is a diminishing return of advertising on more platforms. This diminishing return is the precise property captured by the sub-modular functions. Mathematically, a function $f:\{0,1\}^n \rightarrow \mathbb{R}$ is submodular if $f(S \cup \{e\}) - f(S) \geq f(T \cup \{e\}) - f(T)$ for every $S \subseteq T$ and $e \notin T $. In this setting, there is an information-theoretic lower bound of $1/2$-approximation [Feige et al.'11] and there is an optimal algorithm which matches this bound [Buchbinder et al.'15].

This paper considered the continuous submodular function where, instead of maximizing on the vertices of the hypercube $\{0,1\}^n$, we want to maximize over the full hypercube $[0,1]^n$. The main result of the paper is that they obtained a randomized algorithm for maximizing a continuous submodular and $L$-Lipschitz function over the hypercube that guarantees a $1/2$-approximation. Note that this is currently the best possible ratio that is information-theoretically achievable.

The reason this paper stood out is that the authors used the double greedy framework of Buchbinder et al.'15 to solve the coordinate-wise zero-sum game, and then use the geometry of this game to bound the value at its equilibrium. This is a nice application of game theory to maximize the value of the function. The authors also conducted experiments on 100-dimensional synthetic data and achieved comparable results as the previous work they referenced. One thing we hoped to see was that their achievement of better approximations and faster algorithms would also show a significant advantage in the experiments, but that was not the case.

In terms of the open problems, I am really excited to see the development of parallel and online algorithms for continuous sub-modular optimization. In fact, there is a recent work for parallel algorithms of Chen et al.'18 which achieves a tight $(1/2 - \epsilon)$-approximation guarantee using $\tilde{O}(\epsilon^{-1}$) adaptive rounds. (/**KJ**)

**Authors**: Jiantao Jiao · Weihao Gao · Yanjun Han

This paper focused on estimating the differential entropy of a continuous distribution $f$ given $n$ i.i.d. samples. Entropy has been a core concept of information theoretic measures and has engendered numerous important applications, such as goodness-of-fit tests, feature selection, and tests of independence. In the vast body of literature around this concept, most of the measures have appeared to take on an asymptotic flavor – that is, until several recent works.

This paper is one of those works. The authors took particular focus on the fixed-k nearest neighbor (fixed-kNN) estimator, also called the Kozachenko-Leonenko estimator. This estimator is simple; there is only one parameter to tune, and it requires no knowledge about the smoothness degree $s$ about targeted distribution $f$. Moreover, it is also computationally efficient, since $k$ is fixed (compared to other methods with similar finite sample bounds) and statistically efficient: As shown in this paper, it has a finite sample bound that is close to optimal. All of these properties make the estimator realistic and attractive in practice.

I found the paper also carried some interesting technical results. One direct approach in estimating the differential entropy is to plug in a consistent estimator, for example, based on kNN distance statistics of the density function $f$ into the formula of entropy. However, such estimators usually come with an impractical computational demand. For instance, in the kNN-based estimator, $k$ has to approach $\infty$ as the number of samples $n$ approaches $\infty$.

In a recent paper by Han et al. [2017], the authors constructed a complicated estimator that achieves a finite sample bound in the rate of $n\log(n))^{-\frac{s}{s+d}} + n^{-\frac12}$ (the optimal rate). One caveat, though, is that it requires the knowledge of the smoothness degree $s$ of the targeted distribution $f$. The last challenging part is to deal with the area where $f$ is small. A major difficulty in achieving such bounds for the entropy estimator is that the nearest neighbor estimator exhibits a huge bias in its low-density area. Most papers tend to make assumptions about the property of $f$ such that this bias is well controlled. However, this paper did not presume similar assumptions. Given all these constraints, including fixed $k$, no knowledge of $s$ and no assumptions on how $f$ is bounded from below, the authors managed to prove a nearly optimal finite sample bound for a simple estimator. According to the authors, the new technical tools here are the Besicovitch convering lemma and a generalized Hardy-Littlewood maximal inequality. This part is not yet clear to me.

Lastly, the authors also pointed out several weaknesses in their paper and their plans for future work. For example, they conjectured that both the upper bound and the lower bound in the paper could be further improved. They also hypothesized a way to extend the constraint on $s$ in the theorem so that the result can be applied to a more general setting. (/**RH**)

**Authors**: Kevin Scaman · Francis Bach · Sebastien Bubeck · Laurent Massoulié · Yin Tat Lee

This paper considered distributed optimization of non-smooth convex functions using a network of computing units. The objective of this work was to study the impact of the communication network on learning, and the tradeoff between the structure of the network and algorithmic efficiency. The network consists of a connected simple graph of nodes, each having access to a function (such as a loss function). The optimization problem exists to minimize the average of the local functions: communication between nodes takes a given length of time and computation takes one unit of time. Under a decentralized scenario, local communication is performed through gossip.

The authors give bounds on the time to reach a given precision, then provide an optimal algorithm that uses a primal-dual reformulation. They are able to show that the error due to limits in communication resources will then decrease at a fast rate. In the centralized setting, the authors provide an algorithm which achieves convergence rates within $d^{1/4}$ to the optimal, where d is the underlying dimension.

I found this paper intriguing because it considers the impact of communication and computation resources in learning, which will be increasingly important as systems we learn on become larger. It received one of the best paper awards and is one of few papers that consider such impacts. There’s an argument to be made that these two things are related; as learning systems scale up and get distributed through IoT and mobile devices, the importance of distributed learning in a setting where there is tension between communication and computation has also increased. The elegant analytical tools used in this paper – gossip methods, primal-dual formulation, Chambolle-Pock algorithm for saddle-point optimization, the combined use of optimization and graph theory, and the bounds that give insight into which resources are important at which stage of convergence – show that the award places well-deserved attention toward a growing area. (/**NH**)

**Invited talk**: Jon Kleinberg - Fairness, Simplicity, and Ranking

In Jon Kleinberg's fascinating invited talk, he addressed the effect of implicit bias on producing adverse outcomes. The specific application he referred to is that of bias in activities such as hiring, promotion, admissions. The setting is as follows: a recruitment committee is tasked with selecting a shortlist of final candidates from a given pool of applicants, but their estimates of skill, used in the selection, may be skewed by implicit bias.

The Rooney Rule is an NFL policy in effect since 2003 that requires teams to interview ethnic-minority candidates for coaching and operations positions. (Note: There is no quota or preference given in the actual hiring). Kleinberg and his co-authors showed that measures such as the Rooney Rule lead to higher payoffs for the organization. Their model is as follows: a recruiting committee must select a list of k candidates for final interviews; the set of applicants is then divided into two groups, X and Y, X being the minority group; there are $n$ Y applicants and $n\alpha$ X applicants with $\alpha \le 1$. Each candidate has a numerical value representing their skill, and there is a common distribution from which these skills are drawn.

Based on empirical studies of skills in creative and skilled workforces, the authors then modeled this distribution as a Pareto distribution (power law). The utility that the recruiting committee aims to maximize is the sum of the candidates’ skills that were selected to the list. The authors modeled the bias as a multiplicative bias in the estimation of the skills of X-candidates. So, Y candidates are estimated at their true value and an X-candidate skill is estimated to be $X_i/\beta$ for candidate i where $\beta >1$. The authors then analyzed the utility of a list of $k$ candidates where at least one must be an X candidate. Their analysis showed an increase in utility even when the list was of size 2, and for a large range of values for the bias, power law, and population parameters.

I found this to be another very interesting and important paper because it tackles the question of fairness at a very practical level and provided a tangible algorithmic framework with which to expose, then analyze the outcomes. Furthermore, the modelling assumptions were very realistic, and their results demonstrated the potential for significant impact. The particular scenario considered here may be for activities such as hiring and admissions, but the result has consequences for machine learning models. (/**NH**)