Given a fixed history of events and their corresponding times – like those shown below in Fig. 1 – multiple actions are possible in the future. In our CVPR paper of the same name, which we will be presenting this week in Long Beach, we propose a powerful generative approach that can effectively model the distribution over future actions.
To date, much of the work in this domain has focused on taking framelevel data of video as input in order to predict the actions or activities that may occur in the immediate future. Timeseries data often involves regularly spaced data points with interesting events occurring sparsely across time. We hypothesize that in order to model future events in such a scenario, it is beneficial to consider the history of sparse events (action categories and their temporal occurrence in the above example) alone, instead of regularly spaced frame data. This approach also allows us to model highlevel semantic meaning in the timeseries data that can be difficult to discern from framelevel data.
More specifically, we are interested in modeling the distribution over future action category and action timing given the past history of sparse events. For action timing, we aim to model the distribution over interarrival time. The interarrival time is the time difference between the starting time of two consecutive actions.
The contributions of this work center around the APPVAE (Action Point Process VAE), a novel generative model for asynchronous time action sequences. Fig. 2 shows the overall structure of our proposed framework. We formulated our model with the variational autoencoder (VAE) paradigm, a powerful class of probabilistic models that facilitate generation and the ability to model complex distributions. We present a novel form of VAE for action sequences under a point process approach. This approach has a number of advantages, including a probabilistic treatment of action sequences to allow for likelihood evaluation, generation, and anomaly detection.
Fig. 3 shows the architecture of our model. Overall, the input sequence of action categories and interarrival times are encoded using a recurrent VAE model. At each step, the model uses the history of actions to produce a distribution over latent codes $zn$, a sample of which is then decoded into two probability distributions: one over the possible action categories and another over the interarrival time for the next action.
Since the true distribution over latent variables $z_n$ is intractable, we rely on a timedependent posterior network $q_\phi(z_{n}x_{1:n})$ that approximates it with a conditional Gaussian distribution $N(\mu_{\phi_n}, \sigma^2_{\phi_n})$.
To prevent $z_n$ from just copying $x_n$, we force $q_\phi(z_nx_{1:n})$ to be close to the prior distribution $p(z_n)$ using a KLdivergence term. Here, in order to consider the history of past actions in generation phase, we learn a prior that varies across time and is a function of past actions, except the current action $p_\psi(z_nx_{1:n1})$.
The sequence model generates two probability distributions: i) a categorical distribution over the action categories; and ii) a temporal point process distribution over the interarrival times for the next action.
The distribution over action categories $a_n$ is modeled with a multinomial distribution when $a_n$ can only take a finite number of values: \begin{equation}
p^a_\theta(a_n=kz_n) = p_k(z_n) \quad \text{and} \,\,\,\,
\sum_{k=1}^K{p_k(z_n)} =1 \label{eq:action}
\end{equation} where $p_k(z_n)$ is the probability of occurrence of action $k$, and $K$ is the total number of action categories.
The interarrival time $\tau_n$ is assumed to follow an exponential distribution parameterized by $\lambda(z_n)$, similar to a standard temporal point process model:
\begin{equation}
\begin{aligned}
p^{\tau}_{\theta}(\tau_n  z_n) =
\begin{cases}
\lambda(z_n) e^{{\lambda(z_n)}\tau_n} & \text{if}~~ \tau_n \geq 0 \\
0 & \text{if}~~ \tau_n<0
\end{cases}
\end{aligned} \label{eq:time}
\end{equation}
where $p^{\tau}_{\theta}(\tau_nz_n)$ is a probability density function over random variable $\tau_n$ and $\lambda(z_n)$ is the intensity of the process, which depends on the latent variable sample $z_n$. We train the model by optimizing the variational lower bound over the entire sequence comprised of $N$ steps:
\begin{align}
\mathcal{L}_{\theta,\phi}(x_{1:N}) = \sum_{n=1}^N(&{\mathop{\mathbb{E}}}_{q_\phi(z_{n}x_{1:n})}[\log p_\theta{(x_nz_{n})}] \\
& D_{KL}(q_\phi(z_nx_{1:n})p_\psi(z_nx_{1:n1})))
\nonumber
\label{eq:loss}
\end{align}
We empirically validate the efficacy of APPVAE for modeling action sequences on the MultiTHUMOS and Breakfast datasets. Experiments show that our model is effective in capturing the uncertainty inherent in tasks such as action prediction and anomaly detection.
Fig. 4 shows examples of diverse future action sequences that are generated by APPVAE given the history. For different provided histories, sampled sequences of actions are shown. We note that the overall duration and sequence of actions on the Breakfast Dataset are reasonable. Variations, e.g. taking the juice squeezer before using it, adding salt and pepper before cooking eggs, are plausible alternatives generated by our model.
Fig. 5 visualizes a traversal on one of the latent codes for three different sequences by uniformly sampling one z dimension over µ − 5σ, µ + 5σ while fixing others to their sampled values. As shown, this dimension correlates closely with the action add saltnpepper, strifry egg and fry egg.
We further qualitatively examine the ability of the model to score the likelihood of individual test samples. We sort the test action sequences according to the average per timestep likelihood estimated by drawing samples from the approximate posterior distribution following the importance sampling approach. High scoring sequences should be those that our model deems as “normal” while low scoring sequences those that are unusual. Tab. 1 shows some example of sequences with low and high likelihood on the MultiTHUMOS dataset. We note that a regular, structured sequence of actions such as jump, body roll, cliff diving for a diving action or body contract, squat, clean and jerk for a weightlifting action receives high likelihood. However, repeated hammer throws or golf swings with no set up actions receives a low likelihood.
Table 1 (below): Example of test sequences with high and low likelihood according to our learned model:
Test sequences with high likelihood 



Test sequences with low likelihood 



We presented a novel probabilistic model for point process data – a variational autoencoder that captures uncertainty in action times and category labels. As a generative model, it can produce action sequences by sampling from a prior distribution, the parameters of which are updated based on neural networks that control the distributions over the next action type and its temporal occurrence. Our model can be used to analyze and model asynchronous data in a wide variety of ranges like social networks, earthquakes events, health informatics, and the list goes on.
]]>In this work, we propose a local discriminative neural model with a much smaller negative sampling space that can efficiently learn against incorrect orderings. The proposed coherence model is simple in structure, yet it significantly outperforms previous stateoftheart methods on a standard benchmark dataset on the Wall Street Journal corpus, as well as in multiple new challenging settings of transfer to unseen categories of discourse on Wikipedia articles.
Code and dataset here.
]]>The answer boils down to trust.
Trust in a machine, or an algorithm, is difficult to quantify. It’s more than just performance — most people will not be convinced by being told research cars have driven X miles with Y crashes. You may care about when negative events happen. Were they all in snowy conditions? Did they occur at night? How robust is the system, overall?
In machine learning, we typically have a metric to optimize. This could mean we minimize the time to travel between points, maximize the accuracy of a classifier, or maximize the return on an investment. However, trust is much more subjective, domain dependent, and user dependent. We don’t know how to write down a formula for trust, much less how to optimize it.
This post argues that intelligibility is a key component to trust.^{1} The deep learning explosion has brought us many highperforming algorithms that can tackle complex tasks at superhuman levels (e.g., playing the games of Go and Dota 2, or optimizing data centers). However, a common complaint is that such methods are inscrutable “black boxes.”
If we cannot understand exactly how a trained algorithm works, it is difficult to judge its robustness. For example, one group of researchers trained a deep neural network to detect pneumonia from Xrays. The data was collected from both inpatient wards and an emergency department, which had very different rates of the disease. Upon analysis, the researchers realized that the Xray machines added different information to the Xrays — the network was focusing on the word “portable,” which was present only in the emergency department Xrays, rather than medical characteristics of the picture itself. This example highlights how understanding a model can identify problems that would potentially be hidden if one only focuses on the accuracy of the model.
Another reason to focus on intelligibility is in cases where we have properties we want to verify but cannot easily add to the loss function, i.e., the objective we wish to optimize. One may want to respect user preferences, avoid biases, and preserve privacy. An inscrutable algorithm may be difficult to verify, whereas an intelligible algorithm’s output would not be so. For example, the black box algorithm COMPAS is being used for assessing the risk of recidivism and has been accused of being racially biased by an influential ProPublica article. In Cynthia Rudin’s article "Please Stop Explaining Black Box Models for HighStakes Decisions", she argues that her model (CORELS) achieves the same accuracy as COMPAS, but is fully understandable, as it consists of only 3 if/then rules, and it does not take race (or variables correlated with race) into account.
if (age = 18 − 20) and (sex = male) then predict yes 
Rule list to predict 2year recidivism rate found by CORELS.
As mentioned above, intelligibility is contextdependent. But in general, we want to have some construct that can be understood when considering a user’s limited memory (i.e., people have the ability to hold 7 +/ 2 concepts in mind at once). There are three different ways we can think about intelligibility, which I enumerate next.
A local explanation of a model focuses on a particular region of operation. Continuing our autonomous car example, we could consider a local explanation to be one that explains how a car made a decision in one particular instance. The reasoning for a local explanation may or may not hold in other circumstances. A global explanation, in contrast, has to consider the entire model at once and thus is likely more complicated.
A more technically inclined user or model builder may have different requirements. First, they may think about the properties of the algorithm used. Is it guaranteed to converge? Will it find a nearoptimal solution? Do the hyperparameters of the algorithm make sense? Second, they may think about whether all the inputs (features) to the algorithm seem to be useful and are understandable. Third, is the algorithm “simulatable” (where a person can calculate the outputs from inputs) in a reasonable amount of time?
If a solution is intelligibile, a user should be able to generate explanations about how the algorithm works. For instance, there should be a story about how the algorithm gets to a given output or behavior from its inputs. If the algorithm makes a mistake, we should be able to understand what went wrong. Given a particular output, how would the input have to change in order to get a different output?
There are four highlevel ways of achieving intelligibility. First, the user can passively observe input/output sequences and formulate their own understanding of the algorithm. Second, a set of posthoc explanations can be provided to the user that aim to summarize how the system works. Third, the algorithm could be designed with fewer blackbox components so that explanations are easier to generate and/or are more accurate. Fourth, the model could be inherently understandable.
Observing the algorithm act may seem to be too simplistic. Where this becomes interesting is when you consider what input/output sequences should be shown to a user. The HIGHLIGHTS algorithm focuses on reinforcement learning settings and works to find interesting examples. For instance, the authors argue that in order to trust an autonomous car, one wouldn’t want to see lots of examples of driving on a highway in light traffic. Instead, it would be better to see a variety of informative examples, such as driving through an intersection, driving at night, driving in heavy traffic, etc. At the core of the HIGHLIGHTS method is the idea of state importance, or the difference between the value of the best and worst actions in the world at a given moment in time:
\begin{equation}
I(s) = \max_{a} Q^\pi_{(s,a)}  \min_a Q^\pi_{(s,a)}
\end{equation}
In particular, HIGHLIGHTS generates a summary of trajectories that capture important states an agent encountered. To test the quality of the generated examples, a user study was performed where people watched summaries of two agents playing Ms. Pacman and were asked to identify the better agent.
This animation shows the output of the HIGHLIGHTS algorithm in the Ms. Pacman domain https://goo.gl/79dqsd
Once the model learns to perform a task, a second model could be trained to then explain the task. The motivation is that maybe a simpler model can represent most of the true model, while being much more understandable. Explanations could be natural language, visualizations (e.g., saliency maps or tSNE), rules, or other humanunderstandable systems. The underlying assumption is that there’s a fidelity/complexity tradeoff: these explanations can help the user understand the model at some level, even if it is not completely faithful to the model.
For example, the LIME algorithm works on supervised learning methods, where it generates a more interpretable model that is locally faithful to a classifier. The optimization problem is set up so that it minimizes the difference between the interpretable model g from the actual function f in some locality πx, while also minimizing the measure of the complexity of the model g:
\begin{equation}
\xi(x) = \mbox{argmin}_{g \in G} ~~\mathcal{L} (f, g, \pi_x) + \Omega(g)
\end{equation}
The paper also introduces SPLIME, an algorithm to select a set of representative instances by exploiting the submodularity principle to greedily add nonoverlapping examples that cover the input space while giving examples of different, relevant, outputs.
A novel approach to automated rationale generation for reinforcement learning agents is presented by Ehsan et al. Many people are asked to play the game of Frogger. Then, while they’re playing the game, they provide explanations as to why they executed an action in a given state. This large corpus of states/actions/explanations is then fed into a model. The explanation model can then provide socalled rationales for actions from different states, even if the actual agent controlling the game’s avatar uses something like a neural network. The explanations may be plausible, but there’s no guarantee that they match the actual reasons the agent acted the way it did.
The Deep neural network Rule Extraction via Decision tree induction (DeepRED) algorithm is able to extract humanreadable rules that approximate the behavior of multilevel neural networks that perform multiclass classification. The algorithm takes a decompositional approach: starting with the output layer, each layer is explained by the previous layer, and then the rules (produced by the C4.5 algorithm) are merged to produce a rule set for the entire network. One potential drawback of the method is that it is not clear if the resulting rule sets are indeed interpretable, or if the number of terms needed in the rule to reach appropriate fidelity would overwhelm a user.
In order to make a better explanation, or summary of a model’s predictions, the learning algorithm could be modified. That is, rather than having bolton explainability, the underlying training algorithm could be enhanced so that it is easier to generate the posthoc explanation. For example, González et al. build upon the DeepRED algorithm by sparsifying the network and driving hidden units to either maximal or minimal activations. The first goal of the algorithm is to prune connections from the network, without reducing accuracy by much, with the expectation that “rules extracted from minimally connected neurons will be simpler and more accurate.” The second goal of the algorithm is to use a modified loss function that drives activations to maximal or minimal values, attempting to binarize the activation values, again without reducing accuracy by much. Experiments show that models generated in this manner are more compact, both in terms of the number of terms and the number of expressions.
A more radical solution is to focus on models that are easier to understand (e.g., “white box” models like rule lists). Here, the assumption is that there is not a strong performance/complexity tradeoff. Instead the goal is to do (nearly) as well as black box methods while maintaining interpretability. One example of this is the recent work by Okajima and Sadamasa on “Deep Neural Networks Constrained by Decision Rules,” which trains a deep neural network to select humanreadable decision rules. These ruleconstrained networks make decisions by selecting “a decision rule from a given decision rule set so that the observation satisfies the antecedent of the rule and the consequent gives a high probability to the correct class.” Therefore, every decision made is supported by a decision rule, by definition.
The Certifiably Optimal RulE ListS (CORELS) method, as mentioned above, is a way of producing an optimal rule list. In practice, the branch and bound method solves a difficult discrete optimization problem in reasonable amounts of time. One of the takehome arguments of the article is that simple rule lists can perform as well as complex blackbox models — if this is true, shouldn’t whitebox models be preferred?
One current research project we're working on at Borealis AI focuses on making deep reinforcement learning more intelligible. However, the current methods are opaque: it is difficult to explain to clients how the agent works, and it is difficult to be able to explain individual decisions. While we are still early in our research, we are investigating methods in all categories outlined above. The longterm goal of this research is to bring explainability into customerfacing models to ultimately help customers understand, and trust, our algorithms.
^{1}Note that we choose the word “intelligibility” with purpose. The questions discussed in this blog are related to explainability, interpretability, and “XAI,” and more broadly to safety and trust in artificial intelligence. However, we wish to emphasize that it is important for the system to be understood and that this may take some effort on the part of the subject understanding the system. Providing an explanation may or may not lead to this outcome — the example may be unhelpful, inaccurate, or even misleading with respect to the system’s true operation.
]]>By systematically controlling the frequency components of the perturbation, evaluating against the topplacing defense submissions in the NeurIPS 2017 competition, we empirically show that performance improvements in both the whitebox and blackbox transfer settings are yielded only when low frequency components are preserved. In fact, the defended models based on adversarial training are roughly as vulnerable to low frequency perturbations as undefended models, suggesting that the purported robustness of stateoftheart ImageNet defenses is reliant upon adversarial perturbations being high frequency in nature. We do find that under ℓ∞ ϵ=16/255, the competition distortion bound, low frequency perturbations are indeed perceptible. This questions the use of the ℓ∞norm, in particular, as a distortion metric, and, in turn, suggests that explicitly considering the frequency space is promising for learning robust models which better align with human perception.
]]>How do you surpass a bar that’s already high? One way is to build something in the mountains.
With a peek of the Rockies from the window, our Vancouver research centre officially opens its doors today. RBC President and CEO, Dave McKay, will formally inaugurate the space and later join an informal panel moderated by John Stackhouse with Foteini Agrafioti and Vancouver research director, Greg Mori. It's the final ribbon cutting in a jampacked season that also saw the completion of our new Waterloo and Edmonton locations.
Partnering once again with design firm Lemay, our vision was for each office to have its own identity and proudly represent the city it inhabits. This allowed Lemay full creative freedom to come up with interesting concepts that you wouldn’t normally find in a research centre.
Take a look at what we cooked up in the lab:
The main inspiration behind this research centre came from the institution that helped put the KitchenerWaterloo corridor on the map: The University of Waterloo. With this theme in mind, we designed rooms to pay homage to campus life. There’s a student lounge, teacher’s lounge, science lab and track and field pitch. And what’s campus life without movies about campus life? When it came down to the decorative details, we honoured retro classics like Grease, The Breakfast Club and Teachers while acknowledging modern masterpieces like The Big Bang Theory.
While our Edmonton team has been established for two years, we wanted to make sure they had a space that reinforced the excellence of their research. In early February, the team moved into a place that drew inspiration from the city's great winter escape: West Edmonton Mall. The mall is legendary not just for its size, but for providing a scope of attractions that make it plausible to spend three entire months indoors. And when “fun” is your theme, the design possibilities create themselves. We set up a bowling alley meeting room (with pins hanging from the ceiling), a pirate ship kitchen (galley? kitchen?), minigolf, and a water park in the living room. The pièce de résistance, and feature that makes every other nonEdmontonbased researcher jealous, is the sprawling outdoor terrace where the team can enjoy the Aurora Borealis on nights that don’t freeze water on impact.
Vancouverites are known for being a laidback crew, so we tried a more mellow approach to our interior design. After all, it’s impossible to improve upon the natural beauty of the West Coast landscape. Our Vancouver space was designed to evoke scenes of cozy ski lodges, wildlife and mountain climbs. There’s even a room modeled after the city’s favourite mode of transportation – the bicycle –which is the embodiment of high and lowtech in perfect tandem.
These spaces are even better in person. Be sure to check our Careers page regularly for job openings, or our Fellowships section for application opportunities.
]]>When the world lined up to start the AI race, Canada was out of the gate like Andre de Grasse, taking the lead on research and development.
Now, can we lead on tackling its ethical and societal implications?
The news is flooded with examples of AI fails: algorithms that favour male job applicants over women, or image recognition software failing to correctly identify people of colour.
Dr. Foteini Agrafioti, the Head of Borealis AI and one of the country’s strongest voices on ensuring AI is ethical, was announced last week as the cochair of Canada’s new Advisory Council on AI. She led the latest RBC Disruptors conversation about battling bias in AI with Dr. Elissa Strome, Executive Director, PanCanadian AI Strategy at CIFAR, and Dr. Layla El Asri, Research Manager, Microsoft Research Montréal.
Here are their thoughts on what the scientific community, governments and ordinary citizens can do to confront bias in artificial intelligence, and position Canada as a leader in ethical AI.
Bias has long existed in our society – and so it exists in our data. El Asri sees this as an opportunity. Unlike our own unconscious bias, we can at least uncover bias in an algorithm. To do this, companies need to be auditing their AI for bias every step of the way, as the major labs are now doing. El Asri credited Canadian leaders, such as AI pioneer Yoshua Bengio, for developing a will in Canada’s tech community to develop AI in a responsible way.
Right now, artificial intelligence is being developed by a very narrow subset of society: mainly highlyeducated men who went to the same schools, and now live in the same cities. Only 18% of AI researchers are women, a fact that Strome called “terrible.” Organizations like CIFAR are working to bring more voices into the development of AI, with initiatives such as the AI for Good Summer Lab, a sevenweek training program for undergraduate women in AI.
AI is only as good as the data it’s trained on. “If your data is not representative enough, your model is not going to work,” El Asri said. There needs to be more vigilance in ensuring data is representative — an area where Canada has a homegrown advantage. If you’re working with data collected in a multicultural country like ours, you’re likely working with data that represents different ethnic backgrounds. This kind of data will be essential to building technology that works for everyone, especially when it comes to something like health care.
Right now, it’s really just the tech community and policymakers talking about issues that are going to transform our society. We need to broaden that perspective, building in consultation with social scientists as an integral part of the development process. A recent CIFAR initiative brought together computer scientists and social scientists for a day to discuss the social, legal and ethical implications of AI. “The computer scientists were so eager to get their advice and insights,” Strome said. Similarly, at Microsoft, El Asri noted that their AI and ethics committees are made up of people from different disciplines, including anthropologists and historians.
“There’s a lot of fear and misunderstanding and myths about AI,” Strome said. Over the next few years, it’s going to be critical to bring the public into the AI conversation. People need to be aware of the positive implications, as well as the risks, that AI will have on their lives. The better the next generation understands AI and its societal and ethical implications, the better prepared they’ll be to ask tough questions of their leaders. Agrafioti suggested that Canadian culture is particularly attuned to ensuring fairness, casting a critical eye on technology before implementing it. Our balance of technical expertise and social values is exactly what’s needed to make sure the product that gets to market is ethical.
AI has been advancing much faster than any government can regulate it — so it was big news this week when the OECD adopted a set of AI principles, which set valuesbased standards for developing AI. Our leaders have an incredibly important role to play in developing policy and regulations around the use of AI, both domestically and internationally. Strome noted that Canada’s solid international reputation could go a long way in urging the world to play catchup. Last summer, Prime Minister Trudeau and President Macron announced a joint CanadaFrance initiative on an International Panel on AI to support and guide the responsible adoption of AI, grounded in human rights. The first symposium will be in Paris this fall.
Solving bias in machines will take a human touch — and there’s no country better positioned than Canada to take the reins.
]]>Our team is composed of 2 neural networks trained with state of the art deep reinforcement learning algorithms and makes use of concepts like reward shaping, curriculum learning, and an automatic reasoning module for action pruning. Here, we describe these elements and, additionally, we present a collection of opensourced agents that can be used for training and testing in the Pommerman environment. Code available here.
]]>We discover this sensitivity by analyzing the Bayes classifier's clean accuracy and robust accuracy. Extensive empirical investigation confirms our finding. Numerous neural nets trained on MNIST and CIFAR10 variants achieve comparable clean accuracies, but they exhibit very different robustness when adversarially trained. This counterintuitive phenomenon suggests that input data distribution alone can affect the adversarial robustness of trained neural networks, not necessarily the tasks themselves. Lastly, we discuss practical implications on evaluating adversarial robustness, and make initial attempts to understand this complex phenomenon.
]]>Machine learning models have demonstrated a vulnerability to adversarial perturbations. Adversarial perturbations are minor modifications to the input data that can cause machine learning models to output completely different predictions but that are not perceptible to the human eye.
Here’s an example (with the jupyter notebook) of what that looks like: In the figure below, after a very small perturbation (middle image) is added to the panda image (right), the neural network model returns recognizes the perturbed image (left) as a bucket. However, to a human observer, the perturbed panda looks exactly the same as the original panda.
Adversarial perturbation poses certain risks to machine learning applications that can have potential realworld impact. For example, researchers have shown that by putting black and white patches on a stop sign, stateoftheart object detection systems cannot recognize the stop sign correctly anymore.
This problem is not only restricted to images: speech recognition systems and malware detection systems have all been shown to be vulnerable to similar attacks. More broadly, realistic adversarial attacks could happen on machine learning systems whenever it might be profitable for adversaries, for instance, on fraudulent detection systems, identity recognition systems, and decision making systems.
AdverTorch (repo, report) is a tool we built at the Borealis AI research lab that implements a series of attackanddefense strategies. The idea behind it emerged back in 2017, when my team began to do some focused research around adversarial robustness. At the time, we only had two tools at our disposal: CleverHans and Foolbox.
While these are both good tools, they had their respective limitations. Back then, CleverHans was only set up for TensorFlow, which limits its usage in other deep learning frameworks (in our case, PyTorch). Moreover, the static computational graph nature of TensorFlow makes the implementation of attacks less straight forward. For anyone new to this type of research, it can be hard to understand what’s going on if the attack is written in static graph language.
Foolbox, on the other hand, contains various types of attack methods but it only supports running attacks imagebyimage, and not batchbybatch. These parameters make it slow to run and thus only suitable for running evaluations. At the time, Foolbox also lacked variety in the number of attacks, e.g. the projected gradient descent attack (PGD) and the CarliniWagner $\ell_2$norm constrained attack.
In the absence of a toolbox that would serve more of our needs, we decide to implement our own. Creating a proprietary tool would also allow us to use our favorite language – PyTorch – which was not an option with the others.
Our aim was to provide researchers the tools for conducting research in different research directions for adversarial robustness. For now, we’ve built AdverTorch primarily for researchers and practitioners who have some algorithmic understanding of the methods.
We had the following design goal in mind:
Resources permitted, we are also working to make it more user friendly in the future.
For gradientbased attacks, we have the fast gradient (sign) methods (Goodfellow et al., 2014), projected gradient descent methods (Madry et al., 2017), CarliniWagner Attack (Carlini and Wagner, 2017), spatial transformation attack (Xiao et al., 2018) and more. We also implemented a few gradientfree attacks including single pixel attack, local search attack (Narodytska and Kasiviswanathan, 2016), and the Jacobian saliency map attack (Papernot et al., 2016).
Besides specific attacks, we also implemented a convenient wrapper for the Backward Pass Differentiable Approximation (Athalye et al., 2018), which is an attack technique that enhances gradientbased attacks when attacking defended models that have nondifferentiable or gradientobfuscating components.
In terms of defenses, we considered two strategies: i) preprocessingbased defenses and ii) robust training. For preprocessing based defenses, we implement the JPEG filter, bit squeezing, and different kinds of spatial smoothing filters.
For robust training methods, we implemented them as examples in our repo. So far, we have a script for adversarial training on MNIST which you can access here and we plan to add more examples with different methods on various datasets.
We used the fast gradient sign attack as an example of how to create an attack in AdverTorch. The GradientSignAttack
can be found at advertorch.attacks.one_step_gradient.
To create an attack on classifier, we’ll need the Attack
and LabelMixin
from advertorch.attacks.base.
from advertorch.attacks.base import Attack
from advertorch.attacks.base import LabelMixin
Attack
is the base class of all attacks in AdverTorch. It defines the API of an Attack. The core of it looks like this:
def __init__(self, predict, loss_fn, clip_min, clip_max):
self.predict = predict
self.loss_fn = loss_fn
self.clip_min = clip_min
self.clip_max = clip_max
def perturb(self, x, **kwargs):
error = "Subclasses must implement perturb."
raise NotImplementedError(error)
def __call__(self, *args, **kwargs):
return self.perturb(*args, **kwargs)
An attack contains three core components:
predict
: the function we want to attack;loss_fn
: the loss function we maximize in order to during attack; perturb
: the method that implements the attack algorithm.Let’s illustrate these components with GradientSignAttack
as an example.
class GradientSignAttack(Attack, LabelMixin):
"""
One step fast gradient sign method (Goodfellow et al, 2014).
Paper: https://arxiv.org/abs/1412.6572
:param predict: forward pass function.
:param loss_fn: loss function.
:param eps: attack step size.
:param clip_min: minimum value per input dimension.
:param clip_max: maximum value per input dimension.
:param targeted: indicate if this is a targeted attack.
"""
def __init__(self, predict, loss_fn=None, eps=0.3, clip_min=0.,
clip_max=1., targeted=False):
"""
Create an instance of the GradientSignAttack.
"""
super(GradientSignAttack, self).__init__(
predict, loss_fn, clip_min, clip_max)
self.eps = eps
self.targeted = targeted
if self.loss_fn is None:
self.loss_fn = nn.CrossEntropyLoss(reduction="sum")
def perturb(self, x, y=None):
"""
Given examples (x, y), returns their adversarial counterparts with
an attack length of eps.
:param x: input tensor.
:param y: label tensor.
 if None and self.targeted=False, compute y as predicted
labels.
 if self.targeted=True, then y must be the targeted labels.
:return: tensor containing perturbed inputs.
"""
x, y = self._verify_and_process_inputs(x, y)
xadv = x.requires_grad_()
###############################
# start: the attack algorithm #
outputs = self.predict(xadv)
loss = self.loss_fn(outputs, y)
if self.targeted:
loss = loss
loss.backward()
grad_sign = xadv.grad.detach().sign()
xadv = xadv + self.eps * grad_sign
xadv = clamp(xadv, self.clip_min, self.clip_max)
# end: the attack algorithm #
###############################
return xadv
predict
is the classifier, while loss_fn
is the loss function for gradient calculation. The perturb
method takes x
and y
as its arguments, where x
is the input to be attacked, y
is the true label of x. predict(x)and
contains the “logits” of the neural work. The loss_fn
could be the crossentropy loss function or another suitable loss function who takes predict(x)
and y
as its arguments.
Thanks to the dynamic computation graph nature of PyTorch, the actual attack algorithm can be implemented in a straightforward way with a few lines. For other types of attacks, we just need replace the algorithm part of the code in perturb
and change what parameters to pass to __init__
.
Note that the decoupling of these three core components is flexible enough to allow more versatile attacks. In general, we require the predict
and loss_fn
to be designed in such a way that loss_fn
always takes predict(x)
and y
as its inputs. As such, no knowledge about predict
and loss_fn
is required by the perturb
method. For example, FastFeatureAttack and PGDAttack share the same underlying perturb_iterative
function, but differ in the predict
and loss_fn
. In FastFeatureAttack
, the predict(x)
outputs the feature representation from a specific layer, the y is the guide feature representation that we want predict(x)
to match, and the loss_fn becomes the mean squared error.
More generally, y
can be any target of the adversarial perturbation, while predict(x)
can output more complex data structures as long as the loss_fn
can take them as its inputs. For example, we might want to generate one perturbation that fools both model A’s classification result and model B’s feature representation at the same time. In this case, we just need to make y and predict(x)
to be tuples of labels and features, and modify the loss_fn
accordingly. There is no need to modify the original perturbation implementation.
As mentioned above, AdverTorch provides modules for preprocessingbased defense and examples for robust training.
We use MedianSmoothing2D as an example to illustrate how to define a preprocessingbased defense.
class MedianSmoothing2D(Processor):
"""
Median Smoothing 2D.
:param kernel_size: aperture linear size; must be odd and greater than 1.
:param stride: stride of the convolution.
"""
def __init__(self, kernel_size=3, stride=1):
super(MedianSmoothing2D, self).__init__()
self.kernel_size = kernel_size
self.stride = stride
padding = int(kernel_size) // 2
if _is_even(kernel_size):
# both ways of padding should be fine here
# self.padding = (padding, 0, padding, 0)
self.padding = (0, padding, 0, padding)
else:
self.padding = _quadruple(padding)
def forward(self, x):
x = F.pad(x, pad=self.padding, mode="reflect")
x = x.unfold(2, self.kernel_size, self.stride)
x = x.unfold(3, self.kernel_size, self.stride)
x = x.contiguous().view(x.shape[:4] + (1, )).median(dim=1)[0]
return x
The preprocessor is simply a torch.nn.Module.
Its __init__
function takes the necessary parameters, and the forward
function implements the actual preprocessing algorithm. When using MedianSmoothing2D
, it can be composed with the original model to become a new model:
median_filter = MedianSmoothing2D()
new_model = torch.nn.Sequential(median_filter, model)
y = new_model(x)
or to be called sequentially
processed_x = median_filter(x)
y = model(processed_x)
We provide an example of how to use AdverTorch to do adversarial training (Madry et al. 2018) in tutorial_train_mnist.py
. Compared to regular training, we only need to two changes. The first is to initialize an adversary before training starts.
if flag_advtrain:
from advertorch.attacks import LinfPGDAttack
adversary = LinfPGDAttack(
model, loss_fn=nn.CrossEntropyLoss(reduction="sum"), eps=0.3,
nb_iter=40, eps_iter=0.01, rand_init=True, clip_min=0.0,
clip_max=1.0, targeted=False)
The second is to generate the “adversarial minibatch” during training, and use it to train the model instead of the original minibatch.
if flag_advtrain:
advdata = adversary.perturb(clndata, target)
with torch.no_grad():
output = model(advdata)
test_advloss += F.cross_entropy(
output, target, reduction='sum').item()
pred = output.max(1, keepdim=True)[1]
advcorrect += pred.eq(target.view_as(pred)).sum().item()
Since building the toolkit, we’ve already used it for two papers: i) On the Sensitivity of Adversarial Robustness to Input Data Distributions; and ii) MMA Training: Direct Input Space Margin Maximization through Adversarial Training. It’s our sincere hope that AdverTorch helps you in your research and that you find its components useful. Of course, we welcome any contributions for the community and would love to hear your feedback. You can open an issue or a pull request for AdverTorch, or email me at gavin.ding@borealisai.com.
]]>While deep learning applications have successfully integrated into multiple product categories, RL has had a slower initiation. The recent momentum behind RLbased commercialization has been propelled by research advancements that have naturally lent themselves to product ideas in specific sectors, like financial markets, health care and marketing. Once competition ignites, this early trickle is predicted to burst into a gushing pipeline.
But RL algorithms are not your standard, runofthemill solutions and it’s unwise to treat them as such. Most pressingly, they’re continual learning algorithms, which means the type of data they require, combined with their potential industry disruption, demands that privacy techniques catch up with the privacy challenges these algorithms pose. One such technique is differential privacy.
The notion of privacy is inuitively difficult to translate into a technical definition. One standard definition around which the academic community has coalesced is through a framework called differential privacy. Differential privacy centers around the notion of whether an individual’s participation in a dataset is discernable. So, an algorithm that acts on a dataset is called differentially private if an individual’s presence or removal from that dataset makes minimal impact on the algorithm’s output. Differential privacy would then be achieved when perturbation – or “noise” – is added during the algorithmic training process. The level and location of the noise is finely calibrated according the degree of privacy and accuracy desired and the properties of the dataset and algorithm.
Standard differential privacy techniques work on fixed data sets that are already known to researchers. This prior knowledge allows researchers to decide how much noise they want to add to the data set in order to protect the individual’s privacy. A standard example of how this works is by compiling some aggregate statistics on how many people have done ‘x’ activity, then setting the parameters so that we end up with the same statistical result whether we keep or remove any individual from within the data set. But what happens when the data set is from a continuous state space, dynamic, constantly changing and we are continuously learning? For that, we need a new approach.
In our paper, Private QLearning with Functional Noise in Continuous Spaces, we focus on finding new avenues to address this complexity. We do this by approaching general concepts of differential privacy, then abstracting them and applying them to a different space. So, instead of adding scalar noise to a vector, we focus on protecting the reward function, adding perturbations as it gets updated by the algorithm.
This step is important because a reward function reveals the value of actions and, therefore, the latent preferences of users. So, for example, when you click the thumbsup button on a social media app, this action gets codified as a “reward” that informs the “policy” for what the algorithm should do next time it identifies a similar user in a similar state. Our approach protects the “why” – the motivation or intent – of the individual’s decision. It blocks individual preferences from being identified while still allowing for the abstraction of the policy. This protects the motivation for the reward instead of the outcome. We want to protect the fact that the system has learned about your diehard fandom for indie music, while enabling the algorithm to build intelligence so it can personalize recommendations to different users.
We applied privacy to a setting that can be generalized to a variety of learning tasks – the Qlearning framework of reinforcement learning – where the objective was to maximize the actionvalue function. We used function approximators (i.e. a neural network) parameterized by θ to learn the optimal actionvalue function. In particular, we considered the continuous state space setting, where the actionstate value Q(s, a) was assumed to a set of m functions defined on the interval [0, 1] and, similarly, the reward was a set of m functions each defined on the interval [0, 1].
Standard perturbation methods for ML models achieve DP by adding noise to vectors – the input to the algorithm, the output of the algorithm, or to gradient vectors within the ML model. In our case, we aimed to protect the reward function, which can depend on highdimensional context. Using standard methods to add perturbation would mean that the amount of noise to be added would grow quickly to infinity if the continuous state space were to be discretized. Since we wanted to perturb the actionvalue functions, we added functional noise, rather than vectorvalued noise as in standard methods. This functional noise was a sample path of an appropriately parametrized Gaussian process and was added to the function released by our Qlearning algorithm. As in the standard methods, the noise was parametrized by the sensitivity of the query, which, for vectorvalued noise was the l2 norm of the difference in output with two datasets that differ in one individual’s value. Here, since we considered reward functions which change in value according to the state of the environment (that has randomness), we used the notion of Mahalanobis distance for sensitivity, which captured the idea of a distance from one point to a set of sampled points.
Say Patient Zero is exhibiting medical symptoms and goes to see the doctor. The doctor gives Patient Zero Drug A (first state). Drug A doesn’t alleviate the symptoms, so now the doctor tries Drug B (second state). Patient Zero then moves into third state, and so on, until the problem (illness) gets solved. Here, the agent is programmed to be able to take a limited number of actions, then the system observes the state of the agent (symptoms relieved? Not relieved?), and based on the observations of that state, the agent must make a decision about what to do. The algorithm observes the outcome and, depending on the results, the agent gets a reward or punishment. The quality of the reward will depend on the longterm goals it’s trying to achieve. Are you getting closer to the goal (symptom alleviation) or moving away from it (even sicker than before the drugs)?
The privacy measures applied to RL in the past have mostly centered around protecting an individual’s movement (or itinerary) within a particular state. The policy, then, would be defined around why a user took a specific action in this state. This approach works well for the above scenario, where we’re protecting the user’s movement from statetostate but not protecting a policy that can be extrapolated to many other users. It falls short, however, when applied to areas like marketing, with far more dynamic data sets and continual learning.
Differential privacy in deep RL is a more general and scalable technique, as it protects a higherlevel model that captures behaviors rather than just limiting itself to a particular data point. This approach is important for the future as we move to continuous, online learning systems: by blocking individual preferences from being identified while allowing for the policy to be abstracted, we protect the motivation for the reward instead of the outcome. These kinds of safety guarantees are vital in order to make RL practical.
]]>On Friday, March 1, we hosted a reception in honour of our 20182019 Borealis AI Fellowship winners. Our fellows span the country from universities across British Columbia, Alberta, Ontario and Quebec. We were thrilled that so many flew in to attend our ceremony, which gave us a chance to congratulate our new fellows in person and share the research we’re doing at Borealis AI.
Most importantly, the event provided a space for everyone to connect with each other. A big motivation for our fellowship awards is the continuation of our fulfillment to support students across Canada. A foundational aspect of this support is to nurture a spate of networks, where researchers can exchange ideas, form fruitful partnerships, and continue our national leadership in the AI space.
So much of our success in the field stems from Canada’s supportive academic culture. It’s just who we are. For every brilliant graduate, there are a host of advisors and mentors who drew out the best in them along a challenging and competitive path.
True to form, Professors Ioannis Mitliagkas, Reihaneh Rabbany, and Jackie Cheung came by to support the winners. “We’ve been a pioneer in machine learning and AI, largely due to smart investments by public funding organizations. It bodes well for the future to see industry join in this effort to help sustain our research community,” said Prof. Mitliagkas, who came to support his new graduate student, Alexia JolicoeurMartineau.
Alexia, a statistician by training, was recently accepted to study at Mila on the strength of her independent work and has been gracious about her achievements. “This [fellowship] will bring me a peace of mind, so that I can fully focus on my research,” she said. “This makes me very confident about the future!”
Gauthier Gidel, whose research proposal on Efficient SaddlePoint Optimization for Modern Machine Learning impressed the adjudication committee, explained why young researchers need more efforts to champion their work: “It makes me feel like my research matters, that I’m moving in the right direction,” he said. “To me, winning this award is a sort of peer recognition.”
After a tour of our research centre, a catered lunch and an opportunity to meet the Borealis AI Montreal research team, we ended on a high note, with some of our guests hanging out at the lab well into the afternoon. With new friendships forged, we look forward to seeing what direction our outstanding fellows bring to Canada’s research community.
]]>We first empirically compare different labeling strategies to show the potential for using partial labels on multilabel datasets. Then, to learn with partial labels, we introduce a new classification loss that exploits the proportion of known labels per example. Our approach allows the use of the same training settings as when learning with all the annotations. We further explore several curriculum learning based strategies to predict missing labels. Experiments are performed on three largescale multilabel datasets: MS COCO, NUSWIDE and Open Images.
]]>Borealis AI is thrilled to announce our 20182019 Graduate Fellowship winners. The fellowships were awarded to exceptional students pursuing graduatelevel studies in machine learning and artificial intelligence at top universities across Canada.
Each of our winners demonstrated outstanding research capabilities, provided strong references, and outlined a clear, thoughtful research focus for the current academic year. We were overwhelmed by the exceptional calibre of this year’s candidates and our adjudication committee had no easy task selecting the 10 finalists.
We’re proud to introduce our inaugural group (below) and look forward to meeting everyone in person on March 1 as we host an event in their honour in Montreal.
School: University of Alberta, Amii, PhD candidate
Research Interests: Reinforcement learning
Research Topic: The Predictive Approach to Knowledge
School: Université de Montréal, Mila, PhD candidate
Research Areas: Deep generative modelling, computational statistics
Research Topic: Understanding, improving, and extending GANs
School: McGill University, Mila/RLLab, PhD candidate
Research Interests: Language and interaction, reinforcement learning
Research Topic: Emergent Communication and Representation Learning
School: University of British Columbia, PhD candidate
Research Interests: ML on knowledge graphs
Research Topic: Improved Knowledge Graph Embedding Using Ontology, Time, and Higher Arity Relations
School: Université de Montréal, Mila, PhD candidate
Research Interest: Optimization, multiagent learning
Research Topic: Efficient SaddlePoint Optimization for Modern Machine Learning
School: University of British Columbia, MSc candidate
Research Interests: Discrete and continuous optimization, theoretical ML, ML and data mining, design and analysis of algorithms, computational neuroscience, computational biology
Research Topic: Nonhomogenous Stochastic Gradient Descent
School: University of Toronto, Vector Institute, PhD candidate
Research Interests: Optimization, regularization, Bayesian neural networks, generative models
Research Topic: Online Hyperparameter Adaptation for Improved Training and Generalization
School: University of Waterloo, PhD candidate
Research Interests: Deep generative models, mixture models, online learning, sumproduct networks, optimization
Research Topic: Deep Homogenous Mixture Models: Representation, Separation
School: McGill University, Mila, PhD candidate
Research Interest: Deep reinforcement learning
Research Topic: Unifying Imitation and RL for DataEfficient Learning
School: University of Toronto, Vector Institute, PhD candidate
Research Areas: Bayesian deep learning, generalization
Research Topic: Reliable Uncertainty Estimation in Bayesian Neural Networks
Last month, we had the chance to attend the 2019 Association for the Advancement of Artificial Intelligence (AAAI) conference in Honolulu. As one of the longeststanding conferences in the field, AAAI distinguishes itself from most leading ML events in that it focuses on far more than just deep learning. This breadth of topics encourages a vigorous interaction between some of the more classical methods in the field and a few of the modern ones. In turn, knowledge about these classical ideas can be used to inform and develop better ML approaches. Below, we’ve compiled a few of our favorite moments and standout papers, some which clearly show the interactive display between these two worlds.
Auto ML proved one of the hot topics at this year’s conference and one of the more intriguing papers on the topic was Automatic Bayesian Density Analysis by Antonio Vergari (Max Planck Institute for Intelligent Systems)*; Alejandro Molina (TU Darmstadt); Robert Peharz (University of Cambridge); Zoubin Ghahramani (University of Cambridge); Kristian Kersting (TU Darmstadt); Isabel Valera (MPIIS). Inspired by advances around automatic model selection in supervised learning, the paper proposed an automated framework for tackling the problem of density estimation in unsupervised learning. A challenge here is that this problem often requires domain expertise from the sector where the data comes from. Instead of going down this limiting path, the authors suggested learning a sumproduct network – an architecture initially proposed by Hoifung Poon and Pedro Domingos at UAI 2011 – to model the data.
The sum and product nodes of the network allowed the authors to capture the data’s feature behaviour as a mixture of heterogeneous distributions. In other words, rather than relying solely on a mixture of Gaussians for a particular feature, the authors allowed for the same mixture to contain discrete and continuous distributions and different types of parameterizations from a preset dictionary. Once learned, the structure of the SPN allowed for efficient inference using Gibbs sampling to infer missing feature values for some of the data points in the dataset. The model is also capable of providing information with respect to how well the data points fit the mixtures and highlights which points are likely to be outliers.
The idea of automating selection and analysis, which is a widely emergent approach at ML and AI conferences, and to see it applied to the ageold statistical ML problem of density estimation, is a prime example of new approaches being used to tackle longstanding classical problems.
One of the most engaging panels at the conference took the form of a debate on the “Future of AI.” Here, one team, consisting of Peter Stone and Jennifer Neville, argued for the proposition that “The AI community today should continue to focus mostly on ML methods,” while the opposing team, Michael Littman and Oren Etzioni, argued against it.
The panel was hilarious, entertaining and informative. However, the audience rightly pointed out that all the debaters were, in fact, ML experts, which made the contrarian position somewhat of an argument for the sake of debate and suggested that perhaps bias in ML is mostly inevitable. There was a good argument presented during the panel that suggested too much of an ML focus, like the one happening now, will create bias in current students that due, to industry demands, haven’t learned the classical AI and nonML related topics as well as previous grads. The challenge will arise when the trend changes from ML to the next big thing, at which point we might end up with an insufficient number of experts in nonML related AI topics.
Postdebate voting showed the audience still argued against the proposition that suggest we need to focus on ML research.
One dominant theme throughout the conference was the development and usage of neural networks to address problems involving graphs. Central to this argument was William Hamilton and Jian Tang’s tutorial on Graph Representation Learning. Here, the presenters outlined current advances in Graph Neural Networks (GNNs) and also presented models capable of generating graphs. Example domains of applications for these approaches included operations research and biology, where the data doesn't lend itself well to traditional architectures. An example of this lessthanideal fit would be convolutional networks, which were designed mainly for vision applications with the image structure in mind.
Of particular note was a paper presented in the Search, Constraint Satisfaction and Optimization session, entitled Learning to Solve NPComplete Problems  A Graph Neural Network for Decision TSP by Marcelo Prates (Federal University of Rio Grande do Sul); Pedro H C Avelar (Federal University of Rio Grande do Sul); Henrique Lemos (Federal University of Rio Grande do Sul); Luís Lamb (Federal University of Rio Grande do Sul); Moshe Vardi (Rice University). In the paper, the authors proposed using a GNN to tackle the “traveling salesman” problem. They introduced this approach as an extension of Learning a SAT Solver from SingleBit Supervision by Selsam et al., whose authors used a messagepassing NN approach to solve SAT problems.
Unlike that problem, however, the traveling salesman problem (TSP) represents a class of NPcomplete problems with weighted relations between the vertices of its graph. To handle this, the authors mapped the weighted graph to a GNN by also allowing for embeddings on the edge weights, rather than just on the vertices. The model was then trained as classification problem by giving it a constructed graph and asking it the question: “Does a Hamiltonian path of length X or less exist in this graph?”
For each constructed graph, two training examples were fed to the model: one with cost (1dev)X* and the other with cost (1+dev)X* where X* is the known minimal length and dev is the userdesired deviation from this optimal length. In the training process, the correct labels for these two instances are “NO” and “YES”, respectively. In addition to providing an extension for GNN usage beyond binary SAT problems, this approach is of particular importance because the TSP, along with Karp's 20 other NPcomplete problems, often reoccur as reductions of many daily ML problems. Using GNNs – a modern approach – to solve these NPcomplete problems which arise from classical computer science literature is another example of how old met new at AAAI.
Overall, AAAI 2019 showed that while we are going through current research explosions in DL and RL, classical topics (along with the faculty who have worked on them for decades!) are still alive, well and reincarnating themselves through these new channels. A welcome sight and a sign of what's to come.
]]>Borealis AI has big goals for 2019 and exceeding those goals requires exceptional leadership. Just two weeks into the new year, we’re well on our way with the addition of Dr. Kathryn Hume, who joins our team today as Director of Business Development. In her new role, Kathryn will oversee the application of our academic research within the bank, help inform our strategy and also tap into her broad experience to assist with driving Borealis AI's brand profile among key audiences.
Kathryn brings an unusually rich and varied background to the AI field. In addition to holding prior leadership positions at Integrate AI and Fast Forward Labs (Cloudera), she’s a prolific speaker and author on AI, has mastered seven languages and holds a PhD in Comparative Literature from Stanford. In her spare time, she teaches courses on enterprise adoption of AI, law and ethics at Harvard, MIT, Stanford and University of Calgary (just kidding, she doesn’t have any spare time).
As this introduction barely scratches the surface, we thought it would help to let Kathryn do the talking.
People underestimate how much of our lives are touched by banking as the substrate for the entire economy. Banking has macro impact—with risk management undergirding international market stability—and micro impact, where we all entrust banks with our financial assets to support our daily needs, like food, and our life aspirations, like education.
Now with AI, we’re able to use rich data that’s far more relevant to banking. People may not be aware that one of the first production deep learning applications was the use of computer vision to automatically recognize handwritten digits on cheques. This had been a rate limiter for the ATM and now, just a few years later, a customer can easily insert up to 50 cheques at a time and have the denominations read, analyzed and deposited within seconds. And now that we can also recognize and generate speech, what else might we do? What could payments look like? I’m most interested in AI applications like this where the tech hides behind the scenes but makes our lives so much easier.
First off, I really love the team and culture. I find there’s a mixture of curiosity and pure research talent. I also love that it’s a culture grounded in integrity. Everybody here takes the time to mean what they say and that’s very important to me. Apart from the culture, it’s exciting to return to my academic roots while pursuing my longterm career ambition, which is to be at the forefront of early commercialization of academic and scientific research.
There are a lot of existing applications the team has already built and I’m excited to bring them out of the lab and into production across the bank. I’m also looking forward to solidifying the relationships with our academic partners and to use the success of Borealis AI as an example of how academia and business can work together to holistically bridge gaps between both worlds.
My approach to responsible AI comes from a firm belief that ethics occurs in the trenches. There are, obviously, aspects of ethics that tackle large questions about AI’s impact on society, but I think the rubber really hits the road when a group of people collaborating to build a machine learning system have come together from different departments to make a series of tactical choices together. I’m excited to put this into practice here at Borealis AI. What better place to be impacting the future of responsible AI than in one of the world’s largest banks?
I was working in New York in early 2017 and it was common knowledge in the American machine learning research community that Canada was the place to be. The Vector Institute had just been established in Toronto and it was interesting to observe this experiment in building a commercialization leg from a university research department. I originally moved here to join a company called Integrate AI. What’s kept me in Toronto is the excitement of working in an ecosystem that feels similar to what Silicon Valley was like 15 years ago. There are new companies popping up everywhere and I sense the right energy flowing between groups in academia, policy, government and business. It’s a unique place in time to be. I also love Amii (in Edmonton) and Mila (in Montreal). What’s going on in the Canadian ecosystem is just amazing to behold.
I got my PhD in Comparative Literature, but I actually have a strong math and science background. In fact, my dissertation is about the use of habit (or repetitive action) as a technique to generate knowledge in 17th century mathematics, philosophy and literature. I’ve come to believe since then that I inadvertently wrote a history of supervised learning through this work. Supervised learning is an AI technique that starts with a set of labeled training examples. For example, we teach an algorithm to adequately identify that a picture of a cat is a cat by giving the images a “cat” label, then training the system over time. The “supervised learning” I wrote about in my thesis pertains to human selftransformation: that if we want to become a different type of person, we have to think a certain way, then practice those thoughts so we don’t default to our old habits.
Years ago, I gave a talk about why my background as an intellectual historian of math and philosophy actually makes me a great product marketer. My work doesn’t ask whether philosophers like Descartes or Leibniz or Newton were “right”; rather, it asks what did they think they were thinking? So, my task was to read everything they’d read and try to reconstruct what they thought so as to reinterpret what they were saying. It’s an excellent skill set for someone in business development because when you’re working as a translator between academic machine learning researchers and businesspeople, you have to do that work on both sides. How do the researchers think? What are they reading? How do they use language to express their point of view? Similarly, how do the bankers in the various divisions of the bank think? What do they read? How do they see the world? And, most importantly, can we make those two points meet at the intersection? These are the unique translation skills my background has provided, and I’ve seen it unfold to great effect in the boardroom. I’m really looking forward to adapting it to this next chapter of my career at Borealis AI.
]]>Whether you’re training a model on an enormous dataset with an industrial scale server or deploying a small model for a cellphone application, energy is often a fundamental bottleneck. I suspect algorithmic innovations that provide greater energy efficiency will be necessary to push forward the next frontier of machine learning. It’s with this mindset (and with my own paper in tow) that I attended this workshop.
In a previous blog post, I discussed Max Welling’s Intelligence per Kilowatthour paper, which he presented at the ICML conference in Stockholm last summer. Machine learning models are helping to solve increasingly difficult tasks, but as a natural result of scale, some of the models are getting enormous. Often, our community creates models that work just for the particular task at hand; but if these techniques are to be widely deployable, we must work to decrease the energy of these models. Because of this problem, Prof. Welling argued that machine learning should be judged by the intelligence per unit energy. The CDNNRIA workshop I attended seemed like a very natural response to Prof. Welling’s ICML presentation, as in addition to him being one of the workshop organizers, the focus was on compact (i.e. more energy efficient) neural networks.
Since I’m personally quite passionate about this topic, I’ve spent time exploring it in various forms of scientific inquiry. My workshop paper, On Learning WireLength Efficient Neural Networks, which I worked on with coauthors Luyu Wang, Giuseppe Castiglione, Christopher Srinivasa, and Marcus Brubaker, attempted to tackle an aspect of this important topic. I was honoured to present it. In this post, I will summarize our paper, highlight other interesting papers that relate to the subject, present the results of an experiment inspired by some of the additional workshop papers, then draw an emergent lesson from the workshop about the value of negative results.
A classic paper in the field, called Optimal Brain Damage, first introduced the basic training and pruning pipeline. The standard technique for creating energyefficient neural networks involves assuming some initial architecture, initializing weight and bias values of the network, then modifying those parameters so that the network closely fits training data. This step is called "training". The next step – “pruning” – involves deleting the edges of the network that are somehow deemed "unimportant," then retraining the network after the edges are removed. The way “unimportant” gets defined, in this context, can vary depending on the specific technique. The big revelation of machine learning is that this pipeline works, and when done iteratively, the number of parameters in the model can be reduced by upwards of 50 times with no decrease in accuracy. In fact, there’s often an increase in accuracy.
Most previous work considers evaluating the performance of pruning algorithms by using the number of nonzero parameters metric. But as we shall see, these are not the only criteria. Some of the notable work at the CDNNRIA workshop considered energy consumption that assumes a threelevel cache architecture (as discussed below). The existing work – both the cachearchitecture work and number of nonzero parameters work – is a good model for energy consumption on general purpose circuitry with fixed memory hierarchies. However, some machine learning applications (say, image recognition), may need to be more widely deployed.
As with errorcontrol coding, specialized neural networks may directly implement the edges of the neural network as wires. In this case, however, it is the total wiring length and not the number of memory accesses that will dominate energy consumption. The reason for this is due to the resistivecapacitive effects of wires, but more generally, it occurs because wiring length is a fundamental energy limitation of all the practical computational techniques that we can conceive. This hinges upon a basic fact: real systems have friction.
With this context established, our paper seeks to introduce a simple criterion for analyzing energy. We called it the “wirelength”, or informationfriction model. Our model is inspired by the works of Thompson, as well as the more recent work of Grover, and by my own PhD thesis, in which energy is proportional to the total length of all the edges connecting the neurons of the network. The technique involves placing the nodes of the neural network on a threedimensional grid, so the nodes are at least one unitdistance apart. Then, if two nodes are connected by a wire in the neural network, the length of the wire becomes the Manhattan distance between the two nodes they connect. The task we define is to find a neural network that is both accurate and has a placement of nodes that could be considered wirelength efficient.
In our paper, we introduced three algorithms that can be combined and used at the training, pruning, and node placement steps. Our experiments show that each of our techniques is independently effective, and by combining them and using a hyperparameter search we can get even lower energy, which has allowed us to produce benchmarks for some standard problems. We also found that the techniques worked across datasets.
Several workshop papers submitted in parallel to the conference caught my attention. This one, authored by Yue Wang, Tan Nguyen, Yang Zhao, Zhangyang Wang, Yingyan Lin and Richard Baraniuk, seemed to tackle similar themes as ours. What interested me most was that it sought a way to make energy efficient neural networks using a technique distinguished from the standard trainingpruning pipeline:
This paper is also interesting for the way it suggests minimizing total energy in a different manner than the standard pruning paradigm. The human brain is hyperoptimized for minimizing energy consumption, and if our machine learning techniques are to mimic the kinds of tasks performed by the brain, I suspect we will have to use all kinds of techniques to keep energy costs under control. The “skip policy” idea of Wang et al. may be one such technique useful on the road to more energy efficient artificial intelligence.
In the normal iterative training, pruning, retraining framework, we keep the weights of the neural network the same postpruning (except, of course, for the weights associated with pruned edges) then retrain the weights from this point onward. The idea behind this methodology is that the trainingpruning process helps the network learn important edges and important weights.
Two similar papers submitted to the workshop added a twist to this paradigm. They showed if the weights are randomly reinitialized after some training and pruning, and then the network gets retrained from the random reinitialization, then a higher accuracy can be obtained. This result suggests that pruning helps find important connections but not important weights. It contradicts what many (including myself) would have intuited: that pruning allows you to learn important weight values and important connections.
The experimental evidence across these two papers could provide a very easytoimplement tool for the machinelearning practitioner, and I’m curious to see if this technique gets widely adopted. There’s a good chance, as one of the papers, “Rethinking the Value of Network Pruning”, won the workshop’s best paper award.
However, while the papers’ results suggest a simple approach toward achieving higher accuracy on lowenergy machine learning models, it begs the question whether this tool will work on the pruned architectures we obtained when optimizing for wirelength pruning. The fact that this was discovered independently by two different groups gives me more confidence that it will work for us. In the spirit of learning from the workshop, we decided to try it out back at Borealis AI HQ.
We obtained the bestperforming model using distancebased regularization at different target accuracies. Then, we reinitialized the weights of the resulting network before we retrained. The table below shows the results:
Accuracy Before Initialization  Accuracy After Initialization  Accuracy After Retraining 

98  8.92  97.08 
97  10.1  84.64 
95  10.32  61.94 
90  10.09  30.76 
The left side presents the accuracy before reinitialization; the middle column shows the accuracy afterreinitialization, and the final column reveals the accuracy after retraining the reinitialized network. As we can see, we consistently get lower accuracy than the network before reinitialization. This suggests the reinitialization approach does not work. But we also find the technique gets less effective when we reinitialize smaller networks, which we have considered as a possible explanation for why the technique didn’t work.
Does this contradict the results of the workshop papers discussed above? No. But it might reveal that the general approach is less likely to work than we might have thought. Nevertheless, in “Rethinking the Value of Network Pruning,” the authors present negative results, so we might have guessed the technique wouldn’t work on our networks trained for shorter wirelength. In the next section, I’ll discuss why I think having negative results is so useful.
It’s a machine learning truism that any single result is hard to pinpoint as being independently important, but the many pieces of evidence reported across multiple papers allows us to draw an emergent lesson. I view machine learning as a grabbag of techniques that help solve new classes of computational problems. They don’t always work, but alltoooften some of them do. This allows us to use the literature in a way that produces rulesofthumb to inform techniques that might work. For example, if we looked at the original Optimal Brain Damage paper in isolation, it might be hard to discern the paper’s broad applicability. But the fact that the standard trainingpruning pipeline has been so widely used, and that so many modifications of the technique (including our wirelength pruning work) also work, gives us confidence in the idea’s ability to capture something basic and fundamental – that doing some type of pruning is appropriate if minimizing energy consumption is a major concern.
Due to the multiplicity of possible techniques, the only thing machine learning practitioners can do is test them out in the first place, set good evaluation criteria, and see if they work. Since engineering and computational resources are limited, this also means judiciously choosing which techniques to take on. This process requires a careful balancing of engineering risk and reward.
So, while it may be worth it to try a technique, the decision depends on the particulars of the problem and the probability that the technique will be successful. Presenting negative results allows our readers to intuit this probability. Moreover, negative results inform researchers about areas that have already been attempted and saves them the effort of retesting them.
The value of negative results can be further illustrated with a concrete thought experiment. Suppose your goal is to create a neural network in a place where computational resources are free and plentiful, and the tool is not going to be widely deployed. Perhaps such a network is used in an internal tool at a small company. The engineer, in this case, might ask: Should I try the reinitialization and retraining technique in order to get a more accurate small network?
Since the papers (and our experiment) suggest the techniques only work some of the time, it may not be worth the effort to give it a try. After all, there’s only a slim chance of it being successful and the reward margin is small. However, suppose the network were to be widely deployed to a billion cellphones, like, for example, in some widely deployed social media application. In this case, it makes sense to try this technique, as well as a number of others, to ensure the tool uses as little energy as possible.
Real problems may fit somewhere between these two extremes, and choosing the right approach requires having a finely tuned sense of the probabilities that they will work. Having a collection of experimental results in the literature, both positive and negative, helps the engineer make the right judgment call about whether a technique is worth the effort.
Right now, we have enormous potential in the field, but we have very limited human talent, and limited computational resources. We should take on the responsibility to ensure we draw the right lessons from the work we do and present our work in as useful a way as possible. That’s why “Rethinking the Value of Network Pruning” is a strong output – not only does it find a surprising and successful technique, it also presented negative results. The quality of the scientific analysis in the paper makes it, in my opinion, a worthy recipient of the workshop’s Best Paper Award and hopefully sets a precedent for more researchers to explore negative results for the greater good of the field.
*Special thanks to Luyu Wang for running the reinitialization experiment in this post.
]]>The Pommerman environment [1] is based off of the classic Nintendo console game, Bomberman. It was set up by a group of machine learning researchers to explore multiagent learning and push stateoftheart through competitive play in reinforcement learning.
The team competition was held on December 8, 2018 during the NeurIPS conference in Montreal. It involved 25 participants from all over the world. The Borealis AI team, consisting of Edmonton researchers Chao Gao, Pablo HernandezLeal, Bilal Kartal and research director, Matt Taylor, won 2nd place in the learning agents category, and 5th place in the global ranking including (nonlearning) heuristic agents. As a reward, we got to haul a sweet NVIDIA Titan V GPU CEO Edition home. Here’s how we pulled it off.
The Pommerman team competition consists of four bomber agents placed at the corners of an 11 x 11 symmetrical board. There are two teams, each consisting of two agents.
Competition rules work like this:
At every timestep, each agent has the ability to execute one of six actions: they can move in any one of four cardinal directions, remain in place, or plant a bomb.
Each cell on the board can serve as a passage (the agent can walk over it), a rigid wall (the cell cannot be destroyed), or a plank of wood (the cell can be destroyed with a bomb).
The game maps, which function as individual levels, are randomly generated; however, there is always a guaranteed path between any agents so that the procedurally generated maps are guaranteed to be playable.
Whenever an agent plants a bomb it explodes after 10 timesteps, producing flames that have a lifetime of two timesteps. Flames destroy wood and kill any agents within their blast radius. When wood is destroyed, the fallout reveals either a passage or a powerup (see below).
Powerups, which are items that impact a player’s abilities during the game, can be of three types: i) they increase the blast radius of bombs; ii) they increase the number of bombs the agent can place; or iii) they give the ability to kick bombs.
Each game episode lasts up to 800 timesteps. There are two ways to end a game: if a team wins before reaching this upper bound, the game is over. If not, a tie is called at 800 timesteps and the game ends that way.
An example of a Pommerman team game.
The Pommerman team competition is a very challenging benchmark for reinforcement learning methods. Here’s why:
When an agent is in the early stage of training, it commits suicide many times.
After some training, the agent learns to place bombs near the opponent and move away from the blast.
Our learning agent (white) is highly skilled against a SimpleAgent. It avoids the blasts and also learns how to trick SimpleAgent to commit suicide in order to win without having to place any bombs.
When we examined the behavior of our learning agent against SimpleAgent we discovered that our agent had learned how to force SimpleAgent to commit suicide. It started when SimpleAgent first placed a bomb, then took a movement action to go toward a neighbor cell X. Our agent, after learning this opponent behaviour, then took a movement action to simultaneously go to the cell X, and thus, by gameengine forward model both were sent back to their original location in the next time step. In other words, our agent had learned a flaw in SimpleAgent and exploited this flaw to win the games by forcing it to commit suicide. This pattern was repeated until the bomb went off, successfully blasting SimpleAgent. This policy is optimal against SimpleAgent; however, it lacks generalization against other opponents since these learning agents learned to stop placing bombs and make themselves easy targets for exploitation.
Notes: This generalization over opponent policies is of utmost importance when dealing with dynamic multiagent environments and similar problems have also been encountered in Laser Tag [6].
In singleagent tasks, faulty (and strange) behaviors have also been observed [7]. A trained RL agent for CoastRunners discovered a spot in the game where, due to the unexpected mismatch between maximum possible reward and intended behaviour, RL agent can obtain higher scores in this spot rather than finishing the game.
A Skynet team is composed of a single neural network and is based on five building blocks:
We make use of the parameter sharing mechanism; that is, we allow the agents to share the parameters of a single network. This allows the network to be trained with the experiences of the two agents. However, it still allows for diverse behavior between agents because each agent receives different observations.
Additionally, we added dense rewards to help the agent improve learning performance. We took inspiration from the difference reward mechanism to provide agents with a more meaningful contribution of their behavior, this in contrast to simply using the single global reward.
Our third block, ActionFilter module, built on the philosophy of installing prior knowledge to the agent by telling the agent what it should not do, then allowed the agent to discover what to do by trialanderror, i.e., learning. The benefit is twofold: 1.) the learning problem gets simplified; and 2.) superficial skills, such as avoiding flames or evading bombs in simple cases, are perfectly acquired by the agent.
It is worth mentioning that the above ActionFilter does not significantly slow down RL training as it is extremely fast. Together with neural net evaluation, each action still takes several milliseconds – a speed almost equivalent to pure neural netforward inferring. For context, the time limit in the competition is 100 ms per move.
The neural net is trained by $\mathit{PPO}$, minimizing the following objective:
\begin{equation}
\begin{split}
o(\theta;\mathcal{D}) & = \sum_{(s_t, a_t, R_t) \in \mathcal{D}} \Bigg[ \mathit{clip}(\frac{\pi_\theta(a_ts_t)}{\pi_\theta^{old}(a_ts_t)}, 1\epsilon, 1+\epsilon) A(s_t, a_t) + \\
& \frac{\alpha}{2} \max\Big[ (v_\theta(s_t) R_t)^2, (v_\theta^{old}(s_t) + \mathit{clip}(v_\theta(s_t)  v_\theta^{old}(s_t), \epsilon, \epsilon)R_t)^2 \Big] \Bigg],
\end{split}
\end{equation}
where $\theta$ is the neural net, $\mathcal{D}$ is sampled by $\pi_\theta^{old}$, $\epsilon$ is a tuning parameter. Refer to PPO paper for details.
We let our team compete against a set of curriculum opponents:
The reason we allowed the opponent to not place a bomb is that we realized the neural net can focus on learning true “blasting” skills, and not a skill that solely relies the opponent's mistakenly suicidal actions. Also, this strategy avoids training on “false positive” reward signal caused by an opponent's involuntary suicide.
As shown in the figure below, the architecture first repeats four convolution layers, followed by two policy and value heads, respectively.
Instead of using an LSTM to track the observational history, we used a “retrospective board” to keep track of the most recent value of each cell on the board. For cells outside an agent's purview, the “retrospective board” filled the unobserved elements of the board with the elements that were observed most recently. The input feature counts 14 planes in total, where the first 10 planes are extracted from the agent's current observation, while the remaining four come from the “retrospective board”.
An example of a game between Skynet Team (Red) vs a team composed of two SimpleAgents (Blue).
The Pommerman team competition used a doubleelimination style. The top three agents used tree search methods, i.e., they actively employed the gameforward model to look ahead for each decision by using heuristics. During the competition, these agents seemed to perform more bombkicking, which increased their chances of survival.
As mentioned in the introduction, our Skynet team won 2nd place on the category of learning agents, and 5th place on the global ranking, including (nonlearning) heuristic agents. It is worth noting that scripted agents were not among the top players in this competition, which shows the high quality level amongst the tree search and learning methods.
Another one of our submissions, CautiousTeam, was based on SimpleAgent and – interestingly enough – wound up ranking 7th overall in the competition. CautiousTeam was submitted primarily for verifying the suspicion that a SimpleAgent without placing a bomb could be strong (or perhaps even stronger) than the winner [3] of the first competition held in June, i.e., a fully observable freeforall scenario. It seems the competition results supported this suspicion.
Aside from being an interesting and (most importantly) fun environment, the Pommerman simulator was also designed as a benchmark for multiagent learning. We are currently exploring multiagent deep reinforcement learning methods [5] by using Pommerman as a testbed.
We would like to thank the creators of the Pommerman testbed, the competition organizers and the growing Pommerman community on Discord. We look forward to future competitions.
[1] Cinjon Resnick, Wes Eldridge, David Ha, Denny Britz, Jakob Foerster, Julian Togelius, Kyunghyun Cho, and Joan Bruna. Pommerman: A multiagent playground. arXiv preprint arXiv:1809.07124, 2018.
[2] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
[3] Hongwei, Zhou et al. "A hybrid search agent in pommerman." Proceedings of the 13th International Conference on the Foundations of Digital Games. ACM, 2018.
[4] Bilal Kartal, Pablo HernandezLeal, and Matthew E. Taylor. "Using Monte Carlo Tree Search as a Demonstrator within Asynchronous Deep RL." arXiv preprint arXiv:1812.00045(2018).
[5] Pablo HernandezLeal, Bilal Kartal, and Matthew E Taylor. Is multiagent deep reinforcement learning the answer or the question? A brief survey. arXiv preprint arXiv:1810.05587, 2018.
[6] Lanctot, Marc, et al. "A unified gametheoretic approach to multiagent reinforcement learning." Advances in Neural Information Processing Systems. 2017.
[7] Open AI. Faulty Reward Functions in the Wild. https://blog.openai.com/faultyrewardfunctions/
[8] Pablo HernandezLeal, Bilal Kartal, and Matthew E Taylor. Skill Reuse in Partially Observable Multiagent Environments. LatinX in AI Workshop @ NeurIPS 2018
[9] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." nature 521.7553 (2015): 436.
[10] J. N. Foerster, Y. M. Assael, N. De Freitas, S. Whiteson, Learning to communicate with deep multiagent reinforcement learning, in: Advances in Neural Information Processing Systems, 2016,
[11] J. N. Foerster, N. Nardelli, G. Farquhar, T. Afouras, P. H. S. Torr, P. Kohli, S. Whiteson, Stabilising Experience Replay for Deep MultiAgent Reinforcement Learning., in: International Conference on Machine Learning, 2017.
[12] Devlin, Sam, et al. "Potentialbased difference rewards for multiagent reinforcement learning." Proceedings of the 2014 international conference on Autonomous agents and multiagent systems.
[13] Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017).
[14] Bansal, Trapit, et al. "Emergent complexity via multiagent competition." arXiv preprint arXiv:1710.03748 (2017).
In this section, we provide more details on some of the aforementioned concepts as follows:
Difference rewards: This is a method to better address the credit assignment challenge for multiagent teams. Relying only on the external reward, both agents receive the same team (global) reward independently of what they did in the episode. However, this renders multiagent learning more difficult, as spurious actions can occasionally be rewarded, or some of the agents can learn to do the most of the jobs while the rest learn to be lazy. However, difference rewards [14] propose to compute the individual contribution without hurting the coordination performance. The main idea is very tidy: you compute individual rewards by subtracting the counterfactual reward computed without corresponding agents’ actions to the external global reward signal. Thus, team members are encouraged to optimize the overall team performance but also optimize their own contribution, so that no lazy agents can arise.
Centralized training and decentralized execution: Even though the team game scenario comes with partially observable execution, you can only hack the game simulator to access the full state during training. With full state access, you can train a centralized value function, which is used during training for actorcritic setting, but deployed agents will only utilize the policy network, i.e. that are trained with partial observations.
Dense rewards: There can be cases where an agent commits suicide and the remaining team member terminates the opposing team by itself. In the simplest case, both team members would get a +1 reward – i.e., suicidal agent is reinforced! We altered the single stark external reward for the first dying team member in order to address this issue where the first dying agent gets a smaller reward than last surviving one. This helped our agent improve gameplay performance, although this modification comes with some expected and some unforeseen consequences. For instance, under this setting we can never have a team where one agent sacrifices itself (in such a configuration that it dies with the enemy simultaneously) for the team to win or that allows it to allot less credit to a hardworking agent that terminated an enemy but died when battling the second enemy.
Action filter description: We implement a filter with two categories:
For avoiding suicide
For placing bombs
Authors: YuAn Chung · WeiHung Weng · Schrasing Tong · James Glass
Last year, I was surprised by a paper that introduced a technique to perform word translation between two languages without parallel corpora. To clarify, you’re said to have unparalleled corpora when two sets of a text exist in two languages (e.g. a set of English words and a set of French words), but there is no information regarding which English word corresponds to the appropriate French translation.
Previously, the stateoftheart method for learning crosslingual word embeddings mainly relied on bilingual dictionaries, along with some help from characterlevel information for languages that shared a common alphabet. None of this was competitive with supervised machine translation techniques. The authors of the paper managed to pose a different question: Was it possible to do unsupervised word translation? They answered their own question by introducing a new technique that worked quite well.
Their model worked by obtaining the word embeddings space for both languages, independently, and introducing a technique for unsupervised alignment between the two embedding spaces that can achieve translations without parallel corpora. The intuition behind this technique is to rotate one embedding space to the point that the two embedding spaces are virtually indistinguishable to a classifier (i.e., adversarial training).
This year, a new set of authors presented their work regarding the task of automatic speech recognition without parallel data. This, again, means two independent sets of speech data and text data exist, but the correspondence information between them is unclear. This work stood out since it is the first successful attempt to apply the unsupervised alignment technique introduced last year on multiple modalities of data. The task involved taking a dataset of words from one language and a dataset of spoken words from either the same or a different language, and automatically identifying spoken words without parallel information.
The authors first trained an embedding space for written words and another embedding space for spoken words. They then applied the unsupervised alignment technique to the embedding spaces to align them so that spoken words could automatically be classified and translated. At test time, a speech segment is first mapped into its respective embedding space, aligned to the text embedding space, then the nearest neighbors of the text embedding are picked as the translation. The same procedure can be used for the texttospeech conversion task.
The authors present some experiments on the spoken Wikipedia and LibriSpeech datasets that show unsupervised alignments are still not as good as supervised alignment – but they’re close. Some challenges still remain to be solved before unsupervised crossmodal alignments can be competitive with supervised ones; however, this work shows the promise of improving automatic speech recognition (ASR), texttospeech (TTS) and even translation systems, especially in languages with a low availability of paralleled data. (/HS)
Authors: Rad Niazadeh · Tim Roughgarden · Joshua Wang
This paper was accepted as an oral presentation. The authors gave an approximation algorithm for maximizing continuous nonmonotone submodular functions.
To give a brief recap, submodular functions arise in several important areas of machine learning and, in particular, around the intersection of economics and learning. They can be used to model the problem of maximizing multiplatform ad revenue, where a buyer wants to maximize their profit = revenue  cost by advertising on different platforms and there is a diminishing return of advertising on more platforms. This diminishing return is the precise property captured by the submodular functions. Mathematically, a function $f:\{0,1\}^n \rightarrow \mathbb{R}$ is submodular if $f(S \cup \{e\})  f(S) \geq f(T \cup \{e\})  f(T)$ for every $S \subseteq T$ and $e \notin T $. In this setting, there is an informationtheoretic lower bound of $1/2$approximation [Feige et al.'11] and there is an optimal algorithm which matches this bound [Buchbinder et al.'15].
This paper considered the continuous submodular function where, instead of maximizing on the vertices of the hypercube $\{0,1\}^n$, we want to maximize over the full hypercube $[0,1]^n$. The main result of the paper is that they obtained a randomized algorithm for maximizing a continuous submodular and $L$Lipschitz function over the hypercube that guarantees a $1/2$approximation. Note that this is currently the best possible ratio that is informationtheoretically achievable.
The reason this paper stood out is that the authors used the double greedy framework of Buchbinder et al.'15 to solve the coordinatewise zerosum game, and then use the geometry of this game to bound the value at its equilibrium. This is a nice application of game theory to maximize the value of the function. The authors also conducted experiments on 100dimensional synthetic data and achieved comparable results as the previous work they referenced. One thing we hoped to see was that their achievement of better approximations and faster algorithms would also show a significant advantage in the experiments, but that was not the case.
In terms of the open problems, I am really excited to see the development of parallel and online algorithms for continuous submodular optimization. In fact, there is a recent work for parallel algorithms of Chen et al.'18 which achieves a tight $(1/2  \epsilon)$approximation guarantee using $\tilde{O}(\epsilon^{1}$) adaptive rounds. (/KJ)
Authors: Jiantao Jiao · Weihao Gao · Yanjun Han
This paper focused on estimating the differential entropy of a continuous distribution $f$ given $n$ i.i.d. samples. Entropy has been a core concept of information theoretic measures and has engendered numerous important applications, such as goodnessoffit tests, feature selection, and tests of independence. In the vast body of literature around this concept, most of the measures have appeared to take on an asymptotic flavor – that is, until several recent works.
This paper is one of those works. The authors took particular focus on the fixedk nearest neighbor (fixedkNN) estimator, also called the KozachenkoLeonenko estimator. This estimator is simple; there is only one parameter to tune, and it requires no knowledge about the smoothness degree $s$ about targeted distribution $f$. Moreover, it is also computationally efficient, since $k$ is fixed (compared to other methods with similar finite sample bounds) and statistically efficient: As shown in this paper, it has a finite sample bound that is close to optimal. All of these properties make the estimator realistic and attractive in practice.
I found the paper also carried some interesting technical results. One direct approach in estimating the differential entropy is to plug in a consistent estimator, for example, based on kNN distance statistics of the density function $f$ into the formula of entropy. However, such estimators usually come with an impractical computational demand. For instance, in the kNNbased estimator, $k$ has to approach $\infty$ as the number of samples $n$ approaches $\infty$.
In a recent paper by Han et al. [2017], the authors constructed a complicated estimator that achieves a finite sample bound in the rate of $n\log(n))^{\frac{s}{s+d}} + n^{\frac12}$ (the optimal rate). One caveat, though, is that it requires the knowledge of the smoothness degree $s$ of the targeted distribution $f$. The last challenging part is to deal with the area where $f$ is small. A major difficulty in achieving such bounds for the entropy estimator is that the nearest neighbor estimator exhibits a huge bias in its lowdensity area. Most papers tend to make assumptions about the property of $f$ such that this bias is well controlled. However, this paper did not presume similar assumptions. Given all these constraints, including fixed $k$, no knowledge of $s$ and no assumptions on how $f$ is bounded from below, the authors managed to prove a nearly optimal finite sample bound for a simple estimator. According to the authors, the new technical tools here are the Besicovitch convering lemma and a generalized HardyLittlewood maximal inequality. This part is not yet clear to me.
Lastly, the authors also pointed out several weaknesses in their paper and their plans for future work. For example, they conjectured that both the upper bound and the lower bound in the paper could be further improved. They also hypothesized a way to extend the constraint on $s$ in the theorem so that the result can be applied to a more general setting. (/RH)
Authors: Kevin Scaman · Francis Bach · Sebastien Bubeck · Laurent Massoulié · Yin Tat Lee
This paper considered distributed optimization of nonsmooth convex functions using a network of computing units. The objective of this work was to study the impact of the communication network on learning, and the tradeoff between the structure of the network and algorithmic efficiency. The network consists of a connected simple graph of nodes, each having access to a function (such as a loss function). The optimization problem exists to minimize the average of the local functions: communication between nodes takes a given length of time and computation takes one unit of time. Under a decentralized scenario, local communication is performed through gossip.
The authors give bounds on the time to reach a given precision, then provide an optimal algorithm that uses a primaldual reformulation. They are able to show that the error due to limits in communication resources will then decrease at a fast rate. In the centralized setting, the authors provide an algorithm which achieves convergence rates within $d^{1/4}$ to the optimal, where d is the underlying dimension.
I found this paper intriguing because it considers the impact of communication and computation resources in learning, which will be increasingly important as systems we learn on become larger. It received one of the best paper awards and is one of few papers that consider such impacts. There’s an argument to be made that these two things are related; as learning systems scale up and get distributed through IoT and mobile devices, the importance of distributed learning in a setting where there is tension between communication and computation has also increased. The elegant analytical tools used in this paper – gossip methods, primaldual formulation, ChambollePock algorithm for saddlepoint optimization, the combined use of optimization and graph theory, and the bounds that give insight into which resources are important at which stage of convergence – show that the award places welldeserved attention toward a growing area. (/NH)
Invited talk: Jon Kleinberg  Fairness, Simplicity, and Ranking
In Jon Kleinberg's fascinating invited talk, he addressed the effect of implicit bias on producing adverse outcomes. The specific application he referred to is that of bias in activities such as hiring, promotion, admissions. The setting is as follows: a recruitment committee is tasked with selecting a shortlist of final candidates from a given pool of applicants, but their estimates of skill, used in the selection, may be skewed by implicit bias.
The Rooney Rule is an NFL policy in effect since 2003 that requires teams to interview ethnicminority candidates for coaching and operations positions. (Note: There is no quota or preference given in the actual hiring). Kleinberg and his coauthors showed that measures such as the Rooney Rule lead to higher payoffs for the organization. Their model is as follows: a recruiting committee must select a list of k candidates for final interviews; the set of applicants is then divided into two groups, X and Y, X being the minority group; there are $n$ Y applicants and $n\alpha$ X applicants with $\alpha \le 1$. Each candidate has a numerical value representing their skill, and there is a common distribution from which these skills are drawn.
Based on empirical studies of skills in creative and skilled workforces, the authors then modeled this distribution as a Pareto distribution (power law). The utility that the recruiting committee aims to maximize is the sum of the candidates’ skills that were selected to the list. The authors modeled the bias as a multiplicative bias in the estimation of the skills of Xcandidates. So, Y candidates are estimated at their true value and an Xcandidate skill is estimated to be $X_i/\beta$ for candidate i where $\beta >1$. The authors then analyzed the utility of a list of $k$ candidates where at least one must be an X candidate. Their analysis showed an increase in utility even when the list was of size 2, and for a large range of values for the bias, power law, and population parameters.
I found this to be another very interesting and important paper because it tackles the question of fairness at a very practical level and provided a tangible algorithmic framework with which to expose, then analyze the outcomes. Furthermore, the modelling assumptions were very realistic, and their results demonstrated the potential for significant impact. The particular scenario considered here may be for activities such as hiring and admissions, but the result has consequences for machine learning models. (/NH)
]]>