However properties of distributions constructed with normalizing flows remain less well understood theoretically. One important property is that of *tail behavior*. We can think about a distribution as having two regions: the *typical set* and the *tails* which are illustrated in Figure 1. The typical set is what is most often considered; it's the area where the distribution has a significant amount of density. That is, if you draw samples or have a set of training examples they're generally from the typical set of the distribution. How accurately a model captures the typical set is important when we want to use distributions to, for instance, generate data which looks similar to the training data. Many papers show figures like Figure 2 which showcase how well a model matches the target distribution in regions where there's lots of density.

The tails of the distribution are basically everything else and, when working on an unbounded domain (like $\mathbb{R}^n$) corresponds to asking how the probability density behaves as you go to infinity. We know that the probability density of a continuous distribution on an unbounded domain goes to zero in the limit, but the rate at which it goes to infinity can vary significantly between different distributions. Intuitively tail behaviour indicates how likely extreme events are and this behaviour can be very important in practice. For instance, in financial modelling applications like risk estimation, return prediction and actuarial modelling, tail behaviour plays a key role.

This blog post discusses the tail behaviour of normalizing flows and presents a theoretical analysis showing that some popular normalizing flow architectures are actually unable to estimate tail behaviour. Experiments show that this is indeed a problem in practice and a remedy is proposed for the case of estimating heavy-tailed distributions. This post will omit the proofs and other formalities and instead will aim at providing a high level overview of the results. For readers interested in the details we refer them to the full paper which was recently presented at ICML 2020.

Let $\mathbf{X} \in \mathbb{R}^D$ be a random variable with a known and tractable probability density function $f_\mathbf{X} : \mathbb{R}^D \to \mathbb{R}$. Let $\mathbf{T}$ be an invertible function and $\mathbf{X} = \mathbf{T}(\mathbf{Y})$. Then using the change of variables formula, one can compute the probability density function of the random variable $\mathbf{Y}$:

\begin{align}

f_\mathbf{Y}(\mathbf{y}) & = f_\mathbf{X}(\mathbf{T}(\mathbf{y})) \left| \det \textrm{D}\mathbf{T}(\mathbf{y}) \right| , \tag{1}

\end{align}

where $\textrm{D}\mathbf{T}(\mathbf{y}) = \frac{\partial \mathbf{T}} {\partial \mathbf{y}}$ is the Jacobian of $\mathbf{T}$. Normalizing Flows are constructed by defining invertible, differential functions $\mathbf{T}$ which can be thought of as transforming the complex distribution of data into the simple base distribution, or "normalizing" it. The paper attempts to characterize the tail behaviour of $f_\mathbf{Y}$ in terms of $f_\mathbf{X}$ and properties of the transformation $\mathbf{T}$.

Before we can do that though we need to formally define what we mean by tail behaviour. The basis for characterizing tail behaviour in 1D was provided in a paper by Emanuel Parzen. Parzen argued that tail behaviour could be characterized in terms of the *density-quantile function*. If $f$ is a probability density and $F : \mathbb{R} \to [0,1]$ is its cumulative density function then the quantile function is the inverse, *i.e.*, $Q = F^{-1}$ where $Q : [0,1] \to \mathbb{R}$. The density-quantile function $fQ : [0,1] \to \mathbb{R}$ is then the composition of the density and the quantile function $fQ(u) = f(Q(u))$ and is well defined for square integrable densities. Parzen suggested that the limiting behaviour of the density-quantile function captured the differences in the tail behaviour of distributions. In particular, for many distributions

\begin{equation}

\lim_{u\rightarrow1^-} \frac{fQ(u)}{(1-u)^{\alpha}} \tag{2}

\end{equation}

converges for some $\alpha > 0$. In other words, the density-quantile function asymptotically behaves like $(1-u)^{\alpha}$ and we denote this as $fQ(u) \sim (1-u)^{\alpha}$. (Note that here we consider the right tail, i.e., $u \to 1^-$, but we could just as easily consider the left tail, i.e., $u \to 0^+$.) We call the parameter $\alpha$ the *tail exponent* and Parzen noted that it characterizes how heavy a distribution is with larger values having heavier tails. Values of $\alpha$ between $0$ and $1$ are called light tailed and include things like bounded distributions. A value of $\alpha=1$ corresponds to some well known distributions like the Gaussian or Exponential distributions. Distributions with $\alpha > 1$ are called heavy tailed, *e.g.*, a Cauchy or student-T. More fine-grained characterizations of tail behaviour are possible in some cases but we won't go into those here.

Now, given the above and two 1D random variables, $\mathbf{Y}$ and $\mathbf{X}$ with tail exponents $\alpha_\mathbf{Y}$ and $\alpha_\mathbf{X}$, we can make a statement about the transformation $\mathbf{T}$ that maps between them. First, the transformation is given by $T(\mathbf{x}) = Q_\mathbf{Y}( F_\mathbf{X}( \mathbf{x} ) )$ where $F_\mathbf{X}$ denotes the CDF of $\mathbf{X}$ and $Q_\mathbf{Y}$ denotes the quantile function (i.e., the inverse CDF) of $\mathbf{Y}$. Second, we can then show that the derivative of this transformation is given by

\begin{equation}

T'(\mathbf{x}) = \frac{fQ_\mathbf{X}(u)}{fQ_\mathbf{Y}(u)} \tag{3}

\end{equation}

where $u=F_\mathbf{X}(\mathbf{y})$ and $fQ_\mathbf{X}$ and $fQ_\mathbf{Y}$ are the density-quantile functions of $\mathbf{X}$ and $\mathbf{Y}$ respectively.

Now, given our characterization of tail behaviour we get that

\begin{equation}

T'(\mathbf{x}) \sim \frac{(1-u)^{\alpha_{\mathbf{X}}}}{(1-u)^{\alpha_{\mathbf{Y}}}} = (1-u)^{\alpha_{\mathbf{X}}-\alpha_{\mathbf{Y}}} \tag{4}

\end{equation}

and now we come to a key result. If $\alpha_{\mathbf{X}} < \alpha_{\mathbf{Y}}$ then, as $u \to 1$ we get that $T'(\mathbf{x}) \to \infty$. That is, if the tails of the target distribution of $\mathbf{Y}$ are heavier than those of the source distribution $\mathbf{X}$ then the slope of the transformation must be unbounded. Conversely, if the slope of $T(\mathbf{x})$ is bounded (i.e., $T(\mathbf{x})$ is Lipschitz) then the tail exponent of $\mathbf{Y}$ will be the same as $\mathbf{X}$, i.e., $\alpha_\mathbf{Y} = \alpha_\mathbf{X}$.

The above is an elegant characterization of tail behaviour and it's relationship to the transformations between distributions but it only applies to distributions in 1D. To generalize it to higher dimensional distributions, we consider the tail behaviour of the norm of a random variable, i.e., $\Vert \cdot \Vert$. Then the degree of heaviness of $\mathbf{X}$ can be characterized by the degree of heaviness of the distribution of the norm. Using this characterization we can then prove an analog of the above.

**Theorem 3** *Let $\mathbf{X}$ be a random variable with density function $f_\mathbf{X}$ that is light-tailed and $\mathbf{Y}$ be a target random variable with density function $f_\mathbf{Y}$ that is heavy-tailed. Let $T$ be such that $\mathbf{Y} = T(\mathbf{X})$, then $T$ cannot be a Lipschitz function.*

So what does this all mean for normalizing flows which are attempting to transform a Gaussian distribution into some complex data distribution? The results show that a Lipschitz transformation of a distribution cannot make it heavier tailed. Unfortunately, many commonly implemented normalizing flows are actually Lipschitz. The transformations used in RealNVP and Glow are known as affine coupling layers and they have the form

\begin{equation}

T(\mathbf{x}) = (\mathbf{x}^{(A)},\sigma(\mathbf{x}^{(A)}) \odot \mathbf{x}^{(B)} + \mu(\mathbf{x}^{(A)}) \tag{5}

\end{equation}

where $\mathbf{x} = (\mathbf{x}^{(A)},\mathbf{x}^{(B)})$ is a disjoint partitioning of the dimensions, $\odot$ is element-wise multiplication and $\sigma(\cdot)$ and $\mu(\cdot)$ are arbitrary functions. For transformations of this form, we can then prove the following:

**Theorem 4** *Let $p$ be a light-tailed density and $T(\cdot)$ be a triangular transformation such that $T_j(x_j; ~x_{<j}) = \sigma_{j}\cdot x_j + \mu_j$. If, $\sigma_j(z_{<j})$ is bounded above and $\mu_j(z_{<j})$ is Lipschitz continuous then the distribution resulting from transforming $p$ by $T$ is also light-tailed.*

The RealNVP paper uses $\sigma(\cdot) = \exp(NN(\cdot))$ and $\mu(\cdot) = NN(\cdot)$ where $NN(\cdot)$ is a neural network with ReLU activation functions. The translation function $t(\cdot)$ is hence Lipschitz since a neural network with ReLU activation is Lipschitz. However the scale function, $\sigma(\cdot)$, at first glance, is not bounded because the exponential function is not unbounded. However in practice this was implemented as $\sigma(\cdot) = \exp(c\tanh(NN(\cdot)))$ for a scalar $c$. This means that, as originally implemented, $\sigma(\cdot)$ *is* bounded above, i.e., $\sigma(\cdot) < \exp(c)$. Similarly, Glow uses $\sigma(\cdot) = \mathsf{sigmoid}(NN(\cdot))$ which is also clearly bounded above.

Hence, RealNVP and Glow are unable to unable to represent heavier tailed distributions. Not all architectures have this property though and we point out a few that can actually change tail behaviour, for instance SOS Flows.

To address this limitation with common architectures, we proposed using a parametric base distribution which is capable of representing heavier tails which we called *Tail Adaptive Flows* (TAF). In particular, we proposed the use of the student-T distribution as a base distribution with learnable degree-of-freedom parameters. With TAF the tail behaviour can be learned in the base distribution while the transformation captures the behaviour of the typical set of the distribution.

We also explored these limitations experimentally. First we created a synthetic dataset using a target distribution with heavy tails. After fitting with a normalizing flow, we can measure it's tail behaviour. Measuring tail behaviour can be done by estimating the density-quantile function and finding the value of $\alpha$ such that $(1-u)^{\alpha}$ approximates its near $u=1$. Our experimental results confirmed the theory. In particular, fitting a normalizing flow with a RealNVP or Glow style affine coupling layer was fundamentally unable to change the tail exponent, even as more depth was added. Figure 4 shows an attempt to fit a model based on a RealNVP-style affine coupling layers to a heavy tailed distribution (student T). No matter how many blocks of affine coupling layers are used, it is unable to capture the structure of the distribution and the measured tail exponents remain the same as the base distribution.

However, when using a tail adaptive flow the tail behaviour can be readily learned. Figure 5 shows the results of fitting a tail adaptive flow on the same target as above but with 5 blocks. This isn't entirely surprising as tail adaptive flows use a student T base distribution. However, SOS Flows is also able to learn the tail behaviour as predicted by the theory. This is shown in Figure 6.

We also evaluated TAF on a number of other datasets. For instance, Figure 7 shows tail adaptive flows successfully fitting the tails of Neal's Funnel, an important distribution which has heavier tails and exhibits some challenging geometry.

In terms of log likelihood on a test set, our experiments show that using TAF is effectively equivalent to not using TAF. However, this shouldn't be too surprising.

We know that normalizing flows are able to capture the distribution around the typical set and this is where most samples, even in the test set, are likely to be. Put another way, capturing tail behaviour is about understanding how frequently rare events happen and by definition it's unlikely that a test set will have many of these events.

This paper explored the behaviour of the tails of commonly used normalizing flows and showed that two of the most popular normalizing flow models are unable to learn tail behaviour that is heavier than that of the base distribution. It also showed that by changing the base distribution we are able to restore the ability of these models to capture tail behaviour. Alternatively, other normalizing flow models like SOS Flows are also able to learn tail behaviour.

So does any of this matter in practice? If the problem you're working on is sensitive to tail behaviour then absolutely and our work suggests that using an adaptive base distribution with a range of tail behaviour is a simple and effective way to ensure that your flow can capture tail behaviour. If your problem isn't sensitive to tail behaviour then perhaps less so. However, it is interesting to note that the seemingly minor detail of adding a $\tanh(\cdot)$ or replacing $\exp(\cdot)$ with sigmoid could significantly change the expressiveness of the overall model. These details have typically been motivated by empirically observed training instabilities. However our work connects these details to fundamental properties of the estimated distributions, perhaps suggesting alternative explanations for why they were empirically necessary.

]]>Interested in how ownership and copyright protection for media content is impacted by the rise of social media, Xiaohong’s research is focused on image watermarking and image forgery detection. More specifically his research involves new deep neural network architectures for blind image watermarking based on information-theoretic principles.

He is a third year Ph.D. student at McMaster University, supervised by Dr. Jun Chen. Xiaohong is interested in a career in machine learning because teaching machines complex tasks formerly only accomplished by humans excites him.

The Borealis AI fellowship has provided him with the opportunity continue his research and broaden its impact. The fellowship also leads him to join some of the most talented minds in ML an AI and advises him how to take his research and career further.

A fun fact about Xiaohong is that he has a musical side, knowing how to play the accordion.

Check out Xiaohong Liu’s Google Scholar.

Sedigheh is passionate about finding machine learning solutions that could positively impact important domains like healthcare. Her research involves predicting continuous-time Markov chains, with a focus on stochastic processes and simulations with applications to nucleic acid kinetics.

Sedigheh Zolaktaf received her BSc in Computer Engineering from Sharif University of Technology, Iran, in 2013, and MSc in Computer Science from the University of British Columbia, Canada, in 2015. She is currently a Ph.D. candidate in the Artificial Intelligence and Algorithms laboratories at the University of British Columbia. She chose a career in machine learning as it aligned with her interests in mathematics, coding and problem-solving.

The Borealis AI 2020 fellowship has provided support to Sedigheh by recognizing the importance of her work. This award also motivates her to continue her research in the area of stochastic processes and nucleic acid kinetics.

Outside of research, Sedigheh likes to stay active playing basketball and netball.

She is enthusiastic about the future of AI technologies and how they will intertwine with human decision making. Ibtihel Amara is focused on performing efficient analysis of Neural Network uncertainty. More specifically, her research looks into finding efficient uncertainty computation for edge devices. She also believes that ensuring trust and reliability are integrated into AI systems is paramount.

Ibtihel Amara is currently completing her Ph.D. at McGill University at the Center for Intelligent Machines (CIM). The Borealis AI fellowship has given her the opportunity to fully focus on her research goals and provided her with valuable encouragement and support that motivates her to dream big.

Ibtihel's hobbies are harmonious with her passion for technology. She enjoys spending her time gardening and finding ways to enhance urban agriculture with the help of AI.

AI is transforming industries. Whether it’s healthcare or global warming, cyber security or customer service, I’m constantly amazed and excited about the potential for machine learning to help businesses and society address some of today’s biggest challenges.

However, for modern AI to be performed properly and to succeed at scale, researchers and engineers need access to large datasets – the kind that are held by only a few companies worldwide. At the same time, the need to protect sensitive and private information is paramount.

To me, this is where the real opportunity lies. How can we ensure that AI is accessible to all in a safe and ethical manner?

At Borealis AI, we are championing the importance of Responsible AI by researching and developing practical solutions to enable a safer and more ethical adoption of AI technology. It includes a wide-range of considerations including privacy, accountability, transparency and bias and is critical to maintaining trust and accountability.

I recently recorded a panel discussion for Collision from Home where I touched on this opportunity, and the responsibility we have to ensure responsible AI for all. You can check out the Untapped Potential of AI recording in the video above.

]]>Elahe received a BSc degree in Electrical Engineering from the Isfahan University of Technology in 2012, and a MASc in Electronic-Digital Systems from Amirkabir University of Technology (Tehran Polytechnic) in 2016. She is currently a second-year Ph.D. student at Concordia Institute for Information System Engineering (CIISE) in Montreal where her studies focus on machine learning and deep learning models in rehabilitation and assistive technologies under the supervision of Prof. Arash Mohammadi.

Outside of her research Elahe enjoys testing out new baking recipes in the kitchen and being out in nature.

Read more about Elahe's work on Google Scholar.

Chenyang completed his bachelor degree in mathematics at the Northwest Polytechnic in Xi’an, Shaanxi, China before moving to Canada in 2013. He studied at the University of Windsor, Ontario where he obtained a Bachelor in Computer Science before moving to the University of Alberta. Chenyang is currently studying for his PhD in Computer Science while fulfilling his passion for teaching as a teaching assistant at the U of A.

In his spare time, Chenyang enjoys watching documentaries and testing his strategy skills with online gaming. He also enjoys listening to classical music.

Read more about Chenyang's work on Google Scholar.

Canada is a pioneer of AI and Machine Learning (ML). However, there continues to be a lack of women in this field. Borealis AI’s collaboration with Athena Pathways will tackle the gender imbalance in AI and technology in general, by providing mentorship and internship opportunities to women, starting their careers in these fields.

Athena Pathways has a near-term goal of enrolling 500 women in high-school and university courses, as well as providing internships, mentorships, and other workplace opportunities to significantly increase the number of women currently working for industry in the technology sector.

As part of its support, Borealis AI will work with female students across universities in British Columbia to add industry skills and experience to their studies through internships and provide them with job-seeking advice to prepare them for their careers as soon as they graduate. Borealis AI’s support will help improve gender diversity in the field of technology and will help address Canada’s needs in AI talent.

The Athena Pathways project also aims to mitigate risks in AI technology due to the gender imbalance and misrepresentation across AI model creators.

Speaking of the project, Dr. Eirene Seiradaki, Director of Research Partnerships at Borealis AI, said: “We are proud to support Athena Pathways. We share their commitment to attracting more women to the field of AI. Our collaboration with Athena will enable us to provide ongoing support and training to women at the very start of their careers and encourage a more competitive talent landscape in Canada."

This project is part of Borealis AI’s ongoing program focused on women in AI and technology. Borealis AI recently announced its support for AI4Good Lab, a 7-week summer training program that annually brings together a cohort of 30 women from across Canada.

Anna completed her bachelor’s degree in Biophysics at Goethe University in Frankfurt, Germany in 2014. A highlight of Anna’s academic career is completing her Masters at the Perimeter Scholars International program in 2017. She is now a Ph.D. student at the Perimeter Institute for Theoretical Physics and the University of Waterloo under advisor Roger Melko.

Outside of the world of research, Anna is passionate about art and sports. In her spare time, she lets her creative and adventurous side shine by painting and finding new spots to surf.

Read more about Anna's work on Google Scholar.

High Performance Computing (HPC) infrastructure, with a distributed and fully automated environment, is extremely important when building modern AI models, especially when this research is applied in production environments like RBC’s, where the size of the datasets can be massive (~10 Billion new client interactions every month).

Our objective was to build an AI infrastructure that could handle both research and production workload, ensuring that Borealis AI’s research projects could transition to production efficiently. We believe in quick iterations and therefore this infrastructure is designed to be flexible and easy to use. It encompasses two GPU clusters to accommodate the distinct needs of Borealis AI’s research and production work.

Throughout the research community, there has been a growing number of HPC clusters that use Slurm - a resource scheduler and cluster management software. AI researchers are familiar with this technology and it was adopted at Borealis AI in order to facilitate use and reduce the learning curve for new users. Our researchers coming from academia can now quickly onboard onto our platform and start their research.

Building a powerful cluster is more than just stacking together GPUs. An AI cluster requires every component, including networking and storage, to operate in harmony and at high performance. With the AI community moving towards training larger models, an integrated system became more important than piling up servers. We built our cluster using AIRI based on NVIDIA‘s reference architecture which provided us with a high performance integrated solution, and the flexibility to increase capacity efficiently.

Taking a machine learning model into production is not a trivial task. These applications need to handle complexities such as data reliability and stochasticity which are not there in traditional software development. In order to manage this complexity we designed a compute infrastructure based on industry standards and best practices. The emergence of Docker and Kubernetes has changed the way we build AI infrastructure and, with RBC’s vast expertise in managing the Red Hat OpenShift platform, Borealis AI built its production cluster using OpenShift and allowed developers to deploy containerized ML applications and services into production using GPUs.

Borealis AI is leveraging the power of this new infrastructure across a broad spectrum of projects, ranging from personal & commercial banking, to wealth management and capital markets. Prediction tasks in the finance industry are particularly challenging, because they are driven by massive datasets and require exhaustive analysis of multiple dependent axes, including data filtering, neural architecture search, hyperparameter optimization, dynamic targets, and path-dependent metrics. A thorough exploration of the resulting joint parameter space typically requires optimization of tens of thousands of configurations, or the equivalent of thousands of CPU years.

Our new HPC infrastructure, with a distributed and fully automated environment, enables parallel execution of above tasks in a matter of days. In production, this infrastructure enabled us with parallel online computation of complex feature representations and, as a consequence, ultra-fast reactions in an environment that is primarily dominated by one factor...speed!

]]>That’s something that the organizers of the recent AI4Good Lab Industry Night were able to recreate – albeit in a virtual world - thanks to personalized avatars, virtual meeting rooms and real-time chats.

The purpose of the event was to give the all-women students of the AI4Good Lab a stronger sense of research groups and companies that work in the AI space and an array of initiatives that they can get involved in. It also gave the partners an opportunity to provide more detailed information about themselves.

Borealis AI’s all-female team, along with other partners, including CIFAR, IVADO, Amii, DeepMind and Accenture among others, participated in the AI4Good industry event, chatting with delegates about internships, fellowships, and offering advice on how to navigate the job market in the AI space. The team shared their thoughts on a wide range of topical issues, including ethical AI. They also provided information about AI research and products at Borealis AI as well as various internship and job opportunities with the team.

The AI4Good team prepared avatars for everyone, using photos of the participants. The delegates were able to virtually walk and stand with each other while they chatted. Borealis AI’s room, designed by visual designer, April Cooper, brought some nature and light to the room with the addition of a virtual tree!

Thanks to Maya Marcus-Sells, Executive Director of AI4Good Lab, and her colleague, Yosra Kazemi, for pulling the industry Night together and giving us a much-needed chance to chat and further build the women in AI community.

If you would like a peek inside this year’s virtual Industry Night, a tour of the 3D booths, a look at Maya’s, Eirene’s, and April’s avatars enjoying the virtual shadow of the Borealis AI tree, or just want to virtually “feel” and “smell” the breeze though the branches of the Borealis AI tree, we’ve got you covered!

Click on the gallery below to see pics from the event.

]]>

The 38th Conference on Computer Vision and Pattern Recognition was held on June 14-19 and, for the first time, was held as a fully virtual conference. The first CVPR was held in 1977 (although under a different name) and had only 63 papers which certainly would have made going through the proceedings a much easier endeavour. Since then it has grown into the premier conference in the field of computer vision with a massive technical program. This year there were the usual highlights at CVPR: the plenary talks (fireside chats with Satya Nadella and Charlie Bell), the award winning papers (presented by Greg Mori, CVPR 2020 program chair and Borealis AI’s Senior Research Director) and the retrospective Longuet-Higgins Prize. (Be sure to check out this excellent post by Michael J Black about one of the Longuet-Higgins Prize winning papers.) But there’s much more to CVPR 2020; this year alone had nearly 1,500 papers along with a wide array of tutorials and workshops. Our researchers have picked out a few of their favourites to highlight.

*Wenqian Liu, Runze Li, Meng Zheng, Srikrishna Karanam, Ziyan Wu, Bir Bhanu, Richard J. Radke, and Octavia Camps.*

- Grad-CAM: Visual explanations from deep networks via gradient-based localization
- Disentangling by Factorising
- Adapting Grad-CAM for embedding networks

The paper proposed a gradient-based method to explain Variational Autoencoders (VAE) for images. The proposed method is also able to localize anomalies in images, which are examples that are not seen in training.

The method extends the widely-used Grad-CAM method to explain generative models, specifically VAEs. It can produce visual explanations for images that are generated from VAEs. For example, what is the most important image region for the digit 5? Which region of the digit 7 is different from that of digit 1? Visual explanations for these questions provide a clear understanding of the reasoning behind an algorithms' predictions and adds robustness and performance guarantees.

The method is based on Grad-CAM in which the key technique is the choice of differentiable activation. The differentiable activation will be back-propagated to the last CNN layer to obtain channel-wise weights and thus a visual attention map. This paper uses the latent vector $\mathbf{z}$ as the differentiable activation. Specifically, each element $\mathbf{z}_i$ is backpropagated independently to generate an element-wise attention map. Then, the overall attention map is the mean of element-wise attention maps. Figure 1 illustrates the process of element-wise attention generation.

By modifying the differentiable activation to the sum of all elements in the (inferred) mean vector, the method is also able to generate anomaly attention maps. Further, the method can help the VAE to learn improved latent space disentanglement by adding an attention disentanglement loss.

The performance of the method is demonstrated qualitatively and quantitatively.

Figure 2 shows qualitative results on the MNIST dataset for anomaly attention explanations. The method correctly highlights the difference between the training digit and the testing digits. For example, the heatmap highlights a key difference region between the "1'' and the "7'', which is the top-horizontal bar in "7''.

The paper also shows quantitative results on a pedestrian video dataset (UCSD Ped 1) and a more comprehensive anomaly detection dataset (MVTec AD) using pixel-level segmentation scores. Its performance is significantly better than vanilla-VAE on UCSD Ped 1 and moderately better than previous work on MVTec AD.

*Boyang Deng, Kyle Genova, Soroosh Yazdani, Sofien Bouaziz, Geoffrey Hinton and Andrea Tagliasacchi.*

**Related Papers:**

- BSP-Net: Generating Compact Meshes via Binary Space Partitioning
- OctNet: Learning Deep 3D Representations at High Resolutions

This paper presents a new, differentiable representation of 3D shape based on convex polytopes.

There are numerous shape representations that are widely used in computer vision and computer graphics. However, most are extremely expensive to work with, not easily differentiable or impractical to use in some settings. A good, compact representation of 3D shape which is differentiable and efficient to work with would be a boon to a wide range of applications which work with 3D shape, e.g., 3D reconstruction from images, recognition of 3D objects, and rendering and physical simulation of 3D shapes.

There are many different shape representations in common use including voxel grids, signed distance functions, implicit surfaces, meshes, etc. In computer vision one of the most common is in terms of voxels where a 3D object is represented by a 3D grid of values which indicate whether the object is occupying a given point. Voxel grids are convenient because modern machine learning tools like convolutional networks can be naturally used, just as with images. Unfortunately, the memory requirements for voxel representations grows cubicly (i.e., $O(n^3)$) with resolution $n$, quickly outgrowing the available memory on GPUs. This has led to the development of OctTrees and other specialized methods to attempt to save memory. Other approaches, including shape primitives, implicit shape representations and meshes have been used but come with their own challenges, for instance being too simplistic to capture real geometry, non-differentiable or computationally expensive.

Instead, this paper proposes to represent a shape by a combination of convex shapes which are themselves defined by a set of planes. The signed distance from a plane $h$ is defined as

\[

\mathcal{H}_h(\mathbf{x}) = \mathbf{n}_h \cdot \mathbf{x} + d_h

\]

where $\mathbf{n}_h$ is the (unit) normal of the plane and $d_h$ is the distance of the plane from the origin. A convex shape can then be defined as the set of all points which are on the positive side of every plane. To do this smoothly, CvxNet makes use of the Sigmoid and LogSumExp functions, i.e.,

\[

\mathcal{C}(\mathbf{x}) = \textrm{Sigmoid}(-\sigma \textrm{LogSumExp}_h \delta \mathcal{H}_h(\mathbf{x}) )

\]

where $\mathcal{C}(\mathbf{x})$ is close to 1 for points inside the shape and close to 0 for points on the outside. The values $\sigma$ and $\delta$ control the smoothness of the shape. This construction is show in Figure 4. The result of this is a function $\mathcal{C}(\mathbf{x})$ which is fast to compute to determine inside/outside for a given convex shape. More complex (i.e., non-convex) shapes can then be constructed by a union of these convex parts. Further, the underlying planes themselves can be used directly to efficiently construct other representations like polygonal meshes which are convenient for use in simulation.

The paper then goes on to train an autoencoder model of these planes on a database of shapes. The result is a low dimensional latent space and a decoder which can be used for other tasks such as reconstructing 3D shapes from depth or RGB images.

They demonstrate both the fidelity of the overall representation as well as it's usefulness in 3D estimation tasks. Quantitatively, the results shows significant improvement in the depth-to-3D task and competitive performance on RGB-to-3D task. The traditional table of numbers can be found in the paper. However more interestingly, the latent space representation finds natural and semantically meaningful parts (i.e., the individual convex pieces) in an entirely unsupervised manner. These can be seen in Figure 4.

*Organizers: Stephen Gould, Anoop Cherian, Dylan Campbell and Richard Hartley*

- Deep Declarative Networks
- Deep Equilibrium Models
- Residual Flows for Invertible Generative Modeling
- Differentiable Convex Optimization Layers

"Deep Declarative Networks" are a new class of machine learning model which indirectly defines a transformation.

The current success of machine learning to date has been driven by explicitly defining parametric functions which transform inputs into the desired output. However, these models are growing increasingly large (e.g., the recently released GPT-3 has over 175 billion parameters) and data hungry. Further, some recent work has suggested that many of the parameters in these massive models are redundant. In contrast, **declarative networks** operate by defining these transformations indirectly. For instance, the transformation may be the solution to Ordinary Differential Equation, the minimum of an energy function or the root to a non-linear equation. The result are methods which are more compact and efficient and may be the future of machine learning.

This workshop brought together researchers who have been pushing in this direction in one place to both review the results to date and discuss the outstanding problems in the area. For people unfamiliar with this exciting new direction, the talks and papers of this workshop are an excellent starting point. Further, they contain previews of exciting new work to come.

Deep Equilibrium Models, Neural ODEs and Differentiable Convex Optimization Layers were proposed in 2019 and 2018 respectively and this workshop is a natural outgrowth of this interest in finding new ways to define the transformations we aim to learn.

The presented papers and invited talks at this workshop contained many exciting results. However, one notable result was in the talk by Zico Kolter. There he previewed the latest results with Multiscale Deep Equilibrium Models which was just recently released on arXiv. The results, some of which can be seen in Figure 5, show that these models are both competitive with state-of-the-art methods on image problems like classification and semantic segmentation but result in models which are smaller. Further, for size-constrained models (e.g., with limited numbers of parameters or memory usage), these models can significantly outperform existing techniques.

]]>

The partnership is part of Borealis AI’s ongoing commitment to advancing AI in Canada and fostering diversity and inclusion in the field. Borealis AI will be providing mentorship, career advice, and online workshops for the 30 women selected to participate in this year’s program as well as ongoing support for the AI4Good team.

The AI4Good Lab was founded in 2017 in Montreal by Angelique Mannella, Global Alliance Lead at Amazon Web Services, and Dr. Doina Precup, researcher at Mila, McGill University, and Deepmind. It's the first program of its kind to combine rigorous teaching in Artificial Intelligence (AI) with tackling diversity and inclusion in research and development, while promoting AI as a tool for social good.

The 2020 AI4Good Lab cohort marks the 4th year of training the next generation of diverse AI leaders, with 110 participants and alumni from across Canada. This year the lab will be held virtually from June 8th to July 28th due to the COVID-19 pandemic.

The 7-week program consists of two parts:

- intensive machine learning training through workshops and lectures by AI experts from academia and industry;
- a prototype development phase, during which the participants will work on AI products to tackle a social good problem of their choosing.

Borealis AI will actively be involved in both parts of the program with presentations and mentorship for the students as well as advice on how to navigate the job market in the AI space.

Speaking about the Lab, co-founder Angelique Mannella explained:

“Creating more diversity in technical environments is hard. While progress is being made, the only way to make sustainable, lasting change is to take an ecosystem approach, where organizations work together to surface new ways of working, new ways of knowledge sharing, and new ways of nurturing talent.”

Dr. Eirene Seiradaki, Director of Research Partnerships at Borealis AI, said:

“Increasing the number of women working in technology and science is a priority for Borealis AI. We are delighted to support the AI4Good Lab program. We hope this program will provide the participants with new skills and tools to help them develop their careers in this exciting and evolving industry.”

Maya Marcus-Sells, Executive Director at the AI4Good Lab, said:

“Our partnership with Borealis AI helps bring women across Canada into the fast-moving tech ecosystem. Through mentorship, speaking, and career guidance, Borealis AI will provide the participants of the AI4Good Lab with insights and networks into AI careers that will help them grow into the AI leaders of tomorrow, ultimately leading towards a more diverse and representative AI talent pool.”

“Borealis AI's commitment from day one to foster an environment for gender diverse talent and to work with us to create new opportunities for knowledge sharing and mentorship has been invaluable to our participants and alumni and also to our ability to amplify our collective impact across Canada,”added Mannella.

“There is a lot more to be done in this area”said Seiradaki.“Borealis AI will continue to partner with universities, government, and industry to further narrow the gender gap and improve the talent pool through a larger presentation of women in AI.”

Borealis AI is a world-class AI Research center backed by RBC. Recognized for scientific excellence, Borealis AI uses the latest in machine learning capabilities to solve challenging problems in the financial industry. Led by award-winning inventor and entrepreneur, Foteini Agrafioti, and with top North America scientists and engineers, Borealis AI is at the core of the bank’s innovation strategy and benefits from RBC’s scale, data, and trusted brand.

With a focus on responsible AI, natural language processing, and reinforcement learning, Borealis AI is committed to building solutions using machine learning and artificial intelligence that will transform the way individuals manage their finances and their futures. For more information please visit www.borealisai.com.

]]>This talk will provide an update on recent progress in this area. It will start out with novel state-of-the-art methods for the self-play setting. Next, it will introduce the Zero-Shot Coordination setting as a new frontier for multi-agent research. Finally it will introduce Other-Play as a novel learning algorithm, which allows agents to coordinate ad-hoc and biases learning towards more human compatible policies.

]]>However, other optimization problems are much more challenging. Consider *hyperparameter search* in a neural network. Before we a train the network, we must choose the architecture, optimization algorithm, and cost function. These choices are encoded numerically as a vector of hyperparameters. To get the best performance, we must find the hyperparameters for which the resulting trained network is best. This hyperparameter optimization problem has many challenging characteristics:

**Evaluation cost:** Evaluating the function that we wish to maximize (i.e., the network performance) in hyperparameter search is very expensive; we have to train the neural network model and then run it on the validation set to measure the network performance for a given set of hyperparameters.

**Multiple local optima:** The function is not convex and there may be many combinations of hyperparameters that are locally optimal.

**No derivatives:** We do not have access to the gradient of the function with respect to the hyperparameters; there is no easy and inexpensive way to propagate gradients back through the model training / validation process.

**Variable types:** There are a mixture of discrete variables (e.g., the number of layers, number of units per layer and type of non-linearity) and continuous variables (e.g., the learning rate and regularization weights).

**Conditional variables:** The existence of some variables depends on the settings of others. For example, the number of units in layer $3$ is only relevant if we already chose $\geq 3$ layers.

**Noise:** The function may return different values for the same input hyperparameter set. The neural network training process relies on stochastic gradient descent and so we typically don't get exactly the same result every time.

Bayesian optimization is a framework that can deal with optimization problems that have all of these challenges. The core idea is to build a model of the entire function that we are optimizing. This model includes both our current estimate of that function and the uncertainty around that estimate. By considering this model, we can choose where next to sample the function. Then we update the model based on the observed sample. This process continues until we are sufficiently certain of where the best point on the function is.

Let's now put aside the specific example of hyperparameter search and consider Bayesian optimization in its more general form. Bayesian optimization addresses problems where the aim is to find the parameters $\hat{\mathbf{x}}$ that maximize a function $\mbox{f}[\mathbf{x}]$ over some domain $\mathcal{X}$ consisting of finite lower and upper bounds on every variable:

\begin{equation}

\hat{\mathbf{x}} = \mathop{\rm argmax}_{\mathbf{x} \in \mathcal{X}} \left[ \mbox{f}[\mathbf{x}]\right]. \tag{1}

\label{eq:global-opt}

\end{equation}

At iteration $t$, the algorithm can learn about the function by choosing parameters $\mathbf{x}_t$ and receiving the corresponding function value $f[\mathbf{x}_t]$. The goal of Bayesian optimization is to find the maximum point on the function using the minimum number of function evaluations. More formally, we want to minimize the number of iterations $t$ before we can guarantee that we find parameters $\hat{\mathbf{x}}$ such $f[\hat{\mathbf{x}}]$ is less than $\epsilon$ from the true maximum $\hat{f}$.

We'll assume for now that all parameters are continuous, that their existences are not conditional on one another, and that the cost function is deterministic so that it always returns the same value for the same input. We'll return these complications later in this document. To help understand the basic optimization problem let's consider some simple strategies:

**Grid Search:** One obvious approach is to quantize each dimension of $\mathbf{x}$ to form an input grid and then evaluate each point in the grid (figure 1). This is simple and easily parallelizable, but suffers from the curse of dimensionality; the size of the grid grows exponentially in the number of dimensions.

**Random Search:** Another strategy is to specify probability distributions for each dimension of $\mathbf{x}$ and then randomly sample from these distributions (Bergstra and Bengio, 2012). This addresses a subtle inefficiency of grid search that occurs when one of the parameters has very little effect on the function output (see figure 1 for details). Random search is also simple and parallelizable. However, if we are unlucky, we can may either (i) make many similar observations that provide redundant information, or (ii) never sample close to the global maximum.

**Sequential search strategies:** One obvious deficiency of both grid search and random search is that they do not take into account previous measurements. If the measurements are made sequentially then we could use the previous results to decide where it might be strategically best to sample next (figure 2). One idea is that we could *explore* areas where there are few samples so that we are less likely to miss the global maximum entirely. Another approach could *exploit* what we have learned so far by sampling more in relatively promising areas. An optimal strategy would recognize that there is a trade-off between *exploration* and *exploitation* and combine both ideas.

Bayesian optimization is a sequential search framework that incorporates both exploration and exploitation and can be considerably more efficient than either grid search or random search. It can easily be motivated from figure 2; the goal is to build a probabilistic model of the underlying function that will know both (i) that $\mathbf{x}_{1}$ is a good place to sample because the function will probably return a high value here and (ii) that $\mathbf{x}_{2}$ is a good place to sample because the uncertainty here is very large.

A Bayesian optimization algorithm has two main components:

**A probabilistic model of the function:**Bayesian optimization starts with an initial probability distribution (the prior) over the function $f[\bullet]$ to be optimized. Usually this just reflects the fact that we are extremely uncertain about what the function is. With each observation of the function $(\mathbf{x}_t, f[\mathbf{x}_t])$, we learn more and the distribution over possible functions (now called the posterior) becomes narrower.**An acquisition function:**This is computed from the posterior distribution over the function and is defined on the same domain. The acquisition indicates the desirability of sampling each point next and depending on how it is defined it can favor exploration or exploitation.

In the next two sections, we consider each of these components in turn.

There are several ways to model the function and its uncertainty, but the most popular approach is to use Gaussian processes (GPs). We will present other models (Bernoulli-Beta bandits, random forests, and Tree-Parzen estimators) later in this document.

A Gaussian Process is a collection of random variables, where any finite number of these are jointly normally distributed. It is defined by (i) a mean function $\mbox{m}[\mathbf{x}]$ and (ii) a covariance function $k[\mathbf{x},\mathbf{x}']$ that returns the similarity between two points. When we model our function as $\mbox{f}[\mathbf{x}]\sim \mbox{GP}[\mbox{m}[\mathbf{x}],k[\mathbf{x},\mathbf{x}^\prime]]$ we are saying that:

\begin{eqnarray}

\mathbb{E}[\mbox{f}[\mathbf{x}]] &=& \mbox{m}[\mathbf{x}] \tag{2}

\end{eqnarray}

\begin{eqnarray}

\mathbb{E}[(\mbox{f}[\mathbf{x}]-\mbox{m}[\mathbf{x}])(f[\mathbf{x}']-\mbox{m}[\mathbf{x}'])] &=& k[\mathbf{x}, \mathbf{x}']. \tag{3}

\end{eqnarray}

The first equation states that the expected value of the function is given by some function $\mbox{m}[\mathbf{x}]$ of $\mathbf{x}$ and the second equation tells us how to compute the covariance of any two points $\mathbf{x}$ and $\mathbf{x}'$. As a concrete example, let's choose:

\begin{eqnarray}

\mbox{m}[\mathbf{x}] &=& 0 \tag{4}

\end{eqnarray}

\begin{eqnarray}

k[\mathbf{x}, \mathbf{x}']

&=&\mbox{exp}\left[-\frac{1}{2}\left(\mathbf{x}-\mathbf{x}'\right)^{T}\left(\mathbf{x}-\mathbf{x}'\right)\right], \tag{5}

\end{eqnarray}

so here the expected function values are all zero and the covariance decreases as a function of distance between two points. In other words, points very close to one another of the function will tend to have similar values and those further away will be less similar.

Given observations $\mathbf{f} = [f[\mathbf{x}_{1}], f[\mathbf{x}_{2}],\ldots, f[\mathbf{x}_{t}]]$ at $t$ points, we would like to make a prediction about the function value at a new point $\mathbf{x}^{*}$. This new function value $f^{*} = f[\mathbf{x}^{*}]$ is jointly normally distributed with the observations $\mathbf{f}$ so that:

\begin{equation}

Pr\left(\begin{bmatrix}\label{eq:GP_Joint}

\mathbf{f}\\f^{*}\end{bmatrix}\right) = \mbox{Norm}\left[\mathbf{0}, \begin{bmatrix}\mathbf{K}[\mathbf{X},\mathbf{X}] & \mathbf{K}[\mathbf{X},\mathbf{x}^{*}]\\ \mathbf{K}[\mathbf{x}^{*},\mathbf{X}]& \mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\end{bmatrix}\right], \tag{6}

\end{equation}

where $\mathbf{K}[\mathbf{X},\mathbf{X}]$ is a $t\times t$ matrix where element $(i,j)$ is given by $k[\mathbf{x}_{i},\mathbf{x}_{j}]$, $\mathbf{K}[\mathbf{X},\mathbf{x}^{*}]$ is a $t\times 1$ vector where element $i$ is given by $k[\mathbf{x}_{i},\mathbf{x}^{*}]$ and so on.

Since the function values in equation 6 are jointly normal, the conditional distribution $Pr(f^{*}|\mathbf{f})$ must also be normal, and we can use the standard formula for the mean and variance of this conditional distribution:

\begin{equation}\label{eq:gp_posterior}

Pr(f^*|\mathbf{f}) = \mbox{Norm}[\mu[\mathbf{x}^{*}],\sigma^{2}[\mathbf{x}^{*}]], \tag{7}

\end{equation}

where

\begin{eqnarray}\label{eq:GP_Conditional}

\mu[\mathbf{x}^{*}]&=& \mathbf{K}[\mathbf{x}^{*},\mathbf{X}]\mathbf{K}[\mathbf{X},\mathbf{X}]^{-1}\mathbf{f}\nonumber \\

\sigma^{2}[\mathbf{x}^{*}]&=&\mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\!-\!\mathbf{K}[\mathbf{x}^{*}, \mathbf{X}]\mathbf{K}[\mathbf{X},\mathbf{X}]^{-1}\mathbf{K}[\mathbf{X},\mathbf{x}^{*}]. \tag{8}

\end{eqnarray}

Using this formula, we can estimate the distribution of the function at any new point $\mathbf{x}^{*}$. The best estimate of the function value is given by the mean $\mu[\mathbf{x}]$, and the uncertainty is given by the variance $\sigma^{2}[\mathbf{x}]$. Figure 3 shows an example of measuring several points on a function sequentially and showing how the predicted mean and variance changes for other points.

Now that we have a model of the function and its uncertainty, we will use this to choose which point to sample next. The *acquisition* function takes the mean and variance at each point $\mathbf{x}$ on the function and computes a value that indicates how desirable it is to sample next at this position. A good acquisition function should trade off exploration and exploitation.

In the following sections we'll describe four popular acquisition functions: the upper confidence bound (Srinivas *et* al., 2010), expected improvement (Močkus, 1975), probability of improvement (Kushner, 1964), and Thompson sampling (Thompson, 1933). Note that there are several other approaches which are not discussed here including those based on entropy search (Villemonteix *et* al., 2009, Hennig and Schuler, 2012) and the knowledge gradient (Wu *et* al., 2017).

**Upper confidence bound:** This acquisition function (figure 4a) is defined as:

\begin{align}

\mbox{UCB}[\mathbf{x}^{*}] = \mu[\mathbf{x}^{*}] + \beta^{1/2} \sigma[\mathbf{x}^{*}]. \label{eq:UCB-def} \tag{9}

\end{align}

This favors either (i) regions where $\mu[\mathbf{x}^{*}]$ is large (for exploitation) or (ii) regions where $\sigma[\mathbf{x}^{*}]$ is large (for exploration). The positive parameter $\beta$ trades off these two tendencies.

**Probability of improvement:** This acquisition function computes the likelihood that the function at $\mathbf{x}^{*}$ will return a result higher than the current maximum $\mbox{f}[\hat{\mathbf{x}}]$. For each point $\mathbf{x}^{*}$, we integrate the part of the associated normal distribution that is above the current maximum (figure 4b) so that:

\begin{equation}

\mbox{PI}[\mathbf{x}^{*}] = \int_{\mbox{f}[\hat{\mathbf{x}}]}^{\infty} \mbox{Norm}_{\mbox{f}[\mathbf{x}^{*}]}[\mu[\mathbf{x}^{*}],\sigma[\mathbf{x}^{*}]] d\mbox{f}[\mathbf{x}^{*}] \tag{10}

\end{equation}

**Expected improvement:** The main disadvantage of the probability of improvement function is that it does not take into account how much the improvement will be; we do not want to favor small improvements (even if they are very likely) over larger ones. Expected improvement (figure 4c) takes this into account. It computes the expectation of the improvement $f[\mathbf{x}^{*}]- f[\hat{\mathbf{x}}]$ over the part of the normal distribution that is above the current maximum to give:

\begin{equation}

\mbox{EI}[\mathbf{x}^{*}] = \int_{\mbox{f}[\hat{\mathbf{x}}]}^{\infty} (f[\mathbf{x}^{*}]- f[\hat{\mathbf{x}}])\mbox{Norm}_{\mbox{f}[\mathbf{x}^{*}]}[\mu[\mathbf{x}^{*}],\sigma[\mathbf{x}^{*}]] d\mbox{f}[\mathbf{x}^{*}]. \tag{11}

\end{equation}

There also exist methods to allow us to trade-off exploitation and exploration for probability of improvement and expected improvement (see Brochu *et* al., 2010).

**Thompson sampling:** When we introduced Gaussian processes, we only talked about how to compute the probability distribution for a single new point $\mathbf{x}^{*}$. However, it's also possible to draw a sample from the joint distribution of many new points that could collectively represent the entire function. Thompson sampling (figure 4d) exploits this by drawing such a sample from the posterior distribution over possible functions and then chooses the next point $\mathbf{x}$ according to the position of the maximum of this sampled function. To draw the sample, we append an equally spaced set of points to the observed ones as in equation 6, use the conditional formula to find a Gaussian distribution over these points as in equation 8, and then draw a sample from this Gaussian.

Figure 5 shows a complete worked example of Bayesian optimization in one dimension using the upper confidence bound. As we sample more points, the function becomes steadily more certain. The method explores the function but also focuses on promising areas, exploiting what it has already learned.

In the previous section, we summarized the main ideas of Bayesian optimization with Gaussian processes. In this section, we'll dig a bit deeper into some of the practical aspects. We consider how to deal with noisy observations, how to choose a kernel, how to learn the parameters of that kernel, how to exploit parallel sampling of the function, and finally we'll discuss some limitations of the approach.

Until this point, we have assumed that the function that we are estimating is noise-free and always returns the same value $\mbox{f}[\mathbf{x}]$ for a given input $\mathbf{x}$. To incorporate a stochastic output with variance $\sigma_{n}^{2}$, we add an extra noise term to the expression for the Gaussian process covariance:

\begin{eqnarray}

\mathbb{E}[(y[\mathbf{x}]-\mbox{m}[\mathbf{x}])(y[\mathbf{x}]-\mbox{m}[\mathbf{x}'])] &=& k[\mathbf{x}, \mathbf{x}'] + \sigma^{2}_{n}. \tag{12}

\end{eqnarray}

We no longer observe the function values $\mbox{f}[\mathbf{x}]$ directly, but observe noisy corruptions $y[\mathbf{x}] = \mbox{f}[\mathbf{x}]+\epsilon$ of them. The joint distribution of previously observed noisy function values $\mathbf{y}$ and a new unobserved point $f^{*}$ becomes:

\begin{equation}

Pr\left(\begin{bmatrix}

\mathbf{y}\\f^{*}\end{bmatrix}\right) = \mbox{Norm}\left[\mathbf{0}, \begin{bmatrix}\mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I} & \mathbf{K}[\mathbf{X},\mathbf{x}^{*}]\\ \mathbf{K}[\mathbf{x}^{*},\mathbf{X}]& \mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\end{bmatrix}\right], \tag{13}

\end{equation}

and the conditional probability of a new point becomes:

\begin{eqnarray}\label{eq:noisy_gp_posterior}

Pr(f^{*}|\mathbf{y}) &=& \mbox{Norm}[\mu[\mathbf{x}^{*}],\sigma^{2}[\mathbf{x}^{*}]], \tag{14}

\end{eqnarray}

where

\begin{eqnarray}

\mu[\mathbf{x}^{*}]&=& \mathbf{K}[\mathbf{x}^{*},\mathbf{X}](\mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I})^{-1}\mathbf{f}\nonumber \\

\sigma^{2}[\mathbf{x}^{*}] &=& \mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\!-\!\mathbf{K}[\mathbf{x}^{*}, \mathbf{X}](\mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I})^{-1}\mathbf{K}[\mathbf{X},\mathbf{x}^{*}]. \tag{15}

\end{eqnarray}

Incorporating noise means that there is uncertainty about the function even where we have already sampled points (figure 6), and so sampling twice at the same position or at very similar positions could be sensible.

When we build the model of the function and its uncertainty, we are assuming that the function is smooth. If this was not the case, then we could say nothing at all about the function between the sampled points. The details of this smoothness assumption are embodied in the choice of kernel covariance function.

We can visualize the covariance function by drawing samples from the Gaussian process prior. In one dimension, we do this by defining an evenly spaced set of points $\mathbf{X}=\begin{bmatrix}\mathbf{x}_{1},& \mathbf{x}_{2},&\cdots,& \mathbf{x}_{I}\end{bmatrix}$, drawing a sample from $\mbox{Norm}[\mathbf{0}, \mathbf{K}[\mathbf{X},\mathbf{X}]]$ and then plotting the results. In this section, we'll consider several different choices of covariance function, and use this method to visualize each.

**Squared Exponential Kernel:** In our example above, we used the squared exponential kernel, but more properly we should have included the amplitude $\alpha$ which controls the overall amount of variability and the length scale $\lambda$ which controls the amount of smoothness:

\begin{equation}\label{eq:bo_squared_exp}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2}\cdot \mbox{exp}\left[-\frac{d^{2}}{2\lambda}\right],\nonumber

\end{equation}

where $d$ is the Euclidean distance between the points:

\begin{equation}

d = \sqrt {\left(\mathbf{x}-\mathbf{x}'\right)^{T}\left(\mathbf{x}-\mathbf{x}'\right)}. \tag{16}

\end{equation}

When the amplitude $\alpha^{2}$ is small, the function does not vary too much in the vertical direction. When it is larger, there is more variation. When the length scale $\lambda$ is small, the function is assumed to be less smooth and we quickly become uncertain about the state of the function as we move away from known positions. When it is large, the function is assumed to be more smooth and we are increasingly confident about what happens away from these observations (figure 7). Samples from the squared exponential kernel are visualized in figure 8a-c.

**Matérn kernel:** The squared exponential function assumes that the function is infinitely differentiable. The Matérn kernel (figure 8d-l) relaxes this constraint by assuming a certain degree of smoothness $\nu$. The Matérn kernel with $\nu=0.5$ is once differentiable and is defined as

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2}\cdot \exp\left[-\frac{d}{\lambda^{2}}\right], \tag{17}

\end{equation}

where once again, $d$ is the Euclidean distance between $\mathbf{x}$ and $\mathbf{x}'$, $\alpha$ is the amplitude, and $\lambda$ is the length scale. The Matérn kernel with $\nu=1.5$ is twice differentiable and is defined as:

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2} \left(1+\frac{\sqrt{3}d}{\lambda}\right)\exp\left[-\frac{\sqrt{3}d}{\lambda}\right]. \tag{18}

\end{equation}

The Matérn kernel with $\nu=2.5$ is three times differentiable and is defined as:

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2} \left(1+\frac{\sqrt{5}d}{\lambda} + \frac{5d^{2}}{3\lambda^{2}}\right)\exp\left[-\frac{\sqrt{5}d}{\lambda}\right]. \tag{19}

\end{equation}

The Matérn kernel with $\nu=\infty$ is infinitely differentiable and is identical to the squared exponential kernel (equation 16).

**Periodic Kernel:** If we believe that the underlying function is oscillatory, we use the periodic function:

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}^\prime] = \alpha^{2} \cdot \exp \left[ \frac{-2(\sin[\pi d/\tau])^{2}}{\lambda^2} \right], \tag{20}

\end{equation}

where $\tau$ is the period of the oscillation and the other parameters have the same meanings as before.

A common application for Bayesian optimization is to search for the best hyperparameters of a machine learning model. However, in an ironic twist, the kernel functions used in Bayesian optimization themselves contain unknown hyper-hyperparameters like the amplitude $\alpha$, length scale $\lambda$ and noise $\sigma^{2}_{n}$. There are several possible approaches to choosing these hyperparameters:

**1. Maximum likelihood:** similar to training ML models, we can choose these parameters by maximizing the marginal likelihood (i.e., the likelihood of the data after marginalizing over the possible values of the function):

\begin{eqnarray}\label{eq:bo_learning}

Pr(\mathbf{y}|\mathbf{x},\boldsymbol\theta)&=&\int Pr(\mathbf{y}|\mathbf{f},\mathbf{x},\boldsymbol\theta)d\mathbf{f}\nonumber\\

&=& \mbox{Norm}_{y}[\mathbf{0}, \mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I}], \tag{21}

\end{eqnarray}

where $\boldsymbol\theta$ contains the unknown parameters in the kernel function and the measurement noise $\sigma^{2}_{n}$.

In Bayesian optimization, we are collecting the observations sequentially, and where we collect them will depend on the kernel parameters, and we would have to interleave the processes of acquiring new points and optimizing the kernel parameters.

**2. Full Bayesian approach:** here we would choose a prior distribution $Pr(\boldsymbol\theta)$ on the kernel parameters of the Gaussian process and combine this with the likelihood in equation 21 to compute the posterior. We then weight the acquisition functions according to this posterior:

\begin{equation}\label{eq:snoek_post}

\hat{a}[\mathbf{x}^{*}]\propto \int a[\mathbf{x}^{*}|\boldsymbol\theta]Pr(\mathbf{y}|\mathbf{x},\boldsymbol\theta)Pr(\boldsymbol\theta). \tag{22}

\end{equation}

In practice this would usually be done using an Monte Carlo approach in which the posterior is represented by a set of samples (see Snoek *et* al., 2012) and we sum together multiple acquisition functions derived from these kernel parameter samples (figure 9).

For practical applications like hyperparameter search, we would want to make multiple function evaluations in parallel. In this case, we must consider how to prevent the algorithm from starting a new function evaluation in a place that is already being explored by a parallel thread.

One solution is to use a stochastic acquisition function. For example, Thompson sampling draws from the posterior distribution over the function and samples where this sample is maximal (figure 4d). When we sample several times, we will get different draws from the posterior and hence different values of $\mathbf{x}$.

A more sophisticated approach is to treat the problem in a fully Bayesian way (Snoek et al., 2012). The optimization algorithm keeps track of both the points that have been evaluated and the points that are pending, marginalizing over the uncertainty in the pending points. This can be done using a sampling approach similar to the method in figure 9 for incorporating different length scales. We draw samples from the Gaussians representing the possible pending results and build an acquisition function for each. We then average together these acquisition functions weighted by the probability of observing those results.

In practice, Bayesian optimization with Gaussian Processes works best if we start with a number of points from the function that have already been evaluated. A rule of thumb might be to use random sampling for $\sqrt{d}$ iterations where $d$ is the number of dimensions and then start the Bayesian optimization process. A second useful trick is to occasionally incorporate a random sample into the scheme. This can stop the Bayesian optimization process getting waylaid examining unproductive regions of the space and forces a certain degree of exploration. A typical approach might be to use a random sample every 10 iterations.

The main limitation of Bayesian optimization with GPs is efficiency. As the dimensionality increases, more points need to be evaluated. Unfortunately, the cost of exact inference in the Gaussian process scales as $\mathcal{O}[n^3]$ where $n$ is the number of data points. There has been some work to reduce this cost through different approximations such as:

**Inducing points:**This approach tries to summarize the large number of observed points into a smaller subset known as inducing points (Snelson*et*al., 2006).**Decomposing the kernel:**This approach decomposes the "big" kernel in high dimension into "small" kernels that act on small dimensions (Duvenaud*et*al., 2011).**Using random projections:**This approach relies on random embedding to solve the optimization problem in a lower dimension (Wang*et*al., 2013).

So far, we have considered optimizing continuous variables. What does Bayesian optimization look like in the discrete case? Perhaps we wish to choose which of $K$ discrete conditions (parameter values) yields the best output. In the absence of noise, this problem is trivial; we simply try all $K$ conditions in turn and choose the one that returns the maximum. However, when there is noise on the output, we can use Bayesian optimization to find the best condition efficiently.

The basic approach is model each condition independently. For continuous observations, we could model each output $f_{k}$ with a normal distribution, choose a prior over the mean of the normal and then use the measurements to compute a posterior over this mean. We'll leave developing this model as an exercise for the reader. Instead and for a bit of variety, we'll move to a different setting where the observations are binary we wish to find the configuration that produces the highest proportion of '1's in the output. This setting motivates the Beta-Bernoulli bandit model.

Consider the problem of choosing which of $K$ graphics to present to the user for a web-advert. We assume that for the $k^{th}$ graphic, there is a fixed probability $f_{k}$ that the person will click, but these parameters are unknown. We would like to efficiently choose the graphic that prompts the most clicks.

To solve this problem, we treat the parameters $f_{1}\ldots f_{K}$ as uncertain and place an uninformative Beta distribution prior with $\alpha,\beta=1$ over their values:

\begin{equation}

Pr(f_{k}) = \mbox{Beta}_{f_{k}}\left[1.0, 1.0\right]. \tag{23}

\end{equation}

The likelihood of showing the $k^{th}$ graphic $n_{k}$ times and receiving $c_{k}$ clicks is then

\begin{equation}

Pr(c_{k}|f, n_{k}) = f_{k}^{c_{k}}(1-f_{k})^{n_{k}-c_{k}}, \tag{24}

\end{equation}

and we can combine these two equations via Bayes rule to compute the posterior distribution over the parameter $f_{k}$ (see chapter 4 of Prince, 2012) which will be given by

\begin{equation}

Pr(f_{k}|c_{k},n_{k}) = \mbox{Beta}_{f_{k}}\left[1.0 + c_{k}, 1.0 + n_{k}-c_{k} \right]. \tag{25}

\end{equation}

Now we must choose which value of $k$ to try next given the $k$ posterior distributions over the probabilities $f_{k}$ of getting a click (figure 10). As before, we choose an acquisition function and sample the value of $k$ that maximizes this. In this case, the most practical approach is to use Thompson sampling. We sample from each posterior distribution separately (they are independent) and choose $k$ based on the highest sampled value.

As in the continuous case, this method will trade off exploiting existing knowledge by showing graphics that it knows will generate a high rate of clicks and exploring graphics where the click rate is very uncertain. This model and algorithm are part of a more general literature on bandit algorithms. More information can be found in this book.

When we have many discrete variables (e.g., the orientation, color, font size in an advert graphic), we could treat each combination of variables as one value of $k$ and use the above approach in which each condition is treated independently. However, the number of combinations may be very large and so this is not necessarily practical.

If the discrete variables have a natural order (e.g., font size) then one approach is to treat them as continuous. We amalgamate them into an observation vector $\mathbf{x}$ and use a Gaussian process model. The only complication is that we now only compute the acquisition function at the discrete values that are valid.

If the discrete variables have no natural order then we are in trouble. Gaussian processes depend on the kernel 'distance' between points and it is hard to define such kernels for discrete variables. One approach is to use a one-hot encoding, apply a kernel for each dimension and let the overall kernel be defined by the product of these sub-kernels (Duvenaud *et* al., 2014). However, this is not ideal because there is no way for the model to know about the invalid input values which will be assigned some probability and may be selected as new points to evaluate. One way to move forward is to consider a different underling probabilistic model.

The approaches up to this point can deal with most of the problems that we outline in the introduction, but are not suited to the case where there are many discrete variables (possibly in combination with continuous variables). Moreover, they cannot elegantly handle the case of conditional variables where the existence of some variables is contingent on the settings of others. In this section we consider random forest models and tree-Parzen estimators, both of which can handle these situations.

The *Sequential Model-based Algorithm Configuration* (SMAC) algorithm uses a random forest as an alternative to Gaussian processes. Consider the case where we have made some observations and trained a regression forest. For any point, we can measure the mean of the trees' predictions and their variance (figure 11). This mean and variance are then treated similarly to the equivalent outputs from a Gaussian process model. We apply an acquisition function to choose which of a set of candidate points to sample next. In practice, the forest must be updated as we go along and a simple way to do that is just to split a leaf when it accumulates a certain number of training examples.

Random forests based on binary splits can easily cope with combinations of discrete and continuous variables; it is just as easy to split the data by thresholding a continuous value as it is to split it by dividing a discrete variable into two non-overlapping sets. Moreover, the tree structure makes it easy to accommodate conditional parameters: we do not consider splitting on contingent variables until they are guaranteed by prior choices to exist.

The Tree-Parzen estimator (Bergstra *et* al., 2011) works quite differently from the models that we have considered so far. It describes the likelihood $Pr(\mathbf{x}|y)$ of the data $\mathbf{x}$ given the noisy function value $y$ rather than the posterior $Pr(y|\mathbf{x})$.

More specifically, the goal is to build two separate models $Pr(\mathbf{x}|y\in\mathcal{L})$ and $Pr(\mathbf{x}|y\in\mathcal{H})$ where the set $\mathcal{L}$ contains the lowest values of $y$ seen so far and the set $\mathcal{H}$ contains the highest. These sets are created by partitioning the values according to whether they fall below or above some fixed quantile.

The likelihoods $Pr(\mathbf{x}|y\in\mathcal{L})$ and $Pr(\mathbf{x}|y\in\mathcal{H})$ are modelled with kernel density estimators; for example, we might describe the likelihood as a sum of Gaussians with a mean on each observed data point $\mathbf{x}$ and fixed variance (figure 12). It can be shown that expected improvement is then maximized by choosing the point that maximizes the ratio $Pr(\mathbf{x}|y\in\mathcal{H})/Pr(\mathbf{x}|y\in\mathcal{L})$.

Tree-Parzen estimators work when we have a mixture of discrete and continuous spaces, and when some parameters are contingent on others. Moreover, the computation scales linearly with the number of data points as opposed to with their cube as for Gaussian processes.

In this tutorial, we have discussed Bayesian optimization, its key components, and applications. For further information, consult the recent surveys by Shahriari *et* al. (2016) and Frazier 2018. Python packages for Bayesian optimization include BoTorch, Spearmint, GPFlow, and GPyOpt.

Code for hyperparameter optimization can be found in the Hyperopt and HPBandSter packages. A popular application of Bayesian optimization is for AutoML which broadens the scope of hyperparameter optimization to also compare different model types as well as choosing their parameters and hyperparameters. Python packages for AutoML include Auto-sklearn, Hyperop-sklearn, and NNI.

]]>