June 3, 2020

Optimization is at the heart of machine learning. When we fit a model, we rely on finding the minimum of the associated cost function. Fortunately, many optimization problems are relatively easy. For example, the cost function for a logistic regression model is convex with a single minimum and closed-form derivatives and so any sensible gradient-based method will find the minimum.

However, other optimization problems are much more challenging. Consider *hyperparameter search* in a neural network. Before we a train the network, we must choose the architecture, optimization algorithm, and cost function. These choices are encoded numerically as a vector of hyperparameters. To get the best performance, we must find the hyperparameters for which the resulting trained network is best. This hyperparameter optimization problem has many challenging characteristics:

**Evaluation cost:** Evaluating the function that we wish to maximize (i.e., the network performance) in hyperparameter search is very expensive; we have to train the neural network model and then run it on the validation set to measure the network performance for a given set of hyperparameters.

**Multiple local optima:** The function is not convex and there may be many combinations of hyperparameters that are locally optimal.

**No derivatives:** We do not have access to the gradient of the function with respect to the hyperparameters; there is no easy and inexpensive way to propagate gradients back through the model training / validation process.

**Variable types:** There are a mixture of discrete variables (e.g., the number of layers, number of units per layer and type of non-linearity) and continuous variables (e.g., the learning rate and regularization weights).

**Conditional variables:** The existence of some variables depends on the settings of others. For example, the number of units in layer $3$ is only relevant if we already chose $\geq 3$ layers.

**Noise:** The function may return different values for the same input hyperparameter set. The neural network training process relies on stochastic gradient descent and so we typically don't get exactly the same result every time.

Bayesian optimization is a framework that can deal with optimization problems that have all of these challenges. The core idea is to build a model of the entire function that we are optimizing. This model includes both our current estimate of that function and the uncertainty around that estimate. By considering this model, we can choose where next to sample the function. Then we update the model based on the observed sample. This process continues until we are sufficiently certain of where the best point on the function is.

Let's now put aside the specific example of hyperparameter search and consider Bayesian optimization in its more general form. Bayesian optimization addresses problems where the aim is to find the parameters $\hat{\mathbf{x}}$ that maximize a function $\mbox{f}[\mathbf{x}]$ over some domain $\mathcal{X}$ consisting of finite lower and upper bounds on every variable:

\begin{equation}

\hat{\mathbf{x}} = \mathop{\rm argmax}_{\mathbf{x} \in \mathcal{X}} \left[ \mbox{f}[\mathbf{x}]\right]. \tag{1}

\label{eq:global-opt}

\end{equation}

At iteration $t$, the algorithm can learn about the function by choosing parameters $\mathbf{x}_t$ and receiving the corresponding function value $f[\mathbf{x}_t]$. The goal of Bayesian optimization is to find the maximum point on the function using the minimum number of function evaluations. More formally, we want to minimize the number of iterations $t$ before we can guarantee that we find parameters $\hat{\mathbf{x}}$ such $f[\hat{\mathbf{x}}]$ is less than $\epsilon$ from the true maximum $\hat{f}$.

We'll assume for now that all parameters are continuous, that their existences are not conditional on one another, and that the cost function is deterministic so that it always returns the same value for the same input. We'll return these complications later in this document. To help understand the basic optimization problem let's consider some simple strategies:

**Grid Search:** One obvious approach is to quantize each dimension of $\mathbf{x}$ to form an input grid and then evaluate each point in the grid (figure 1). This is simple and easily parallelizable, but suffers from the curse of dimensionality; the size of the grid grows exponentially in the number of dimensions.

**Random Search:** Another strategy is to specify probability distributions for each dimension of $\mathbf{x}$ and then randomly sample from these distributions (Bergstra and Bengio, 2012). This addresses a subtle inefficiency of grid search that occurs when one of the parameters has very little effect on the function output (see figure 1 for details). Random search is also simple and parallelizable. However, if we are unlucky, we can may either (i) make many similar observations that provide redundant information, or (ii) never sample close to the global maximum.

**Sequential search strategies:** One obvious deficiency of both grid search and random search is that they do not take into account previous measurements. If the measurements are made sequentially then we could use the previous results to decide where it might be strategically best to sample next (figure 2). One idea is that we could *explore* areas where there are few samples so that we are less likely to miss the global maximum entirely. Another approach could *exploit* what we have learned so far by sampling more in relatively promising areas. An optimal strategy would recognize that there is a trade-off between *exploration* and *exploitation* and combine both ideas.

Bayesian optimization is a sequential search framework that incorporates both exploration and exploitation and can be considerably more efficient than either grid search or random search. It can easily be motivated from figure 2; the goal is to build a probabilistic model of the underlying function that will know both (i) that $\mathbf{x}_{1}$ is a good place to sample because the function will probably return a high value here and (ii) that $\mathbf{x}_{2}$ is a good place to sample because the uncertainty here is very large.

A Bayesian optimization algorithm has two main components:

**A probabilistic model of the function:**Bayesian optimization starts with an initial probability distribution (the prior) over the function $f[\bullet]$ to be optimized. Usually this just reflects the fact that we are extremely uncertain about what the function is. With each observation of the function $(\mathbf{x}_t, f[\mathbf{x}_t])$, we learn more and the distribution over possible functions (now called the posterior) becomes narrower.**An acquisition function:**This is computed from the posterior distribution over the function and is defined on the same domain. The acquisition indicates the desirability of sampling each point next and depending on how it is defined it can favor exploration or exploitation.

In the next two sections, we consider each of these components in turn.

There are several ways to model the function and its uncertainty, but the most popular approach is to use Gaussian processes (GPs). We will present other models (Bernoulli-Beta bandits, random forests, and Tree-Parzen estimators) later in this document.

A Gaussian Process is a collection of random variables, where any finite number of these are jointly normally distributed. It is defined by (i) a mean function $\mbox{m}[\mathbf{x}]$ and (ii) a covariance function $k[\mathbf{x},\mathbf{x}']$ that returns the similarity between two points. When we model our function as $\mbox{f}[\mathbf{x}]\sim \mbox{GP}[\mbox{m}[\mathbf{x}],k[\mathbf{x},\mathbf{x}^\prime]]$ we are saying that:

\begin{eqnarray}

\mathbb{E}[\mbox{f}[\mathbf{x}]] &=& \mbox{m}[\mathbf{x}] \tag{2}

\end{eqnarray}

\begin{eqnarray}

\mathbb{E}[(\mbox{f}[\mathbf{x}]-\mbox{m}[\mathbf{x}])(f[\mathbf{x}']-\mbox{m}[\mathbf{x}'])] &=& k[\mathbf{x}, \mathbf{x}']. \tag{3}

\end{eqnarray}

The first equation states that the expected value of the function is given by some function $\mbox{m}[\mathbf{x}]$ of $\mathbf{x}$ and the second equation tells us how to compute the covariance of any two points $\mathbf{x}$ and $\mathbf{x}'$. As a concrete example, let's choose:

\begin{eqnarray}

\mbox{m}[\mathbf{x}] &=& 0 \tag{4}

\end{eqnarray}

\begin{eqnarray}

k[\mathbf{x}, \mathbf{x}']

&=&\mbox{exp}\left[-\frac{1}{2}\left(\mathbf{x}-\mathbf{x}'\right)^{T}\left(\mathbf{x}-\mathbf{x}'\right)\right], \tag{5}

\end{eqnarray}

so here the expected function values are all zero and the covariance decreases as a function of distance between two points. In other words, points very close to one another of the function will tend to have similar values and those further away will be less similar.

Given observations $\mathbf{f} = [f[\mathbf{x}_{1}], f[\mathbf{x}_{2}],\ldots, f[\mathbf{x}_{t}]]$ at $t$ points, we would like to make a prediction about the function value at a new point $\mathbf{x}^{*}$. This new function value $f^{*} = f[\mathbf{x}^{*}]$ is jointly normally distributed with the observations $\mathbf{f}$ so that:

\begin{equation}

Pr\left(\begin{bmatrix}\label{eq:GP_Joint}

\mathbf{f}\\f^{*}\end{bmatrix}\right) = \mbox{Norm}\left[\mathbf{0}, \begin{bmatrix}\mathbf{K}[\mathbf{X},\mathbf{X}] & \mathbf{K}[\mathbf{X},\mathbf{x}^{*}]\\ \mathbf{K}[\mathbf{x}^{*},\mathbf{X}]& \mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\end{bmatrix}\right], \tag{6}

\end{equation}

where $\mathbf{K}[\mathbf{X},\mathbf{X}]$ is a $t\times t$ matrix where element $(i,j)$ is given by $k[\mathbf{x}_{i},\mathbf{x}_{j}]$, $\mathbf{K}[\mathbf{X},\mathbf{x}^{*}]$ is a $t\times 1$ vector where element $i$ is given by $k[\mathbf{x}_{i},\mathbf{x}^{*}]$ and so on.

Since the function values in equation 6 are jointly normal, the conditional distribution $Pr(f^{*}|\mathbf{f})$ must also be normal, and we can use the standard formula for the mean and variance of this conditional distribution:

\begin{equation}\label{eq:gp_posterior}

Pr(f^*|\mathbf{f}) = \mbox{Norm}[\mu[\mathbf{x}^{*}],\sigma^{2}[\mathbf{x}^{*}]], \tag{7}

\end{equation}

where

\begin{eqnarray}\label{eq:GP_Conditional}

\mu[\mathbf{x}^{*}]&=& \mathbf{K}[\mathbf{x}^{*},\mathbf{X}]\mathbf{K}[\mathbf{X},\mathbf{X}]^{-1}\mathbf{f}\nonumber \\

\sigma^{2}[\mathbf{x}^{*}]&=&\mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\!-\!\mathbf{K}[\mathbf{x}^{*}, \mathbf{X}]\mathbf{K}[\mathbf{X},\mathbf{X}]^{-1}\mathbf{K}[\mathbf{X},\mathbf{x}^{*}]. \tag{8}

\end{eqnarray}

Using this formula, we can estimate the distribution of the function at any new point $\mathbf{x}^{*}$. The best estimate of the function value is given by the mean $\mu[\mathbf{x}]$, and the uncertainty is given by the variance $\sigma^{2}[\mathbf{x}]$. Figure 3 shows an example of measuring several points on a function sequentially and showing how the predicted mean and variance changes for other points.

Now that we have a model of the function and its uncertainty, we will use this to choose which point to sample next. The *acquisition* function takes the mean and variance at each point $\mathbf{x}$ on the function and computes a value that indicates how desirable it is to sample next at this position. A good acquisition function should trade off exploration and exploitation.

In the following sections we'll describe four popular acquisition functions: the upper confidence bound (Srinivas *et* al., 2010), expected improvement (Močkus, 1975), probability of improvement (Kushner, 1964), and Thompson sampling (Thompson, 1933). Note that there are several other approaches which are not discussed here including those based on entropy search (Villemonteix *et* al., 2009, Hennig and Schuler, 2012) and the knowledge gradient (Wu *et* al., 2017).

**Upper confidence bound:** This acquisition function (figure 4a) is defined as:

\begin{align}

\mbox{UCB}[\mathbf{x}^{*}] = \mu[\mathbf{x}^{*}] + \beta^{1/2} \sigma[\mathbf{x}^{*}]. \label{eq:UCB-def} \tag{9}

\end{align}

This favors either (i) regions where $\mu[\mathbf{x}^{*}]$ is large (for exploitation) or (ii) regions where $\sigma[\mathbf{x}^{*}]$ is large (for exploration). The positive parameter $\beta$ trades off these two tendencies.

**Probability of improvement:** This acquisition function computes the likelihood that the function at $\mathbf{x}^{*}$ will return a result higher than the current maximum $\mbox{f}[\hat{\mathbf{x}}]$. For each point $\mathbf{x}^{*}$, we integrate the part of the associated normal distribution that is above the current maximum (figure 4b) so that:

\begin{equation}

\mbox{PI}[\mathbf{x}^{*}] = \int_{\mbox{f}[\hat{\mathbf{x}}]}^{\infty} \mbox{Norm}_{\mbox{f}[\mathbf{x}^{*}]}[\mu[\mathbf{x}^{*}],\sigma[\mathbf{x}^{*}]] d\mbox{f}[\mathbf{x}^{*}] \tag{10}

\end{equation}

**Expected improvement:** The main disadvantage of the probability of improvement function is that it does not take into account how much the improvement will be; we do not want to favor small improvements (even if they are very likely) over larger ones. Expected improvement (figure 4c) takes this into account. It computes the expectation of the improvement $f[\mathbf{x}^{*}]- f[\hat{\mathbf{x}}]$ over the part of the normal distribution that is above the current maximum to give:

\begin{equation}

\mbox{EI}[\mathbf{x}^{*}] = \int_{\mbox{f}[\hat{\mathbf{x}}]}^{\infty} (f[\mathbf{x}^{*}]- f[\hat{\mathbf{x}}])\mbox{Norm}_{\mbox{f}[\mathbf{x}^{*}]}[\mu[\mathbf{x}^{*}],\sigma[\mathbf{x}^{*}]] d\mbox{f}[\mathbf{x}^{*}]. \tag{11}

\end{equation}

There also exist methods to allow us to trade-off exploitation and exploration for probability of improvement and expected improvement (see Brochu *et* al., 2010).

**Thompson sampling:** When we introduced Gaussian processes, we only talked about how to compute the probability distribution for a single new point $\mathbf{x}^{*}$. However, it's also possible to draw a sample from the joint distribution of many new points that could collectively represent the entire function. Thompson sampling (figure 4d) exploits this by drawing such a sample from the posterior distribution over possible functions and then chooses the next point $\mathbf{x}$ according to the position of the maximum of this sampled function. To draw the sample, we append an equally spaced set of points to the observed ones as in equation 6, use the conditional formula to find a Gaussian distribution over these points as in equation 8, and then draw a sample from this Gaussian.

Figure 5 shows a complete worked example of Bayesian optimization in one dimension using the upper confidence bound. As we sample more points, the function becomes steadily more certain. The method explores the function but also focuses on promising areas, exploiting what it has already learned.

In the previous section, we summarized the main ideas of Bayesian optimization with Gaussian processes. In this section, we'll dig a bit deeper into some of the practical aspects. We consider how to deal with noisy observations, how to choose a kernel, how to learn the parameters of that kernel, how to exploit parallel sampling of the function, and finally we'll discuss some limitations of the approach.

Until this point, we have assumed that the function that we are estimating is noise-free and always returns the same value $\mbox{f}[\mathbf{x}]$ for a given input $\mathbf{x}$. To incorporate a stochastic output with variance $\sigma_{n}^{2}$, we add an extra noise term to the expression for the Gaussian process covariance:

\begin{eqnarray}

\mathbb{E}[(y[\mathbf{x}]-\mbox{m}[\mathbf{x}])(y[\mathbf{x}]-\mbox{m}[\mathbf{x}'])] &=& k[\mathbf{x}, \mathbf{x}'] + \sigma^{2}_{n}. \tag{12}

\end{eqnarray}

We no longer observe the function values $\mbox{f}[\mathbf{x}]$ directly, but observe noisy corruptions $y[\mathbf{x}] = \mbox{f}[\mathbf{x}]+\epsilon$ of them. The joint distribution of previously observed noisy function values $\mathbf{y}$ and a new unobserved point $f^{*}$ becomes:

\begin{equation}

Pr\left(\begin{bmatrix}

\mathbf{y}\\f^{*}\end{bmatrix}\right) = \mbox{Norm}\left[\mathbf{0}, \begin{bmatrix}\mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I} & \mathbf{K}[\mathbf{X},\mathbf{x}^{*}]\\ \mathbf{K}[\mathbf{x}^{*},\mathbf{X}]& \mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\end{bmatrix}\right], \tag{13}

\end{equation}

and the conditional probability of a new point becomes:

\begin{eqnarray}\label{eq:noisy_gp_posterior}

Pr(f^{*}|\mathbf{y}) &=& \mbox{Norm}[\mu[\mathbf{x}^{*}],\sigma^{2}[\mathbf{x}^{*}]], \tag{14}

\end{eqnarray}

where

\begin{eqnarray}

\mu[\mathbf{x}^{*}]&=& \mathbf{K}[\mathbf{x}^{*},\mathbf{X}](\mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I})^{-1}\mathbf{f}\nonumber \\

\sigma^{2}[\mathbf{x}^{*}] &=& \mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\!-\!\mathbf{K}[\mathbf{x}^{*}, \mathbf{X}](\mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I})^{-1}\mathbf{K}[\mathbf{X},\mathbf{x}^{*}]. \tag{15}

\end{eqnarray}

Incorporating noise means that there is uncertainty about the function even where we have already sampled points (figure 6), and so sampling twice at the same position or at very similar positions could be sensible.

When we build the model of the function and its uncertainty, we are assuming that the function is smooth. If this was not the case, then we could say nothing at all about the function between the sampled points. The details of this smoothness assumption are embodied in the choice of kernel covariance function.

We can visualize the covariance function by drawing samples from the Gaussian process prior. In one dimension, we do this by defining an evenly spaced set of points $\mathbf{X}=\begin{bmatrix}\mathbf{x}_{1},& \mathbf{x}_{2},&\cdots,& \mathbf{x}_{I}\end{bmatrix}$, drawing a sample from $\mbox{Norm}[\mathbf{0}, \mathbf{K}[\mathbf{X},\mathbf{X}]]$ and then plotting the results. In this section, we'll consider several different choices of covariance function, and use this method to visualize each.

**Squared Exponential Kernel:** In our example above, we used the squared exponential kernel, but more properly we should have included the amplitude $\alpha$ which controls the overall amount of variability and the length scale $\lambda$ which controls the amount of smoothness:

\begin{equation}\label{eq:bo_squared_exp}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2}\cdot \mbox{exp}\left[-\frac{d^{2}}{2\lambda}\right],\nonumber

\end{equation}

where $d$ is the Euclidean distance between the points:

\begin{equation}

d = \sqrt {\left(\mathbf{x}-\mathbf{x}'\right)^{T}\left(\mathbf{x}-\mathbf{x}'\right)}. \tag{16}

\end{equation}

When the amplitude $\alpha^{2}$ is small, the function does not vary too much in the vertical direction. When it is larger, there is more variation. When the length scale $\lambda$ is small, the function is assumed to be less smooth and we quickly become uncertain about the state of the function as we move away from known positions. When it is large, the function is assumed to be more smooth and we are increasingly confident about what happens away from these observations (figure 7). Samples from the squared exponential kernel are visualized in figure 8a-c.

**Matérn kernel:** The squared exponential function assumes that the function is infinitely differentiable. The Matérn kernel (figure 8d-l) relaxes this constraint by assuming a certain degree of smoothness $\nu$. The Matérn kernel with $\nu=0.5$ is once differentiable and is defined as

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2}\cdot \exp\left[-\frac{d}{\lambda^{2}}\right], \tag{17}

\end{equation}

where once again, $d$ is the Euclidean distance between $\mathbf{x}$ and $\mathbf{x}'$, $\alpha$ is the amplitude, and $\lambda$ is the length scale. The Matérn kernel with $\nu=1.5$ is twice differentiable and is defined as:

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2} \left(1+\frac{\sqrt{3}d}{\lambda}\right)\exp\left[-\frac{\sqrt{3}d}{\lambda}\right]. \tag{18}

\end{equation}

The Matérn kernel with $\nu=2.5$ is three times differentiable and is defined as:

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2} \left(1+\frac{\sqrt{5}d}{\lambda} + \frac{5d^{2}}{3\lambda^{2}}\right)\exp\left[-\frac{\sqrt{5}d}{\lambda}\right]. \tag{19}

\end{equation}

The Matérn kernel with $\nu=\infty$ is infinitely differentiable and is identical to the squared exponential kernel (equation 16).

**Periodic Kernel:** If we believe that the underlying function is oscillatory, we use the periodic function:

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}^\prime] = \alpha^{2} \cdot \exp \left[ \frac{-2(\sin[\pi d/\tau])^{2}}{\lambda^2} \right], \tag{20}

\end{equation}

where $\tau$ is the period of the oscillation and the other parameters have the same meanings as before.

A common application for Bayesian optimization is to search for the best hyperparameters of a machine learning model. However, in an ironic twist, the kernel functions used in Bayesian optimization themselves contain unknown hyper-hyperparameters like the amplitude $\alpha$, length scale $\lambda$ and noise $\sigma^{2}_{n}$. There are several possible approaches to choosing these hyperparameters:

**1. Maximum likelihood:** similar to training ML models, we can choose these parameters by maximizing the marginal likelihood (i.e., the likelihood of the data after marginalizing over the possible values of the function):

\begin{eqnarray}\label{eq:bo_learning}

Pr(\mathbf{y}|\mathbf{x},\boldsymbol\theta)&=&\int Pr(\mathbf{y}|\mathbf{f},\mathbf{x},\boldsymbol\theta)d\mathbf{f}\nonumber\\

&=& \mbox{Norm}_{y}[\mathbf{0}, \mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I}], \tag{21}

\end{eqnarray}

where $\boldsymbol\theta$ contains the unknown parameters in the kernel function and the measurement noise $\sigma^{2}_{n}$.

In Bayesian optimization, we are collecting the observations sequentially, and where we collect them will depend on the kernel parameters, and we would have to interleave the processes of acquiring new points and optimizing the kernel parameters.

**2. Full Bayesian approach:** here we would choose a prior distribution $Pr(\boldsymbol\theta)$ on the kernel parameters of the Gaussian process and combine this with the likelihood in equation 21 to compute the posterior. We then weight the acquisition functions according to this posterior:

\begin{equation}\label{eq:snoek_post}

\hat{a}[\mathbf{x}^{*}]\propto \int a[\mathbf{x}^{*}|\boldsymbol\theta]Pr(\mathbf{y}|\mathbf{x},\boldsymbol\theta)Pr(\boldsymbol\theta). \tag{22}

\end{equation}

In practice this would usually be done using an Monte Carlo approach in which the posterior is represented by a set of samples (see Snoek *et* al., 2012) and we sum together multiple acquisition functions derived from these kernel parameter samples (figure 9).

For practical applications like hyperparameter search, we would want to make multiple function evaluations in parallel. In this case, we must consider how to prevent the algorithm from starting a new function evaluation in a place that is already being explored by a parallel thread.

One solution is to use a stochastic acquisition function. For example, Thompson sampling draws from the posterior distribution over the function and samples where this sample is maximal (figure 4d). When we sample several times, we will get different draws from the posterior and hence different values of $\mathbf{x}$.

A more sophisticated approach is to treat the problem in a fully Bayesian way (Snoek et al., 2012). The optimization algorithm keeps track of both the points that have been evaluated and the points that are pending, marginalizing over the uncertainty in the pending points. This can be done using a sampling approach similar to the method in figure 9 for incorporating different length scales. We draw samples from the Gaussians representing the possible pending results and build an acquisition function for each. We then average together these acquisition functions weighted by the probability of observing those results.

In practice, Bayesian optimization with Gaussian Processes works best if we start with a number of points from the function that have already been evaluated. A rule of thumb might be to use random sampling for $\sqrt{d}$ iterations where $d$ is the number of dimensions and then start the Bayesian optimization process. A second useful trick is to occasionally incorporate a random sample into the scheme. This can stop the Bayesian optimization process getting waylaid examining unproductive regions of the space and forces a certain degree of exploration. A typical approach might be to use a random sample every 10 iterations.

The main limitation of Bayesian optimization with GPs is efficiency. As the dimensionality increases, more points need to be evaluated. Unfortunately, the cost of exact inference in the Gaussian process scales as $\mathcal{O}[n^3]$ where $n$ is the number of data points. There has been some work to reduce this cost through different approximations such as:

**Inducing points:**This approach tries to summarize the large number of observed points into a smaller subset known as inducing points (Snelson*et*al., 2006).**Decomposing the kernel:**This approach decomposes the "big" kernel in high dimension into "small" kernels that act on small dimensions (Duvenaud*et*al., 2011).**Using random projections:**This approach relies on random embedding to solve the optimization problem in a lower dimension (Wang*et*al., 2013).

So far, we have considered optimizing continuous variables. What does Bayesian optimization look like in the discrete case? Perhaps we wish to choose which of $K$ discrete conditions (parameter values) yields the best output. In the absence of noise, this problem is trivial; we simply try all $K$ conditions in turn and choose the one that returns the maximum. However, when there is noise on the output, we can use Bayesian optimization to find the best condition efficiently.

The basic approach is model each condition independently. For continuous observations, we could model each output $f_{k}$ with a normal distribution, choose a prior over the mean of the normal and then use the measurements to compute a posterior over this mean. We'll leave developing this model as an exercise for the reader. Instead and for a bit of variety, we'll move to a different setting where the observations are binary we wish to find the configuration that produces the highest proportion of '1's in the output. This setting motivates the Beta-Bernoulli bandit model.

Consider the problem of choosing which of $K$ graphics to present to the user for a web-advert. We assume that for the $k^{th}$ graphic, there is a fixed probability $f_{k}$ that the person will click, but these parameters are unknown. We would like to efficiently choose the graphic that prompts the most clicks.

To solve this problem, we treat the parameters $f_{1}\ldots f_{K}$ as uncertain and place an uninformative Beta distribution prior with $\alpha,\beta=1$ over their values:

\begin{equation}

Pr(f_{k}) = \mbox{Beta}_{f_{k}}\left[1.0, 1.0\right]. \tag{23}

\end{equation}

The likelihood of showing the $k^{th}$ graphic $n_{k}$ times and receiving $c_{k}$ clicks is then

\begin{equation}

Pr(c_{k}|f, n_{k}) = f_{k}^{c_{k}}(1-f_{k})^{n_{k}-c_{k}}, \tag{24}

\end{equation}

and we can combine these two equations via Bayes rule to compute the posterior distribution over the parameter $f_{k}$ (see chapter 4 of Prince, 2012) which will be given by

\begin{equation}

Pr(f_{k}|c_{k},n_{k}) = \mbox{Beta}_{f_{k}}\left[1.0 + c_{k}, 1.0 + n_{k}-c_{k} \right]. \tag{25}

\end{equation}

Now we must choose which value of $k$ to try next given the $k$ posterior distributions over the probabilities $f_{k}$ of getting a click (figure 10). As before, we choose an acquisition function and sample the value of $k$ that maximizes this. In this case, the most practical approach is to use Thompson sampling. We sample from each posterior distribution separately (they are independent) and choose $k$ based on the highest sampled value.

As in the continuous case, this method will trade off exploiting existing knowledge by showing graphics that it knows will generate a high rate of clicks and exploring graphics where the click rate is very uncertain. This model and algorithm are part of a more general literature on bandit algorithms. More information can be found in this book.

When we have many discrete variables (e.g., the orientation, color, font size in an advert graphic), we could treat each combination of variables as one value of $k$ and use the above approach in which each condition is treated independently. However, the number of combinations may be very large and so this is not necessarily practical.

If the discrete variables have a natural order (e.g., font size) then one approach is to treat them as continuous. We amalgamate them into an observation vector $\mathbf{x}$ and use a Gaussian process model. The only complication is that we now only compute the acquisition function at the discrete values that are valid.

If the discrete variables have no natural order then we are in trouble. Gaussian processes depend on the kernel 'distance' between points and it is hard to define such kernels for discrete variables. One approach is to use a one-hot encoding, apply a kernel for each dimension and let the overall kernel be defined by the product of these sub-kernels (Duvenaud *et* al., 2014). However, this is not ideal because there is no way for the model to know about the invalid input values which will be assigned some probability and may be selected as new points to evaluate. One way to move forward is to consider a different underling probabilistic model.

The approaches up to this point can deal with most of the problems that we outline in the introduction, but are not suited to the case where there are many discrete variables (possibly in combination with continuous variables). Moreover, they cannot elegantly handle the case of conditional variables where the existence of some variables is contingent on the settings of others. In this section we consider random forest models and tree-Parzen estimators, both of which can handle these situations.

The *Sequential Model-based Algorithm Configuration* (SMAC) algorithm uses a random forest as an alternative to Gaussian processes. Consider the case where we have made some observations and trained a regression forest. For any point, we can measure the mean of the trees' predictions and their variance (figure 11). This mean and variance are then treated similarly to the equivalent outputs from a Gaussian process model. We apply an acquisition function to choose which of a set of candidate points to sample next. In practice, the forest must be updated as we go along and a simple way to do that is just to split a leaf when it accumulates a certain number of training examples.

Random forests based on binary splits can easily cope with combinations of discrete and continuous variables; it is just as easy to split the data by thresholding a continuous value as it is to split it by dividing a discrete variable into two non-overlapping sets. Moreover, the tree structure makes it easy to accommodate conditional parameters: we do not consider splitting on contingent variables until they are guaranteed by prior choices to exist.

The Tree-Parzen estimator (Bergstra *et* al., 2011) works quite differently from the models that we have considered so far. It describes the likelihood $Pr(\mathbf{x}|y)$ of the data $\mathbf{x}$ given the noisy function value $y$ rather than the posterior $Pr(y|\mathbf{x})$.

More specifically, the goal is to build two separate models $Pr(\mathbf{x}|y\in\mathcal{L})$ and $Pr(\mathbf{x}|y\in\mathcal{H})$ where the set $\mathcal{L}$ contains the lowest values of $y$ seen so far and the set $\mathcal{H}$ contains the highest. These sets are created by partitioning the values according to whether they fall below or above some fixed quantile.

The likelihoods $Pr(\mathbf{x}|y\in\mathcal{L})$ and $Pr(\mathbf{x}|y\in\mathcal{H})$ are modelled with kernel density estimators; for example, we might describe the likelihood as a sum of Gaussians with a mean on each observed data point $\mathbf{x}$ and fixed variance (figure 12). It can be shown that expected improvement is then maximized by choosing the point that maximizes the ratio $Pr(\mathbf{x}|y\in\mathcal{H})/Pr(\mathbf{x}|y\in\mathcal{L})$.

Tree-Parzen estimators work when we have a mixture of discrete and continuous spaces, and when some parameters are contingent on others. Moreover, the computation scales linearly with the number of data points as opposed to with their cube as for Gaussian processes.

In this tutorial, we have discussed Bayesian optimization, its key components, and applications. For further information, consult the recent surveys by Shahriari *et* al. (2016) and Frazier 2018. Python packages for Bayesian optimization include BoTorch, Spearmint, GPFlow, and GPyOpt.

Code for hyperparameter optimization can be found in the Hyperopt and HPBandSter packages. A popular application of Bayesian optimization is for AutoML which broadens the scope of hyperparameter optimization to also compare different model types as well as choosing their parameters and hyperparameters. Python packages for AutoML include Auto-sklearn, Hyperop-sklearn, and NNI.