In part I of this blog, we introduced out-of-distribution (OOD) detection, and discussed the special case of anomaly detection. In this scenario, we receive an unlabelled, but clean training set of in-distribution data $\{\mathbf{x}_{i}\}_{i=1}^{I}$. For a new example $\mathbf{x}^{*}$, we are required to determine whether it belongs to the same distribution as the training set or not. We described several families of methods for this task, including those based on one-class classification, probability models, and reconstruction quality.

In part II of this blog, we consider other OOD problems. First, we consider multi-class OOD detection (open-set recognition). Here, we receive a labelled dataset of in-distribution training data $\{\mathbf{x}_{i},y_{i}\}_{i=1}^{I}$. For a new example $\mathbf{x}^{*}$ we must classify it into one of the existing classes $y\in\{1,\ldots K\}$ or decide to reject it.

Second, we consider the case where we are also given a set of known OOD examples $\{\tilde{\mathbf{x}}_{j}\}_{j=1}^J$. Usually, this set is too small to be used to build an in-distribution/out-of-distribution classifier directly. However, we can still exploit the OOD data to calibrate or regularize an OOD detector. Thirdly, we briefly consider the case of outlier detection, in which we receive a single monolithic dataset containing both in-distribution and out-of-distribution examples and must attempt to remove the unwanted outliers.

OOD detection with class labels

When we train a system to distinguish between $K$ classes in an academic setting we usually make the closed-world assumption. This implies that we only expect those same $K$ classes to occur when we run the system. However, when we deploy a classifier in the real-world, this is unrealistic; it is almost inevitable that the system will encounter data from previously unseen classes.

Figure 1. Closed set recognition vs. open set recognition. a) Closed-set recognition. We train our system to distinguish between a fixed set of pre-determined classes based on training examples. Different colored circles represent training examples from three classes in 2D space $\mathbf{x}=[x_{1},x_{2}]^{T}$. Typically, the decision boundaries (grey lines) will fall between the clusters of training data. b) Open-set recognition. We now admit the possibility that samples from other classes (squares) will be present at test time, although we do not have access to these examples at training time. Open-set recognition methods implicitly define a tight decision boundary around the true classes to try to distinguish them from OOD examples.

In open-set recognition we acknowledge this possibility and build models that assign the input to one of $K+1$ classes, where the additional category indicates that we believe this data is from a previously unseen class. Multi-class OOD detection solves essentially the same problem, but with a slightly different emphasis; here the non-typical data are corrupted or unusual samples that we cannot confidently assign to one of the predetermined categories classes rather than new unseen classes.

The challenge in both cases is that the possible variation in the observed OOD examples is huge. To paraphrase Donald Rumsfeld, we must learn to cope with “unknown-unknowns”. Regardless of the method, we are implicitly defining extra decision boundaries around the true categories that prevent OOD examples being wrongly categorized (figure 1).

Methods for OOD detection with class labels can be divided into two categories. In the first case, we use a conventional classifier but modify it to return $K+1$ classes where the last class represents ‘none of the above’. Some of these techniques add OOD detection to an existing classifier, whereas others require partial or complete retraining. In the second case, we train the models with bespoke architectures that are specialized to the OOD detection task. We’ll now consider each category in turn.

Figure 2. Thresholding softmax probabilities. The data is passed through the classifier to produce $K$ activations corresponding to the $K$ classes. These are passed through the softmax function to produce $K$ non-negative numbers that sum to one and represent the class probabilities. Under the assumption that the classification will be uncertain for previously unseen classes, the maximum softmax probability can be used to identify if an example is OOD.

Adding OOD capabilities to standard classifiers

First, we note that we are not obliged to use the labels; we can still apply any of the methods from Part I of this blog to the raw data to find outliers based on the input data $\{\mathbf{x}_{i}\}_{i=1}^{I}$ alone. Another simple approach is to use the labels to build a deep neural network classifier, and then apply the methods from part I to the representation at any level of this classifier. However, there are several other prominent methods that make explicit use of the class labels.

Threshold maximum softmax score: Perhaps the most obvious approach to detecting OOD examples using an existing classifier is to consider the final class probabilities themselves. Hendrycks and Gimpel (2017) conjecture that if the classifier has never seen the class before, it will uncertain about its decision (figure 2). It follows that a simple OOD detector can be constructed by thresholding the maximum softmax probability. This is commonly used as a baseline in many OOD tasks, but unfortunately, Bendale and Boult (2016) conducted experiments that show that this does not work very well.

ODIN: Liang et al. (2018) made two adjustments to this baseline method and termed their approach ODIN (Out-of-DIstribution detector for Neural networks). Firstly, they add a temperature scaling factor $T$ to the softmax function so that the probabilities that the label $y$ is class $c$ is given by:

\begin{equation}
    Pr(y=c|\mathbf{x}) = \frac{\exp[\mbox{f}[\mathbf{x}]/T]}{\sum_{c=1}^{C}\exp[\mbox{f}[\mathbf{x}]/T]}
\end{equation}

Figure 3. ODIN. The data is passed through the classifier to produce $K$ activations corresponding to the $K$ classes. These are divided by a positive temperature constant, which makes them more similar and then passed through the softmax function to produce $K$ non-negative numbers that sum to one and represent the class probabilities. The maximum softmax probability can be used to identify if an example is OOD. In the full system, the input $\mathbf{x}$ is also manipulated to try to improve the log of the maximum softmax probability.

$\mathbf{x}$ is the input and $\mbox{f}[\mathbf{x}]$ is a network that computes the activations that go into the softmax function (figure 3). Surprisingly (and non-obviously), a large value of $T$ improves the performance of methods based on thresholding the maximum probability. The authors show that using a Taylor expansion that for large $T$, this modified criterion effectively compares (i) the extent to which the maximum score differs from the remaining scores to (ii) the variation within those scores.

ODIN’s second modification is to add noise to the input by taking a single step of size $\epsilon$ that increases the log of the maximum softmax probability using techniques similar to those in adversarial learning. This is empirically shown to improve the softmax score more for in-distribution examples than out-of-distribution examples and so the subsequent thresholding is more effective. One possible explanation is that for in-distribution inputs, there are nearby examples that are more prototypical of the chosen class, and so it is easier to increase the softmax score.

One of the drawbacks of ODIN is that it needs at least some OOD data to set the hyperparameters (temperature T, and step size $\epsilon$ of noise). Hsu et al. (2020) introduced Generalized ODIN which addresses these issues. They replace the end last layer of the network with a new layer that implicitly learns the temperature, and they choose $\epsilon$ so that it increases the maximum softmax score the most on in-distribution data.

OpenMax: Part of the problem with basing a test on thresholding softmax scores is that in some circumstances, we might expect the top few softmax scores to be similar. For example, in an image classification task, there might be several visually similar classes representing different species of fish, but other classes such as zebras that are quite visually distinct.

Motivated by this observation,Bendale and Boult (2016) propose treating the pattern of pre-softmax activations as a random variable and using the distribution of this variable as a basis for deciding whether a new example is out-of-distribution. They calculate the mean pre-softmax activation vector for each class from the subset of training data that was correctly classified. They then model the distribution of distances from each these means using a Weibull distribution. 

Figure 4. OpenMAX. The data is passed through the classifier to produce $K$ activations corresponding to the $K$ classes. These are treated as random variable and their distance from the mean activation vector for each class is measured. The softmax probabilities are computed as usual, but the top $N$ probabilities are attenuated by a factor derived the distances from corresponding the mean activation factor and their rank. A $K+1^{th}$ class is added that represents unknown classes and has a probability based on the total amount of attenuation.

They remove the softmax layer for the $K$ classes and replace it with a new softmax layer with $K+1$ classes, where the extra class represents unknown images. They attenuate the activations for the top $N$ classes according to (i) the index of the class (higher ranked classes are attenuated more) and (ii) the probability that the distance from the associated mean activation vector is likely to be at least as large as observed. They then construct an activation for the ‘other’ class based on the total amount of attenuation. They term this modified and calibrated softmax layer an ‘OpenMax’ layer (figure 4).

Membership loss: Perera and Patel (2019) replace the standard softmax layer and cross-entropy loss with membership loss. Here the output probabilities are generated using $K$ sigmoid functions that indicate whether each class is present or not. During training they use both membership loss and the standard cross entropy loss, but during inference, they only use the sigmoids. They choose the maximum sigmoid response if it is large enough or classify as OOD if it is too small (figure 5). Essentially, this system means that the $K$ network outputs do not compete with each other; an acceptable answer is that the probability for every known class is low.

​Figure 5. Membership loss. The data is passed through the classifier to produce $K$ activations corresponding to the $K$ classes. These are individually passed through sigmoid functions to produce $K$ numbers between 0 and 1 that are estimates of whether the input belongs to each of the $K$ classes. Because there is no softmax function, it is now possible for all $K$ classes to be rejected.

Building multi-class OOD detectors

The methods in the previous section all rely on a standard classifier backbone. However, there are also algorithms and architectures that have been developed specifically for multi-class OOD detection. Many of these mirror the principles of OOD detection without labels (see part I of this blog). For example, there are approaches based on measuring reconstruction quality and consistency across different models. Others techniques such as those based on prototypes are specific to the multi-class setting. In this section, we’ll consider a cross-section of these methods.

Classification: There are various classical (i.e., non-deep learning) approaches to the multi-class OOD problem. For example, Scheirer et al. (2013) who first defined the open-set recognition problem use a method based on support vector machines in which they add another hyperplane to account for parts of the space beyond the reasonable support of the known classes. Scheirer et al. (2014) present another SVM based method.

Distance-based methods: A second family of approaches focuses on distances. For example, Bendale and Boult (2015) detect unknown classes based on the maximum distance from the centroids of known classes. Mendes Júnior et al. (2017) uses the nearest neighbor distance ratio method, which compares the distance to examples of the two most similar classes. If this ratio is close to one, then the new example is neither closer to one class nor the other and so might correspond to an unseen class.

Chen et al. (2020a) learn a model where the $K$ activations that feed into the softmax function are based on the distance from $K$ learned “reciprocal points” in feature space (figure 6). Each reciprocal point is a sort of anti-prototype that represents data points that do not belong to the $k^{th}$ class. The system is trained with a loss function that consists of two components. The first maximizes the cross-entropy of the classification task. The second encourages each of the $K$ reciprocal points to be within a learned margin of the of the data examples for the remaining $K-1$ classes. In other words, it aims to make all the examples that do not belong to a class to be close to its reciprocal point, and hence have a low softmax activation for that class.

Figure 6. Reciprocal points approach of Chen et al. (2020a). a) Data for three classes represented by circles. The probability of being assigned to the green class increases with the distance from the associated reciprocal point (green square). Here, the distance has been thresholded at a fixed value. b) The probability of belonging to the red class increases with the distance from the red reciprocal point. c) Likewise, the probability of belonging to the yellow class increases with to the distance from the yellow point. d) Putting this altogether, each position in space is classified according to the reciprocal point that it is furthest from. Positions which are not far enough from all of the reciprocal points (white region in center) are classified as OOD.

Probability and generative models: OpenHybrid (Zhang et al. 2020) simultaneously trains a classifier and flow-based density estimator. Both of these share a common feature extraction backbone. The classifier uses these features to decide which of the $K$ classes is present. The flow-based density estimator models the probability of the features for the in-distribution classes. At test time, the classifier is used if the estimated probability density is above a threshold, or the example is identified as OOD if not.

Reconstruction: A number of multi-class OOD systems are built on modified versions of the VAE architecture; the latent representation is used to drive the in-distribution classification performance, and the reconstruction quality is used to help understand whether the sample is OOD. This approach has the advantage that the model is less likely to throw away information that could be used to distinguish between in-distribution and out-of-distribution examples, since it must retain enough information to reconstruct a wide variety of data.

A typical example of a reconstruction-based method is that of Sun et al. (2020) who built a modified auto-encoder that learns a class-conditional prior so that different classes are mapped to different positions in the embedding space. In addition to the usual loss terms that encourage the posterior distributions of the data to match their respective priors, and the reconstruction error to be low, there is an additional loss term that penalizes classification errors in the latent space. The data is classified as unknown if its estimated posterior distribution in the embedding space does not match any of the class-conditional means closely enough or if the reconstruction error is unusually large.

Other work by Yoshihashi et al. (2019) modified the OpenMax method in the context of an autoencoder. The encoder produces both an activation vector for classification and an embedding for reconstruction. The system is trained to minimize both the classification and reconstruction loss. The activations are modified as in OpenMax, but now based on both the distance from the mean activation vector and the reconstruction error. Oza and Patel (2019) present a different variation of combining classification in the embedding space of an autoencoder and rejecting OOD examples using reconstruction error.

Consistency: Bergman et al. (2020) and Salehi et al. (2020) both build systems that are based on the discrepancy between a teacher network (which has been trained on another general task like object classification) and a student network, which attempts to clone the activations of the teacher network for in-distribution samples. Since the student network has only seen in-distribution examples, there is no reason why its activations should be similar for out-of-distribution samples and this disparity is used as a signal to detect OOD examples.

Discussion

In this section we’ve provided a brief overview of methods for OOD detection in the presence of class labels. One family of methods uses the basic classifier architecture and adds a mechanism to identify if an example is OOD. A second family of methods builds bespoke models that are specialized to the OOD task. These rely on many of the same principles as OOD detection without class labels; they categorize inputs as OOD based on the distance from in-distribution examples, the estimated probability distributions over real examples, reconstruction quality, and consistency between different models.

The performance of these OOD methods depends to some extent on the performance of the underlying classifier; Vaze et al. (2021) provide evidence that OOD detection performance is positively correlated with classification performance for both thresholding the highest softmax response and for the reciprocal points method of Chen et al. (2020a) (figure 7). This is perhaps unsurprising; if the classes are well separated in feature space, it should be easier to identify examples that lie between these clusters.

Figure 7. Performance of open-set recognition gets increases as underlying classification performance increases. Circles represent classifying examples as OOD by thresholding softmax probabilities. Squares represent reciprocal point learning method of Chen et al. (2020a). Adapted from Vaze et al. (2021).

OOD detection with labelled OOD examples

In this section, we consider how we can exploit known OOD examples to support OOD detection. This is sometimes termed outlier exposure. The main difficulty is that OOD examples are by their very nature extremely varied, and so it difficult to collect a sufficiently representative set.

Approaches to using OOD examples

In principle, we could just build a binary classifier to distinguish outliers from real examples. If we already have class labels, then we can add one or more classes to the classifier that represent the OOD examples and train as normal (see e.g., Abbasi et al., 2018, Da et al., (2014), Mohseni et al., 2020).

Unfortunately, in practice, directly building an OOD classifier does not work well because the example outliers do not span the entire possible space (Shafaei et al. 2019). Consequently, labelled OOD examples are usually employed to regularize, expand the feature space, or calibrate existing methods. We consider each of these cases in turn.

Regularization: In many cases, we only have a few OOD examples, but nonetheless we can exploit these to regularize or improve OOD methods that do not fundamentally require this data by adding extra terms to the loss function. For example, Ruff et al. (2019) adapted the deep one-class SVM method to the situation where there are a few examples of outliers by adding an extra term to the objective function that encourages these to be far from the center of the hypersphere.

Lee et al. (2017} and Hendrycks et al. (2019) both modify OOD detectors based on probability by adding an extra regularization term that ensures that the probability of the OOD examples is low. Vyas et al. (2018) modified a method based on the consistency across an ensemble of classifiers by adding a term that encourages inconsistency for known OOD examples.

Expanding feature space: One possible problem with modifying a standard classifier to perform multi-class OOD is that the features in the model may not even be able to represent OOD examples. For example, if we built a classifier that distinguishes between jungle, desert, and snowy scenes, it might throw away all information except for the fact that these images are on average, green, yellow, and white respectively. Consequently, if we now present the classifier with a picture of a white house, there is nothing left in the representation to help distinguish it from the snowy scene. To counter this problem, Perera and Patel (2019) also train the common backbone of the system (i.e., the convolutional layers) with out-of-distribution classes, to ensure that the system has extra features that respond to a variety of other things without triggering the in-distribution response.

Calibration: OOD data is sometimes used to calibrate or select hyperparameters for existing OOD systems. For examples, the temperature scaling parameter $T$ and step size $\epsilon$ in ODIN (Liang et al., 2018) are chosen using a validation set of OOD examples. Similarly, Kong and Ramanan (2021) use a set of OOD examples as a validation set to select the number of training iterations in OpenGAN (see below).

Synthesizing data

OOD data is hard to acquire in large quantities, and so several studies have experimented with using generative models to synthesize relevant OOD examples. To this end, Neal et al. (2018) use an encoder-decoder GAN to generate images that are similar to those in a multi-class training set, but do not belong to known classes. They synthesize examples that are close in the latent space to in-distribution examples but have low confidence scores on a trained classifier. When they have generated enough OOD examples, the use these to train K+1-class classifier.

Ge et al. (2016) generate OOD examples in a multi-class setting by building a conditional GAN to describe the dataset and then generating OOD examples by interpolating between examples from different classes in the latent space (figure 8) Chen et al. (2021) improved their reciprocal point method by generating adversarial examples using a GAN which were similar to the one of the classes, but also close to the associated reciprocal point.

Figure 8. Synthesizing OOD examples. Ge et al. (2016) generate examples by building a conditional GAN and interpolating between known classes. a) Examples of interpolating between classes from MNIST. b) Examples of interpolating between classes from ImageNet

Kong and Ramanan (2021) build a novel multi-class OOD system called OpenGAN that uses known OOD examples. They exploit the idea that if we build a GAN that can generate the in-distribution dataset, then the discriminator from the GAN might be able to act as an OOD classifier. However, as the generator gets better and better, the classifier will cease to work. Consequently, they train the discriminator to distinguish in-distribution examples both from synthesized examples and from real OOD examples.

Discussion

Since OOD data is by its nature extremely diverse, it is difficult in practice to collect enough data to build a simple in-distribution/out-of-distribution classifier. Consequently, OOD data is more usually used to expand the feature space, regularize, or calibrate existing solutions. Other work has experimented with synthesizing OOD examples with particular properties to enhance the performance of existing OOD systems.

Outlier detection

Finally, we briefly consider outlier detection. Here, we do not have access to a clean in-distribution dataset, but receive data that is predominantly in-distribution with a minority of outliers. The focus is simply on identifying and removing these outliers with a view to using the remaining dataset for another task.

There are a number of classical methods in this area that precede the deep learning era. For example, it’s possible to build a one-class support vector machine that trades off the compactness of the decision boundary (so that it fits tightly around most of the data) against the number of items that are outside that boundary (i.e., the outliers). Alternatively, we can fit a robust probability model such as a t-distribution and simply reject any samples that have very low probability. Liu et al. (2008) proposed the isolation forest. Here the data is partitioned using a set of trees with random splits on random features; outliers should be easy to separate from the rest of the data and so will on average be partitioned at higher levels in the trees.

The above methods work for simple datasets but are unlikely to be successful in high-dimensional complex datasets like those used in modern image classification tasks. Contemporary methods are usually built along similar principles to OOD detectors where we do have access to a clean dataset; however, they either (i) associate a hidden variable with each data point that indicates whether it is an outlier or not and pay a fixed cost for assigning each outlier, or (ii) simply train the model and reject samples that do not conform.

An example of the former case is Zhou and Paffenroth (2017) who built a system based on reconstruction within an autoencoder. The cost function is the reconstruction error for inliers or a fixed cost for outliers. The system alternates between optimizing the parameters of the autoencoder and setting the hidden variables for each data point that indicate whether it is an inlier or an outlier.

An example of the second strategy is Wang et al. 2019 who use an auxiliary task in which the data is transformed and the transformation must be estimated. They argue that the inliers will dominate this procedure and it will tend to fail on the outliers.

Conclusion

In this two-part blog, we have considered out-of-distribution detection in a number of different scenarios. In part I, we considered the case where we have a clean set of unlabelled data and must determine if a new sample comes from the same set. In part II, we considered the open-set recognition scenario where we also have class labels. This is particularly relevant to the real-world deployment of classifiers, which will inevitably encounter OOD data. Finally, we considered the cases where we have access to a source of known OOD data, and where the OOD data is mixed in with the main dataset.

In all these situations, there are many competing approaches, and no definitive solution exists at this time. For more information about out-of-distribution detection, the interested reader can consult the surveys by Geng et al. (2020), Ruff et al. (2021), and Yang et al. (2021).