The cost of error is often asymmetric in real-world systems that involve rare classes or events. For example, in medical imaging, incorrectly diagnosing a tumor as benign (a false negative) could lead to cancer being detected later at a more advanced stage, when survival rates are much worse. This would be a higher cost of error than incorrectly diagnosing a benign tumor as potentially cancerous (a false positive). In banking, misclassifying a fraudulent transaction as legitimate may be more costly in terms of financial losses or legal penalties than misclassifying a legitimate transaction as fraudulent (a false positive). In both examples, the critical class is rare, and a missed detection carries a disproportionately high cost.

In such situations, systems are often operated at high true positive rates, even though this may require tolerating high false positive rates. Unfortunately, false positives can undermine user confidence in the system and responding to them could incur other costs (e.g., additional medical imaging tests).

In recent years, much progress has been seen from recent efforts in imbalanced learning, such as improved learning objectives that increase the impact of under-represented classes during training [1], re-balancing classification results with class co-occurrence frequencies [2], and data augmentation [3].  However, the critical positive setting has been rarely studied.

Our Method

In this work, we present a novel approach to address the challenge of minimizing false positives for systems that need to operate at a high true positive rate. We propose a ranking-based regularization (RankReg) approach that is easy to implement and shows empirically that it not only effectively reduces false positives but also complements conventional imbalanced learning losses.

Motivations

Which optimization option is preferred in operational contexts with critical positives, as shown in Figure 1? The answer is: Option 2 is preferred because, with a suitable threshold depicted by the dashed line, all positives can be detected (100% TPR) with only one false positive (i.e., 25% FPR at 100% TPR), better than option 1 where two false positives need be tolerated. Therefore, a desired learning objective function should distinguish between the middle and right columns and assign a higher loss to the middle column.

Illustration of two different optimization options. Option 2 is preferred in an operational context with critical positives, and our proposed method, RankReg, can capture such an option.

Figure 1. Illustration of two different optimization options. Option 2 is preferred in an operational context with critical positives, and our proposed method, RankReg, can capture such an option.

Ranking based Regularizer

We present a novel, plug-and-play regularization loss as a generic method for inducing a neural network to prioritize minimizing false positives at a high true positive rate. Our key insight is that the false positive rate at a high true positive rate is determined by how the least confident positives are ranked by the network. Given the ranking scores of predicted outputs, the ranking regularizer is simply computed as the normalized sum over the squared rank values of the positive samples.

The proposed regularization term distinguishes between the middle and right columns and assigns a higher loss to the middle column. In the middle column, the positives have the first and fourth-highest classification scores, producing a regularization loss of 1² + 4² = 17. In the right column, the positives have the second and third-highest classification scores, producing a regularization loss of  2² + 3² = 13. The proposed regularization, therefore, favours the right column, as desired. Note that if we use the ranks directly instead of squaring, the regularization loss would be 5 in both cases (1 + 4 = 2 + 3 = 5). Squaring places an increasing penalty on positive samples the lower they are ranked in a sorted list of the network’s output scores, which works to push up the scores of the least confident positive samples.

External Memory

In practice, since positive samples may be severely under-represented in the dataset, we compute the regularization term over the union of the batch and an external memory that caches previous positive samples, as illustrated in Figure 2. The buffer enables the regularization term to be computed per batch, even in datasets with severe imbalance ratios, as a batch may contain few (or no) positive samples.

At the start of training, positive samples are accumulated from the incoming batches and added to the buffer up to a fixed maximum capacity. Afterwards, as batches are processed, new positive samples replace the samples in the buffer for which the model is the most certain, i.e., the buffered samples with the maximum scores (see Figure 2). This replacement strategy keeps the hard positives in the buffer and removes positives for which the classifier is already confident.

An illustration of the regularization term and the external positive buffer by example.

Figure 2. An illustration of the regularization term and the external positive buffer by example.

Main Results

Extensive experiments are conducted on three public image-based benchmarks: binary imbalanced CIFAR-10, binary imbalanced CIFAR-100, and Melanoma. Table [1] highlights the reduction of FPR at various TPR on the imbalanced CIFAR-10 dataset. The baseline covers most of the imbalanced-oriented loss objectives as well as the existing state-of-the-art method, i.e., ALM [4]. Is it seen that our proposed RankReg results in the reduction of FPR ranging from 2% to 9%. Similar improved results can be observed from the Melanoma dataset in Table [2].

Binary CIFAR10, imb. 1:100

MethodsFPR@↓
98%TPR
FPR@↓
95%TPR
FPR@↓
92%TPR
AUC↑
BCE
+ALM
+RankReg
56.0
52.0
47.1
45.0
35.0
26.2
29.0
21.0
20.6
91.2
93.1
94.3
S-ML
+ALM
+RankReg
59.0
50.0
45.6
40.0
37.0
31.4
26.0
24.0
29.7
91.7
92.5
93.9
S-FL
+ALM
+RankReg
59.0
55.0
53.3
40.0
39.0
35.4
27.0
25.0
20.7
91.7
91.5
92.8
A-ML
+ALM
+RankReg
54.0
45.0
47.8
36.0
35.0
28.9
23.0
23.0
21.4
92.4
92.8
94.1
A-FL
+ALM
+RankReg
50.0
49.0
50.5
38.0
37.0
28.7
24.0
23.0
20.9
92.3
92.8
94.3
CB-BCE
+ALM
+RankReg
89.0
67.0
48.8
72.0
51.0
29.9
59.0
36.0
24.6
78.0
88.1
93.2
W-BCE
+ALM
+RankReg
69.0
69.0
60.0
52.0
48.0
39.4
37.0
31.0
29.6
87.4
89.3
92.1
LDAM
+ALM
+RankReg
65.0
60.0
42.8
48.0
42.0
25.6
34.0
31.0
23.8
89.0
91.0
95.0
Avg. ∆6.09.72.82.3
Table 1. Comparison results for binary imbalanced CIFAR-10 showing FPRs at {98%, 95%, 92%} TPRs. “+ALM” and “+RankReg” are shorthand for BaseLoss+ALM and BaseLoss+RankReg, respectively.

Melanoma, imb. 1:170

MethodsFPR@↓
98%TPR
FPR@↓
95%TPR
FPR@↓
92%TPR
FPR@↓
90%TPR
AUC↑
BCE
+ALM
+RankReg
49.8
49.9
49.4
45.9
41.8
37.9
38.6
40.0
33.9
35.5
37.7
31.6
85.7
85.6
86.8
S-ML
+ALM
+RankReg
46.6
51.3
54.6
42.8
40.5
42.4
38.4
39.8
36.1
37.4
36.2
34.4
85.3
83.5
86.3
S-FL
+ALM
+RankReg
59.0
47.8
56.6
47.3
42.7
37.8
44.4
39.2
31.2
39.5
38.1
29.8
83.8
84.0
86.1
A-ML
+ALM
+RankReg
47.5
51.0
58.3
42.9
41.5
40.8
40.4
37.5
36.7
36.6
37.1
33.9
85.4
83.7
86.2
A-FL
+ALM
+RankReg
55.6
49.0
48.0
45.0
42.4
36.2
42.7
40.1
30.7
41.2
38.1
28.8
84.4
83.6
86.3
CB-BCE
+ALM
+RankReg
67.2
60.8
57.8
59.5
59.5
44.9
35.7
46.3
35.7
33.2
45.8
34.7
82.6
81.5
83.7
W-BCE
+ALM
+RankReg
69.0
66.0
56.4
52.0
48.0
41.1
37.0
31.0
33.0
32.1
30.7
30.5
87.4
89.3
90.9
LDAM
+ALM
+RankReg
59.7
62.7
65.6
48.2
47.7
47.5
46.2
43.3
45.7
39.0
40.7
43.9
83.4
81.5
81.7
Table 2. Comparison results for Melanoma dataset showing FPRs at {98%, 95%, 92%, 90%} TPRs. “+ALM” and “+RankReg” are shorthand for BaseLoss+ALM and BaseLoss+RankReg, respectively.

On Multi-class Classification

RankReg can be used in multi-class settings by ranking the critical samples higher than others based on the output probability for each class. Table [3] shows additional results in the multi-class setting using long-tailed CIFAR-10. We report the average error rate of other classes after setting thresholds for {80, 90} % TPR on the critical class. Our method performs better than under the 1:100 imbalance ratio setting and comparably under the 1:200 setting.

Error@ 𝛽 %TPR↓LT-CIFAR10 imb. 100LT-CIFAR10 imb. 200
80%90%Acc.80%90%Acc.
CE29.834.770.437.842.464.0
CE+ALM28.933.970.936.139.965.1
CE+RankReg26.729.371.636.737.865.0
Table 3. Multi-class experiments using long-tailed CIFAR-10.

Robustness to Label Noise

Real-world datasets often contain mislabeled data. To evaluate the robustness of our approach in the presence of label noise, we perform additional experiments in which we incrementally flip a proportion η of training labels. Figure 3 shows how FPR@ {98, 95, 92} %TPR (left to right) degrades as a function of in the range of [0, 0.5], using BCE as a base loss. These results suggest that RankReg is as robust to label noise as the state-of-the-art approach [4].

Label noise experiments using BCE as base loss on CIFAR10. We report FPR@{98, 95, 92}%TPR (left to right) with varied noise ratios.

Figure 3. Label noise experiments using BCE as base loss on CIFAR10. We report FPR@{98, 95, 92}%TPR (left to right) with varied noise ratios.

Conclusion

The problem setting of critical rare positives has been surprisingly under-studied in the research community and is of great importance. Our proposed RankReg is a very general method for such a problem and has demonstrated promising results on public benchmarks.

In future research, we would like to explore the combinatory result with other lines of research, such as data augmentation and weighted sampling. Going forward, our research team would further explore the combinatory impact of using RankReg with other lines of research, such as data augmentations, test the applicability on other data domains (e.g., time series) and extend to other tasks (e.g., imbalanced regression). If you are interested, we are excited to chat more about it at CVPR 2023.

References