Borealis AI and RBC are researching tools that can help developers govern machine learning models in a standardized, scalable, robust and automated way.
With the explosive growth of AI products, we are beginning to see how AI can transform peoples’ lives. From online shopping to healthcare, and banking to climate change, the potential for AI is almost limitless. However, examples of misuse of data, biased algorithms and inaccurate decisions are fueling a general mistrust and skepticism of the technology. For society to embrace AI and ensure broader adoption of these technologies, we need to effectively evaluate model reliability and safety. Compared to the lab, the range of consequences is far wider if our models don’t perform as intended.
Building robust model governance tools is critical to model performance. They give us more ways to assess behaviour, and to ultimately gain a greater level of trust in AI. One example of this can be found in facial recognition.
A recent RBC Disruptors podcast exposed the dangers presented by biased facial recognition systems. It is widely known that machine learning algorithms tend to perform worse on images of women and people of colour . To some extent this is due to biased datasets; however, we must be vigilant at all stages of the ML lifecycle. We need to rethink how we design, test, deploy, and monitor machine learning algorithms. Performance across diverse groups should be a property that models satisfy before they are deployed.
This article explains how we are researching automated model validation tools at Borealis AI. While facial recognition tasks are not currently in our model validation pipeline, the examples shown can be generalized across datasets and machine learning tasks: you just need to define the property you want to detect.
Defining and testing properties
We build AI systems with the intent of capturing patterns inside of datasets. The patterns they learn, however, are determined by the dataset and the training algorithm used. Without explicit regularization, models can take shortcuts to achieve their learning objectives, and the result is illusory performance and undesirable behaviors.
One solution is to find more data, or find cleaner data. But that can be expensive, even if it is possible. What’s more, we don’t always recognize when data is contaminated.
The next best option is to ensure your model will not act adversely against a range of potential scenarios. But this is a bit of a “chicken and egg” situation: how can you do that without deploying your model in the real world first, if you only have so much data? The proactive answer is to run extensive tests. We begin from community-accepted definitions of desirable model behavior: for instance, good models should have consistent predictions around an input, and avoid making predictions based on protected attributes. We then run a search over the inputs and outputs to find violations of these properties, and return them for analysis. Actions can then be taken to improve the model, for example by retraining it to account for the violations.
At a high level, that is how our validation platform is being developed. Each test is essentially a mathematical expression which consists of the model, plus the desired property for which it is being assessed. One example is a test for adversarial robustness, as shown in figure 1.A. Here we are interested in knowing whether a tiny nudge (of size epsilon) to an input data point X can completely change a prediction made by the model. Having defined our property, we then run a solver over the expression to see if any failure examples can be found. If so, we return them to the user as examples of having failed this adversarial robustness test.
Tests for other properties can be crafted in the same way, where the underlying theme always relies on a region around a point . Varying the shape of the region corresponds to different properties such as in Figure 1.B. Our current research work involves developing methods capable of coming up with these shapes to test for notions such as fairness.
Changing the shape of the region results in a more complex search space for our solver to explore. As such, future research may involve looking into more powerful solvers: for instance, by using symmetry to avoid redundant areas of the search space.
Figure 1. Testing properties of models over different neighbourhoods. (A) An adversarially robust classifier should have its decisions be stable against small perturbations. To test this, we define a neighbourhood around a point x, and look for points inside this neighbourhood that change the model’s outputs. (B) One definition of fairness states that similar individuals should be treated similarly, regardless of protected attributes like race and sex. For a given input x, we can test this by transforming only the protected attributes, and then examining if a model’s output changes. Because there may be no simple mathematical expression for transformation of race, we can construct an auxiliary model G to generate these samples. This G implicitly transforms the neighborhood we search over .
As an ML lab within the highly regulated financial services industry, we need to demonstrate that we meet a set of strict compliance and regulatory standards. RBC’s long-standing model risk management policies and procedures form a great basis for managing the risks of AI. However, our testing methodologies have to keep pace with AI’s rapid evolution to ensure that we continue to deploy cutting-edge models safely.
Borealis AI recently launched a new online hub, RESPECT AI, providing resources to the AI community and business leaders to help them adopt responsible, safe and ethical AI practices. This program includes a focus on model governance and includes interviews with leading industry experts as well as learnings from our own research. We will continue to share our findings with our peers in the AI community, particularly in non-regulated industries where governance is far less mature.
AI is undeniably one of the biggest opportunities for our economy and innovation. However, with this opportunity comes risks that are too great to ignore. In order to meet these challenges, the AI community needs to work alongside industry, regulators, and government to push the boundaries and adapt or develop new AI validation methods tailored to the complexities of this space.
 Buolamwini, Joy and Timnit Gebru. “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification.” FAT (2018).
 Robey, Alexander et al. “Model-Based Robust Deep Learning: Generalizing to Natural, Out-of-Distribution Data.” (2020).