With the explosive growth of AI products, we are beginning to see how AI can transform peoples’ lives. From online shopping to healthcare, and banking to climate change, the potential for AI is almost limitless. However, examples of misuse of data, biased algorithms and inaccurate decisions are fueling a general mistrust and skepticism of the technology. For society to embrace AI and ensure broader adoption of these technologies, we need to effectively evaluate model reliability and safety. Compared to the lab, the range of consequences is far wider if our models don’t perform as intended.

Building robust model governance tools is critical to model performance. They give us more ways to assess behaviour, and to ultimately gain a greater level of trust in AI. One example of this can be found in facial recognition.

A recent RBC Disruptors podcast exposed the dangers presented by biased facial recognition systems. It is widely known that machine learning algorithms tend to perform worse on images of women and people of colour [1]. To some extent this is due to biased datasets; however, we must be vigilant at all stages of the ML lifecycle. We need to rethink how we design, test, deploy, and monitor machine learning algorithms. Performance across diverse groups should be a property that models satisfy before they are deployed.

This article explains how we are researching automated model validation tools at Borealis AI. While facial recognition tasks are not currently in our model validation pipeline, the examples shown can be generalized across datasets and machine learning tasks: you just need to define the property you want to detect.

We build AI systems with the intent of capturing patterns inside of datasets. The patterns they learn, however, are determined by the dataset and the training algorithm used. Without explicit regularization, models can take shortcuts to achieve their learning objectives, and the result is illusory performance and undesirable behaviors.

One solution is to find more data, or find cleaner data. But that can be expensive, even if it is possible. What’s more, we don’t always recognize when data is contaminated.

The next best option is to ensure your model will not act adversely against a range of potential scenarios. But this is a bit of a “chicken and egg” situation: how can you do that without deploying your model in the real world first, if you only have so much data? The proactive answer is to run extensive tests. We begin from community-accepted definitions of desirable model behavior: for instance, good models should have consistent predictions around an input, and avoid making predictions based on protected attributes. We then run a search over the inputs and outputs to find violations of these properties, and return them for analysis. Actions can then be taken to improve the model, for example by retraining it to account for the violations.

At a high level, that is how our validation platform is being developed. Each test is essentially a mathematical expression which consists of the model, plus the desired property for which it is being assessed. One example is a test for adversarial robustness, as shown in figure 1.A. Here we are interested in knowing whether a tiny nudge (of size epsilon) to an input data point X can completely change a prediction made by the model. Having defined our property, we then run a solver over the expression to see if any failure examples can be found. If so, we return them to the user as examples of having failed this adversarial robustness test.

Tests for other properties can be crafted in the same way, where the underlying theme always relies on a region around a point [2]. Varying the shape of the region corresponds to different properties such as in Figure 1.B. Our current research work involves developing methods capable of coming up with these shapes to test for notions such as fairness.

Changing the shape of the region results in a more complex search space for our solver to explore. As such, future research may involve looking into more powerful solvers: for instance, by using symmetry to avoid redundant areas of the search space.

As an ML lab within the highly regulated financial services industry, we need to demonstrate that we meet a set of strict compliance and regulatory standards. RBC’s long-standing model risk management policies and procedures form a great basis for managing the risks of AI. However, our testing methodologies have to keep pace with AI’s rapid evolution to ensure that we continue to deploy cutting-edge models safely.

Borealis AI recently launched a new online hub, RESPECT AI, providing resources to the AI community and business leaders to help them adopt responsible, safe and ethical AI practices. This program includes a focus on model governance and includes interviews with leading industry experts as well as learnings from our own research. We will continue to share our findings with our peers in the AI community, particularly in non-regulated industries where governance is far less mature.

AI is undeniably one of the biggest opportunities for our economy and innovation. However, with this opportunity comes risks that are too great to ignore. In order to meet these challenges, the AI community needs to work alongside industry, regulators, and government to push the boundaries and adapt or develop new AI validation methods tailored to the complexities of this space.

[1] *Buolamwini, Joy and Timnit Gebru. "Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification." FAT (2018).*

[2] *Robey, Alexander et al. “Model-Based Robust Deep Learning: Generalizing to Natural, Out-of-Distribution Data.” (2020).*

*The views expressed in this article are those of the interviewee and do not necessarily reflect the position of RBC or Borealis AI. *

**Valentine Goddard (VG): **

The debate about the social impacts of AI have evolved as technology has evolved.

However, the pandemic is rapidly increasing the Digital Gap which is often linked to socioeconomic inequities. The conversation has shifted from an overemphasis on the economic efficiencies of AI to one about social resilience in the age of AI. We’re facing a historical opportunity to look into what kind of New Normal we want.

*VG: *

To start, I think diversity is key. We need to improve the diversity of perspectives that go into the development of AI and data analytics. I think awareness about the importance of this is slowly increasing.

Meanwhile, we’re also seeing social issues rise up the corporate agenda. An increasing number of organizations have created roles responsible for AI Ethics, but there is still much debate about what that means exactly.

I’m also encouraged by efforts to ensure that developers and employees are able to raise concerns about the fairness of the algorithms they are creating. Though we still encounter businesses that can improve when it comes to properly supporting their applied research teams in this area.

*VG:*

I have noticed an increasing desire on the part of businesses to participate in ‘AI for Good’ or ‘Data for Good’ initiatives. These are great first step towards improving awareness around the social impacts of AI, but they are often just ‘one-offs’. What we need is a more sustainable approach.

That is why I have been advocating for greater partnership between the AI ecosystem and Civil Society Organization’s (CSOs) as a way to drive participation and fairness in the development and implementation of AI. It will likely require more funding for CSOs as they move up the AI maturity curve, but this is not just about funding.

For those at the leading edge of AI development and governance, the inclusion of CSOs can deliver massive benefits – they can help bring diversity of perspective (seven in ten employees at socially focused non-profits are women); they can help socialize the wider benefits of AI within their communities (often those most underserved by technology), and they can help identify emerging social issues related to the use of AI (and maybe even help solve them before they become problems).

There is also a growing body of research on the value of including CSOs in business decision-making. Academics suggest it can increase trust and accelerate the responsible adoption of new technologies. Others stress the value inherent in working with CSOs to collect relevant, high-quality data that can lead to more robust and socially-beneficial results. Many simply highlight the need to include more democratic processes in regulatory innovation.

**VG:**

That largely depends on the sector you are working in and the stakeholders you touch, but the net should be cast wide. Before the pandemic disrupted our normal lives, my organization – AI Impact Alliance –conducted workshops throughout the year that lead to the adoption of international public policy recommendations on the role of civil society, the Arts, in digital governance by the United Nations. It’s essential to work towards more inclusive AI policy recommendations.

**VG:**

Two approaches are gaining greater adoption in the market. The first is to conduct ongoing social impact assessments on the AI models and technologies you are developing and implementing. There is no single guide to what goes into a social impact assessment, so companies will need to work with stakeholders and others to define what standards and KPIs they will measure.

We are also seeing greater adoption of social return on investment criteria, both within decision-making and in corporate reporting. Again, the standards vary depending on the sector and market. But they can be a useful tool for measuring progress.

*VG: *

I would argue that public institutions need to take a lead role in supporting education and digital literacy – particularly around the ethical, social, legal, economic and political implications of AI. They need to be encouraging the adoption of democratic AI processes and normative frameworks. They need to be tackling the roots of the digital divide. And they need to be supporting new forms of democratic participation and civic engagement in the field of AI.

The government can also play a role in incentivising businesses for responsible behaviour. They could take a heavy hand-making social impact assessments compulsory, for example – or they could take a more collaborative approach by ensuring the right stakeholders are at the table and that social return on investment is recognized and valued.

*VG: *

At an individual level, we’ve enjoyed tremendous support from developers, researchers and scientists who want to contribute to the debate about the impact of AI on society. I have also seen AI developers and researchers donate their time directly to CSOs to help data literacy and management capacity.

I think more broadly, the AI community needs to continue to focus on addressing the root causes of digital inequality. I think we need to be aware of how the models we are developing for the digital economy can sometimes become a driver of ethical problems. And we need to be more supported when we see problems emerging.

I think the AI community wants to build socially responsible models and technologies. They just need the tools, frameworks and encouragement to go do it.

Valentine Goddard is the founder and executive director of AI Impact Alliance, an independent non-profit organization operating globally, whose mission is to facilitate an ethical and responsible implementation of artificial intelligence. She is a member of the United Nations Expert Groups on *The Role of Public Institutions in the Transformative Impact of New Technologies*, and on a *“Socially just transition towards sustainable development: The role of digital technologies on social development and well-being of all”*. Ms. Goddard sits on several committees related to the ethical and social impact of AI and contributes to public policy recommendations related to the ethical and normative framework of AI.

This tutorial concerns the *Boolean satisfiability* or *SAT* problem. We are given a formula containing binary variables that are connected by logical relations such as $\text{OR}$ and $\text{AND}$. We aim to establish whether there is any way to set these variables so that the formula evaluates to $\text{true}$. Algorithms that are applied to this problem are known as *SAT solvers*.

The tutorial is divided into three parts. In part I, we introduce Boolean logic and the SAT problem. We discuss how to transform SAT problems into a standard form that is amenable to algorithmic manipulation. We categorize types of SAT solvers and present two naïve algorithms. We introduce several SAT constructions, which can be thought of as common sub-routines for SAT problems. Finally, we present some applications; the Boolean satisfiability problem may seem abstract, but as we shall see it has many practical uses.

In part II of the tutorial, we will dig more deeply into the internals of modern SAT solver algorithms. In part III, we recast SAT solving in terms of message passing on factor graphs. We also discuss satisfiability modulo theory (SMT) solvers, which extend the machinery of SAT solvers to solve more general problems involving continuous variables.

The relevance of SAT solvers to machine learning is not immediately obvious. However, there are two direct connections. First, machine learning algorithms rely on optimization. SAT can also be considered an optimization problem and SAT solvers can find global optima without relying on gradients. Indeed, in this tutorial, we'll show how to fit both neural networks and decision trees using SAT solvers.

Second, machine learning techniques are often used as components of SAT solvers; in part II of this tutorial, we'll discuss how reinforcement learning can be used to speed up SAT solving, and in part III we will show that there is a close connection between factor graphs and SAT solvers and that belief propagation algorithms can be used to solve satisfiability problems.

In this section, we define a set of *Boolean operators* and show how they are combined into *Boolean logic formulae*. Then we introduce the *Boolean satisfiability problem*.

*Boolean operators* are standard functions that take one or more binary variables as input and return a single binary output. Hence, they can be defined by *truth tables* in which we enumerate every combination of inputs and define the output for each (figure 1). Common logical operators include:

- The $\text{OR}$ operator is written as $\lor$ and takes two inputs $x_{1}$ and $x_{2}$. It returns $\text{true}$ if one or both of the inputs are $\text{true}$ and returns $\text{false}$ otherwise.
- The $\text{AND}$ operator is written as $\land$ and takes two inputs $x_{1}$ and $x_{2}$. It returns $\text{true}$ if both the inputs are $\text{true}$ and $\text{false}$ otherwise.
- The $\text{IMPLICATION}$ operator is written as $\Rightarrow$ and evaluates whether the two inputs are consistent with the statement 'if $x_{1}$ then $x_{2}$'. The statement is only disobeyed when $x_{1}$ is $\text{true}$ and $x_{2}$ is $\text{false}$ and so implication returns $\text{false}$ for this combination of inputs and $\text{true}$ otherwise.
- The $\text{EQUIVALENCE}$ operator is written as $\Leftrightarrow$ and takes two inputs $x_{1}$ and $x_{2}$. It returns $\text{true}$ if the two inputs are the same and returns $\text{false}$ otherwise.
- The $\text{NOT}$ operator is written as $\lnot$ and takes one input. It returns $\text{true}$ if the input $x_{1}$ is $\text{false}$ and vice-versa. We refer $\lnot x_{1}$ as the
*complement*of $x_{1}$.

A *Boolean logic formula* $\phi$ takes a set of $I$ variables $\{x_{i}\}_{i=1}^{I}\in\{$$\text{false}$,$\text{true}$$\}$ and combines them using Boolean operators, returning $\text{true}$ or $\text{false}$. For example:

\begin{equation}

\phi:= (x_{1}\Rightarrow (\lnot x_{2}\land x_{3})) \land (x_{2} \Leftrightarrow (\lnot x_{3} \lor x_{1}). \tag{1}

\end{equation}

For any combination of input variables $x_{1},x_{2},x_{3}\in\{$$\text{false}$,$\text{true}$$\}$, we could evaluate this formula and see if it returns $\text{true}$ or $\text{false}$. Notice that even for this simple example with three variables it is hard to see what the answer will be by inspection.

The *Boolean satisfiability problem* asks whether there is *at least one* combination of binary input variables $x_{i}\in\{$$\text{false}$,$\text{true}$$\}$ for which a Boolean logic formula returns $\text{true}$. When this is the case, we say the formula is *satisfiable*.

A SAT solver is an algorithm for establishing satisfiability. It takes the Boolean logic formula as input and returns $\text{SAT}$ if it finds a combination of variables that can satisfy it or $\text{UNSAT}$ if it can demonstrate that no such combination exists. In addition, it may sometimes return without an answer if it cannot determine whether the problem is $\text{SAT}$ or $\text{UNSAT}$.

To solve the SAT problem, we first convert the Boolean logic formula to a standard form that it is more amenable to algorithmic manipulation. Any formula can be re-written as a conjunction of disjunctions (i.e., the logical $\text{AND}$ of statements containing $\text{OR}$ relations). This is known as *conjunctive normal form*. For example:

\begin{equation}\label{eq:example_cnf}

\phi:= (x_{1} \lor x_{2} \lor x_{3}) \land (\lnot x_{1} \lor x_{2} \lor x_{3}) \land (x_{1} \lor \lnot x_{2} \lor x_{3}) \land (x_{1} \lor x_{2} \lor \lnot x_{3}). \tag{2}

\end{equation}

Each term in brackets is known as a *clause* and combines together variables and their complements with a series of logical $\text{OR}$s. The clauses themselves are combined via $\text{AND}$ relations.

The *Tseitin transformation* converts an arbitrary logic formula to conjunctive normal form. The approach is to i) associate new variables with sub-parts of the formula using logical equivalence relations, (ii) to restate the formula by logically $\text{AND}$-ing these new variables together, and finally (iii) manipulate each of the equivalence relations so that they themselves are in conjunctive normal form.

This process is most easily understood with a concrete example. Consider the conversion of the formula:

\begin{equation}

\phi:= ((x_{1} \lor x_{2}) \Leftrightarrow x_{3}) \Rightarrow (\lnot x_{4}). \tag{3}

\end{equation}

**Step 1:** We associate new binary variables $y_{i}$ with the sub-parts of the original formula using the $\text{EQUIVALENCE}$ operator:

\begin{eqnarray}\label{eq:tseitin}

y_{1} &\Leftrightarrow &(x_{1} \lor x_{2})\nonumber \\

y_{2} &\Leftrightarrow &(y_{1} \Leftrightarrow x_{3}) \nonumber \\

y_{3} &\Leftrightarrow &\lnot x_{4}\nonumber \\

y_{4} &\Leftrightarrow &(y_{2} \Rightarrow y_{3}). \tag{4}

\end{eqnarray}

We work from the inside out (i.e., from the deepest brackets to the least deep) and choose sub-formulae that contain a single operator ($\lor, \land, \lnot, \Rightarrow$ or $\Leftrightarrow$).

**Step 2:** We restate the formula in terms of these relations. The full original statement is now represented by $y_{4}$ together with the definitions of $y_{1},y_{2},y_{3},y_{4}$ in equations 4. So the statement is $\text{true}$ when we combine all of these relations with logical $\text{AND}$ relations. Working backwards we get:

\begin{eqnarray}\label{eq:tseitin_stage2}

\phi&=& y_{4} \land (y_{4} \Leftrightarrow (y_{2} \Rightarrow y_{3})) \nonumber \\&&\hspace{0.4cm}\land (y_{3} \Leftrightarrow \lnot x_{4})\nonumber\\&& \hspace{0.4cm}\land (y_{2} \Leftrightarrow (y_{1} \Leftrightarrow x_{3}))\nonumber

\\&&\hspace{0.4cm}\land (y_{1} \Leftrightarrow (x_{1} \lor x_{2})). \tag{5}

\end{eqnarray}

This is getting closer to the conjunctive normal form as it is now a conjunction (logical $\text{AND}$) of different terms.

**Step 3:** We convert each of these individual terms to conjunctive normal form. In practice, there is a recipe for each type of operator:

\begin{eqnarray}

a \Leftrightarrow (\lnot b) & = & (a \lor b) \land (\lnot a \lor \lnot b) \\

a \Leftrightarrow (b \lor c) &= & (a\lor \lnot b) \land (a \lor \lnot c) \land (\lnot a \lor b \lor c) \nonumber \\

a \Leftrightarrow (b \land c) & = & (\lnot a \lor b) \land (\lnot a \lor c) \land (a \lor \lnot b \lor \lnot c) \nonumber \\

a \Leftrightarrow (b \Rightarrow c) & = & (a \lor b) \land (a \lor \lnot c) \land (\lnot a \lor \lnot b \lor c) \nonumber \\

a \Leftrightarrow (b \Leftrightarrow c) & = & (\lnot a \lor \lnot b \lor c)\land (\lnot a \lor b \lor \lnot c) \land (a \lor \lnot b \lor \lnot c) \land (a\lor b\lor c).\nonumber \tag{6}

\end{eqnarray}

The first of these recipes is easy to understand. If $a$ is $\text{true}$ then the first clause is satisfied, but the second can only be satisfied by having $\lnot b$. If $a$ is $\text{false}$ then the second clause is satisfied, but the first clause can only be satisfied by $b$. Hence when $a$ is $\text{true}$, $\lnot b$ is $\text{true}$ and when $a$ is $\text{false}$, $\lnot b$ is $\text{false}$ and so $a \Leftrightarrow (\lnot b)$ as required.

The remaining recipes are not obvious, but you can confirm that they are correct by writing out the truth tables for the left and right sides of each expression and confirming that they are the same. Applying the recipes to equation 5 we get the final expression in conjunctive normal form:

\begin{eqnarray}\label{eq:tseitin_stage3}

\phi\!\!&\!\!:=& y_{4} \land (y_4\lor y_2) \land (y_4 \lor \lnot y_3) \land (\lnot y_4 \lor \lnot y_2 \lor y_3)\nonumber \\

&&\hspace{0.4cm}\land (y_3 \lor x_4) \land (\lnot y_3 \lor \lnot x_4)\nonumber\\

&& \hspace{0.4cm}\land (\lnot y_2 \lor \lnot y_1 \lor x_3)\land (\lnot y_2 \lor y_1 \lor \lnot x_3) \land (y_2 \lor \lnot y_1 \lor \lnot x_3) \land (y_2\lor y_1\lor x_3)\nonumber \\&&

\hspace{0.4cm}\land (y_1\lor \lnot x_1) \land (y_1 \lor \lnot x_2) \land (\lnot y_1 \lor x_1 \lor x_2). \tag{7}

\end{eqnarray}

In the conjunctive normal form, each clause is a conjunction (logical $\text{OR}$) of variables and their complements. For neatness, we will write the complement $\lnot x$ of a variable as $\overline{x}$, so instead of writing:

\begin{equation}

\phi:= (x_{1} \lor x_{2} \lor x_{3}) \land (\lnot x_{1} \lor x_{2} \lor x_{3}) \land (x_{1} \lor \lnot x_{2} \lor x_{3}) \land (x_{1} \lor x_{2} \lor \lnot x_{3}), \tag{8}

\end{equation}

we write:

\begin{equation}\label{eq:example_cnf2}

\phi:= (x_{1} \lor x_{2} \lor x_{3}) \land (\overline{x}_{1} \lor x_{2} \lor x_{3}) \land (x_{1} \lor \overline{x}_{2} \lor x_{3}) \land (x_{1} \lor x_{2} \lor \overline{x}_{3}). \tag{9}

\end{equation}

We collectively refer to the variables and their complements as *literals* and so this formula contains literals $x_{1},\overline{x}_{1},x_{2},\overline{x}_{2}, x_{3}$ and $\overline{x}_{3}.$

When expressed in conjunctive normal form, we can characterise the problem in terms of the number of variables, the number of clauses and the size of those clauses. To facilitate this we introduce the following terminology:

- A clause that contains $k$ variables is known as a $k$
*-clause*. When a clause contains only a single variable, it is known as a*unit clause*. - When all the clauses contain $k$ variables, we refer to a problem as $k$
*-SAT*. Using this nomenclature, we see that equation 9 is a 3-SAT problem.

SAT solvers are algorithms that establish whether a Boolean expression is satisfiable and they can be classified into two types. *Complete* algorithms guarantee to return $\text{SAT}$ or $\text{UNSAT}$ (although they may take an impractically long time to do so). *Incomplete* algorithms return $\text{SAT}$ or return $\text{UNKNOWN}$ (i.e. return without providing an answer). If they find a solution that satisfies the expression then all is good, but if they don't then we can draw no conclusions.

Here are two naïve algorithms that will help you understand the difference:

- An example of a complete algorithm is
*exhaustive search*. If there are $V$ variables, we evaluate the expression with all $2^{V}$ combinations of literals and see if any combination returns $\text{true}$. Obviously, this will take an impractically long time when the number of variables are large, but nonetheless it is guaranteed to return either $\text{SAT}$ or $\text{UNSAT}$ eventually. - An example of an incomplete algorithm is
*Schöning's random walk*. This is a Monte Carlo solver in which we repeatedly (i) randomly choose an unsatisfied clause, (ii) choose one of the variables in this clause at random and set it to the opposite value. At each step we test if the formula is now satisfied and if so return $\text{SAT}$. After $3V$ iterations, we return $\text{UNKNOWN}$ if we have not found a satisfying configuration.

When a solver returns $\text{SAT}$ or $\text{UNSAT}$, it also returns a *certificate*, which can be used to check the result with a simpler algorithm. If the solver returns $\text{SAT}$, then the certificate will be a set of variables that obey the formula. These can obviously be checked by simply computing the formula with them and checking that it returns $\text{true}$ . If it returns $\text{UNSAT}$ then the certificate will usually be a complex data structure that depends on the solver.

First, the bad news. The SAT problem is proven to be NP-complete and it follows that there is no known polynomial algorithm for establishing satisfiability in the general case. An important exception to this statement is 2-SAT for which a polynomial algorithm is known. However, for 3-SAT and above the problem is very difficult.

The good news is that modern SAT solvers are very efficient and can often solve problems involving tens of thousands of variables and millions of clauses in practice. In part II of this tutorial we will explain how these algorithms work.

Until now we have focused on the satisfiability problem in which try to establish if there is at least one set of literals that makes a given statement evaluate to $\text{true}$. We note that there are also a number of closely related problems:

**UNSAT:** In the UNSAT problem we aim to show that there is no combination of literals that satisfies the formula. This is subtly different from SAT where algorithms return as soon as they find literals that show the formula is $\text{SAT}$, but may take exponential time if they cannot find a solution. For the UNSAT problem, the converse is true. The algorithm will return as soon as soon as it establishes the formula is *not* $\text{UNSAT}$, but may take exponential time to show that it is $\text{UNSAT}$.

**Model counting:** In model counting (sometimes referred to as #SAT or #CSP), our goal is to count the number of distinct sets of literals that satisfy the formula.

**Max-SAT:** In Max-SAT, it may be the case that a formula is $\text{UNSAT}$ but we aim to find a solution that minimizes the number of clauses that are invalid.

**Weighted Max-SAT:** This is a variation of Max-SAT in which we pay a different penalty for each clause when it is invalid. We wish to find the solution that incurs the least penalty.

For the rest of this tutorial, we'll concentrate on the main SAT problem, but we'll return to these related problems in part III of this tutorial when we discuss factor graph methods.

Most of the remainder of part I of this tutorial is devoted to discussing practical applications of satisfiability problems. Based on the discussion thus far, the reader would be forgiven for being sceptical about how this rather abstract problem can find real-world uses. We will attempt to convince you that it can! However, before, we can do this, it will be helpful to review commonly-used *SAT constructions*.

SAT constructions can be thought of as subroutines for Boolean logic expressions. A common situation is that we have a set of variables $x_{1},x_{2},x_{3},\ldots$ and we want to enforce a collective constraint on their values. In this section, we'll discuss how to enforce the constraints that they are all the same, that exactly one of them is $\text{true}$, that no more than $K$ of them are true or that exactly $K$ of them are true.

To enforce the constraint that a set of variables $x_{1},x_{2}$ and $x_{3}$ are either all $\text{true}$ or all $\text{false}$ we simply take the logical $\text{OR}$ of these two cases so we have:

\begin{equation}

\mbox{Same}[x_{1},x_{2},x_{3}]:= (x_{1}\land x_{2}\land x_{3})\lor(\overline{x}_{1}\land \overline{x}_{2}\land \overline{x}_{3}). \tag{10}

\end{equation}

Note that this is not in conjunctive normal form (the $\text{AND}$ and $\text{OR}$s are the wrong way around) but could be converted via the Tseitin transformation.

To enforce the constraint that only one of a set of variables $x_{1},x_{2}$ and $x_{3}$ is true and the other two are false, we add two constraints. First we ensure that at least one variable is $\text{true}$ by logically $\text{OR}$ing the variables together:

\begin{equation}

\phi_{1}:= x_{1}\lor x_{2} \lor x_{3}. \tag{11}

\end{equation}

Then we add a constraint that indicates that both members of any pair of varaiables cannot be simultaneously $\text{true}$:

\begin{equation}\label{eq:exactly_one}

\mbox{ExactlyOne}[x_{1},x_{2},x_{3}]:= \phi_{1}\land \lnot (x_{1}\land x_{2}) \land \lnot (x_{1}\land x_{3}) \land \lnot (x_{2}\land x_{3}) . \tag{12}

\end{equation}

There are many standard ways to enforce the constraint that at least $K$ of a set of variables are $\text{true}$. We'll present one method which is a simplified version of the sequential counter encoding.

The idea is straightforward. If we have $J$ variables $x_{1},x_{2},\ldots x_{J}$ and wish to test if $K$ or more are true, we construct a $J\times K$ matrix containing new binary variables $r_{j,k}$ (figures 2b and d). The $j^{th}$ row of the table contains a count of the number of $\text{true}$ elements we have seen in $x_{1\ldots j}$. So, if we have seen 3 variables that are $\text{true}$ in the first $j$ elements, the $j^{th}$ row will start with 3 $\text{true}$ elements and finish with $K-3$ $\text{false}$ elements.

If there are at least $K$ variables, then the bottom right variable $r_{J,K}$ in this table must be $\text{true}$ and so in practice, we would add a clause $(r_{J,K})$ stating that this bottom right element must be $\text{true}$ to enforce the constraint. When this element is $\text{false}$, the solver will search for a different solution where $\mathbf{x}$ does have at least $K$ elements or return $\text{UNSAT}$ if it cannot find one. By the same logic, to enforce the constraint that there are less than $K$ elements, we add a clause $\overline{r}_{J,K}$ stating the at the bottom right hand variable is $\text{false}$.

The table constructed in figure 2d also shows us how to constrain the data to have exactly K $\text{true}$ values. In this case, we expect the bottom right element to be $\text{true}$, but the element above this to be $\text{false}$ add the clause $(r_{J,K}\land \overline{r}_{J-1,K})$. Figure 3 provides more detail about how we add extra clauses to the SAT formula that build these tables.

Armed with these SAT constructions, we'll now present two complementary ways of thinking about SAT applications. The goal is to inspire the novice reader to see the applicability to their own problems. In the next section, we'll consider SAT in terms of constraint satisfaction problems and in the section following that, we'll discuss it in terms of model fitting.

The constraint satisfaction viewpoint considers combinatorial problems where there are a very large number of potential solutions, but most of those solutions are ruled out by some pre-specified constraints. To make this explicit, we'll consider the two examples of graph coloring and scheduling.

In the graph coloring problem (figure 4) we are given a graph consisting of a set of vertices and edges. We want to associate each vertex with a color in such a way that every pair of vertices connected by an edge have different colors. We might also want to know how many colors are necessary to find a valid solution. Note that this maps to our description of the generic constraint satisfaction problem; there are a large number of possible assignments of colors, but many of these are ruled out by the constraint that neighboring colors must be different.

To encode this as a SAT problem, we'll choose the number of colors $C$ to test. Then we create binary variables $x_{c,v}$ which will be $\text{true}$ if vertex $v$ is colored with color $c$. We then encode the constraint that each vertex can only have exactly one color using the construction $\mbox{ExactlyOne}[x_{\bullet, v}]$ from equation 12. We also add the constraints to ensure that the neighbours have different colors. Formally this means that that $x_{c,v}\Rightarrow \lnot x_{c,v'}$ for every color $c$ and neighbour $v'$ of vertex $v$.

Having set up the problem, we run the SAT solver. If it returns $\text{UNSAT}$ this means we need more colors. If it returns $\text{SAT}$ with a concrete coloring, then we have an answer. We can find the minimum number of colors required by using binary search over the number of colors to find the point where the problem changes from $\text{SAT}$ to $\text{UNSAT}$.

The graph coloring problem is a rather artificial computer science example, but many real-world problems can similarly be expressed in terms of satisfiability. For example, consider scheduling courses in a university. We have a number of professors, each of whom teach several different courses. We have a number of classrooms. We have a number of possible time-slots in each classroom. Finally, we have the students themselves, who are each signed up to a different subset of courses. We can use the SAT machinery to decide which course will be taught in which classroom and in what time-slot so that no clashes occur.

In practice, this is done by defining binary variables describing the known relations between the real world quantities. For example, we might have variables $x_{i,j}$ indicating that student $i$ takes course $j$. Then we encode the relevant constraints: no teacher can teach two classes simultaneously, no student can be in two classes simultaneously, no room can host more than one class simultaneously, and so on. The details are left as an exercise to the reader, but the similarity to the graph coloring problem is clear.

A second way to think about satisfiability is in terms of function fitting. Here, there is a clear connection to machine learning in which we fit complex functions (i.e., models) to training data. In fact there is a simple relationship between function-fitting and constraint satisfaction; when we fit a model, we can consider the parameters as unknown variables, and each training data/label pair represents a constraint on the values those parameters can take. In this section, we'll consider fitting binary neural networks and decision trees.

Binary neural networks are nets in which both the weights and activations are binary. Their performance can be surprisingly good, and their implementation can be extremely efficient. We'll show how to fit a binary neural network using SAT.

Following Mezard and Mora (2008) we consider a one layer binary network with $K$ neurons. The network takes a $J$ dimensional data example $\mathbf{x}$ with elements $x_{j}\in\{-1,1\}$ and computes a label $y\in\{-1,1\}$, using the function:

\begin{equation}\label{eq:one_layer}

y = \mbox{sign}\left[\sum_{j=1}^{J}\phi_{j}x_{j}\right] \tag{13}

\end{equation}

where the unknown model parameters $\phi_{j}$ are also binary and the function $\mbox{sign}[\bullet]$ returns -1 or 1 (figure 5) based on the sign of the summed terms.

Given a training set of $I$ data/label pairs $\{\mathbf{x}_{i},y_{i}\}$, our goal is to choose the model parameters $\phi_{j}$. We'll force all of the training examples to be classified correctly and so each training example/label pair can be considered a hard constraint on the parameters.

To encode these constraints, we create new variables $z_{i,j}$ that indicate whether the product $\phi_{j}x_{i,j}$ is positive. This happens when either both elements are positive or both are negative, so we can use the $\mbox{Same}[\phi_{j},x_{i,j}]$ construction. Note that for the rest of this discussion we'll revert to the convention that $x_{i,j}, y_{j}\in\{$$\text{false}$,$\text{true}$$\}$.

The predicted label is the sum of the elements $z_{i,j}$ and will be positive when more than half of the product terms $z_{i,\bullet}$ evaluate to $\text{true}$. Likewise it will be negative if less than half are $\text{true}$. Hence, for the network to predict the correct output label $y_{i}$ we require

\begin{equation}

\left(y_{i} \land \mbox{AtLeastK}[\mathbf{z}_{i}]\right)\lor \left(\overline{y}_{i} \land \lnot\mbox{AtLeastK}[\mathbf{z}_{i}]\right) \tag{14}

\end{equation}

where $K=J/2$ and the vector $\mathbf{z}_{i}$ contains the product terms $z_{i,\bullet}$.

We have one such constraint for each training example and we logically $\text{AND}$ these together. When we run the SAT solver we are asking whether it is possible to find a set of parameters $\boldsymbol\phi$ for which all of these constraints are met.

It is easy to extend this example to multi-layer networks and to allow a certain amount of training error and we leave these extensions as exercises for the reader.

A binary decision tree also classifies data $\mathbf{x}_{i}$ into binary labels $y_{i}\in\{0,1\}$. Each data example $\mathbf{x}_{i}$ starts at the root. It then passes to either the left or right branch of the tree by testing one of its elements $x_{i,j}$. We'll consider binary data $x_{i,j}\in\{$$\text{false}$, $\text{true}$$\}$ and adopt the convention that the data example passes left if $x_{i,j}$ is $\text{false}$ and right if $x_{i,j}$ is $\text{true}$. This procedure continues, testing a different value of $x_{i,j}$ at each node in the tree until we reach a leaf node at which a binary output label is assigned.

Learning the binary decision tree can also be framed as a satisfiability problem. From a training perspective, we would like to select the tree structure so that the training examples $\mathbf{x}_{i}$ that reach each leaf node have labels $y_{i}$ that are all $\text{true}$ or all $\text{false}$ and hence the training classification performance is 100%.

We'll develop simplified version of the approach of Narodytska et al. (2018). Incredibly, we can learn both the structure of the tree and which features to branch on simultaneously. When we run the SAT solver for a given number $N$ of tree nodes, it will search over the space of all tree structures and branching features and return $\text{SAT}$ if it is possible to classify all the training examples correctly and provide a concrete example in which this is possible. By changing the number of tree nodes, we can find the point at which this problem turns from $\text{SAT}$ to $\text{UNSAT}$ and hence find the smallest possible tree that classifies the training data correctly.

We'll describe the SAT construction in two parts. First we'll describe how to encode the structure of the tree as a set of logical relations and then we'll discuss how to choose branching features that classify the data correctly.

**Tree structure:** We create $N$ binary variables $v_{n}$ that indicate if each of the $N$ nodes is a leaf. Similarly we create $N^{2}$ binary variables $l_{m,n}$ indicating if node $n$ is the left child of node $m$ and $N^{2}$ binary variables $r_{m,n}$ indicating if node $m$ is the right child of node $n$. Then we build Boolean expressions to enforce the following constraints:

- The root node (node 1) is not a leaf.
- Leaf nodes have neither left or right children.
- Non-leaf nodes have exactly one left and one right child.
- Every node except the root is either a left or child or a right child of one another node.

Any set of variables $v_{n}$, $l_{m,n}$, $r_{m,n}$ that obey these constraints form a valid tree, and we can find such a configuration with a SAT solver. Two such trees are illustrated in figure 6.

**Classification:** The second part of the construction ensures that the data examples $\mathbf{x}_{i}$ are classified correctly (figure 7). We introduce variables $f_{n,j}$ that indicate that node $n$ branches on feature $x_j$. We'll adopt the convention that when the branching variable $x_{j}$ is $\text{false}$ we will always branch left and when it is $\text{true}$ we will always branch right. In addition, we introduce variables $\hat{y}_{n}$ that will indicate if each leaf node classifies the data as $\text{true}$ or $\text{false}$ (their values will be arbitrary for non-leaf nodes).

We'll also create several book-keeping variables that are needed to set this up as a SAT problem, but are not required to run the model once trained. We introduce ancestor variables $a^{l}_{nj}$ at each node $n$ which are $\text{true}$ if we branched left on feature $j$ at node $n$ or at any of its ancestors and similarly $a^{r}_{nj}$ if we branched right on feature $j$ at this node or any of its ancestors. Finally, we introduce variables $e_{i,n}$ that indicate that training example $\mathbf{x}_{i}$ reached leaf node $n$. Notice that this happens when $x_{ij}$ is $\text{false}$ everywhere $a^{l}_{nj}$ is $\text{true}$ (i.e., we branched left somewhere above on these left ancestor features) and $x_{ij}$ is $\text{true}$ everywhere $a^{r}_{nj}$ is $\text{true}$ (i.e., we branched right somewhere above on these right ancestor features).

Using these variables, we build Boolean expressions to enforce the following constraints:

- Each non-leaf node must branch on exactly one feature.
- The left and right ancestor variables at the root are all $\text{false}$.
- The left ancestor variables $a^{l}_{\bullet, n}$ at a node $n$ are the same as the parent's, but the index associated with the parents branching variable is also $\text{true}$ if we branched left to get here.
- The right ancestor variables $a^{r}_{\bullet, n}$ at a node $n$ are the same as the parent's, but the index associated with the parents branching variable is also $\text{true}$ if we branched right to get here.
- You can't branch on a variable twice in any one path to a leaf.
- A data example reaches a leaf node if the left and right ancestors match its pattern of $\text{true}$ and $\text{false}$ elements as described above.
- All data that reach a given leaf node must have the same class.

Collectively, these constraints mean that all of the data must be correctly classified. When we logically $\text{AND}$ all of these constraints together, and find a solution that is $\text{SAT}$ we retrieve a tree that classifies the data 100% correctly. By reducing the number of nodes until the point that the problem becomes $\text{UNSAT}$, we can find the most efficient tree that partitions the training data exactly.

This concludes part I of this tutorial on SAT solvers. We've introduced the SAT problem, shown how to convert it to conjunctive normal form and presented some standard SAT constructions. Finally, we've described several different applications which we hope will inspire you to see SAT as a viable approach to your own problems.

In the next part of this tutorial, we'll delve into how SAT solvers actually work. In the final part, we'll elucidate the connections between SAT solving and factor graphs. For those readers who still harbor reservations about the applicability of a method based purely on Boolean variables, we'll also consider (i) how to converting non-Boolean variables to binary form and (ii) methods to work with them directly using SMT solvers.

If you want to try working with SAT algorithms, then this tutorial will help you get started. For an extremely comprehensive list of applications of satisfiability, consult SAT/SMT by example. This may give you more inspiration for how to re-frame your problems in terms of satisfiability.

]]>This execution problem sounds straightforward but there are several complications. A naive approach might be to wait until the price seems "low enough" and then buy all the shares at once. Putting aside the question of how to define "low enough", this method has a huge drawback. Executing a large order all at once creates a great deal of demand and the effect of this is to increase the price (market impact). Unfortunately, this action has an undesirable effect on the final price achieved. Consequently, it could be more sensible to buy the shares gradually through the specified time period. But how many should the broker buy, and when?

To the savvy machine learning researcher, it will be obvious that this problem lends itself to a reinforcement learning formulation. The execution algorithm must make a series of sequential decisions about how many shares to buy at each time step and receives rewards in the form of low execution prices.

The structure of the rest of this article is as follows. First, we describe the order execution problem in more detail. This will necessitate a discussion of how modern financial markets work in practice and the limit order book. Then we provide a brief review of reinforcement learning and describe how it maps to this problem. Finally, we describe the practical details of the Aiden system.

Contemporary financial markets such as the TSX/NYSE/NASDAQ are *limit order markets*. This means that traders who wish to purchase shares can specify not only the volume they wish to purchase, but also the maximum price (limit*)* that they are prepared to pay. More formally, a *limit order* can be specified by the tuple $\{\tau, p, n\}$ where $\tau\in\{0,1\}$ specifies whether this is a buy or sell order, $p$ is the specified price limit, and $n$ is the maximum number of shares to be traded. The possible prices of the shares in the order book are discrete, and the smallest difference allowable between them is a *tick*.

The *limit order book* consists of the set of all current limit orders. It can be visualized as two histograms (figure 1). The first consists of the volume of the buy orders at each discrete price level and the second consists of volumes of the sell orders. The highest buy order is known as the current *bid* price and the lowest sell order is known as the current *ask* price. The difference between the two is known as the *bid-ask spread* and the average of the two is known as the *mid-price*.

When a trader enters a buy limit order that is at or above the current ask price, the order will receive executions. The first trades will occur at the current ask price, but if the volume of the buy order exceeds the volume available at that price, the order will continue at the next price level. This process occurs until either the entire order has been fulfilled, or it reaches the specified limit. In this case, there are insufficient shares available for sale at or below this limit and so the order is only partially filled. Hence, the overall effect of placing a limit order is that the price is guaranteed to be within the specified limit, but the volume is not.

Any remaining unfulfilled part of the order is then added to the buy side of the limit order book and remains there until it is either (i) matched by a new sell-side order (ii) its time limit expires or (iii) it is removed by the trader. Orders are typically matched based on a first-in / first-out basis for most trading venues; in this instance, any order placed below the current ask price will be placed last in the queue for that particular price level. A worked example of a limit order is given in figure 2.

In addition to limit orders it is possible to place a *market order* which is specified by the volume $n$ of shares that we wish to buy. Essentially, this means that the trader will buy all $n$ shares now at whatever prices are available on the sell side of the limit order book. So first the trader buys shares at the current ask price. If the volume of the buy order exceeds the volume available at the current ask price, the trader will continue fulfilling the order at the next best price and so on. Effectively, a market order is hence a limit order where the limit is $+\infty$. A worked example of a market order is given in figure 3.

Notice that for both the limit order and the market order, a large volume affects the current ask price. As the volume at one price level is exhausted the ask price increases to the next level where there is non-zero volume. Hence, by placing a large volume buy order, ceteris paribus, it may have a large impact on the market and the mid-price (a proxy for current stock price) correspondingly increases.

Now that limit order book has been explained, let's return to the problem of order execution. The goal is to buy a known volume $V$ of shares within a given time window $[0,T]$. This is typically referred to as the parent order (or meta order). At each time step $0\leq t < T$ the trader can place a limit order, remove an existing limit order, or place a market order by tranching parts of the meta order into smaller parts (child orders) as to minimize market impact. As the trader reaches the end of the execution timeframe, they can make use of more aggressively priced orders to complete their order, potentially at a higher cost.

How can a trader decide which action to take at each time step? Electronic markets release in real-time the contents of the limit order book and the trader can use this data as a basis for their decisions. Such *market micro-structure data* comes in two different resolutions which are referred to as level I and level II data respectively. Level I data includes the current bid price and associated volume, the current ask price and associated volume and the price and volume of the last transaction. Level II data includes more details about the contents of the limit order book; for example, it might include the top ten current bid and ask orders and their associated volumes.

It's clear that this market micro-structure data contains clues about when it might be a good idea to place an order. For example, if the ask-price is decreasing over time, it might be worth using this *momentum* signal to delay buying shares. Similarly, if there is a lot more volume on the sell side than the buy side of the limit order book then this gives an insight into the current levels of supply and demand and this may similarly affect the decision to execute an order at this time or not. In addition, the time stamp and volume already executed should feed into the decision. If time is running out, the trader needs to place more aggressive orders to fulfil their obligation.

In this section we provide a brief recap of reinforcement learning (RL). RL is concerned with an agent that is interacting with an environment. At each time-step $t$, the state of the environment is captured in a state vector $\mathbf{s}_{t}$. The agent observes this state and chooses an action which is parameterized by the vector $\mathbf{a}_{t}$. Taking an action triggers two events. First the state changes to a new state $\mathbf{s}_{t+1}$ via the stochastic *transition function* $Pr(\mathbf{s}_{t+1}|\mathbf{s}_{t}, \mathbf{a}_{t})$. Second, a reward $r_{t}$ may be issued to the agent, where this reward depends on the unseen *reward function* $Pr(r_{t}|\mathbf{s}_{t}, \mathbf{a}_{t})$. The basic RL setup is shown in figure 4.

At any time $t'$ the agent might wish to maximize the total sum of future rewards $\sum_{t=t'}^{T}r_{t}$. However, rewards that happen sooner in time are often considered more important, and so instead it maximizes the discounted sum of rewards $\sum_{t=t'}^{T}\gamma^{t-t'}r_{t}$. Here $\gamma\in(0,1]$ controls how the rewards decrease in importance as they stretch into the future. So the goal of reinforcement learning is to learn how to choose actions that maximize the sum of the future discounted rewards.

Reinforcement learning is challenging for a number of reasons ranging from practical considerations and design choices to inherent limitations of the RL framework. First, the agent does not know either the transition function or the reward function and it must either implicitly or explicitly learn these. Second, these functions are stochastic, and so it may take a lot of experience to understand them. Third, the reward for an action may be temporally very distant from the action that caused it. This is known as *the temporal credit assignment problem*. For example, a win in chess may have been largely due to a brilliant move (action) that was made much earlier in the game, yet is only observed by the reward (winning the game) at the end.

Finally, reinforcement learning algorithms must balance *exploration* and *exploitation*. On the one hand, if the agent does not explore the state space and try different actions, it cannot get enough experience to learn a good strategy. On the other, once it has figured out how to receive a respectable reward, it might want to exploit this knowledge rather than explore the other regions of the state-action space. A trade-off between these two tendencies is inherent in any reinforcement learning algorithm.

*Model-based methods* try to predict what the next state and/or reward will be (i.e., the transition function and the reward function), so that they can look into the future and make sensible decisions that will ultimately result in high cumulative rewards. In contrast, *model-free methods* do not build a model of the environment or reward, but just directly map states to actions. Model-free methods can be divided into *policy-based methods* which directly predict a probability distribution over the possible actions from the state and *value-based methods* which compute the relative value of every possible state-action pair and hence indirectly specify the best action for any state.

The Aiden system described in this article is a policy-based model-free method and so it aims to take the state $\mathbf{s}_{t}$ and predict a probability distribution $Pr(\mathbf{a}_{t}|\mathbf{s}_{t}, \boldsymbol\theta)$ over which action $\mathbf{a}$ to take. Since the state space is high-dimensional and data is very limited, Aiden approximates this mapping using a neural network, with parameters $\boldsymbol\theta$. The goal of learning is to ensure that these parameters lead to actions that result in high cumulative rewards.

Hopefully, it is becoming increasingly clear why reinforcement learning is a suitable way to carry out the order execution problem. There is a reward (the average price at which the agent bought the shares), but the agent does not know the extent of this reward until it has completely fulfilled the order. There is a partially observed state which includes the market micro-structure data, the elapsed time, and the remaining volume. Finally, there are a number of actions that can be taken at any time (placing limit orders, removing limit orders, placing market orders). It's clear that these actions affect the state by changing the configuration of the market and depleting the remaining volume.

In this context the goal of the reinforcement learning algorithm is to learn the policy; for a given observed state (market micro-structure, elapsed time and remaining volume), it must learn to output a probability distribution over the possible actions (types of order). The algorithm draws from this distribution to determine what to do next. This in turn changes the state and so on.

In this section we describe the main features of the Aiden reinforcement learning setup: the action space, the state and the reward functions. In the subsequent section we discuss the reinforcement learning algorithm itself.

In practice Aiden does not directly select the details of the order that is provided to Aiden, but instead chooses between different high-level actions at each time step that correspond to different levels of aggressiveness as Aiden begins to liquidate the parent order using child orders. These range from crossing the spread (and so immediately executing some of the order) at one end of the spectrum to doing nothing / removing existing orders at the other. These actions form the input to a system that translates them into concrete limit orders.

Aiden's state is currently composed of several hundred market features and self-aware features. The market features comprise of hand-crafted functions that compute quantities of interest from the market micro-structure data. Examples might include measurements of the liquidity, recent price changes, or whether there is an imbalance between the bid and ask volumes. The self-aware features relate to the history of previous actions that Aiden has taken. For example, they might include measurements of how aggressive Aiden has been in recent time steps, and how many shares Aiden still has to execute.

The rewards are chosen so that Aiden optimizes around a core trading objective, such as a benchmark. One such commonly used benchmark to measure performance is the volume-weighted average price (VWAP) of the market for the asset over the whole period. As the name suggests, this is the average price of all transactions in the limit order book, weighted by volume. Consequently, rewards are designed based on the difference between this market VWAP and the actual prices Aiden achieved. Of course, Aiden will not know the market VWAP price until the end of the period and so as is typical in reinforcement learning, the feedback is delayed.

Aiden is trained using policy gradient algorithms. As the name suggests, these compute the gradient of the expected discounted reward with respect to the parameters $\boldsymbol\theta$ of the network that takes the state $\mathbf{s}_{t}$ and outputs the policy $Pr(\mathbf{a}_{t}|\mathbf{s}_{t},\boldsymbol\theta)$ over actions $\mathbf{a}_{t}$. The gradient is used to find parameters that give better rewards. In practice, the aim is to maximize the following objective:

\begin{equation}

J[\boldsymbol\theta] = \mathbb{E}\left[\log[Pr(\mathbf{a}_{t}|\mathbf{s}_{t},\boldsymbol\theta)]\Psi_{t} \right], \tag{1}

\end{equation}

where the expectation denotes an empirical average over samples. For the simplest policy gradient algorithms, the function $\Psi_{t}$ might just be the total observed rewards.

Unfortunately, this basic policy gradient algorithm is notoriously unstable and so Aiden uses an *actor-critic* approach (see Sutton and Barto, 2018) to decrease the variance of the learning procedure. Here, the function $\Psi$ is changed so that it measures the difference between the observed rewards and the *value function*, which is essentially a prediction of what the total reward will be given that we are in the current state. The network that produces the policy is known as the actor (since it directly affects the environment) and the network that produces the value function is known as the critic (since it evaluates the actor's choices).

The Aiden architecture mainly consists of fully connected layers. However, in partially observable environments like a financial market we do not expect to observe the complete state of the world at each timestep. Therefore, it is common to add a recurrent layer to help deal with this problem. To this end, Aiden uses a recurrent architecture; at each time step it takes as input the market features, self-aware features and the input from the recurrent connection. From these Aiden produces three outputs. First, it produces a soft-max output with probabilities over the action space (i.e., the actor). Second, it produces a single scalar output representing the value function (the critic), and third it produces a new recurrent vector to be passed to the next time step (figure 5).

Aiden exploits another trick to make learning more stable in that is uses proximal policy optimization. This method changes the objective function to:

\begin{equation}

J[\boldsymbol\theta] = \mathbb{E}\left[\frac{Pr(\mathbf{a}_{t}|\mathbf{s}_{t},\boldsymbol\theta)}{Pr(\mathbf{a}_{t}|\mathbf{s}_{t},\boldsymbol\theta_{old})}\Psi_{t} \right], \tag{2}

\end{equation}

where the term $\boldsymbol\theta_{old}$ represents the parameters before the update and then clips this function to prevent very large changes in the policy (hence making it more stable). Defining:

\begin{equation}

f[\boldsymbol\theta] = \frac{Pr(\mathbf{a}_{t}|\mathbf{s}_{t},\boldsymbol\theta)}{Pr(\mathbf{a}_{t}|\mathbf{s}_{t},\boldsymbol\theta_{old})}, \tag{3}

\end{equation}

proximal policy optimization maximizes the following surrogate objective:

\begin{equation}

J[\boldsymbol\theta] = \begin{cases}\mathbb{E}\left[\min\left[ f[\boldsymbol\theta] \Psi_{t}, 1+\epsilon\right]\Psi_{t}\right] &\quad \Psi_{t} > 0 \\

\mathbb{E}\left[\max\left[ f[\boldsymbol\theta] \Psi_{t}, 1-\epsilon\right]\Psi_{t}\right] &\quad \Psi_{t} \leq 0,

\end{cases} \tag{4}

\end{equation}

where $\epsilon$ is a pre-defined threshold.

In this section we discuss a few of the challenges of training a production-level system for the order execution problem like Aiden.

**Generality:** The algorithm is required to work in many situations. Hence, the Aiden algorithm only uses input features that can be found in any market. Moreover, different stocks vary in price, liquidity, volatility and other quantities. To this end, the Aiden algorithm must normalize the input features so that the absolute magnitudes of price and volume observed in the market micro-structure data are factored out.

**Simulation:** Reinforcement learning algorithms are notorious for the amount of data that they must consume to learn a successful strategy. Obviously, it's not practical to wait many years for the algorithm to train. Furthermore, we cannot put the algorithm into the real marketplace before it has learned which decisions to make to achieve good performance.

The solution to both of these problems is to build a training environment in which the market can be simulated based on observations of historical trading data. In this way, Aiden can train much faster than real-time and learn sensible policies without risking financial loss. This procedure can be sped up even more by training multiple RL agents who compete with one another in the same simulated market and learn from one another.

In this article we introduced the order execution problem and showed how it could be mapped to a reinforcement learning problem. We then described some features of Aiden, RBC's electronic execution platform. The scope for reinforcement learning in finance is huge since there are often many rapid decisions that need to be made and these need to take into account the present condition of the market. Aiden is one of the first steps in RBC's adoption of these technologies.

]]>Now, can we lead on tackling its ethical and societal implications?

The news is flooded with examples of AI fails: algorithms that favour male job applicants over women, or image recognition software failing to correctly identify people of colour.

**Dr. Foteini Agrafioti**, the Head of Borealis AI and one of the country’s strongest voices on ensuring AI is ethical, is also the co-chair of Canada’s Advisory Council on AI. She led the RBC Disruptors conversation about battling bias in AI with **Dr. Elissa Strome**, Executive Director, Pan-Canadian AI Strategy at CIFAR, and **Dr. Layla El Asri**, Research Team Lead at Borealis AI and formerly Research Manager at Microsoft Research Montréal.

Here are their thoughts on what the scientific community, governments and ordinary citizens can do to confront bias in artificial intelligence, and position Canada as a leader in ethical AI.

Bias has long existed in our society – and so it exists in our data. El Asri sees this as an opportunity. Unlike our own unconscious bias, we can at least uncover bias in an algorithm. To do this, companies need to be auditing their AI for bias every step of the way, as the major labs are now doing. El Asri credited Canadian leaders, such as AI pioneer** Yoshua Bengio**, for developing a will in Canada’s tech community to develop AI in a responsible way.

Right now, artificial intelligence is being developed by a very narrow subset of society: mainly highly-educated men who went to the same schools, and now live in the same cities. Only 18% of AI researchers are women, a fact that Strome called “terrible.” Organizations like CIFAR are working to bring more voices into the development of AI, with initiatives such as the AI for Good Summer Lab, a seven-week training program for undergraduate women in AI.

AI is only as good as the data it’s trained on. “If your data is not representative enough, your model is not going to work,” El Asri said. There needs to be more vigilance in ensuring data is representative — an area where Canada has a homegrown advantage. If you’re working with data collected in a multicultural country like ours, you’re likely working with data that represents different ethnic backgrounds. This kind of data will be essential to building technology that works for everyone, especially when it comes to something like health care.

Right now, it’s really just the tech community and policy-makers talking about issues that are going to transform our society. We need to broaden that perspective, building in consultation with social scientists as an integral part of the development process. A recent CIFAR initiative brought together computer scientists and social scientists for a day to discuss the social, legal and ethical implications of AI. “The computer scientists were so eager to get their advice and insights,” Strome said. Similarly, at her former employers Microsoft, El Asri noted that their AI and ethics committees are made up of people from different disciplines, including anthropologists and historians.

“There’s a lot of fear and misunderstanding and myths about AI,” Strome said. Over the next few years, it’s going to be critical to bring the public into the AI conversation. People need to be aware of the positive implications, as well as the risks, that AI will have on their lives. The better the next generation understands AI and its societal and ethical implications, the better prepared they’ll be to ask tough questions of their leaders. Agrafioti suggested that Canadian culture is particularly attuned to ensuring fairness, casting a critical eye on technology before implementing it. Our balance of technical expertise and social values is exactly what’s needed to make sure the product that gets to market is ethical.

AI has been advancing much faster than any government can regulate it — so it was big news this week when the OECD adopted a set of AI principles, which set values-based standards for developing AI. Our leaders have an incredibly important role to play in developing policy and regulations around the use of AI, both domestically and internationally. Strome noted that Canada’s solid international reputation could go a long way in urging the world to play catch-up. Last summer, Prime Minister Trudeau and President Macron announced a joint Canada-France initiative on an International Panel on AI to support and guide the responsible adoption of AI, grounded in human rights. The first symposium will be in Paris this fall.

Solving bias in machines will take a human touch — and there’s no country better positioned than Canada to take the reins.

]]>Facial recognition technology is now a part of our daily lives, personalizing our services and making identity verification easier. Yet a lack of clear restrictions on its usage creates ambiguity for Canadian businesses that are constantly seeking data insights to drive growth and maintain relevance with consumers in a platform-based world.

The pandemic has fuelled an explosion in the use of video, as we try to stay connected during the global lockdown. Workers are using Zoom and Webex daily, families are catching up over FaceTime or Skype, and we’re turning ever more frequently to social media like Instagram and TikTok for entertainment. Our faces are travelling everywhere, even when we’re not.

Our faces are travelling everywhere,

even when we’re not.

At the same time, Artificial Intelligence is becoming ever present in our lives as we spend more time online; sending us shopping and podcast recommendations, predicting our upcoming bills, and learning which shows we like to watch. Our faces have become a central part of this data wave, as we teach our phones to recognize us, sort our photos and even interpret our emotions. But facial recognition creates a different sort of data tool from web traffic and credit card histories – one that can assess identities, behaviours and social interactions. We’re no longer anonymous, whether sitting at our computers or taking a walk downtown.

When paired with AI, facial recognition offers incredible commercial applications that could increase the personalization of services and reduce friction in the verification of payments, health records or even voting. How will Canadian firms – big and small – choose to employ the potential of facial recognition, as we all strive to leverage technology and consumer insights? Are there clear regulations on the use of this data? On these questions, we don’t operate in a vacuum; the technology is being developed and the data put to use in various ways around the world. The drive for innovation in this space will test our resolve to ensuring AI is used for good.

To do this right, Canadian businesses should avoid working in isolation. Canada is home to the world’s leaders in developing ethical AI. It’s here that the Montreal Declaration for the Responsible Development of AI was signed, the Privacy by Design certification was developed, and CIFAR’s AI & Society program was born. In this spirit, RBC and Borealis AI have launched RESPECT AI, a hub for firms to gain practical solutions for the responsible adoption of AI.

The number of global patents referencing “facial recognition” stands at 1,617, with over 100 new patents so far in 2020. Tech giants Google, Samsung and IBM dominate the filings (Apple ranks 7th). Only 27 of these patents are held by Canadian applicants.

Despite the data revolution, 53% of Canadian companies aren’t using AI to inform their business decisions, and among them, 6 out of 10 have no plans to do so soon. The other half feel that AI is central to their business growth, and most plan to expand their usage over next 2 years.

Each day, over 100 million hours of video are streamed on Facebook, while more than 95 million uploads are made to Instagram. These images are tagged with names and locations, providing better training data for algorithms. Google is building software that can crawl all social media sites to identify a person’s face (and their associated activity) across all platforms.

Apple FaceID claims a 1 in 1,000,000 chance someone else could unlock your device with their face. Google’s FaceNet achieved 99.63% accuracy against a benchmark image data set, surpassing Facebook’s DeepFace at 97.35%. By comparison, the human eye is accurate 97.53% of the time.

Market for this technology could double within five years

The facial recognition market had revenue of about US$3.2 billion in 2019, with some forecasts calling for it to reach US$7 billion by 2024. Key sectors of growth will continue to be government and security, with rising usage among retail and ecommerce.

As a tool, facial recognition provides rapid identification of an individual. This can help companies provide a personalized experience to consumers, reduce friction points on verification to access secure materials, or assist law enforcement by rapidly identifying possible suspects from video.

Similar image recognition technology is already used to reduce pedestrian accidents in cities by monitoring traffic patterns. It’s also being applied in agriculture to distinguish weeds from crops for precision pesticide use. Yet, we’re likely most accustomed to using facial recognition on our phones to verify banking or email passwords, or to automatically sort our photos.

Increasing confidence in the accuracy of this technology has furthered its public uses. Between 2014 and 2018, the US agency NIST estimates a 20x improvement in accuracy, from a 4.0% failure rate to 0.2%, respectively.

We can expect the list of potential applications to grow:

- Attendance tracking at school and work could be automated by face scans, and could provide added verification in exam halls and polling stations.
- Loyalty members of a retailer could receive virtual coupons and recommendations by text message upon entering a store.
- Security access to buildings or bank machines using facial recognition could reduce barriers for persons with disabilities in the workplace.

A sub-set of research is developing measures to counter the misuse of people’s images, such as deep fakes or identify fraud. And liveness detection software is aiming to identify whether an image or video is true to the subject involved; in essence, a good AI that can identify bad AI.

Many countries have no explicit legal or regulatory requirements related to facial recognition within their privacy regimes. In some places, this leaves interpretation open to discretion or abuse by government and business; as such, surveillance has become synonymous with facial recognition.

Here in Canada, we’ve seen pushback when its use has gone exceeded public comfort. When the RCMP’s association with Clearview AI – a firm with a database of 3 billion photos from Facebook and Instagram – became public, the Mounties had to set limits on its use. When Vancouver police attempted to use driver’s license photos to identify suspects in the Stanley Cup riot in 2011, the privacy commissioner required a court order on future use of such technology.

As this technology develops and pushes the limits of privacy, countries are navigating these challenges in real time. The adoption of Canada’s Digital Charter in 2019 – the federal government’s statement of intentions on digital security – suggests individuals can anticipate increased control over personal data and images under its “control and consent” principles. However, the roadmap remains unclear. In July 2020, 77 Canadian civil society groups called on the Trudeau government to; (i) ban use of the tech by federal law enforcement for surveillance and intelligence, (ii) launch public consultations on use of facial recognition, and (iii) update PIPEDA protections to specifically cover biometric data.

China, with its estimated 626 million surveillance cameras, has perhaps the strictest restrictions on private business use of biometric data. The central government, however, is exempt. Facial recognition is a key tool in its emerging national “social credit” system that scores personal public behaviour and penalizes “bad” practices (e.g. jaywalking or smoking in the wrong spot). These major investments have made China the world’s capital of facial recognition; since 2015, the majority of related patents on facial recognition and surveillance have come from Chinese applicants.

The European Union’s GDPR is the most advanced data privacy regime. It classifies the data harvested from facial recognition technology as biometric, a category that requires explicit consent from the subject prior to its collection.

In the US, four states – Washington, Illinois, Texas and California – have adopted laws on the protection of biometric data including explicit opt-in clauses, while numerous cities have banned the use of facial recognition in public services, including policing. The most recent to do so was Portland, amid civil unrest. However, the Trump administration has sought rulings in federal court to proceed with facial scans at airport entry for all passengers, including non-US citizens. Currently being piloted at Los Angeles International Airport and Dallas-Fort Worth International Airport, facial recognition is used to identify potential criminals, passport fraud and people on no-fly lists

Facial recognition differs from other biometrics – DNA, fingerprints – for two big reasons. First, it’s easy to collect; your image can be captured on video by anyone, anywhere. Second, it’s increasingly easy to verify against online images and with deep learning tools.

Think about it: how many times have you provided your fingerprint? Now how many images of your face are online?

Think about it: how many times have you provided your fingerprint? Now how many images of your face are online? Governments alone have an enormous trove of reliable, labelled images, from your health card to your passport photos. Meantime, our penchant for posting images to social media – with our names, friends and locations – have created inadvertent, massive datasets for facial recognition.

A couple of years ago, Google stunned many observers by announcing development of an algorithm to track people’s social media activity across all platforms, simply by following their face. Google can do this with its propriety “reverse image search” combined with its massive scale in crawling millions of websites at once. The ease of access to people’s faces makes all of this possible.

Despite advances in the technology, facial recognition wears a mask of mistrust, particularly along racial lines. Some of the original facial recognition systems and algorithms were proven to contain ethnic bias with high levels of inaccuracy for non-white faces, due largely to the input data that skewed to white males.

Any application of this technology must appreciate the potential for bias in the underlying data -- particularly given its potential to negatively impact people. Trust in these tools remain divided; when asked about personal verification methods to access health records, 58% of White respondents in the US were comfortable using facial recognition. But this figure fell to 50% and 41% among Hispanic and Black respondents, respectively.

Put simply, tread lightly. Misuse of personal information can carry massive reputational and legal risks.

An RBC Borealis AI survey of Canadian businesses revealed their top motivations to invest in AI programs were (i) to reduce costs, (ii) to increase productivity and (iii) to increase sales. This is increasingly relevant amid the current economic recovery, as firms look to leverage any data advantages to create consumer relevance and new revenue.

Tech giants have inherent advantages to developing this technology, and have staked their claims in hundreds of global patents. In turn, they are seeking consumers of these data tools; for example, a retailer looking for information on shoppers who browse, but don’t buy; a restauranteur aiming to track frequency of visits to grant loyalty points; or a construction firm interested in gathering insights on worker activity on job sites.

Only 36% of US consumers trust tech companies to use facial recognition software responsibly.

Comfort with all these uses, however, is not yet widespread. A 2019 Pew research study found that only 36% of US consumers trust tech companies to use facial recognition software responsibly, and just 17% trust advertisers. When individuals are unsure of how their data is being used, firms risk running afoul of privacy and ethical practices.

Any organization or entrepreneur should consider:

- Educating themselves and business leaders on the risks from resources like RESPECT AI.
- Understanding and creating an awareness of the biases that may be present in their practices.
- Conducting due diligence on vendors that supply personal data to their business.
- Supporting business councils, and other advocacy groups, that call for clear, federal regulatory guidance on the use of data from facial recognition in business.
- Making a public commitment to the ethical use of AI and respect for people’s autonomy, by following the principles of Privacy by Design or signing onto the Montreal Declaration.

Uncertainty about how to use AI responsibly could account for the stark divide among Canadian firms adopting it. Six in ten Canadian businesses feel that AI is mostly for larger organizations.

Consumers should have the right to know why and how firms use their likeness, and governments are responsible for ensuring it is done legally. Businesses that engage in facial recognition applications without appreciating the associated ethical questions risk strong blow-back from consumers.

Canada has been a leader in supporting AI for good. How facial recognition technology is deployed will be an important test of adherence to such ideals. The pandemic has accelerated how Canada moves from conversation towards action around digital ethics.

*This article originally appeared as part of RBC Disruptors series, which offers insights about social, economic and technological trends in an age of disruption.*

Bias is nothing new. In fact, a recent Borealis AI/RBC survey found that 88 per cent of businesses believe bias exists within their own organization. Addressing this is a critical component of building corporate culture. But we have to eliminate bias in our technology, too.

Bias in AI has serious consequences. From wrongful arrests to unfair recruitment policies, a biased algorithm has the ability to negatively impact the freedom, privacy and security of individuals and society as a whole.

AI is not neutral. Bias usually exists because algorithms have been trained using inadequate or biased data or the architectures are skewed towards specific outcomes. As machine learning algorithms are increasingly used to determine important real-world outcomes such as loan approval, pay rates, and parole decisions, the AI community has a responsibility to account for that discrimination. But how?

Listen to our new podcast, led by RBC’s John Stackhouse and featuring Saadia Muzzaffar, entrepreneur and founder of TechGirls Canada; Ruha Benjamin, Associate Professor of African American Studies at Princeton University; and Foteini Agrafioti, Head of Borealis AI and Chief Science Officer at RBC. The discussion offers an enlightening and impassioned view on how society and businesses must tackle bias to ensure a fair, safe and trustworthy approach to AI.

Ruha Benjamin is Associate Professor of African American Studies at Princeton University, Founding Director of the Ida B. Wells JUST Data Lab, and author of the award-winning book *Race After Technology*: *Abolitionist Tools for the New Jim Code* (Polity 2019) and editor of *Captivating Technology* (Duke 2019), among many other publications. Ruha’s work investigates the social dimensions of science, medicine, and technology with a focus on the relationship between innovation and inequity, health and justice, knowledge and power. She is the recipient of numerous awards and fellowships including from the American Council of Learned Societies, National Science Foundation, Institute for Advanced Study and the President’s Award for Distinguished Teaching at Princeton. For more info, please visit ruhabenjamin.com

Saadia Muzzaffar is a tech entrepreneur, author and passionate advocate of responsible innovation, decent work for everyone, and prosperity of immigrant talent in STEM. She is the founder of TechGirls Canada, the hub for Canadian women in science, technology, engineering and math - and co-founder of Tech Reset Canada, a group of business people, technologists, and other residents advocating for innovation that is focused on the public good. In 2017, she was featured in Canada 150 Women, a book about 150 of the most influential and groundbreaking women in Canada. She is honoured to serve on the board of Women's Shelters Canada and the Advisory board for the University of Guelph's Centre for Advancing Responsible and Ethical Artificial Intelligence (CARE-AI).

]]>- Evaluating Lossy Compression Rates of Deep Generative Models by Sicong Huang, Alireza Makhzani, Yanshuai Cao and Roger Grosse
- On Variational Learning of Controllable Representations for Text without Supervision by Peng Xu, Jackie CK Cheung, Yanshuai Cao
- Tails of Lipschitz Triangular Flows by Priyank Jaini, Ivan Kobyzev, Yaoliang Yu and Marcus A. Brubaker

and many members of the research team took the time to virtually attend ICML 2020. Now that the conference content is freely available online, it's a great time to look back and check out some of the highlights. In this post, four Borealis AI researchers describe the papers that they found most interesting or significant from the conference.

*Hrayr Harutyunyan, Kyle Reing, Greg Ver Steeg, Aram Galstyan*

by Peng Xu

**Related Papers:**

- Emergence of Invariance and Disentanglement in Deep Representations
- Information-Theoretic Analysis of Generalization Capability of Learning Algorithms.
- $\mathcal{L}_{\text{DMI}}$: A Novel Information-theoretic Loss Function for Training Deep Nets Robust to Label Noise

**What problem does it solve?** Neural networks have the undesirable tendency to memorize information about the noisy labels. This paper shows that, for any algorithm, low values of mutual information between weights and training labels given inputs $I(w : \pmb{y}|\pmb{x})$ correspond to a reduction in memorization of label-noise and better generalization bounds. Novel training algorithms are proposed to optimize for this and achieve impressive empirical performances on noisy data.

**Why is this important?** Even in the presence of noisy labels, deep neural networks tend to memorize the training labels. This hurts the generalization performance generally and is particularly undesirable with noisy labels. Poor generalization due to label memorization is a significant problem because many large, real-world datasets are imperfectly labeled. From a information-theoretic perspective, this paper reveals the root of the memorization problem and proposes an approach that directly addresses it.

**The approach taken and how it relates to previous work**: Given a labeled dataset $S=(\pmb{x}, \pmb{y})$ for data $\pmb{x}=\{x^{(i)}\}_{i=1}^n$ and categorical labels $\pmb{y}=\{y^{(i)}\}_{i=1}^n$ and learning weights $w$, Achille & Soatto present a decomposition of the expected cross-entropy $H(\pmb{y}|\pmb{x}, w)$:

\[ H(\pmb{y} | \pmb{x}, w) = \underbrace{H(\pmb{y} | \pmb{x})}_{\text{intrinsic error}} + \underbrace{\mathbb{E}_{\pmb{x}, w}D_{\text{KL}}[p(\pmb{y}|\pmb{x})||f(\pmb{y}|\pmb{x}, w)]}_{\text{how good is the classifier}} - \underbrace{I(w : \pmb{y}|\pmb{x})}_{\text{memorization}}. \]

If the labels contain information beyond what can be inferred from inputs, the model may do well by memorizing the labels through the third term of the above equation. To demonstrate that $I(w:\pmb{y}|\pmb{x})$ is directly linked to memorization, this paper proves that any algorithm with small $I(w:\pmb{y}|\pmb{x})$ overfits less to label-noise in the training set. This theoretical result is also verified empirically, as shown in Figure 1. In addition, the information that weights contain about a training dataset $S$ has previously been linked to generalization (Xu & Raginsky), which can be tightened with small values of $I(w:\pmb{y}|\pmb{x})$.

To limit $I(w:\pmb{y}|\pmb{x})$, this paper first shows that the information in weights can be replaced by information in the gradients, and then introduces a variational bound on the information in gradients. The bound employs an auxiliary network that predicts gradients of the original loss without label information. Two ways of incorporating predicted gradients are explored: (a) using them in a regularization term for gradients of the original loss, and (b) using them to train the classifier.

**Results**: The authors set up experiments with noisy datasets to see how well the proposed methods perform for different types and amounts of label noise. The simplest baselines are standard cross-entropy (CE) and mean absolute error (MAE) loss functions. The next baseline is the forward correction approach (FW) proposed by Patrini *et* al., where the label-noise transition matrix is estimated and used to correct the loss function. Finally, they include the recently proposed determinant mutual information (DMI) loss proposed by Xu *et* al., which is the log-determinant of the confusion matrix between predicted and given labels. The proposed algorithm illustrates the effectiveness on versions of MNIST, CIFAR-10 and CIFAR-100 corrupted with various noise models, and on a large-scale dataset Clothing1M that has noisy labels, as shown in Fig 2.

*Kei Ota, Tomoaki Oiki, Devesh K. Jha, Toshisada Mariyama and Daniel Nikovski*

by Pablo Hernandez-Leal

**Related Papers:**

- Learning state representation for deep actor-critic control.
- State representation learning for control: An overview.
- Densely Connected Convolutional Networks.

**What problem does it solve?** This paper starts from the question of whether learning good representations for states and using larger networks can help in learning better policies in deep reinforcement learning.

The paper mentions that many dynamical systems can be described succinctly by *sufficient statistics* which can be used to accurately to predict their future. However, there is still the question whether RL problems with intrinsically low-dimensional state (*i.e.,* with simple sufficient statistics) can benefit by *intentionally increasing* its dimensionality using a neural network with a good feature representation.

**Why is this important?** One of the major successes of neural networks in supervised learning is their ability to automatically acquire representations from raw data. However, in reinforcement learning the task is more complicated since policy learning and representation learning happen at the same time. For this reason, deep RL usually requires a large amount of data, potentially millions of samples or more. This limits the applicability of RL algorithms to real-world problems, for example, continuous control and robotics where that amount of data may not be practical to collect.

It can be assumed that increasing the dimensionality of the input might further complicate the learning process of RL agents. This paper argues this is not the case and that agents can learn more efficiently with the high-dimensional representations than with the lower-dimensional state observations. The authors hypothesize that larger networks (with a larger search space) is one of the reasons that allows agents to learn more complex functions of states, ultimately improving sample efficiency.

**The approach taken and how it relates to previous work**: The area of state representation learning focuses on representation learning where learned features are in low dimension, evolve through time, and are influenced by actions of an agent. In this context, the authors highlight a previous work by Munk *et* al. where the output of a neural network is used as input for a deep RL algorithm. The main difference is that the goal of Munk *et* al. is to learn a *compact* representation, in contrast to the idea of this paper which is learning good higher-dimensional representations of state observations.

The paper proposes an Online Feature Extractor Network (OFENet) that uses neural networks to produce good representations that are used as inputs to a deep RL algorithm, see Figure 3.

OFENet is trained with the goal of preserving a sufficient statistic via an auxiliary task to predict future observations of the system. Formally, OFENet trains a feature extractor network for the states, $z_{o_t}=\phi_o(o_t)$, a feature extractor for the state-action, $z_{o_t,a_t}=\phi_{o,a}(o_t,a_t)$, and a prediction network $f_{pred}$ parameterized by $\theta_{pred}$. The parameters $\{\theta_o, \theta_{o,a}, \theta_{pred}\}$ are optimized to minimize the loss:

$$L=\mathbb{E}_{(o_t,a_t)\sim p,\pi} [||f_{pred}(z_{o_t},a_t) - o_{t+1}||^2]$$

which is interpreted as minimizing the prediction error of the next state.

The authors highlight the need for a network that can be optimized easily and produce meaningful high-dimensional representations. Their proposal is a variation of DenseNet, a densely connected convolutional network whose output is the concatenation of previous layer's outputs. OFENet uses a DenseNet architecture and is learned in on online fashion, at the same time as the agents policy, receiving as input observation and action as depicted in Figure 4.

**Results**: The paper evaluates 60 different architectures with varying connectivity, sizes and activation functions. The results showed that an architecture similar to DenseNet consistently achieved higher scores than the rest.

OFENet was evaluated with both on-policy (SAC and PPO) and off-policy reinforcement learning algorithms (TD3) on continuous control tasks. With these three algorithms the addition of OFENet obtained better results than without it.

Ablation experiments were performed to verify that just increasing the dimensionality of the state representation is not sufficient to improve performance. The key point is that generating *effective* higher dimensional representations, for example with OFENet, is required to obtain better performance.

*Rob Cornish, Anthony L. Caterini, George Deligiannidis, and Arnaud Doucet*

by Ivan Kobyzev

**Related Papers:**

- SurVAE Flows: Surjections to Bridge the Gap between VAEs and Flows
- A RAD approach to deep mixture models
- Augmented Neural ODEs

**What problem does it solve?** The key ingredient of a Normalizing Flow is a diffeomorphic function (*i.e.,* invertible function which is differentiable and its inverse is also differentiable). To model a complex target distribution a normalizing flow transforms a simple base measure via multiple diffeomorphisms stacked together. However, diffeomorphisms preserve topology; hence, the topologies of the supports of the base distribution and target distribution must be the same. This is problematic for the real-world data distributions which can have complicated topology (*e.g.,* they can be disconnected, have holes, etc). The paper proposes a way to replace a diffeomorphic map with a continuous family of diffeomorphisms to solve this problem.

**Why is this important?** It is generally believed that many distributions exhibit complex topology. Generative methods which are unable to learn different topologies will, at the very least, be less sample efficient in learning and potentially fail to learn important characteristics of the target distribution.

**The approach taken and how it relates to previous work**: Given a latent space $\mathcal{Z}$ and a target space $\mathcal{X}$, the paper considers a continuous family of diffeomorphisms $\{ F(\cdot, u): \mathcal{Z} \to \mathcal{X} \}_{u \in \mathcal{U}}$. The generative process of this model is given by

$$z \sim P_Z, \quad u \sim P_{U|Z}(\cdot|Z), \quad x = F(z,u),$$

which is illustrated in Figure 5. There is no closed form expression on the likelihood $p_X(x)$, hence to train the model one needs to use variational inference. This introduces an approximate posterior $q_{U|X} \approx p_{U|X}$, and constructs an variational lower bound on $p_X(x)$ which can be used for training. To increase expressiveness one can then stack several layers of this generative process.

The authors proved that under some conditions on the family $F_u$, the model can well represent a target distribution, even if its topology is irregular. The downside, compared to other normalizing flows, is that model doesn't allow for exact density computation. However estimates can be computed through the use of importance sampling.

**Results**: The performance of the method is demonstrated quantitatively and compared against Residual Flows, on which it's architecture is based. On MNIST and CIFAR-10 in particular it performs better than Residual Flow (Figure 6), improving the bits per dimension on the test set by a small but notable margin. On other standard datasets the improvements are even larger and, in some cases, state-of-the-art.

*Florian Wenzel, Kevin Roth, Bastiaan S. Veeling, Jakub Świątkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, Sebastian Nowozin*

by Mohamed Osama Ahmed

**Related Papers:**

- Simple and scalable predictive uncertainty estimation using deep ensembles.
- Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It
- Bayesian deep learning and a probabilistic perspective of generalization.

**What problem does it solve?** The paper studies the performance of Bayesian neural network (BNN) models and why they have not been adopted in industry. BNNs promise better generalization, better uncertainty estimates of predictions, and should enable new deep learning applications such as continual learning. But despite these potentially promising benefits, they remain widely unused in practice. Most recent work in BNNs has focused on better approximations of the posterior. However this paper asks whether the actual posterior itself is the problem, *i.e.,* is it even worth approximating?

**Why is this important?** If the actual posterior learned by BNN is poor then efforts to construct better approximations are unlikely to produce better results and could actually hurt performance. Instead this would suggest that more efforts should be directed towards fixing the posterior itself before attempting to construct better approximations.

**The approach taken and how it relates to previous work**: Many recent BNN papers use the "cold posterior" trick. Instead of using the posterior $p(\theta|D) \propto \exp( -U(\theta) )$, where $U(\theta)= -\sum_{i=1}^{n} \log(y_i|x_i,\theta)-\log p(\theta)$, they use $p(\theta|D) \propto \exp(-U(\theta)/T)$ where $T$ is a temperature parameter. If $T=1$, then we recover the original posterior distribution. However, recent papers report good performance with a "cold posterior" where $T<1$. This causes the posterior to become sharper around the modes and the limiting case $T=0$ corresponds to maximum a posteriori (MAP) point estimate.

This paper studies why the cold posterior trick is needed. That is, why is the original posterior learned from BNN is not good enough on its own. The paper investigates three factors:

**Inference**: Monte Carlo methods are needed for posterior inference. Could the errors and approximations induced by the Monte Carlo methods cause problems? In particular, the paper studies different problems such as inaccurate SDE simulations, and minibatch noise.**Likelihood**: Since the likelihood function used for training BNNs is the same as the one used for SGD models, then this should not be a problem.

However, the paper raises the point that "Dirty Likelihoods" are used in recent deep learning models. For example batch normalization, dropout, and data augmentation may be causing problems.**Prior**: Most BNN work uses a Normal prior over the weights. The paper raises the question of whether this is a good prior which they call the "Bad Prior Hypothesis". Specifically, the hypothesis is that the current priors used for the parameters of BNNs may be inadequate, unintentionally introducing an incorrect bias into the posterior and potentially being too strong and overruling the data as model complexity increases. To study this the authors draw samples of the BNN parameters $\theta$ from the prior distribution and examine the predictive distribution that results with these randomly generated parameters.

**Results**: The experiments find that, consistent with previous work, the best posteriors are achieved with cold posteriors, *i.e.,* at temperatures $T<1$. This can be seen in Figure 7. While it's still not fully understood why, cold posteriors are needed to get good performance with BNNs.

Further, results suggest that neither inference nor the likelihood are the problem. Rather, the prior seems likely to be, at best, unintentionally and misleadingly informative. Indeed, current priors generally map all images to a single class. This is clearly unrealistic and undesirable behaviour of prior. This effect can be seen in Figure 8 which shows the class distribution over the training set for two different samples from the prior.

**Discussion**

To date there has been a significant amount of work on better approximations for the posterior in BNNs. While this is an important research direction for a number of reasons, this paper suggests that there are other directions that we should be pursuing. This is highlighted clearly by the fact that the performance of BNNs are worse than single point estimates trained by SGD and to improve the performance, cold posteriors are currently required. While this paper hasn't given a definitive answer to the question of why cold posteriors are needed or why BNNs are not more widely used, it has clearly indicated some important directions for future research.

]]>Foteini Agrafioti, Head, Borealis AI, explains why she believes Aiden, the AI-powered electronic trading platform developed by RBC Capital Markets and Borealis AI, is a scientific milestone for reinforcement learning and AI.

]]>

]]>

*The views expressed in this article are those of the interviewee and do not necessarily reflect the position of RBC or Borealis AI.*

** Jodie Wallis (JW)**:

Very simply put, explainability is about being able to detail how the AI came to the decision that it did in a given scenario, and what the drivers were behind that decision. Being able to explain how decisions are being made has always been important. But as the algorithms become more sophisticated and as AI starts to reach deeper and deeper into our decision-making processes, the need for explainability has become much more acute.

** JW**:

No. And that’s an important distinction. Explainability really comes in when we are using AI to make decisions or recommendations that affect people’s lives in some material way. If an algorithm is being used to make a credit decision on a customer, for example, or to decide who to hire or promote – that is a decision that will require explainability. But if I’m using AI and a recommendation engine to decide which pair of shoes to offer you in an online store, I don’t believe that kind of algorithm necessarily needs explaining.

** JW**:

I think one of the issues with explainability in AI is that it feels overwhelming and limiting at the same time. Many execs and IT leaders worry about the complexity and overhead they will need to create if they must explain all of their new models to numerous stakeholders before launching.

The problem with explainability is that the ease or difficulty with which you produce an explanation varies greatly with the type of algorithm you are using. The deeper the algorithm, the more difficult explainability is; the shallower the algorithm, the easier explainability becomes. And I think this has led some organizations to shy away from using certain types of deep learning algorithms.

** JW**:

It all starts with understanding which decisions and algorithms need to be explained and which do not. Right from the outset of the research, you need to know how important explainability is to the issue you are addressing. Does the action taken have a material impact on the life of an individual or individuals? If it’s not important, then the researcher or developer is free to explore any and all algorithms that might best fit their problem. But if explainability is going to be important, you will likely be limited in the types of algorithms you can use to solve that problem.

When we work with clients, that is almost always our first step – creating a framework to help decision-makers understand which actions require explainability and which do not.

** JW**:

No. And, frankly, I think the market is currently very immature in terms of the technical tools to help manage these aspects of responsible AI.

There are a few different schools of thought as to how you do explainability of deep algorithms. Some researchers and scientists are using reverse engineering techniques where they study the outputs and patterns of a sophisticated deep learning algorithm in order to create a less sophisticated model that is able to simulate those outputs in a more explainable way. The problem is that they are trading off a certain amount of accuracy in order to achieve explainability. But in some circumstances, that may be a worthwhile trade-off to make.

Ultimately, every situation will be different and there are no tools that truly ‘solve’ the explainability challenge. That’s why it is so important that designers and developers understand the need for explainability at the very start of the project – at the point where they can build it into the design.

** JW**:

I think governments and privacy commissioners will need play a key role in this area. Some are already making inroads. In Europe, for example, the General Data Protection Regulation (GDPR) talks about a person’s right to “meaningful information about the logic” when automated decisions are being made about them. Individual regulators are also looking at the challenge – Singapore’s monetary authority, for example, has published guidelines around explainability. But, currently, regulation is still pretty nascent.

** JW**:

This is about putting explainability at the very start of the process. Before you go and start solving for a particular business problem, you really need to understand the ultimate need for explainability. There’s no use developing a cool and sophisticated new tool if the business is unable to use it because they can’t explain it to stakeholders. So it is critical that developers and designers understand what will require explaining and select their tools accordingly.

** JW**:

I believe business leaders recognize that explainability is one element of their responsible AI strategy and framework. If they are not already thinking about this, I would suggest the business community spend a bit of time creating smart policies around the explainability of algorithms and extending existing frameworks – like their Code of Business Ethics – into AI development.

That will lead to two key value drivers for businesses. The first is that organizations will be freer to develop really interesting value through AI solutions. But, at the same time, they will be contributing to the societal discourse around the need for explainability. And, given the growing importance of the topic to consumers, regulators and oversight authorities, that can only be a good thing.

Jodie Wallis is the managing director for Artificial Intelligence (AI) in Canada at Accenture. In her role, Jodie works with clients across all of Canada’s industries to develop AI strategies, discover sources of value and implement AI solutions. She also leads Accenture’s collaboration with business partners, government and academia and oversees Accenture’s investments in the Canadian AI ecosystem.

]]>However, the pace of this change has brought with it some tough challenges, with recent failures in AI systems leading to mistrust and fear of the technology. In some instances, even among some of the world’s leading technology companies, it has led to a costly removal of AI products from the market. Many businesses are realizing that they need to slow down and invest in more responsible AI product development.

Building AI responsibly comes with numerous tradeoffs. A recent Borealis AI/RBC* survey found that while 77% of those currently using AI believe it is important for businesses to implement it in an ethical way, 93% say they experience barriers such as cost and lack of understanding when attempting to do so.

In putting issues such as fairness, stability, bias and explainability at the top of their agenda, business leaders are investing in a trusted partnership with their clients at the expense of speed to market. Doing the right thing comes at a cost; and in unregulated environments, businesses could be free to take risks that compromise society.

This is why I believe it is so important that the public, businesses and governments are educated about the risks involved in AI technologies and that product owners are held to account for ethical and transparent deployment of these technologies.

One particular area of concern to me is bias. I’ve seen too many examples of companies perpetuating racial or gender discrimination through poorly executed technologies such as facial recognition, and violating human rights through biased algorithms. In fact, our survey found that 88% of companies believe bias exists in their organization, but almost half (44 per cent) do not understand the challenges that bias presents in AI. The most important thing to understand is that this technology is not neutral, and that we are responsible for removing bias at every step.

Companies should review every level of AI development to ensure that any potential bias has been addressed. The different levels could include:

**Data level**: The data that serves as input to AI models for training may be collected in a way that under-represents certain groups. This is often the problem with face recognition systems which are sometimes designed, and thus better able to serve, individuals who fit within the races the models were trained on, though this problem is pervasive and not constrained to face recognition only.**Model level**: Bias can be introduced at any time during the development of an AI model through architecture decisions made by engineers. It is important to note that these biases may still be unintentional, yet the impact to specific groups is the same. For instance, a model can be tuned to be more receptive to English accents to the detriment of other languages.**Application level**: Even when a completely unbiased model can be engineered, there is still risk in how this AI is applied in the real world. The ethical considerations of the product owners together with (or lack of) regulation or internal controls can play a major role in tipping the scale.

While AI is finding applications across different sectors, each industry is unique and AI’s impact on people’s lives and freedoms can vary widely.

As part of the Royal Bank of Canada (RBC), Borealis AI’s mandate is to advance the field of machine learning by bringing products to life for the financial services industry. Banking is a fundamental aspect of our society and one that plays a major role in helping people achieve financial health and stability. The economic prosperity of our communities is partially the responsibility of this sector. As such, any technological misstep may mean that people don’t reach their full potential - in starting a business, sending children to university, or building a house, for instance. Banks have a contract with society that requires them to be a fair and vested partner in its success.

Borealis AI has the privilege and responsibility of building products that touch the lives of millions of clients. As part of RBC, we are driven by the mission to help our clients thrive and communities prosper, and when it comes to AI this means putting human integrity first.

Over the years, we have developed research practices that ensure that AI is developed responsibly and are supported by RBC’s data and model governance rules. Whether we work with our regulators to understand risks, or we scrutinize our own AI systems with thorough validation, building things the right way means that we routinely trade off speed for considerate and equitable innovation.

It is also our belief that knowledge and opportunity should be shared and, for this reason, we have made the decision to contribute our research, publications and scientific code in this area to the community, as well as share RBC’s approach and expertise in governing and securing AI models which has evolved over decades of practice. Under the RESPECT AI program we are also convening a number of industry and academic leaders who are contributing their experience and offer practical advice on how to approach building AI responsibly.

At a time where technology evolves fast and puts pressure on the ability to govern and secure, it is imperative that we slow our pace down and come together in order to develop robust solutions to the new challenges we are presented with. We hope that RESPECT AI is a step in this direction and that this series opens up some honest dialogue, exchange and sharing of our collective experiences in building AI responsibly.

**Data were collected as part of Maru BizPulse program, operated by Maru/Reports and Maru/Matchbox, which collects and tracks key metrics describing how Canadian businesses are feeling, thinking and behaving. The survey audience was made up of owners and senior decision-makers with Canadian businesses, with a particular focus on small and mid-sized businesses. The survey was fielded in September 2020. All sample was sourced through the Maru/Blue proprietary business panel and partners. A total of 622 responses were collected for this portion of the survey. For more information please visit www.marureports.com.*

However, along with its myriad benefits, AI brings a host of new challenges which require the enhancing of governance processes and validation tools to ensure it is deployed safely and effectively within the enterprise.

With our combined expertise in AI safety, regulation, and model governance, Borealis AI and RBC have been navigating the complexities of this space to develop a robust, comprehensive AI validation process.

Model validation has played an integral role in banks’ traditional data analytics for many years. It helps to ensure that models perform as expected, identifies potential limitations and assumptions, and assesses possible negative impacts. Guidance from the US Federal Reserve dictates that “all model components—inputs, processing, outputs, and reports—should be subject to validation"[1] Banks in Canada have to adhere to similar regulations[2] and have already developed extensive validation processes to meet these requirements and ensure that they manage model risk appropriately. However, the advent of AI poses a number of challenges for traditional validation techniques.

First, it is costly to validate the large volume and variety of data used by AI models. AI models can make use of significantly more variables—referred to as “features” in AI parlance—than conventional quantitative models, and ensuring the integrity and suitability of these large datasets requires more computational power and more attention from validators. This challenge is particularly acute for AI models that use unstructured natural-language data like news feeds and legal or regulatory filings, which require new validation tools as well as more resources. Moreover, AI modelers often use “feature engineering” to transform raw data prior to training, which further increases the dimensionality of the data that must be validated.

Second, the complexity of AI methodologies makes it more difficult for validators to predict how AI models will perform after they are deployed. Compared to conventional models with relatively few features, it is harder to determine how AI models will behave—and why they behave this way—across the full range of inputs these models could face once deployed. AI models’ complexity can also make it more difficult to explain the reasons behind these models’ behavior, which in turn can make it harder to identify biased or unfair predictions. Ensuring that models do not lead some groups of customers to be treated unfairly is an important part of the validation process.

Finally, the dynamic nature of many AI models also creates unique validation challenges. Conventional models are typically calibrated once using a fixed training dataset before being deployed. AI models, on the other hand, often continue to learn after deployment as more data become available, and model performance may degrade over time if these new data are distributed differently or are of lower quality than the data used during development. These models must be validated in a way that takes their adaptiveness into account and frequently monitored to ensure that they remain robust and reliable.

To meet these challenges, banks must develop new validation methods that are better equipped to deal with the scale, complexity, and dynamism of AI. Borealis AI and RBC’s model governance team have joined forces to research and develop a new toolkit that automates key parts of the validation process, provides a more comprehensive view of model performance, and explores new approaches in areas like adversarial robustness and fairness. This pathbreaking technology is designed from the ground up to overcome the unique challenges of AI. AI safety is central to everything we do at Borealis AI, much like strong governance and risk management practices are central to RBC. This research will help to support faster AI deployment and more agile model development, and it will provide validators with more comprehensive and systematic assessments of model performance.

]]>The views expressed in this article are those of the interviewee and do not necessarily reflect the position of RBC or Borealis AI.

Are current privacy laws and regulations enough to ensure data privacy in AI?

** Ann Cavoukian (AC)**:

The challenge with many data privacy laws is that they do not reflect the dynamic and evolving nature of today's technology. In this era of AI, social media, phishing expeditions and data leaks, I would argue that what we really need are proactive measures around privacy.

I think we are just starting to reach the tip of the iceberg on data privacy and protection. The majority of the iceberg is still unknown and, in many cases, unregulated. And that means that, rather than waiting for the safety net of regulation to kick in, we need to be thinking more about algorithmic transparency and designing privacy into the process.

** AC**:

I mean baking privacy protective measures right into the code and algorithms. It’s really about designing programs and models with privacy as the default setting.

During my time as Data Privacy Commissioner for Ontario, I created ‘Privacy by Design’, a framework for helping organizations prevent privacy breaches by embedding privacy into the design process. More recently, I created an extensive module called ‘AI Ethics by Design’ which was specifically intended to deal with the need for algorithmic transparency and accountability. There are seven key principles that underpin the framework, supported by strong documentation to facilitate ethical design and data symmetry. These principles, based on the original privacy by design framework, include respect for privacy as a fundamental human right.

** AC**:

Absolutely. And I’m happy to see that facial recognition tools are routinely banned in various US states and across Canada. Your face is your most sensitive personal information. And, more often than not with these applications, nobody is obtaining individual consent before capturing facial images; there may not even be visible notification that facial recognition tools are being used.

From a privacy perspective, that’s terrible. The point of privacy laws is to provide people with control over their personal data. Applications like facial recognition take away all of that control. All that aside, the technology has also proven to be highly inaccurate and frequently biased; time and again, their use has been struck down in the courts of justice and public opinion.

* AC*:

I think it is absolutely critical to consumers; virtually every study and survey confirms that. Consider what happened early on in the pandemic. A number of governments tried to launch so-called ‘contact tracing’ apps that offered fairly weak privacy controls. Uptake was dismal. Even though the apps could be potentially life-saving for users, few were willing to share their personal information with the government or put it into a centralized repository.

What worked well, on the other hand, was the Apple/Google exposure notification API. In part, it was well adopted because it works on the majority of smart phones in use in North America today. But, more importantly, it is fully privacy protected. I have personally had a number of 1-on-1 briefings from Apple and was highly confident that the API collected no personally identifiable information or geolocation data. Around the world, Canada included, apps based on that API have seen tremendous uptake within the population.

Now, remember, this is for an app that helps people avoid the biggest health crisis to face modern civilization. If they are not willing to trade their privacy for that, you would be crazy to assume consumers would trade it away simply for convenience or service.

** AC**:

Not at all. We need to get away from this view where privacy must be traded for something. It’s not an either/or, zero-sum proposition, involving trade-offs. Far better to enjoy multiple positive gains by embedding both privacy AND AI measures — not one to the exclusion of the other.

I also think the environment is rapidly changing. Consider, for example, the efforts being made by the Decentralized Identify Foundation, a global technology consortium that is working to find new ways to ensure privacy while allowing data to be commercialized. Efforts like these suggest we are moving towards a world where privacy can be embedded into AI by default.

** AC**:

The AI community needs to remember that – above all – transparency is essential. People need to be able to see that their privacy has been baked into the code and program by design. I would argue that public trust in AI is currently very low. The only way to build that trust is by embedding privacy by design.

I think the same advice goes for business executives and privacy oversight leaders: don’t just accept algorithms without looking under the hood first. There are all kinds of potential issues – privacy and ethics related – that can arise when applying AI. As an executive, you need to be sure your organization and people are always striving to protect personal data.

Dr. Ann Cavoukian is recognized as one of the world’s leading privacy experts. Appointed as the Information and Privacy Commissioner of Ontario, Canada in 1997, Dr. Cavoukian served an unprecedented three terms as Commissioner. There she created Privacy by Design, a framework that seeks to proactively embed privacy into the design specifications of information technologies, networked infrastructure and business practices, thereby achieving the strongest protection possible. Dr. Cavoukian is presently the Executive Director of the Global Privacy and Security by Design Centre.

]]>

Why should the AI community be focused on model validation?

** Sander Klous (SK)**:

It’s very tempting to run off and create all sorts of futuristic solutions using AI and Machine Learning. But without proper model validation and governance processes, creativity can very quickly turn into risk.

For example, I have seen examples of healthcare organizations launch algorithms that mysteriously strip people of their healthcare allowances; suddenly patients had to jump through hoops to demonstrate they were eligible, basically reversing the burden of proof. There are also examples of fraud detection algorithms at banks that generated too many false positives; fraud departments quickly became overwhelmed without the means to address the problem and it created frustration for customers.

In both cases, these unexpected outcomes should have been uncovered in the model validation process. Especially with these kind of new technological developments, where trustworthiness is still a fragile concept, positive experiences are the key to success.

** SK**:

Actually, quite the opposite. I’m not worried about AI becoming too smart. I’m worried about AI being too stupid right now. We tend to think AI can do a lot, but often it’s not as smart as we give it credit for. It does not magically solve any issues for you. It requires thorough process, robust validation, governance, controls and risk frameworks – amongst other things – to ensure we remove the ‘stupidity’ from these models. That’s not a future risk, it is something that needs to be addressed right now.

* SK*:

There are actually three lines of defense that normally come into play when we talk about model validation and risk management. The first line are the developers and designers themselves. They are the ones that need to be following the controls and considering the implications at the design level. The second line of defense is the risk function; it’s the risk function that needs to develop and drive adherence to those controls. There is also often a third line of defense that is served by an independent validator. These validators may be internal to the company or external advisors, depending on the circumstances. All three lines of defense need to work together to ensure proper model validation.

** SK**:

There are two big challenges. The first is a general lack of global standards around AI model validation. We see lots of different standards bodies working to come up with practical frameworks. But nothing is really mature yet. So it is very difficult for organizations to assess what ‘good’ looks like and then have that validated in the same way they would a financial statement, for example.

The other challenge comes down to process. Typically, the three lines of defense would work in a waterfall approach – design, followed by risk validation, followed by periodic independent auditing. But AI isn’t developed using a waterfall approach. And that means that it is becoming increasingly difficult to draw the lines of separation of duties around the three lines of defense.

** SK**:

That is certainly an ongoing problem and one that will take some time to resolve. As we saw with other similar regulation – like GDPR in Europe – it takes a lot of case law and a lot of collaboration to come up with a set of global standards. That can take years.

In the meantime, most organizations are creating their own set of validation standards and controls, largely based on industry good practices and evolving current standards. The problem is that the environment is continuously evolving and – until we have a set of global standards that can be audited – ‘good’ will continue to be a moving target.

Some organizations are creating their own ecosystems by collaborating with third parties and industry peers where there are common areas that they can all benefit from. For example, manufacturing companies who want to take a combined approach to validating specific parts of the process that other manufacturers would also use. This means that there could be standard validation for key aspects but not one overall industry approach that everyone adheres to.

** SK**:

To be successful at rapidly adopting AI models solutions, the second and third lines of defense need to reinvent themselves. Risk managers and oversight professionals are starting to rethink their approach to model validation in an agile environment. Unfortunately for them, this may result in a reduction in efficiency as validation processes are run and re-run as the models evolve. But this is not a bad thing; risk managers tell me they understand the trade-off between their own efficiency and that of the business. Some would be willing to see their efficiency cut in half just to deliver a 10 percent efficiency boost to their data scientists.

I think we are also starting to see an interesting evolution in the accounting and auditing professions around this issue. KPMG firms have been working with a range of clients to help develop their own internal standards and controls. The experiences we gained in these activities are the foundation of our "AI in Control framework" – it helps organizations build and evaluate sound AI models, driving better adoption, confidence, and compliance. I believe that eventually – once there is a set of global standards – the auditing profession will play an essential role in providing the same type of independent validation they already deliver on financial statements.

* SK*:

I think we all really need to keep challenging each other. You can’t just accept models on face value; you need to stay sharp and have rock-solid processes and frameworks for model validation. This is all new territory and we don’t really know what the ultimate standards and frameworks will look like. And that means it requires more thought and more caution than other areas where the roadmap has already been created.

I would argue that the greatest challenge is doing all of that while still encouraging the type of creativity, innovation and problem solving that drew you to consider an AI solution in the first place. Balancing that need for creativity against the controls of model validation can be extremely difficult.

Sander is a Professor of Big Data ecosystems for business and society at the University of Amsterdam and D&A Leader for KPMG in the Netherlands. He has a PhD in high energy physics and worked for over a decade on a number of projects for CERN, the world’s largest physics institute in Geneva. His best-selling book, 'We are Big Data', was runner-up for the management book of the year award in 2015. His new book, Trust in a Smart Society, is a top selling management book in the Netherlands.

]]>

However properties of distributions constructed with normalizing flows remain less well understood theoretically. One important property is that of *tail behavior*. We can think about a distribution as having two regions: the *typical set* and the *tails* which are illustrated in Figure 1. The typical set is what is most often considered; it's the area where the distribution has a significant amount of density. That is, if you draw samples or have a set of training examples they're generally from the typical set of the distribution. How accurately a model captures the typical set is important when we want to use distributions to, for instance, generate data which looks similar to the training data. Many papers show figures like Figure 2 which showcase how well a model matches the target distribution in regions where there's lots of density.

The tails of the distribution are basically everything else and, when working on an unbounded domain (like $\mathbb{R}^n$) corresponds to asking how the probability density behaves as you go to infinity. We know that the probability density of a continuous distribution on an unbounded domain goes to zero in the limit, but the rate at which it goes to infinity can vary significantly between different distributions. Intuitively tail behaviour indicates how likely extreme events are and this behaviour can be very important in practice. For instance, in financial modelling applications like risk estimation, return prediction and actuarial modelling, tail behaviour plays a key role.

This blog post discusses the tail behaviour of normalizing flows and presents a theoretical analysis showing that some popular normalizing flow architectures are actually unable to estimate tail behaviour. Experiments show that this is indeed a problem in practice and a remedy is proposed for the case of estimating heavy-tailed distributions. This post will omit the proofs and other formalities and instead will aim at providing a high level overview of the results. For readers interested in the details we refer them to the full paper which was recently presented at ICML 2020.

Let $\mathbf{X} \in \mathbb{R}^D$ be a random variable with a known and tractable probability density function $f_\mathbf{X} : \mathbb{R}^D \to \mathbb{R}$. Let $\mathbf{T}$ be an invertible function and $\mathbf{X} = \mathbf{T}(\mathbf{Y})$. Then using the change of variables formula, one can compute the probability density function of the random variable $\mathbf{Y}$:

\begin{align}

f_\mathbf{Y}(\mathbf{y}) & = f_\mathbf{X}(\mathbf{T}(\mathbf{y})) \left| \det \textrm{D}\mathbf{T}(\mathbf{y}) \right| , \tag{1}

\end{align}

where $\textrm{D}\mathbf{T}(\mathbf{y}) = \frac{\partial \mathbf{T}} {\partial \mathbf{y}}$ is the Jacobian of $\mathbf{T}$. Normalizing Flows are constructed by defining invertible, differential functions $\mathbf{T}$ which can be thought of as transforming the complex distribution of data into the simple base distribution, or "normalizing" it. The paper attempts to characterize the tail behaviour of $f_\mathbf{Y}$ in terms of $f_\mathbf{X}$ and properties of the transformation $\mathbf{T}$.

Before we can do that though we need to formally define what we mean by tail behaviour. The basis for characterizing tail behaviour in 1D was provided in a paper by Emanuel Parzen. Parzen argued that tail behaviour could be characterized in terms of the *density-quantile function*. If $f$ is a probability density and $F : \mathbb{R} \to [0,1]$ is its cumulative density function then the quantile function is the inverse, *i.e.*, $Q = F^{-1}$ where $Q : [0,1] \to \mathbb{R}$. The density-quantile function $fQ : [0,1] \to \mathbb{R}$ is then the composition of the density and the quantile function $fQ(u) = f(Q(u))$ and is well defined for square integrable densities. Parzen suggested that the limiting behaviour of the density-quantile function captured the differences in the tail behaviour of distributions. In particular, for many distributions

\begin{equation}

\lim_{u\rightarrow1^-} \frac{fQ(u)}{(1-u)^{\alpha}} \tag{2}

\end{equation}

converges for some $\alpha > 0$. In other words, the density-quantile function asymptotically behaves like $(1-u)^{\alpha}$ and we denote this as $fQ(u) \sim (1-u)^{\alpha}$. (Note that here we consider the right tail, i.e., $u \to 1^-$, but we could just as easily consider the left tail, i.e., $u \to 0^+$.) We call the parameter $\alpha$ the *tail exponent* and Parzen noted that it characterizes how heavy a distribution is with larger values having heavier tails. Values of $\alpha$ between $0$ and $1$ are called light tailed and include things like bounded distributions. A value of $\alpha=1$ corresponds to some well known distributions like the Gaussian or Exponential distributions. Distributions with $\alpha > 1$ are called heavy tailed, *e.g.*, a Cauchy or student-T. More fine-grained characterizations of tail behaviour are possible in some cases but we won't go into those here.

Now, given the above and two 1D random variables, $\mathbf{Y}$ and $\mathbf{X}$ with tail exponents $\alpha_\mathbf{Y}$ and $\alpha_\mathbf{X}$, we can make a statement about the transformation $\mathbf{T}$ that maps between them. First, the transformation is given by $T(\mathbf{x}) = Q_\mathbf{Y}( F_\mathbf{X}( \mathbf{x} ) )$ where $F_\mathbf{X}$ denotes the CDF of $\mathbf{X}$ and $Q_\mathbf{Y}$ denotes the quantile function (i.e., the inverse CDF) of $\mathbf{Y}$. Second, we can then show that the derivative of this transformation is given by

\begin{equation}

T'(\mathbf{x}) = \frac{fQ_\mathbf{X}(u)}{fQ_\mathbf{Y}(u)} \tag{3}

\end{equation}

where $u=F_\mathbf{X}(\mathbf{y})$ and $fQ_\mathbf{X}$ and $fQ_\mathbf{Y}$ are the density-quantile functions of $\mathbf{X}$ and $\mathbf{Y}$ respectively.

Now, given our characterization of tail behaviour we get that

\begin{equation}

T'(\mathbf{x}) \sim \frac{(1-u)^{\alpha_{\mathbf{X}}}}{(1-u)^{\alpha_{\mathbf{Y}}}} = (1-u)^{\alpha_{\mathbf{X}}-\alpha_{\mathbf{Y}}} \tag{4}

\end{equation}

and now we come to a key result. If $\alpha_{\mathbf{X}} < \alpha_{\mathbf{Y}}$ then, as $u \to 1$ we get that $T'(\mathbf{x}) \to \infty$. That is, if the tails of the target distribution of $\mathbf{Y}$ are heavier than those of the source distribution $\mathbf{X}$ then the slope of the transformation must be unbounded. Conversely, if the slope of $T(\mathbf{x})$ is bounded (i.e., $T(\mathbf{x})$ is Lipschitz) then the tail exponent of $\mathbf{Y}$ will be the same as $\mathbf{X}$, i.e., $\alpha_\mathbf{Y} = \alpha_\mathbf{X}$.

The above is an elegant characterization of tail behaviour and it's relationship to the transformations between distributions but it only applies to distributions in 1D. To generalize it to higher dimensional distributions, we consider the tail behaviour of the norm of a random variable, i.e., $\Vert \cdot \Vert$. Then the degree of heaviness of $\mathbf{X}$ can be characterized by the degree of heaviness of the distribution of the norm. Using this characterization we can then prove an analog of the above.

**Theorem 3** *Let $\mathbf{X}$ be a random variable with density function $f_\mathbf{X}$ that is light-tailed and $\mathbf{Y}$ be a target random variable with density function $f_\mathbf{Y}$ that is heavy-tailed. Let $T$ be such that $\mathbf{Y} = T(\mathbf{X})$, then $T$ cannot be a Lipschitz function.*

So what does this all mean for normalizing flows which are attempting to transform a Gaussian distribution into some complex data distribution? The results show that a Lipschitz transformation of a distribution cannot make it heavier tailed. Unfortunately, many commonly implemented normalizing flows are actually Lipschitz. The transformations used in RealNVP and Glow are known as affine coupling layers and they have the form

\begin{equation}

T(\mathbf{x}) = (\mathbf{x}^{(A)},\sigma(\mathbf{x}^{(A)}) \odot \mathbf{x}^{(B)} + \mu(\mathbf{x}^{(A)}) \tag{5}

\end{equation}

where $\mathbf{x} = (\mathbf{x}^{(A)},\mathbf{x}^{(B)})$ is a disjoint partitioning of the dimensions, $\odot$ is element-wise multiplication and $\sigma(\cdot)$ and $\mu(\cdot)$ are arbitrary functions. For transformations of this form, we can then prove the following:

**Theorem 4** *Let $p$ be a light-tailed density and $T(\cdot)$ be a triangular transformation such that $T_j(x_j; ~x_{<j}) = \sigma_{j}\cdot x_j + \mu_j$. If, $\sigma_j(z_{<j})$ is bounded above and $\mu_j(z_{<j})$ is Lipschitz continuous then the distribution resulting from transforming $p$ by $T$ is also light-tailed.*

The RealNVP paper uses $\sigma(\cdot) = \exp(NN(\cdot))$ and $\mu(\cdot) = NN(\cdot)$ where $NN(\cdot)$ is a neural network with ReLU activation functions. The translation function $t(\cdot)$ is hence Lipschitz since a neural network with ReLU activation is Lipschitz. However the scale function, $\sigma(\cdot)$, at first glance, is not bounded because the exponential function is not unbounded. However in practice this was implemented as $\sigma(\cdot) = \exp(c\tanh(NN(\cdot)))$ for a scalar $c$. This means that, as originally implemented, $\sigma(\cdot)$ *is* bounded above, i.e., $\sigma(\cdot) < \exp(c)$. Similarly, Glow uses $\sigma(\cdot) = \mathsf{sigmoid}(NN(\cdot))$ which is also clearly bounded above.

Hence, RealNVP and Glow are unable to unable to represent heavier tailed distributions. Not all architectures have this property though and we point out a few that can actually change tail behaviour, for instance SOS Flows.

To address this limitation with common architectures, we proposed using a parametric base distribution which is capable of representing heavier tails which we called *Tail Adaptive Flows* (TAF). In particular, we proposed the use of the student-T distribution as a base distribution with learnable degree-of-freedom parameters. With TAF the tail behaviour can be learned in the base distribution while the transformation captures the behaviour of the typical set of the distribution.

We also explored these limitations experimentally. First we created a synthetic dataset using a target distribution with heavy tails. After fitting with a normalizing flow, we can measure it's tail behaviour. Measuring tail behaviour can be done by estimating the density-quantile function and finding the value of $\alpha$ such that $(1-u)^{\alpha}$ approximates its near $u=1$. Our experimental results confirmed the theory. In particular, fitting a normalizing flow with a RealNVP or Glow style affine coupling layer was fundamentally unable to change the tail exponent, even as more depth was added. Figure 4 shows an attempt to fit a model based on a RealNVP-style affine coupling layers to a heavy tailed distribution (student T). No matter how many blocks of affine coupling layers are used, it is unable to capture the structure of the distribution and the measured tail exponents remain the same as the base distribution.

However, when using a tail adaptive flow the tail behaviour can be readily learned. Figure 5 shows the results of fitting a tail adaptive flow on the same target as above but with 5 blocks. This isn't entirely surprising as tail adaptive flows use a student T base distribution. However, SOS Flows is also able to learn the tail behaviour as predicted by the theory. This is shown in Figure 6.

We also evaluated TAF on a number of other datasets. For instance, Figure 7 shows tail adaptive flows successfully fitting the tails of Neal's Funnel, an important distribution which has heavier tails and exhibits some challenging geometry.

In terms of log likelihood on a test set, our experiments show that using TAF is effectively equivalent to not using TAF. However, this shouldn't be too surprising.

We know that normalizing flows are able to capture the distribution around the typical set and this is where most samples, even in the test set, are likely to be. Put another way, capturing tail behaviour is about understanding how frequently rare events happen and by definition it's unlikely that a test set will have many of these events.

This paper explored the behaviour of the tails of commonly used normalizing flows and showed that two of the most popular normalizing flow models are unable to learn tail behaviour that is heavier than that of the base distribution. It also showed that by changing the base distribution we are able to restore the ability of these models to capture tail behaviour. Alternatively, other normalizing flow models like SOS Flows are also able to learn tail behaviour.

So does any of this matter in practice? If the problem you're working on is sensitive to tail behaviour then absolutely and our work suggests that using an adaptive base distribution with a range of tail behaviour is a simple and effective way to ensure that your flow can capture tail behaviour. If your problem isn't sensitive to tail behaviour then perhaps less so. However, it is interesting to note that the seemingly minor detail of adding a $\tanh(\cdot)$ or replacing $\exp(\cdot)$ with sigmoid could significantly change the expressiveness of the overall model. These details have typically been motivated by empirically observed training instabilities. However our work connects these details to fundamental properties of the estimated distributions, perhaps suggesting alternative explanations for why they were empirically necessary.

]]>Interested in how ownership and copyright protection for media content is impacted by the rise of social media, Xiaohong’s research is focused on image watermarking and image forgery detection. More specifically his research involves new deep neural network architectures for blind image watermarking based on information-theoretic principles.

He is a third year Ph.D. student at McMaster University, supervised by Dr. Jun Chen. Xiaohong is interested in a career in machine learning because teaching machines complex tasks formerly only accomplished by humans excites him.

The Borealis AI fellowship has provided him with the opportunity continue his research and broaden its impact. The fellowship also leads him to join some of the most talented minds in ML an AI and advises him how to take his research and career further.

A fun fact about Xiaohong is that he has a musical side, knowing how to play the accordion.

Check out Xiaohong Liu’s Google Scholar.

Sedigheh is passionate about finding machine learning solutions that could positively impact important domains like healthcare. Her research involves predicting continuous-time Markov chains, with a focus on stochastic processes and simulations with applications to nucleic acid kinetics.

Sedigheh Zolaktaf received her BSc in Computer Engineering from Sharif University of Technology, Iran, in 2013, and MSc in Computer Science from the University of British Columbia, Canada, in 2015. She is currently a Ph.D. candidate in the Artificial Intelligence and Algorithms laboratories at the University of British Columbia. She chose a career in machine learning as it aligned with her interests in mathematics, coding and problem-solving.

The Borealis AI 2020 fellowship has provided support to Sedigheh by recognizing the importance of her work. This award also motivates her to continue her research in the area of stochastic processes and nucleic acid kinetics.

Outside of research, Sedigheh likes to stay active playing basketball and netball.

She is enthusiastic about the future of AI technologies and how they will intertwine with human decision making. Ibtihel Amara is focused on performing efficient analysis of Neural Network uncertainty. More specifically, her research looks into finding efficient uncertainty computation for edge devices. She also believes that ensuring trust and reliability are integrated into AI systems is paramount.

Ibtihel Amara is currently completing her Ph.D. at McGill University at the Center for Intelligent Machines (CIM). The Borealis AI fellowship has given her the opportunity to fully focus on her research goals and provided her with valuable encouragement and support that motivates her to dream big.

Ibtihel's hobbies are harmonious with her passion for technology. She enjoys spending her time gardening and finding ways to enhance urban agriculture with the help of AI.

AI is transforming industries. Whether it’s healthcare or global warming, cyber security or customer service, I’m constantly amazed and excited about the potential for machine learning to help businesses and society address some of today’s biggest challenges.

However, for modern AI to be performed properly and to succeed at scale, researchers and engineers need access to large datasets – the kind that are held by only a few companies worldwide. At the same time, the need to protect sensitive and private information is paramount.

To me, this is where the real opportunity lies. How can we ensure that AI is accessible to all in a safe and ethical manner?

At Borealis AI, we are championing the importance of Responsible AI by researching and developing practical solutions to enable a safer and more ethical adoption of AI technology. It includes a wide-range of considerations including privacy, accountability, transparency and bias and is critical to maintaining trust and accountability.

I recently recorded a panel discussion for Collision from Home where I touched on this opportunity, and the responsibility we have to ensure responsible AI for all. You can check out the Untapped Potential of AI recording in the video above.

]]>Elahe received a BSc degree in Electrical Engineering from the Isfahan University of Technology in 2012, and a MASc in Electronic-Digital Systems from Amirkabir University of Technology (Tehran Polytechnic) in 2016. She is currently a second-year Ph.D. student at Concordia Institute for Information System Engineering (CIISE) in Montreal where her studies focus on machine learning and deep learning models in rehabilitation and assistive technologies under the supervision of Prof. Arash Mohammadi.

Outside of her research Elahe enjoys testing out new baking recipes in the kitchen and being out in nature.

Read more about Elahe's work on Google Scholar.

Chenyang completed his bachelor degree in mathematics at the Northwest Polytechnic in Xi’an, Shaanxi, China before moving to Canada in 2013. He studied at the University of Windsor, Ontario where he obtained a Bachelor in Computer Science before moving to the University of Alberta. Chenyang is currently studying for his PhD in Computer Science while fulfilling his passion for teaching as a teaching assistant at the U of A.

In his spare time, Chenyang enjoys watching documentaries and testing his strategy skills with online gaming. He also enjoys listening to classical music.

Read more about Chenyang's work on Google Scholar.