The AI4Good Lab aims to increase gender diversity in AI by providing training and mentorship for people of marginalized gender identities, including but not limited to transgender women, cisgender women, and nonbinary people. The program aims to help them start building their careers in the field. Established in 2017 in Montreal, the Lab has recently expanded to Edmonton and Toronto, welcoming nearly 90 students from across the country to its latest cohort. Participants go through an intense machine learning bootcamp that includes lectures, special topic workshops, mentorship and skillbased tutorials to work together on their selfidentified team projects, developing their ideas and concepts for a Demo Day presentation.
As key sponsors of this year’s AI4Good Lab program, Borealis AI researchers and RBC business experts from Canada and the US will take an active role in helping participants to gain new skills and explore career pathways in AI and ML.
“AI4Good Lab is a triplewin for RBC. We get to encourage greater diversity in the field of AI. We help progress projects that deliver social good. And we have an opportunity to meet and work with some of the world’s future female and femaleidentified AI leaders. We are eager to see what this year’s cohort of students will do.” said Bob Blainey, RBC Fellow & Head RBC Brain.
AI4Good Labs is one of several programs that RBC and Borealis AI support, aimed at encouraging diversity in AI. The suite of programs includes Borealis AI internships, Let’s SOLVE it and RBC Amplify programs, offering more opportunities to work with subjectmatter experts, advance socially responsible development and adoption of AI, and build careers in the field.
“RBC is dedicated to encouraging diversity, inclusion and equity in everything we do. That means helping diverse talent to understand the opportunities in the field of AI and to figure out what they want to do next. AI4Good Lab gives organizations like RBC an opportunity to help mentor, shape and support a diverse group of students, as they explore their future potential.” added Eddy Ortiz, VP, Solution Acceleration & Innovation, RBC.
With our support for the AI4Good Lab, we are building on the long history of collaboration with CIFAR and our partners at RBC. AI4Good has also joined CIFAR’s suite of national training programs, alongside Borealis AI’s Let’s SOLVE it mentorship program for undergraduate students, announced earlier this year.
“It is programs like the AI4Good Lab – initiatives that bring ecosystems together to inspire and encourage diverse talent – that allow us to really unlock the potential of Canada’s ML and AI research. We’re proud of the work we do with CIFAR on this and other similar projects. And we look forward to seeing what this year’s cohorts will contribute to the field of AI and ML with our support.” noted Dr. Eirene Seiradaki , Director of Research Partnerships at Borealis AI
“With a new program now expanding to Ontario, it is clear there is great demand for initiatives like the AI4Good Lab right across the country. It is deeply encouraging to see more diversity than ever, including new regions, represented in this year’s program. And it is gratifying to see leading Canadian businesses partner with us to support programs like the AI4Good Lab.” said Dr. Elissa Strome, Executive Director of PanCanadian AI Strategy at CIFAR and Advisory Board member at AI4Good Lab
]]>
Machine learning certification is the process by which we ensure that an ML model is robust to one or more of these failures. While ML certification is currently at its infancy with many open questions, it has many potential benefits such as mitigating the harmful effects of the aforementioned failures, establishing trust in the model, improving explainability, and helping comply with any regulatory and legal requirements. This can be of paramount importance in numerous settings especially in ones where the ML models have lifechanging impact  such as autonomous vehicles, medical and financial domains. ML Certification is a holistic process that should be ingrained into the entire pipeline of an ML project from the requirements gathering phase to monitoring the deployed model during production. At its core, ML Certification can be regarded as filling out the statement below:
“The submitted machine learning model is known to abide by the property of ...(a)... up to a threshold of ...(b)... over the following input range: ...(c)... .”
Here, property (a) in the template above refers to the absence of a failure. For instance adversarial attack is a wellstudied failure in ML models, and the corresponding property of adversarial robustness refers to robustness to these attacks. While in the ideal world we would wish to certify the model for each property against all possible thresholds and input ranges, certifying an ML model is computationally expensive. Therefore, we identify one or more thresholds (b) and one or more input ranges (c) in the template above, making this process more tractable. As a result, we end up with several filled templates otherwise known as specifications, each with a unique combination of property, threshold and input range to certify an ML model against. Figure 2 outlines the steps in the ML certification process. Broadly, this process consists of two steps:
In step 1 of the ML certification process, we fill in the blanks (a, b, and c) in the template above, resulting in one or more specifications (filled statements) that we wish to certify an ML model against. Subsequently, in step 2 we verify whether the ML model satisfies the specifications.
The first step of certification consists of determining the property, threshold and input range (collectively known as the specification) for which to certify the ML model against. It is a holistic process integrated into the project lifecycle, and should begin very early in the project lifecycle. Ideally, this process starts during the requirements gathering phase of the project in order to ensure clarity of the security requirements of the desired ML model like all other requirements. In addition, more specifications may need to be added or current ones may need to be revised based on any changes in project development.
Determining specification requires significant subject matter expertise on the use case, the limitations of the ML model architecture including the parameters and hyperparameters used, and the limitations of the deployment environment including the type of users this ML model is exposed to. As a consequence, identifying these specifications, including the careful selection of thresholds should be done in collaboration with all stakeholders of the project. Currently, there is limited literature on how to identify specifications, including ways on explaining the significance of the chosen threshold and input range.
As an example to motivate the next part of our post, consider a simple example where a ‘model’ is represented by the equation $y = x^2$, and a specification that defines the property of adversarial robustness, which states that a small change in the input should not drastically change the output. More specifically, given a model M, input value x and threshold ϵ, this simplistic definition of adversarial robustness states that $\forall x' s.t. x  x' < \epsilon, M(x)  M(x') < \epsilon$. In other words, all possible inputs $x′$ within the range of x should result in the model’s output to be within ϵ range of it’s output on $x$. We use a threshold of 0.3, and an input value of 1, which results in the input range of [0.7,1.3]. Figure 3 provides a visual of the ‘model’, as well as outlines the bounding box that represents the input and output values for which we wish to certify the property of adversarial robustness.
The second step of certification consists of verifying whether the ML model satisfies the specifications that have been determined in the previous step. In other words, we check if the ML model is robust to the failures identified in the property for the given thresholds and input ranges in this step. This should usually be done after model training, but before the final evaluation of the model on the test set to ensure that we are able to modify the model without introducing bias, since typically the model should not be modified after the final evaluation. There are three main approaches identified in the literature that can verify whether a model satisfies a given specification:
We next provide an overview on each of these approaches.
Main idea: Figure 4 shows the main idea of the constraintbased verification approach. In this approach, the ML model as well as the specification is converted to a set of constraints  often using boolean or some other formal logic such as mixed integer linear programming. The resulting set of constraints is designed such that finding a solution results in finding an input to the ML model that is within the identified input range but violates the specification (or in other words, a counterexample proving that the model does not satisfy the specification).
These constraints are provided to a formal logic solver which exhaustively searches over the input range provided in the specification in order to find whether violations exist. If a violation is found, then that is proof that the model does not satisfy the specification. However, if the solver is unable to find a violation, then that is proof that one does not exist, since the solver exhaustively searches over the input region. While exhaustively searching over the input range requires exponential time w.r.t the size of the input region in the worst case, solvers use many optimization techniques in order to more efficiently search over the input region in practice. In order to get a better understanding of this approach, you can find our excellent tutorials that cover SAT solvers.
Verifying the example specification: In order to verify the example specification using constraintsbased approach, the model $y = x^2$ as well as the specification of adversarial robustness are first converted to a set of constraints. Let us denote the constraints encoding the model as F and constraints encoding the specification as G. We provide as input to the formal solver the formula F ∧¬G where the variables are defined over the input range identified in the specification. The solver exhaustively searches over this range for a satisfiable assignment. If it finds one, that assignment refers to an input that is a successful adversarial attack on the model  proving that the model does not satisfy the property. In case the solver does not find a satisfying input to this formula, this is a guarantee that one does not exist.
Advantages: There are several advantages to this approach. Firstly, formal language solvers are used in many settings, and there is a lot of literature and tools available which can be leveraged to verify ML models. Secondly, this approach can provide guarantees since it exhaustively searches over the entire input space. Finally, since formal logic is a rich language, we can encode a broad range of specifications in order to verify an ML model.
Challenges: There are however some key limitations to this approach currently. The resulting formula of the ML model and specification can be very large and pose significant scalability and storage issues. Further, in the worst case, this approach requires exponential time with respect to the input range, and can therefore take an exceptionally long time to solve (in the order of many years). Finally, while theoretically many types of ML models can be converted to constraints, there is a considerable gap in the available tools to convert many types of ML models (e.g. no tools exist to convert various neural network architectures such as sigmoid activation functions).
Further reading: An introduction of constraintsbased verification of machine learning (specifically neural networks) can be found here.
Main idea: Figure 5 shows the main idea of the abstractionbased verification approach. First, an abstraction of the input region identified in the specification is defined using a geometric object. Then, functions (known as abstract transformers) are used to propagate this geometric region through the ML model, which results in a corresponding output region. This (potentially infinite) output region is searched to find any violations of the specification. Because this approach uses an abstraction of the input region and the model, it usually overapproximates the search area. What this means is that if a violation is found in the output region, it does not guarantee that this violation is caused by a value in the true input region (identified by the specification), since it may have been from an input in the overapproximated region. In such a case, a refinement of the abstraction is necessary, which results in an increase in computational overhead. However, if no violation is found in the output region, then we can guarantee that the ML model satisfies the specification. There are several types of geometric regions that are used in this approach, which balance degree of overapproximation with computational resources required.
Verifying the example specification: To verify the example specification using abstractionbased approach, we construct an approximation of the identified input region using a geometric object. This region is propagated through the model in order to obtain the corresponding abstract output region which is then checked for violation of the specification. Figure 5 shows the result of three types of geometric regions that can be used to create this abstraction: interval, zonotope, and polyhedron domains. These regions are ordered in terms of increasing computational resources needed, that allow for an increase in the precision of the abstraction region, as can be seen in the figure. Given the definition of adversarial robustness used in the specification, we can check for violations in the output region by identifying whether there are any values in the output region that differ from the expected output region by ϵ. In other words, we check if the corresponding output region obtained via these geometric regions is a strict superset of the expected output region (which is a square).
Advantages: Since the propagation is done in one forward pass of the model, abstractionbased verification is much faster than the constrainedbased approach. In addition, this approach allows to effectively balance the tradeoff between computation and precision by first using simpler geometric objects in order to check whether violations exist and refining that region using more complex objects as needed.
Challenges: This approach does pose certain challenges. Due to the possible overapproximation, the identified output region can extensively deviate from the true output region, resulting in situations where this approach may not find a solution. Further, the types of specifications it allows is restricted to ones where the input regions are contiguous, and so not all specifications may be verified using this approach.
Further reading: An introduction on abstractionbased verification of machine learning models (particularly neural networks) can be found here.
Main idea: Figure 7 provides an outline of dualitybased verification. In this approach, verification is viewed as an optimization problem and a dual function is first constructed. The dual  or more specifically  the Lagrangian dual function relates to the original ‘primal’ function in such a way that the decision variables in the primal problem correspond to the constraints in the dual problem, and every constraint in the primal problem correspond to a decision variable in the dual problem. An (upperbound) optimal solution to the dual serves as a lower bound for the primal function. Because this dual function is constructed to be simpler to solve, we find its optimal solution and compare that with the threshold provided in the specification.
Verifying the example specification: In order to verify the example specification using dualitybased approach,
we first construct the corresponding dual to the primal problem of verifying the specification. Consider Figure 8 which showcases two possible duality functions. Note that the axis for the primal and dual function differ (x versus λ), however we provide both functions on the same dimensions for better clarity. While the model in the specification is the equation y = x2, the primal and dual functions may not necessarily be the same equation. Nevertheless, for the sake of simplicity, we assume that both the primal and dual problems here are quadratic equations. The optimal solution to the dual function computed on the left figure is lower than the optimal solution to the primal problem. This type of dual function is known as weak duality, and the gap between the optimal functions is known as the dual gap. On the other hand, the optimal solution for the primal and dual functions are equal in the figure to the right, and this is known as strong duality. Ideally, we wish to construct a strong duality in order to get a tighter bound, but depending on the usecase, we may wish to allow for a duality gap for the sake of finding a simpler dual function.
Once we compute the optimal solution for the dual function, we compare that against the specification. If the optimal solution is below the threshold, that means the model satisfies the given specification since the specification is better than even the worst case bounds. On the other hand, if the worst case is greater than the specification, then we are not sure if the model violates the specification, since there may be a significant duality gap.
Advantages: There are several advantages to Dualitybased verification approach. Firstly, this approach is computationally efficient, especially in the case of strong convexity in the dual function where gradientbased optimization techniques can be used. In addition, this approach allows to trade computation with precision, by controlling the complexity of the dual function as well as the duality gap. Finally, since this approach uses optimization to find a solution, we can leverage algorithms that can speed up optimization, such as branchandbound techniques.
Challenges: As with other approaches, dualitybased verification is not without its challenges. Similar to the abstractionbased verification, this approach can over approximate the solution due to the duality gap, and as a result, it can falsely state that there is a violation of the specification when that is not the case. In addition, depending on the choice of the solver, this approach may not provide complete guarantees. Finally, we may not be able to construct dual functions for all types of constraints, and may only be able to apply it to a limited number of specifications.
Further reading: An introduction to dualitybased verification of machine learning (emphasizing neural networks) can be found here.
This concludes our post on machine learning certification, a process that ensures that an ML model is robust to one or more failures. This process consists of first determining the specification consisting of the property, threshold and input range for which to certify the model against and then verifying the specification to check whether the model satisfies the identified specifications using one of the identified verification approaches.
ML certification is an active area of research with a rich array of new works that aim to improve the number of properties, as well as the size and types of models that can be certified. A notable initiative here is the verification of neural networks competition (VNNCOMP) that aims to bring together the ML verification community, where researchers can submit tools to compete against others on verifying against a benchmark of ML models and specifications. In addition, a recent line of research (such as Adel et al., 2019) aim to either correct or replace the ML model with another of similar performance but, known to abide by the desired specifications.
However, there are several open problems that need to be addressed in ML certification. Firstly, there are limited systematic ways of determining the specification to certify the ML model against. Conversely, there is a gap between mathematical definitions of specifications that are required to certify a model and explaining their realworld significance. For instance, effectively explaining to stakeholders the significance of the threshold and definition used for a given property to certify an ML model is an area that needs significant improvement. Further, due to the dangers of domain shift and drift where the distribution of input data or relationship between data and labels alter over time  the ML model may need to be recertified for some or all specifications periodically, and we currently do not know how to determine the frequency of recertification, as well as what that process should entail. Finally, if the verification process fails (i.e. the model is found not to abide by the specification), it is not clear what is the best remediation process. Perhaps the model architecture does not allow it to satisfy both the business and security objectives, in which case a new architecture should be selected, instead of modifying the current model. Or perhaps the specification is too stringent and may need to be adjusted. Identifying the root cause in case of a failure is nontrivial, and requires more investigation.
With governments and regulatory boards providing new guidelines on developing ML models, the landscape of ML certification is constantly shifting. ML certification is important because it helps improve explainability of the model’s properties, establishes trust in the model, and helps comply with any internal or external regulations. More importantly, as firm believers in responsible AI, ML certification is the right thing to do.
In this blog post we outline various approaches that provide some form of guarantee that a model satisfies the identified specifications. There are other approaches that do not offer any sort of guarantees  known as testing approaches. These approaches search for counterexamples, which are violations that show that the model does not satisfy a specification. The search can range from simple (randombased fuzzing) to complicated (natural evolutionary strategies). A survey of machine learning testing approaches can be found here.
While fairness regulations are still an emerging area of AI policy, organizations working with AI in regulated industries like finance, transportation, construction, pharmaceuticals, and healthcare have a vested interest in staying ahead of the curve and building trust in their AI models. For larger enterprises, this typically means hiring AIfairness experts as part of a dedicated riskmanagement group, whereas startups will typically opt to train existing AI talent to work with an AI fairness toolkit.
The notion of fairness is difficult to define because it is dependent on the context of the problem domain. This has a few repercussions. First, there is no consensus on a welldefined mathematical objective over which to optimize. Creating such a metric requires deep understanding of the problem domain, the social context, and the mathematical underpinnings of fairness. Second, policymakers shy away from making broad statements, which means there is no consensus on standards for AI in production. More information on the different definitions of fairness can be found here.
While AI fairness toolkits can help streamline the workflow of AIfairness experts, we believe their most important usecases are in enabling AI developers without specific fairness expertise to audit and mitigate models. We divide the landscape of existing AI fairness toolkits into three general types:
We observe that the AI fairness toolkits landscape currently lacks automated mitigation toolkits. Automated mitigation is technologically and strategically challenging at present. Specifically, the complexities of tradeoffs among different fairness metrics requires casespecific intervention. This may change in the future as evolving norms and regulations in different enterpriseAI domains mature and reach consensus, thus allowing for domainspecific automated mitigation.
A more subtle gap in the current landscape is the absence of a perfect fit for the needs of working AI developers who are novices in AI fairness. The toolkits geared towards mitigation require a degree of prior fairnessexpertise to use effectively, while toolkits with didactic and exploratory functionality lack mitigation options. Keeping this target audience in mind, we find different advantages and drawbacks among the leading AI fairness toolkits:
Advantages:
Drawbacks:
Advantages:
Drawbacks:
Advantages:
Drawbacks:
Advantages:
Drawbacks:
Advantages:
Drawbacks:
Advantages:
Drawbacks:
The field of applied AI fairness is rife with deep technical, ethical, and strategic challenges, so it is no surprise that even partial automation of the work of AI fairness experts should be challenging. Still, we believe that even in the short term there is room for innovation in combining the best features of exploratory and exhaustive toolkits.
]]>The views expressed in this article are those of the interviewee and do not necessarily reflect the position of RBC or Borealis AI.
Karthik Ramakrishnan (KR):
The broadest definition of Responsible AI is ensuring that the models do what we want them to do, and don’t do what we don’t want them to do. And by models, I don’t just mean AI and ML models; I mean any type of automated decisionmaking tool – from regression systems right the way through to neural networks.
Responsible AI means different things to different stakeholders. Take explainability, for example. It means very different things to your data scientists, your CEO, your Chief Risk Officer, your regulators, and the business team that owns the model. To the business, it’s more about justifiability – can you explain which features combined to trigger the outcome that transpired and justify the outcome in the context of the business decision. To the data scientist, it’s more about understanding which neuron fired at what point in the process. And that means it’s often easy to get caught up in the semantics.
At Armilla, we believe that Responsible AI isn’t a nebulous thing that we ‘eventually’ need to work out. It’s in the everyday practice of thinking about what you are building – thinking about how you ensure it does what it is supposed to do and doesn’t do what it isn’t supposed to do. That level of safety engineering practice needs to be brought in and ingrained into the DNA of the ML development team.
KR:
The challenge is that machine learning and data science is still in its infancy. We’re still defining what responsibility means and what practices we can develop to ensure responsible AI. Data scientists do all sorts of testing, of course – unit testing, regression testing, integration testing and so on. Financial services firms will have groups of model validators to ensure their ML models conform to regulations like the E23 Guidelines on Model Risk.
However, very few data scientists tend to undertake rigorous stress testing for business or regulatory requirements. In traditional software development, you have very established and mature practices around testing and QA. We need those types of established practices around ML, so we can ensure businesses and data scientists are building robust, transparent, and reliable systems. It’s a big gap that is missing in the industry’s practices.
KR:
We really came to the idea in a roundabout way. It all started with work we were doing with the Monetary Authority of Singapore. Regulators wanted to understand how they could update existing regulation on model validation. We published a paper that outlined a principlesbased approach identifying the types of things you should do when building an AI or ML model, and quickly realized we needed to get more granular, and so we then looked at each pillar in much more detail and offered suggestions on what likely tests would look like.
For the banks involved, however, the process was starting to add much more complexity. Risk and compliance teams were doing the testing manually. Production and development times were slowing down. Highlymaterial models could spend upwards of 6 months in validation. Seeing this firsthand made us realize it was something we could automate.
We bring together the various stakeholders at the beginning of the development process and work with them to specify what they want the model to do and what they don’t want it to do. And then we define those tests that allow us to measure statistically how the model behaves for these various conditions. As your team develops and iterates the model, they can run the various tests and the platform tells them which test cases are passing and which are failing. The platform goes into various scenarios to see how the model is behaving against the business criteria and the technical criteria, giving all of the different stakeholders visibility into how the model is doing.
As a result, everyone knows what data was used, how it was used, how it was tested, and so on, so that businesses can quickly move their models forward into development. It can be a complex set of processes, and we’ve made it a lot more efficient.
KR:
We’re rapidly evolving the platform and in addition to structured timesequence data models, we’ll soon have the ability to test visionbased and textbased models. NLP models are a bit more complex to test, but testing of textbased models will become even more important as more and more chatbots and interactive models are developed. We also see demand for responsible AI spreading into other industries like manufacturing. Manufacturers want to know that the models they are developing for processes like QA are as accurate (or more accurate) than the manual statusquo. Establishing trust with the business around AI models is important in almost every sector. That’s why we’re expanding the types of models we deal with all the time.
KR:
There is certainly lots of regulation in the pipeline. But I don’t believe we necessarily need regulation to solve this. What we need is more of the selfregulation that comes from teams who sit down and think hard about what kinds of issues could arise from that system, and then test for it with every version of that model. We’re not there yet. But I believe there will come a time where that kind of approach just becomes second nature in ML model development.
Karthik Ramakrishnan is the cofounder and Chief Product Officer at Armilla AI, based in Toronto, Canada. Prior to founding Armilla, Karthik headed up the Industry Solutions & Advisory activities at Element AI and led the Decision Science, Cognitive and IoT Analytics practice at Deloitte Canada. Karthik holds a Masters of Applied Science from the University of Waterloo and an MBA from the Ivey Business School at the University of Western Ontario. He and his cofounders also participated in the Y Combinator accelerator program in spring 2022.
]]>Let us define ‘malicious inputs’ as inputs whose values are outside of the generally expected range for the model’s domain. The issue with such inputs is that, in general, even models that are genuinely welloptimized for their domain may experience catastrophic failure. Work on robustness typically further splits this issue into two kinds of usecases: handling malicious inputs designed by malicious users or that indicate changes in the datagenerating environment. In the first usecase, known as adversarial robustness, we are concerned with inputs designed to elicit false predictions or unsound decisions from the model. In the second use case, known as robustness to data drift, we are concerned with unexpected changes to the target distribution that make the trained model obsolete.
The study of adversarial robustness is a thriving branch of fundamental research in AI, though the relationship between the formal problem of adversarial robustness and the applied problem of dealing with malicious inputs is complex. Considered formally, the adversarial robustness of a model is defined by its insensitivity to small changes in an input’s value: a model is robust on input x if no possible small change to x can radically change the model’s output.
The motivation for this standard formal definition of robustness comes from the discovery that ordinary neural network models are extremely sensitive to targeted small perturbations of natural inputs (‘adversarial examples’). Within a fundamental research context, studying a model’s adversarial robustness is one promising way to study the relationship of the model’s ‘reasoning’ to human frames of reference. A growing body of research even suggests that training models against adversarial examples improves other measures of interpretability and alignment with human judgment.
All that said, the benefit of this improved human alignment for a model’s reliability or its capacity to handle unexpected inputs is generally difficult to quantify, and adversarial robustness training can have serious costs to a model’s testaccuracy. What makes robustness to adversarial examples crucial is the direct ability to exploit realworld AI systems via small perturbations in the input.
While it’s important to remember that (despite its name) adversarial robustness doesn’t guarantee robustness to malicious inputs in general, it can deliver certain guarantees against malicious inputs generated with smallperturbations methods. These methods form the basis of the most reliable, costeffective, and adaptable procedures for crafting malicious inputs, so countering them goes some way towards mitigating cyber threats in general. Furthermore, attacks based on small perturbations are especially worrisome because of their potential longterm utility for an attacker: Since sufficiently small perturbations are invisible to human eyes, attackers may be able to continuously manipulate a system (e.g. a video recommendation engine) by infusing seemingly normal inputs with hidden signals that induce a judgment of their choice. For all these reasons, training for adversarial robustness is increasingly necessary for production models despite its limited scope.
In contrast to adversarial inputs, the problem of datadrift looks at ‘natural’ model failure, in other words, genuinely unpredictable changes in the domain that generates the data: Changes in travel, shopping, etc. in models during Covid are a prime example of datadrift, occurring despite best practices in sampling, training, and testing before production.
Notice that outside of such cases, the intuitive idea of ‘drift robustness’ quickly becomes illdefined: To the degree that an instance of datadrift can be mitigated prior to production, there is not much to distinguish optimizing ‘drift robustness’ from good machinelearning practices at large. There is no method for predicting what is unpredictable and making the most of what’s predictable is simply the work of machine learning.
While there might not be any such thing as a driftrobust model, this does not mean that there’s no such thing as drift robustness. Rather than being a property of a model or a training practice, robustness to datadrift is best seen as an important property of active AI platforms. An organization’s AI platform is robust to datadrift to the extent that a wellcoordinated team continues overseeing it after production, monitoring signs of datadrift and reserving resources and infrastructure for retraining models and acquiring additional data as needed. It is here that specialized toolkits can play a crucial role, by offering data scientists a suite of automated monitoring methods for detecting and reporting signs of datadrift.
Though the two aspects or usecases of robustness we’ve discussed within this blog post do not have much overlap with one another, they each represent an effort to find a discrete technical niche within the complex realworld problem of dealing with unexpected inputs. Each usecase is therefore closely associated with a family of measurable quantities, and subsequently with a suite of techniques that benefits from partial automation through a toolkit. We end by surveying some of the leading toolkits currently on offer, including Borealis’ own Advertorch adversarial robustness research framework:
Adversarial Robustness Toolbox (ART)
Optimizing for adversarial robustness is an open research area defined by complex tradeoffs between provability, efficiency, coverage, and sideeffects. This simply means that businesses need to understand that building and testing robust machinelearning systems will remain a challenge, especially where a highdegree of accuracy truly matters. The toolkits that we’ve reviewed provide a selection of generally reliable techniques for enforcing adversarial robustness in a small radius around a model’s training data. As ML and AI become increasingly integral to different facets of society, we must keep the existing risks and limitations in mind.
]]>With the increasing use of ML to power solutions has come intense scrutiny of whether the algorithms used in the solutions are safe for endusers. In line with Borealis’ mission towards the responsible and ethical development of ML algorithms, the institute has had a project team dedicated to building tools for the validation of these very same models. In this blog, we detail the full validation pipeline as well as contributions this project team made to each of the steps in this pipeline.
The validation pipeline, as shown in figure 1, details four main steps which we explore in the following sections.
Before a model can even be assessed with regards to safety, we first need to make sure it is in a good state to do so. If this is not the case, the assessment mechanisms may flag issues incorrectly. For example, if an ML practitioner has failed to account for an imbalanced dataset when training their model, then an adversarial robustness test applied to the resulting model may not even function properly to begin with. One can think of prechecks as sanity checks to be done before even entering the validation pipeline. These prechecks apply to both data and model. Examples of good prechecks can be found in our upcoming prechecks blog post.
Once prechecks are complete, a submitted model can now be assessed with respect to various safety criteria. A model is assessed against a safety property, usually in the form of a mathematical equation. The equation below showcases one such example for adversarial robustness where we do not want a change in input x smaller than ε to change the output of the model f by more than δ.
\begin{align}
\text{property} = x^\primex < \epsilon \land f(x^\prime)f(x) < \delta \label{eq:robustness} \tag{1}
\end{align}
The assessment itself may come in different forms. At one end of the spectrum is testing. In this scenario, the model is assessed against a property such as the one above and if it fails the assessment, an example of this failure, known as a counter example, is returned to the end user. At the other end of the spectrum from testing is certification. Unlike testing, certification methods are capable of returning a signature ensuring that no counter examples exist. While certification is more thorough than testing, it is also more time consuming. A full survey of certification methods can be found in our blog 'Machine Learning Certification: Approaches and Challenges' here.
When it comes to properties, there is as much debate in the field as to what they should cover in terms of safety considerations as there is about how they should be expressed. The example above given for adversarial robustness is a handcrafted mathematical translation of desired behaviour. However, much debate exists about this translation and handcrafting particularly around whether there is bias in the way the expression is obtained in the first place. This is due to the difficulty which exists in taking a qualitative assessment of desired behaviour and attempting to form a representative mathematical expression for a software system to perform either testing or verification. One particular area where this difficulty has been evident is in fairness. To date, there exists multiple handcrafted definitions of fairness. A few examples include demographic parity and equal opportunity as detailed in this blog. However, to this day, there is still no consensus as to which should be used and even more so, some definitions conflict with each other. To circumvent the paradigm of handcrafted definitions and their applicability, one of Borealis’ biggest contributions in this space has been an attempt to this bridge this gap by coming up with a data driven way to define fairness. More details can be found in our upcoming blog post 'fAux: Testing Individual Fairness via Gradient Alignment'.
Once a model has been assessed, it is important to relay the results to the end user in a way in which they can understand them. For example, one question which domain experts often ask us is: how likely is the failure likely to happen in practice? To answer this, we leaned into methods from out of distribution detection. Using these methods, we can take a distribution of counter examples output from a test and compare that distribution against the training data distribution. A high overlap indicates that the failures are fairly realistic and likely to happen while a low overlap indicate the opposite. You can find a full survey of OOD methods in our upcoming blog post 'Outofdistribution detection I: anomaly detection'.
Assessing the model is half the battle. Fixing it is the other. One of the biggest issues in model remediation is the lack of an intuitive way to perform the fix itself. Suggested approaches include everything from retraining the model to changing the architecture. Complicating things further is that there is no consistency in which the fix is suggested. In this blog, we suggest a fix to this problem, by providing a mechanism to fix models by up weighing and down weighing the training points. Even better, the procedure always remains the same. Whether the safety criteria is fairness, robustness, or something else, the fix is always provided to the end user in the same language of weights which should be applied to the data points to remedy the issue at hand.
Throughout this blog, we’ve shown you the steps involved in building a model validation pipeline. By doing so, our hope is to bring awareness to practitioners everywhere on what validators look at when assessing their models so that safety can be top of mind when developing these solutions in the first place. If you’d like to join such initiatives, be sure to check out our careers page.
]]>IVADO is a Quebecbased collaborative institute focused on transforming new digital intelligence discoveries into concrete applications and benefits for all of society. Driven by a deep conviction that data and algorithms can change the world we live in, IVADO aims to help train future talent, develop leadingedge knowledge, support the codevelopment of ideas, and build communities of interest around digital intelligence.
At Borealis AI, we are equally convinced of the power of Machine Learning to catalyze positive societal change. And we know that collaboration is the best way to create value from Machine Learning research. That is why we are partnering with IVADO. Through our partnership, we hope to encourage greater academic collaboration, help diversify the Montreal ML talent pool, and make a positive impact on society.
“Our Partnership with IVADO isn’t just about getting closer to the Montreal ML community. It’s about working together to accelerate new discoveries that can benefit all of society,”
noted Dr. Eirene Seiradaki, Director of Research Partnerships at Borealis AI.
“At Borealis AI, we look forward to working closely with IVADO and its partners to help accelerate Quebec’s digital transformation and create positive social value.”
To learn more about IVADO, click here.
]]>– Elissa Strome, Executive Director of PanCanadian Artificial Intelligence Strategy at CIFAR.
Speaking on a recent RBC LinkedIn Live ‘Diversity in AI’ panel moderated by Natasha Kassim, Director of Diversity and Inclusion at RBC, CIFAR’s Dr. Elissa Strome joined Dr. Eirene Seiradaki, Director of Research Partnerships at Borealis AI, to talk about the diversity gap in the AI and ML industry, to share ideas to help close the gap, and to explore some of the programs they are creating and leading to drive greater diversity and inclusion in the sector. Here are some of the highlights from that panel discussion.
Given the ubiquitous role AI currently plays in our personal and professional lives, the field of AI must include as many voices and perspectives as possible.
“Canada is a huge and diverse country that offers great diversity, but this also means that not everyone has equal access to AI training and education. We need to start unlocking the field of AI for people from an early age to help encourage diversity in Canada’s future AI talent,”
noted Eirene Seiradaki. Advancing diversity – including geographical diversity – would go a long way to increase the responsible adoption of AI.
“We don’t want bias in our algorithms.”
In 2019, estimates suggested that around 1in5 tech workers were identified women, while fewer than 2% of Canadian tech workers surveyed identified as Indigenous (1). There's a pressing need to get better at collecting data to capture the scope of the problem so that the AI ecosystem can develop better and more tailored solutions and measure progress.
“We need to think about how we collect good data around diversity and inclusion in AI,”
argued Elissa Strome.
“Without it, we can't measure the impact of the changes we are making and the initiatives that we have.”
RBC and RBC Foundation have recently announced a transformative donation to CIFAR to help advance the understanding of AI’s impact on society, ensure a more diverse representation of scientific talent across CIFAR's research programs and support research to create a more sustainable environment.
As part of the ongoing partnership, Borealis AI’s new Let’s SOLVE it mentorship and societal impact initiative will become part of a suite of training programs that CIFAR has launched across Canada, called CIFAR’s National AI Training Programs. Like many other toptier training programs at CIFAR, Let's SOLVE it aims to develop the next generation of Machine Learning talent. With a focus on diversity, Let's SOLVE it encourages applications from undergraduate students enrolled in a wide variety of Canadian academic institutions and programs and underrepresented groups in AI to solve real problems in their communities using machine learning (ML).
“We've been privileged to partner with wonderful organizations to deliver training programs and help young people develop AI skills and expertise at all levels,”
noted Elissa Strome.
The panel highlighted several programs to break down barriers, improve access, and drive inclusion. CIFAR, for example, has partnered with Simon Fraser University on a program called Invent the Future which targets high schoolaged women. Together with Borealis AI, they also support the AI4Good Lab, that focuses on undergraduate women.
Recognizing the significant barriers facing Indigenous students, CIFAR has recently launched a program developed with the advice of community leaders and scientists aimed at increasing the participation of Indigenous students in the field of AI.
“CIFAR is pleased to cover the full costs of participation of Indigenous students in any of our partnered AI training programs. We are also working with Indigenous and STEM education leaders to develop pilot programs and curricula that incorporate Indigenous knowledge and experience into how we teach AI,”
added Elissa Strome.
Borealis AI also is interested in building diversity and inclusion in the global AI ecosystem. For example, last November, the organization was a Diversity and Inclusion sponsor at the EMNLP 2021 conference and supported the Women in Machine Learning Workshop at the 2021 NeurIPS Conference.
What do you do to improve diversity and inclusion at your organization? It will take a village, and every program, study, initiative, and project can make a real difference, helping shape the future of AI in Canada. The panel wrapped up with the speakers challenging everyone who tuned in to do more to help solve these critical issues and make AI more accessible to all.
If you missed the event, please check out this recording and share it widely to help raise awareness and advance diversity and inclusion in AI.
References
(1) Lamb, Creig; Viet Vu and Asher Zafar (2019) "Who Are Canada’s Tech Workers?" Brookfield Institute for Innovation + Entrepreneurship.
At Borealis AI, the Product Management & Business Development team moves quickly, challenges the status quo, and owns the delivery of AIbased products that aim to shape the future of banking. Here’s a summary of what that looks like on the ground, daytoday:
PMs at Borealis AI work collaboratively with researchers, machine learning engineers and business stakeholders at RBC to tackle challenges and opportunities using AI that can drive substantial value for the bank. Borealis AI builds cuttingedge machine learning products that often commercialize novel research within various domains of finance. This means no day is the same. From RBC Capital markets to personal and commercial banking, Borealis AI PMs connect the dots to bring ML expertise to real problems in finance and deliver value and impact at scale.
At Borealis AI, business development work closely with top executives and stakeholders at RBC to zero in on some of the most interesting problems. Business Developers (BDs) and PMs have a hand in building our product and business strategy roadmaps, considering not only what we are doing today but where we are going in the future as an organization. They ensure we continue scaling our products and our impact. The team then works with researchers and engineers to deliver products that quite literally shaping the future of financial services. While this sounds like a tall claim, PMs here have seen how just a single model can bring significant positive change. This level of impact drives PMs to tackle new and challenging problems. We conduct discovery and proofs of concept (POCs) to make sure a particular problem or domain is a good fit for machine learning and how the team can best add value. Product managers can also help shape the work by identifying interesting and worthy projects: At Borealis AI, we have an incubator program that encourages PMs and others on the team to bring, develop and commercialize their own ideas. Given RBC’s 17 million clients and Canadian market leadership, the potential for impact is enormous and exciting.
Delivering successful industry leading machine learning products takes a diverse set of skills and expertise. PMs get to collaborate with researchers, engineers and tap into deep domain expertise at RBC, whether it’s Capital Markets, P&CB or other areas of the bank: we do win together as one RBC. There are seemingly endless opportunities to learn about new domains and business lines, and so it helps to be working closely with subject matter experts with deep experience in finance, banking and financial products and develop better awareness of their users’ needs, and where other processes and ideas can be leveraged.
Banking lends itself to time series type modeling for machine learning. Most interactions with banks – things like market trades, transactions, and bill payment, for example – happen in a time series. For those with a technical background, we’re focused on asynchronous, multisource time data and understanding the representation of users to help predict next events, transactions or client needs.
PMs have many exciting opportunities to build products with outsized impact and take them to market: an example of this is NOMI Forecast, embedded into RBC’s NOMI platform used by millions of Canadians across the country. Using a machine learning model allows RBC clients to manage their cash flow and account activity more confidently.
Borealis just turned five, and has already delivered IP, cuttingedge research, and innovative products that improve people’s financial lives. There’s a lot more work ahead, and so a commitment to constant learning, onthejob and in the fields of AI and finance is incredibly important to keeping PMs at the top of their game.
Each day our PMs are exposed to new ideas and perspectives, accelerating their learning and growth. The PM team works closely with top talent at RBC every single day across various domains, sharpening their skills and deepening their domain understanding. Mentorship can be very valuable, both in providing PMs with opportunities to learn from more senior people and in providing them with opportunities to grow their own leadership skills by mentoring those more junior. More importantly, PMs tend to have some leeway for moving around the organization, allowing professionals to really develop their own path to growth and development as a PM.
Want to explore the opportunities and challenges facing AI and ML Product Managers? Interested in learning what it takes to deliver business value from machine learning? Want to find out what a dayinthelife looks like for a Product Manager at Borealis? Check out this recording of the LinkedIn Live event.
]]>“We want to empower the next generation of AI leaders to build technologies that move society forward. Encouraging new and diverse perspectives, backgrounds, and communities to participate in the Canadian AI ecosystem,”
noted Dr. Eirene Seiradaki, Director of Research Partnerships at Borealis AI.
“By bringing together the great ideas and creativity of students with the catalytic power of AI and ML, we have an opportunity to advance positive outcomes for society.”
The Let’s SOLVE it mentorship program is open to all undergraduate students at all Canadian universities. It’s free and will be conducted virtually – teams don’t need to be in the same location or at the same school in order to participate. You don’t even need to be enrolled in a Computer Science program – each team member should have some basic programming knowledge, but specific experience using AI or ML isn’t necessary.
“There are students across the country who are interested in AI, have big ideas, but don’t have access to the skills, tools and capabilities they need in order to realize their dreams,”
noted Dr. Elissa Strome, Executive Director of PanCanadian AI Strategy at CIFAR.
“Let’s SOLVE it encourages students from all walks of life and all regions of the country to not only take a greater interest in AI and ML, but also to achieve positive social outcomes in their communities.”
The Let’s SOLVE it program builds off a successful pilot that included 26 students from universities all across Canada. The seven teams – which included two allfemale teams – focused on a range of important societal problems, including accessibility, mental health, and climate change. The program also helped participants improve their AI literacy skills, and practical understanding of building AI solutions.
]]>“Participating in the Let’s SOLVE it program changed the way I see the field of AI and ML,”
said Wanqing Li, an undergrad at Queen’s University and a participant in the pilot project.
“I used to think AI/ML was hardcore. But the LSi program showed me that, while it is a complex and difficult field, with the right support it is possible to get a clearer and deeper understanding that can help solve important problems.”
We first review some of the state of the art data valuation algorithms, which can be used to answer the first question and relate the issues in different model performance metrics to the training examples. In the second part, we review some of the state of the art model unlearning algorithms. These algorithms could be used to unlearn/remove the influence of the problematic training examples. However, as we will see, none of these algorithms can be used as a posthoc modification algorithm and are mostly limited to the training objective. At the end, we introduce PUMA, our novel model patching algorithm that addresses the shortcomings of the available algorithms. PUMA reduces the negative impact of the problematic training examples on the model parameters, without retraining, while preserving its performance.
With recent successful applications of machine learning models in critical applications, such as healthcare and banking, the importance of model interpretability has become an important research topic. It is important and necessary to understand how the models make the decisions and to make sure these decisions are aligned with the business or legal regulations. Moreover, in the case of a failure, it is essential to understand the reason of the failure. There exist many algorithms that provide different insights about the model prediction process, using which we may find different issues in the model. However, they usually do not suggest a mitigation process for the discovered issues. Here, we propose a data centric model debugging process that helps the model developers and model validators in better understanding of the failures and to mitigate them.
In a data centric machine learning model debugging framework we should answer the following two questions: 1) can we use data valuation to find the training examples causing the model performance degradation? and 2) how can we modify the model parameters to remove the negative impact of the problematic training examples and to improve the performance of the model? Answering these two questions are the focus of this blog post.
To answer the first question, data valuation algorithms can be used [9, 10, 18, 19, 5]. Data centric mode interpretation algorithms are a class of algorithms that explain the model behavior by its training examples, e.g., by decomposing the model prediction into the influences of each training example on the prediction. In other words, data valuation algorithms relate the model performance to the training examples.
To answer the second question, machine unlearning algorithms can be used to remove the negative effect of the detected problematic data points [6, 3, 15].
In the remainder of this post, we first review some papers that study the relation between the model failures and the issues in the training examples. We will continue with a brief review of the state of the art data valuation and unlearning algorithms. At the end, we introduce, PUMA, our novel model patching algorithm.
To motivate the rest of our discussion, in this section we first review papers that relate the quality of training data to model performance. The papers are mostly about the noisy labels, but we can extend it to the confusing labels as well, e.g., Imagenet samples with multiple targets in them and MNIST digits which resemble other digits (see figure 1) [16].
Generalization– generalization of deep learning has been studied in many literatures, from different perspectives, such as finding tight theoretical [7, 8] and numerical [7, 2] bounds for the generalization gap, and finding different approaches that improve the generalization of the models [14, 4]. Focusing on the papers that study the causes of large generalization gap, in [20, 1] it is shown that the effective capacity of neural networks is sufficient for memorizing the entire data set, even the noisy labels. This memorization severely affects the generalization of the models.
Calibration– Miscalibration is a mismatch between the prediction confidence score of the model and its correctness. This causes a mistrust in the predictions of the models, meaning that the confidence score could be over or underconfident for different points. This is specially problematic for out of distribution generalization of the models, e.g., model could be wrong but very confident about its prediction. A potential cause of miscalibration is that Negative LogLikelihood (NLL), commonly used in training classification models, is not a proper score as there is a gap between 01 accuracy and the loss measured by NLL [11]. A shortcut to reduce the NLL during training is to overfit to the easier examples to reduce the loss. As discussed in [11], this phenomena increases the miscalibration.
Adversarial Robustness– A connection between adversarial robustness and noisy labels is studied in [13]. Label noise is identified as one of the causes for adversarial vulnerability, and it provides theoretical and empirical evidences for this relation.
An emerging concept in above reviewed papers is the effect of the noisy labels and confusing training examples on the performance of the models, including robustness and generalization. This is mostly due to overfitting to the simpler examples. However, this is not always the case.
Fairness– Learning easier concepts faster and overfitting to the simple concepts in the training examples is studied in [12], as a cause of bias and unfairness in the model. Thus, in fairness, unlike the other reviewed cases, the issue is caused by learning and overfitting to the simpler concepts as a shortcut to make the predictions.
We use these studies as a motivation of our work in which we want to improve the model performance by removing the negative influence of the problematic data examples from the parameters of a trained model without retraining it.
Now that we know the problematic training examples could affect the performance of a model, we will review some algorithms that help us to understand the effect of each training example on the model parameters and model predictions. We review these algorithms under the name of data valuation: the algorithms that assign an importance score showing the influence of each training example on a performance metric.
Data Shapley– Data Shapley, introduced in [5], is based on a method from coalitional game theory: a prediction can be explained by assuming that each training example is a “player” in a game where the model performance, e.g., accuracy, is the payout. Data Shapley values tells us how to fairly distribute the “payout” among the training data points. To calculate the exact Shapley values, a model should be trained for each combination of the presence of that example with other training examples to measure its performance
\begin{equation}
\phi_{i}=C \sum_{S \subseteq D\{i\}} \frac{V(S \cup\{i\})V(S)}{\left(\begin{array}{c}
n1 \\
S
\end{array}\right)}
\end{equation}
We call φi the Data Shapley value of source i. The sum is over all subsets of training data D not containing training example i and C is an arbitrary constant. This computation is exponentially complex. Therefore, two estimate algorithms are proposed in [5]: Monte Carlo Shapely and Gradient Shapley. Gradient Shapley is only applicable in differentiable models while the Monte Carlo Shapley is more general and can be used for any model.
Reinforcement Learning based Data Valuation– A reinforcement learning based data valuation technique is proposed in [19], called DVRL. The main idea (shown in figure 2) is to learn a masking agent by REINFORCE algorithm [17] choosing the data examples that increase the performance of the model. The trained agent can be used as a data valuation model that assigns an importance score to each training data point. The major advantage of this algorithm is its applicability to any model even if the model is not differentiable.
Representer Point Selection (RPS)– RPS [18] decomposes the preactivation predictions of a model into a linear combination of the preactivation of its training data points. Denoting the preactivation feature by Φ,fi= Φ (xi,Θ), and ˆyi= σ(Φ (xi,Θ)), where σ is the activation function and Θ is the model parameters, using RPS decomposition we have:
\begin{equation}
\Phi\left(\mathbf{x}_{t}, \Theta^{*}\right)=\sum_{i}^{n} k\left(\mathbf{x}_{t}, \mathbf{x}_{i}, \alpha_{i}\right)
\end{equation}
where $\alpha_{i}=\frac{1}{2 \lambda n} \frac{\partial L\left(\mathbf{x}_{i}, \mathbf{y}_{i}, \boldsymbol{\Theta}\right)}{\partial \Phi\left(\mathbf{x}_{i}, \boldsymbol{\Theta}\right)}$ ( $n$ is the number of training examples and $L$ is the loss function) and $k\left(\mathbf{x}_{t}, \mathbf{x}_{i}, \alpha_{i}\right)=\alpha_{i} \mathbf{f}_{i}^{T} \mathbf{f}_{t} .$ The weights $\alpha_{i}$ of this linear combination can be used as a measure of the importance of the training examples in the model predictions. More accurately, these weights can be seen as the resistance for training example feature towards minimizing the norm of the weight matrix and therefore can be used to evaluate the importance of the training data points on the model
It should be noted that equation (2) holds only if the optimization objective of the model training has an ℓ2norm weight decay of λ, and this algorithm is only applicable to differentiable neural networks.
Influence Functions– Influence functions is a classic technique from robust statistics to trace a model’s prediction through the learning algorithm and back to its training examples. Therefore, it could be used to identify the most responsible training examples for a given prediction. Influence functions require expensive second derivative calculations and assume model differentiability and convexity. In [10] efficient secondorder optimization techniques are used to overcome some of the computation problems and used influence functions to explain deep neural networks.
The idea is to compute the parameter change if a sample z were upweighted by some smallε:
\begin{equation}
\hat{\theta}_{\epsilon, z} \stackrel{\text { def }}{=} \arg \min _{\theta \in \Theta} \frac{1}{n} \sum_{i=1}^{n} L\left(z_{i}, \theta\right)+\epsilon L(z, \theta)
\end{equation}
The influence of this upweighting on the model parameters is given by:
\begin{equation}
\left.\mathcal{I}_{\text {up.params }}(z) \stackrel{\text { def }}{=} \frac{d \hat{\theta}_{\epsilon, z}}{d \epsilon}\right_{\epsilon=0}=H_{\hat{\theta}}^{1} \nabla_{\theta} L(z, \hat{\theta})
\end{equation}
where $H_{\hat{\theta}} \stackrel{\text { def }}{=} \frac{1}{n} \sum_{i=1}^{n} \nabla_{\theta}^{2} L\left(z_{i}, \hat{\theta}\right)$ is the Hessian matrix. Using the chain rule we can calculate the influence of this upweighting on the loss function $L$, which can be used as the importance score of the training data examples:
$$
\begin{aligned}
\mathcal{I}_{\text {up }, \text { loss }}\left(z, z_{\text {test }}\right) &\left.\stackrel{\text { def }}{=} \frac{d I_{\nu}\left(z_{\text {test }}, \hat{\theta}_{c, z}\right)}{d \epsilon}\right_{\epsilon=0} ^{\mid}=\left.\nabla_{\theta} L\left(z_{\text {test }}, \hat{\theta}\right)^{\top} \frac{d \hat{\theta}_{\epsilon, z}}{d \epsilon}\right_{\epsilon=0} \\
&=\nabla_{\theta} L\left(z_{\text {test }}, \hat{\theta}\right)^{\top} H_{\hat{\theta}}^{1} \nabla_{\theta} L(z, \hat{\theta})
\end{aligned}
$$
However, calculating the inverse Hessian is computationally very expensive. In [10] two estimation methods are suggested: 1) Conjugate gradients, and 2) stochastic estimation which uses Hessian vector product (HVP) to estimate the inverse Hessian.
Summary– The most important advantage of Data Shapley (specifically Monte Carlo Shapley) and DVRL compared to Influence functions and RPS is that we can use different performance measures other than the model loss objective. Influence functions and representer point selection explain the training data importance based on their contributions in the training loss objective. As a result, if we want to find the most influential training examples on fairness or calibration it cannot be done using RPS and Influence functions. In the next section you will see that our proposed algorithm, PUMA, modifies the Influence functions formulation and addresses this issue. Using PUMA’s modified formulation we can estimate the influence of the training examples on the model parameters in terms of different performance metric even if they are different from the training loss objective.
Up to here, we reviewed papers showing that some of the problems in model performance could be related to the problematic training examples. Then we reviewed data valuation algorithms that find the most influential training examples on the model parameters and the model performance.
Now, the question is if we can remove the negative influence of these problematic data points from the model parameters.
This concept is highly related tounlearning. In this section, we first review few papers focusing on unlearning some of the training examples from the model parameters.
SISA– To remove the influence of some specific training examples we may retrain a model after removing those examples. However, if a model and the training data points is large, the retraining could be computationally very expensive. In addition, beside the negative impacts, every data point has useful influence on the model parameters, which will be removed when retraining the model from scratch. Sharded, Isolated, Sliced, and Aggregated (SISA) [3] addresses the full retraining of a model by training submodels on data shards and voting among the submodel decisions to make the final decision. Using the data shards, we only need to retrain the submodels that are affected by the removed data points.
Amnesiac Machine Learning (AML)– To remove the influence of the training data points from the learned parameters, [6] tracks the gradients of each data point (or batch of the data points) and removes the accumulated effect of the tracked gradients associated to the removed data points (or batches that data points are placed in) during the learning process from the model parameters. Since the gradient removing does not have a constraint to preserve the model performance after unlearning, the updated model is fine tuned for few epochs to recover the degraded performance.
Unrolling SGD (UrSGD)– Using the idea in Amnesiac Machine Learning, [15] removes the effect of the gradients from the model parameters. To improve the memory efficiency, instead of tracking the gradients during the training process a recursive gradient estimation is proposed using the first order gradients w.r.t the initial weights/parameters, and ignoring the higher level gradients. To improve this recursive estimation a regularizati term is proposed to be added to the training objective that minimizes the effect of the higher order gradients.
Summary– There is a limited number of studies available for model unlearning that could be used for our goal of removing negative impact of the problematic training data points. The available algorithms have one or more of the following shortcomings that makes them not suitable for our goal (figure 3):
To estimate the influence of the training examples on the model parameters we follow influence functions derivation. Our goal is to modify the model parameters without seriously hurting its prediction performance with respect to various performance criteria C (or Lc for an individual sample) such that
\begin{equation}
\underbrace{\mid \frac{1}{\leftD_{t n}\right} \sum_{i=1}^{\leftD_{t n}\right} \mathcal{L}_{c}\left(\mathbf{x}_{i}, y_{i}, \theta_{m o d}\right)}_{\mathcal{C}\left(\theta_{m a d}\right)}\underbrace{\frac{1}{\leftD_{t n}\right} \sum_{i=1}^{\leftD_{t n}\right} \mathcal{L}_{c}\left(\mathbf{x}_{i}, y_{i}, \theta_{o r g}\right)}_{\mathcal{C}\left(\theta_{o r g}\right)} \mid \leq \delta
\end{equation}
where $\delta>0$ is the performance drop limit, $\theta_{\text {mod }}$ is the modified model parameters, $\theta_{\text {org }}$ is the original model parameters, $\leftD_{\mathrm{tn}}\right$ is the number of training data examples, and $\left(\mathbf{x}_{i}, y_{i}\right)$ is $i$th training example and its corresponding target.
Model modifications in PUMA is done by reweighting the training examples where the weights λ controls the contribution of each training example in the model modification:
\begin{equation}
\begin{aligned}
&\theta_{\bmod }=\underset{\theta}{\operatorname{argmin} \mathcal{J}_{\mathrm{mod}}(\theta)=} \\
&\underset{\theta}{\operatorname{argmin}} \underbrace{\frac{1}{\mid D_{\mathrm{tn} \mid}} \sum_{i=1}^{\leftD_{\mathrm{tn}}\right} \mathcal{L}_{t}\left(\mathbf{x}_{i}, y_{i}, \theta\right)}_{\mathcal{J}_{\mathrm{org}}(\theta)}+\underbrace{\frac{1}{\leftD_{\mathrm{up}}\right} \sum_{j=1}^{\leftD_{\mathrm{up}}\right} \lambda_{j} \mathcal{L}_{t}\left(\mathbf{x}_{j}, y_{j}, \theta\right)}_{\mathcal{J}_{\mathrm{add}}(\theta)}
\end{aligned}
\end{equation}
where J is the training objective, Lt denotes individual prediction loss, Dup ⊆ Dtn is a subset of training examples used to compensate the effect of the removed examples.
Using this definition of the modified parameters and Taylor expansion of the gradient of $\mathcal{J}_{\bmod }(\theta)$ as the estimated update, we have:
$$
\begin{gathered}
\phi\left(\mathbf{x}_{i}, y_{i}\right)=\left(\nabla^{2} \mathcal{J}_{\text {org }}\left(\theta_{\text {org }}\right)\right)^{1} \nabla \mathcal{L}_{t}\left(\mathbf{x}_{i}, y_{i}, \theta_{\text {org }}\right) \\
\theta_{\text {mod }}\theta_{\text {org }}=\sum_{j=1}^{\leftD_{\text {up }}\right} \lambda_{j} \phi\left(\mathbf{x}_{j}, y_{j}\right)
\end{gathered}
$$
Now we can use this relation to estimate the changes in any performance metric C, which could be different from the training objective. Let’s introduce the following notation
\begin{equation}
\psi\left(\mathbf{x}_{j}, y_{j}\right)=\nabla \mathcal{C}\left(\theta_{\text {org }}\right)\left(\nabla^{2} \mathcal{J}_{\text {org }}\left(\theta_{\text {org }}\right)\right)^{1} \nabla \mathcal{L}_{t}\left(\mathbf{x}_{j}, y_{j}, \theta_{\text {org }}\right)
\end{equation}
Data Valuation– ψ(xj,yj) can be used to valuate the importance of the training examples in the performance of the model, measured by C. Using this notation and Taylor expansion, we have the following estimation for the model performancechange after modifying its parameters:
\begin{equation}
\mathcal{C}\left(\theta_{\mathrm{mod}}\right)\mathcal{C}\left(\theta_{\mathrm{org}}\right)=\nabla \mathcal{C}\left(\theta_{\mathrm{org}}\right)\left(\theta_{\mathrm{mod}}\theta_{\mathrm{org}}\right)+\epsilon \approx\sum_{j=1}^{\leftD_{\mathrm{up}}\right} \lambda_{j} \psi\left(\mathbf{x}_{j}, y_{j}\right)
\end{equation}
The above equation enables us to not only estimate the effect of parameter modifications on the training objective, but also on any differentiable performance metric.
Now we have the tools to get into the most important step of PUMA. To remove the negative effect of the problematic training examples from the model parameters the weight is set to λj=−1. To preserve the performance of the model as much as possible, PUMA compensates the influence ofthe removed training data examples by upweighting the influence of the remaining data points (or a subset of them). To do this upweighting optimally, we solve the following optimization problem to find the optimal weights:
\begin{equation}
\boldsymbol{\lambda}^{*}=\underset{\boldsymbol{\lambda}}{\operatorname{argmin}}\left\\sum_{j \notin D_{\mathrm{mk}}}^{\leftD_{\mathrm{up}}\right} \lambda_{j} \psi\left(\mathbf{x}_{j}, y_{j}\right)\sum_{k=1}^{\leftD_{\mathrm{mk}}\right} \psi\left(\mathbf{x}_{k}, y_{k}\right)\right\^{2}+\Omega(\boldsymbol{\lambda})
\end{equation}
where $D_{m k}$ is the data points marked to be removed, $D_{\text {up }}$ is the subset of data to be upweighted, and $\Omega$ is a regularization term to sparsify the weights. Figure 4 depicts the importance of reweighting.
Model Patching  Knowing the optimal weights, we can use the following equation to update the model parameters:
$$
\theta_{\mathrm{mod}}=\theta_{\mathrm{org}}+\eta\left[\sum_{k=1}^{\leftD_{\mathrm{mk}}\right} \phi\left(\mathbf{x}_{k}, y_{k}\right)\sum_{j \notin D_{\mathrm{mk}}}^{\leftD_{\mathrm{up}}\right} \lambda_{j}^{*} \phi\left(\mathbf{x}_{j}, y_{j}\right)\right]
$$
Summary–
Here we present few experiment results showing the data valuation and model patching performance of PUMA. For extensive experiment results and comparisons with state of the art algorithms refer to the paper and its appendix.
Data Valuation– PUMA extends and generalizes the Influence functions, which is limited to the training objective, to incorporate different performance metrics. In PUMA’s formulation ψ(xj,yj) can be used as a data importance value. Figure 5 compares PUMA with some of the state of the art data valuation algorithms in detecting noisy labels. In this experiment we flip the labels of 10% of the training data points. The goal is to find the most number of noisy labels with least data investigation. We expect the data points with the lowest data value to be the ones with noisy labels. Therefore, we sort the data points according to their data values and investigate different portions of the data. As can be seen in figure 5 PUMA out performs others, or is at least as good as them. It should be noted that this is not the main goal of PUMA, but is a coproduct of its formulation.
Model Patching: Data Removal– The main goal of PUMA is to remove the effect of the data points. To show PUMA’s success in removing the training points from model parameters, we present two examples here: using 1) a synthetic data (Radial dataset), and 2) MNIST.
In the first example, figure 6, we want to remove a cluster of the data points in the green class. The results show that those data points are successfully removed, without affecting the accuracy of the model in classifying the remaining data points in both classes.
In the second example we use MNIST dataset and aim to remove some of the images in class “4” with a specific style (the styles are clustered using Kmeans). As can be seen in figure 7, those “4”s are removed successfully (are classified as “6”) without affecting the other “4”s and other digits.
Model Patching: Calibration– We aim to improve the calibration of the model measured by Expected Calibration Error (ECE). ECE measures the gap between the average model confidence and its accuracy in bins of the model confidence.
In this example we use a two moon synthetic dataset, shown in figure 8 with 10% random label noise. First, PUMA’s data importance values are used to detect the training points that are more involved in increasing the miscalibration. Figure 8 shows three different types of detected data points: 1) data examples with noisy labels, 2) overconfident examples, and 3) underconfident examples. Then, PUMA is used to remove data points with noisy labels and a subset of the overconfident data points;and the underconfident points are upweighted. The ECE improvement is obvious in figure 9 after augmenting the model parameters.
References
[1]Devansh Arpit et al. “A closer look at memorization in deep networks”. In:InternationalConference on Machine Learning. PMLR. 2017, pp. 233–242.
[2]Peter Bartlett, Dylan J Foster, and Matus Telgarsky. “Spectrallynormalized margin boundsfor neural networks”. In:arXiv preprint arXiv:1706.08498(2017).
[3] Lucas Bourtoule et al. “Machine unlearning”. In:arXiv preprint arXiv:1912.03817(2019).
[4]Pierre Foret et al. “Sharpnessaware minimization for efficiently improving generalization”. In:arXiv preprint arXiv:2010.01412(2020).
[5]Amirata Ghorbani and James Zou. “Data shapley: Equitable valuation of data for machinelearning”. In: International Conference on Machine Learning. PMLR. 2019, pp. 2242–2251.
[6]Laura Graves, Vineel Nagisetty, and Vijay Ganesh. “Amnesiac Machine Learning”. In: arXivpreprint arXiv:2010.10981(2020).
[7]Yiding Jiang et al. “Fantastic generalization measures and where to find them”. In: arXivpreprint arXiv:1912.02178(2019).
[8]Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. “Generalization in deep learning”. In: arXiv preprint arXiv:1710.05468(2017).
[9]Been Kim, Rajiv Khanna, and Oluwasanmi O Koyejo. “Examples are not enough, learn to criticize! Criticism for Interpretability”. In: Advances in Neural Information Processing Systems.Ed. by D. Lee et al. Vol. 29. Curran Associates, Inc., 2016.URL:https://proceedings.neurips.cc/paper/2016/file/5680522b8e2bb01943234bce7bf84534Paper.pdf.
[10]Pang Wei Koh and Percy Liang. “Understanding blackbox predictions via influence functions”.In: International Conference on Machine Learning. PMLR. 2017, pp. 1885–1894.
[11]Jishnu Mukhoti et al. “Calibrating deep neural networks using focal loss”. In: arXiv preprintarXiv:2002.09437(2020).
[12]Junhyun Nam et al. “Learning from failure: Training debiased classifier from biased classifier”.In: arXiv preprint arXiv:2007.02561(2020).10
[13]Amartya Sanyal et al. “How benign is benign overfitting?” In: arXiv preprint arXiv:2007.04028(2020).
[14]David Stutz, Matthias Hein, and Bernt Schiele. “Relating adversarially robust generalization toflat minima”. In:Proceedings of the IEEE/CVF International Conference on Computer Vision.2021, pp. 7807–7817.
[15]Anvith Thudi et al. “Unrolling sgd: Understanding factors influencing machine unlearning”.In:arXiv preprint arXiv:2109.13398(2021).
[16]Sunil Thulasidasan et al. “Combating label noise in deep learning using abstention”. In: arXivpreprint arXiv:1905.10964(2019).
[17]Ronald J Williams. “Simple statistical gradientfollowing algorithms for connectionist reinforcement learning”. In:Machine learning8.3 (1992), pp. 229–256.
[18]ChihKuan Yeh et al. “Representer point selection for explaining deep neural networks”. In:arXiv preprint arXiv:1811.09720(2018).
[19]Jinsung Yoon, Sercan Arik, and Tomas Pfister. “Data valuation using reinforcement learning”.In:International Conference on Machine Learning. PMLR. 2020, pp. 10842–10851.
[20]Chiyuan Zhang et al. “Understanding deep learning (still) requires rethinking generalization”.In:Communications of the ACM64.3 (2021), pp. 107–115.
]]>Graph neural networks (GNNs) excel at performing semisupervised classification with the aid of a graph structure for the set of input examples. In recent years, researchers have begun applying GNNs to semisupervised datasets that don’t include graphstructure information, by generating a graph structure from the data. The paper identifies and solves a ‘supervision starvation’ problem in a promising approach called latent graph learning, which jointly optimizes a graph structure and a GNN classifier.
Much of the great success of modern neural network methods comes from applying architectural inductive biases like convolution and attention to structured inputs. One reason that graph neural networks (GNNs) are an exciting area of research is the promise of applying similar architectural techniques to structure in the input domain as a whole, by treating input examples as nodes in a graph. Previous work demonstrates that it’s sometimes possible to leverage the strengths of GNNs even when no graph information is available, by using label and feature information to construct an inputsexample graph with GNNfriendly properties. Since GNN classifiers are known to excel on graphs with high label homophily (i.e. connected nodes often belong to the same class), graph construction typically focuses on label homophily and on feature homophily as a proxy. The simplest methods preconstruct a graph through a kNN algorithm, while more sophisticated methods try to infer a graph during training.
In existing methods for latent graph learning, a neural network that assigns graphstructure to the input set is optimized together with a graph convolutional network (GCN) that classifies input examples based on their graph neighbors. The paper’s central observation is that in this setting, training for classification can’t properly optimize the graph: some edges, which the paper calls ‘starved edges,’ have no impact on the training loss but do affect the classifier’s predictions at test time. Since the values of these edges are learned without any training feedback, the model is at risk of making poor predictions at test time.
The existence of starved edges follows from the fact that an nlayers GCN makes predictions for input examples based on their nhop neighbors. Edges between examples that are each ≥n hops from any labeled example cannot affect any supervised prediction, and will therefore be ‘starved’ for supervision. Furthermore, it turns out that in typical latentgraph learning scenarios we should expect most edges to be starved: The authors observe that in a random graph (aka ErdosRényi graph) or scalefree network with the statistics of graphstructured datasets like Cora, Citeseer, and Pubmed, a random edge is more likely to be starved than unstarved. In particular, they give a simple proof that for an ErdosRényi graph with n nodes and m edges, if we have labels for q nodes selected uniformly at random then the probability of an edge being a starved edge is:
\begin{equation}\left ( 1\frac{q}{n} \right )\left ( 1\frac{q}{n1} \right )\prod_{i=1}^{2q}\left ( 1 \frac{m1}{\binom{n}{2}i} \right)\end{equation}
Furthermore, the proportion of starved edges in the actual Cora, Citeseer, and Pubmed datasets is very close to their probability in the analogue ErdosRényi graphs, so there is reason to believe that ErdosRényi graphs are a good model of natural graphs in this regard.
The paper proposes to solve the starvededge problem by adopting SLAPS (Simultaneous Learning of Adjacency and GNN Parameters with Selfsupervision), a multitask learning framework that supplements the classification task with a selfsupervised task. The selfsupervised task is based on the hypothesis that a graph structure suitable for predicting the features of input examples is also suitable for predicting their labels. The authors add a denoising autoencoder GNN downstream of the graph generator, optimizing the full system (graphgenerator, classifier, and denoising autoencoder) with a mixture of autoencoder loss and classifier loss.The training process thus encourages the graph generator to produce a graph structure that provides the denoising autoencoder with useful auxiliary information for denoising the input examples.
The authors compare their multitask framework with existing graphconstruction and latentgraph learning methods on a range of semisupervised classification datasets. The main experiment uses a ‘graphless’ version of the graphstructured semisupervised classification benchmarks Cora, Citiseer, and Pubmed, showing that SLAPS recovers a useful graphstructure from input examples alone. SLAPS strongly outperforms the baseline MLP, and significantly outperforms GNN methods that rely on preconstructing graphs or supervised latent graph learning:
MODEL  Cora  Citeseet  Cora390  Citeseer370  Pubmed  ogbnarxiv 
MLP  56.1 ± 1.6†  56.7 ± 1.7†  65.8 ± 0.4  67.1 ± 0.5  71.4 ± 0.0  51.7 ± 0.1 
MLPGAM*  70.7‡  70.3‡      71.9‡   
LP  37.6 ± 0.0  23.2 ± 0.0  36.2 ± 0.0  29.1 ± 0.0  41.3 ± 0.0  OOM 
kNNGCN  66.5 ± 0.4†  68.3 ± 1.3†  23.2 ± 0.0  71.8 ± 0.8  70.4 ± 0.4  49.1 ± 0.3 
LDS      71.5 ± 0.8†  71.5 ± 1.1ƚ  OOM  OOM 
GRCN  67.4 ± 0.3  67.3 ± 0.8  71.3 ± 0.9  70.9 ± 0.7  67.3 ± 0.3  OOM 
DGCNN  56.5 ± 1.2  55.1 ± 1.4  67.3 ± 0.7  66.6 ± 0.8  70.1 ± 1.3  OOM 
IDGL  70.9 ± 0.6  68.2 ± 0.6  73.4 ± 0.5  72.7 ± 0.4  72.3 ± 0.4  OOM 
kNNGCN + AdaEdge  67.7 ± 1.0  68.8 ± 0.3  72.2 ± 0.4  71.8 ± 0.6  OOT  OOM 
kNNGCN + selftraining  67.3 ± 0.3  69.8 ± 0.3  71.1 ± 0.3  72.4 ± 0.2  72.7 ± 0.1  NA 
SLAPS (FP)  72.4 ± 0.4  70.7 ± 0.4  76.6 ± 0.4  73.1 ± 0.6  OOM  OOM 
SLAPS (MLP)  72.8 ± 0.8  70.5 ± 1.1  75.3 ± 1.0  73.0 ± 0.9  74.4 ± 0.6  56.6 ± 0.1 
SLAPS (MLPD)  73.4 ± 0.3  72.6 ± 0.6  75.1 ± 0.5  73.9 ± 0.4  73.1 ± 0.7  52.9 ± 0.1 
SLAPS (MLP) + AdaEdge  72.8 ± 0.7  72.6 ± 1.5  75.2 ± 0.6  72.6 ± 1.4  OOT  OOT 
SLAPS (MLP) + selftraining  74.2 ± 0.5  73.1 ± 1.0  75.5 ± 0.7  73.3 ± 0.6  74.3 ± 1.4  NA 
The authors also study the application of SLAPS to semisupervised classification on datasets with noisy or corrupted graphstructure information, and find that using noisy graph information to initialize SLAPS is greatly preferable to feeding the noisy graph information directly to a classifier GNN:
The paper studies the use of latent stochastic differential equations (SDEs) together with normalizing flows to learn continuous timeseries dynamics. When learning continuous timeseries dynamics, the objective is to maximize the observational loglikelihood of an inhomogeneous collection of training sequences with varying lengths and time stamps. At testtime, in addition to the maximization of observational loglikelihoods, we are also interested in sampling trajectories in a manner consistent with these loglikelihoods.
The authors improve on the state of the art in the field by employing a normalizing flow as a timedependent decoder for a flexible latent SDE, achieving greater expressivity compared to methods that rely on a normalizing flow alone. The price of the increase in expressivity is that the observational loglikelihood becomes intractable, making variational approximations necessary. The authors formulate a principled variational approximation of the observational loglikelihood, based on a piecewise construction of the posterior distribution of the latent SDE.
Sparse and irregular observations of continuous dynamics are common in many areas of science, including finance, healthcare, and physics. Timeseries models driven by stochastic differential equations provide an elegant framework for this challenging scenario and have recently gained popularity in the machine learning community. The SDEs are typically implemented by neural networks with trainable parameters, and the latent processes defined by the SDEs are then decoded into an observable space with complex structure.
Despite great recent progress in the field, it remains challenging to produce models that are both computationally tractable and flexible. Cutting edge methods built around combining a simple latent process with invertible transformations have the benefit of giving exact and efficient likelihood evaluation of observations, but can only model a limited class of stochastic processes. In particular, the otherwise highly effective flowbased method CTFP (‘continuoustime flow process’) is provably incapable of modeling some commonplace stochastic processes, from simple processes like the OrnsteinUhlenbeck (OU) process to more complex nonMarkov processes. A more formal difficulty with models like CTFP is that standard neural networks practice for constructing reversible transformations implies Lipschitz continuity, and Lipschitzcontinuous reversible transformations of simple processes are especially limited. Transforming a simple stochastic process into Brownian motion, for example, requires a nonLipschitz function and therefore nonstandard network architecture.
The paper introduces Continuous Latent Process Flows (CLPF). CLPF treats an observed sequence as a partial realization of a continuoustime observable stochastic process X_{t}, and treats X_{t} in turn as a function of the trajectory of a flexible SDE process Z_{t} together with the trajectory of a simple stochastic process O_{t}. (e.g., an OUprocess).
Concretely, the CLPF framework models the evolution of an mdimensional timecontinuous latent state Zₜ in the time interval [0,T] using a flexible stochastic differential equation
\begin{equation}\mathrm{d} \boldsymbol{Z}_{t}=\boldsymbol{\mu}_{\gamma}\left(\boldsymbol{Z}_{t}, t\right) \mathrm{d} t+\sigma_{\gamma}\left(\boldsymbol{Z}_{t}, t\right) \mathrm{d} \boldsymbol{W}_{t}\end{equation}
where Wₜ is an mdimensional Wiener Process and γ denotes the (shared) learnable parameters of the drift function µ and variance function σ.
The latent SDE dynamics then produce an observable process X_{t }as follows:
\begin{equation}X_{t}=F_{\theta}\left(O_{t} ; Z_{t}, t\right)\end{equation}
where Oₜ is a ddimensional Ornstein–Uhlenbeck process with closedform transition density and Fθ( · ; zₜ, t) is a normalizing flow parameterized by θ for any zₜ, t.
Because CLPF latent dynamics follow a generic, flexible SDE process, exact computations of the observational loglikelihood are generally intractable. It’s therefore necessary to use a variational approximation to compute the training gradient of a CLPF model or perform inference. To this end, the authors construct a novel evidence lower bound (ELBO)Save
\begin{equation}\mathbb{E}_{\omega^{(1)}, \ldots, \omega^{(n)} \sim W_{t}^{(1)} x \ldots \times W_{t}^{(n)}}\left[\sum_{i=1}^{n} \log p\left(x_{t_{1}} \mid x_{t_{11}}, \tilde{z}_{t_{1}}, \tilde{z}_{t_{i1}}, \omega^{(i)}\right)+\sum_{i=1}^{n} \log M^{(i)}\left(\omega^{(i)}\right)\right]\end{equation}
where wt^{⁽1⁾},...,wt^{⁽ⁿ⁾} are independent Wiener processes that (speaking informally for brevity) construct W_{t} piecewise and M is an importance weight between the prior and posterior latent SDE.
The authors compare CLPF with previous continuous dynamics methods for modelling irregular observations of a continuous process, as well as with a noncontinuous Variational RNN (VRNN) baseline that excels in likelihood estimation but cannot properly generate trajectories.
The authors first evaluate CLPF on synthetic data sampled from known stochastic processes, to verify its ability to capture a variety of continuous dynamics. They compare CLPF with previous continuous methods on the synthetic cases Geometric Brownian Motion, Linear SDE, Continuous AR(4) Process, and Stochastic Lorenz Curve:
Model  GBM  LSDE  CAR  SLC  
λ=2  λ=20  λ=2  λ=20  λ=2  λ=20  λ=20  λ=40  
VRNN  0.425  0.650  0.634  1.665  1.832  2.675  2.237  1.753 
LatentODE[33]  1.916  1.796  0.900  0.847  4.872  4.765  9.117  9.115 
CTFP[12]  2.940  0.678  0.471  1.778  383.593  51.950  0.489  0.586 
LatentCTFP[12]  1.472  0.158  0.468  1.784  249.839  43.007  1.419  0.077 
LatentSDE[25]  1.243  1.778  0.082  0.217  3.594  3.603  7.740  8.256 
CLPF(ours)  0.444  0.698  0.831  1.939  1.322  0.077  2.620  3.963 
In order to evaluate CLPF on real data, the authors generate irregular timeseries data by sampling from the MujocoHopper, Beijing AirQuality Dataset (BAQD), and PTB Diagnostic Database (PTBDB) datasets at irregular timeintervals. The authors find that CLPF outperforms existing continuousdynamics models in likelihood estimation, and nearly closes the gap with VRNN in sequential prediction:
Model  Mujoco[33]  BAQD[37]  PTBDB[5] 
VRNN[10]  15876  1.204  2.035 
LatentODE[33]  23551  2.540  0.533 
LatentSDE[25  3071  1.512  1.358 
CTFP[12]  7598  0.170  1.281 
LatentCTFP[12]  12693  0.480  1.659 
CLPFANODE(ours)  14694  0.619  1.575 
CLPFiRes(ours)  10873  0.486  1.519 
Model  GBM  LSDE  CAR  SLC 
CLPFGlobal  0.447  0.821  1.552  3.304 
CLPFIntependent  0.800  0.326  4.970  7.924 
CLPFWiener  0.390  0.790  1.041  1.885 
Latent SDE  1.243  0.082  3.594  7.740 
CLPF  0.444  0.831  1.322  2.620 
Model  Mujoco[33]  BAQD[37]  PTBDB[5] 
VRNN[10]  1.599,[0.196,1.221]  0.519, [0.168, 0.681]  0.037, [0.005, 0.032] 
LatentODE[33]  13.959, [9.857, 15.673]  1.416, [0.936, 1.731]  0.224, [0.114, 0.322] 
LatentSDE[25]  7.627, [2.384, 8.381]  0.848, [0.454, 1.042]  0.092, [0.032, 0.111] 
CTFP[12]  1.969, [0.173, 1.826]  0.694, [0.202, 10966]  0.055, [0.006, 0.046] 
LatentCTFP[12]  1.983, [0.167, 1.744]  0.680, [0.189, 0.943]  0.065, [0.007, 0.059] 
CLPFANODE(ours)  1.629, [0.149, 1.575]  0.542, [0.150, 0.726]  0.048, [0.005, 0.041] 
CLPFiRes (ours)  1.846, [0.177, 1.685]  0.582, [0.183, 0.805]  0.055, [0.006, 0.049] 
It is often desirable for a neural network to be monotonic: informally, to have an output function that is nondecreasing with respect to certain features. This paper identifies problems with existing general methods for training a neural network to be monotonic, and proposes a superior general method.
Monotonicity is a common requirement in reallife applications of prediction models. For example, in the case of models used to accept/reject job applications, we expect acceptance scores to be monotonically nondecreasing with respect to features such as a candidate’s years of experience. Such expectations of monotonicity often reflect accepted institutional best practices, or ethical or legal norms, so that predictors that fail on monotonicity are unacceptable for reallife use.
While it is possible to guarantee monotonicity by defining a ‘monotonous by construction’ model class, such model classes have limited use since they exclude many commonly used neural network architectures. More generally applicable approaches focus on finding monotonic candidates within a general class of models while performing empirical risk minimization. Since the verification of a model’s monotonicity can be extremely computationally expensive, these approaches typically do not provide a guarantee. Instead, they rely on regularization penalties that bias a learning algorithm towards monotonic predictors.
The general form of the regularization penalty is:
\begin{equation}\mathbb{E}_{x \sim \mathcal{D}}\left[\sum_{i \in M} \max \left(0,\frac{\partial h(x)}{\partial x_{i}}\right)^{2}\right]\end{equation}
where M is the set of input dimensions with respect to which monotonicity is desired, and
)/xi indicates the gradients of the predictor h relative to the input dimensions i ∈ M. In other words, we penalize h for behaving nonmonotonically at points sampled from some distribution D.
The paper’s novel contribution concerns the choice of distribution D. In previous work, the chosen D was either the empirical distribution of the training sample, or a uniform distribution over the input space. The paper demonstrates that both choices have serious shortcomings: When D is the empirical training distribution, monotonicity is only enforced close to the training data, and may fail on the test data in the case of a covariate shift. When D is the uniform distribution, the sampled points will likely lie far from the training data, thus failing to enforce monotonicity around the training data. (This is particularly likely given a highdimensional input space, where points sampled in uniform are likely to lie close to the input space’s boundary.)
The paper’s solution is to compute the regularization penalty on points generated by mixing up points from the training data and random points. To sample from the ‘mixup’ distribution D, the regularizer augments a minibatch of N training examples with N random points, then samples random interpolations of random pairs of points from the augmented minibatch. The authors hypothesize that performing mixup applies monotonicity to parts of the space that are disregarded if one focuses only on either observed data or random draws from uninformed choices of distributions such as the uniform.
The authors first test their hypotheses on synthetic data manifolds with a covariate shift between the training/validation data and test data. The results show that applying the monotonicity regularizer to the mixup distribution enforces monotonicity on both the test data and random points, whereas applying the regularizer to the training data or at random is effective for one condition at most. In addition, the results suggest that when scaling to high dimensions, the covariate shift weakens the effect of trainingset enforcement but not of mixup enforcement.
 M  / D  20/100  40/200  80/400  100/500  
ρ_{random} 
ρ_{test} 
ρ_{random} 
ρ_{test} 
ρ_{random} 
ρ_{test} 
ρ_{random} 
ρ_{test} 

Nonmon.  99.90%  99.99%  97.92%  94.96%  98.47%  96.56%  93.98%  90.01% 
Ω_{random} 
0.00%  3.49%  0.00%  4.62%  0.01%  11.36%  0.02%  19.90% 
Ω_{train} 
1.30%  0.36%  4.00%  0.58%  9.67%  0.25%  9.25%  5.57% 
Ω_{mixup} 
0.00%  0.35%  0.00%  0.44%  0.00%  0.26%  0.00%  0.42% 
The results generalize to real datasets, where mixup regularization archives the best monotonicity under every evaluation condition. The authors additionally observe that successfully enforcing monotonicity has no effect on prediction performance, suggesting that monotonic predictors are viable as predictors:
Nonmon.  Ω_{random}  Ω_{train}  Ω_{mixup}  
ValidationRMSE  0.213±0.000  0.223±0.002  0.222±0.002  0.235±0.001 
Test RMSE  0.221±0.001  0.230±0.001  0.229±0.002  0.228±0.001 
ρ_{random}  99.11%±1.70%  0.00%±0.00%  14.47%±7.55%  0.00%±0.00% 
ρ_{train}  100.00%±0.00%  7.23%±7.76%  0.01%±0.01%  0.00%±0.00% 
ρ_{test}  100.00%±0.00%  6.94%±7.43%  0.04%±0.03%  0.00%±0.00% 
The authors note that the mixup strategy introduces no computational overhead over the existing strategies, and is therefore strictly preferable. They propose that in future work the mixup strategy could be used to improve interpretability in complex neural network models, by enforcing homogeneity between a network’s outputs and a subset of its highlevel representations.
]]>Odd/distribution shift/domain adaptation
by Jiawei He
Time series research remains a cutting edge field in machine learning community. This is especially important in Finance applications as we face stock price data, credit card transaction data, etc. A common assumption in training neural networks via maximum likelihood estimation is that the errors across time steps are uncorrelated. This assumption is still heavily applied to almost all machine learning optimizations. However, in time series problems, intrinsically, errors can be autocorrelated in many cases due to the temporal nature of the data, which makes such maximum likelihood estimation inaccurate. Although adjusting for autocorrelated errors in linear or nonlinear time series data has been studied extensively, especially in econometrics, those methods are applicable only when the exact (and correct) form of the underlying system is known. On the other hand, NNs for timeseriesrelated tasks have become a popular research direction due to NNs’ effectiveness in approximating unknown, nonlinear systems.
To adjust for autocorrelated errors, a method to jointly learn the autocorrelation coefficient with the model parameters via gradient descent is proposed in this paper. Extensive simulation verified the effectiveness of the proposed approach on time series forecasting. Results across a wide range of realworld datasets with various stateoftheart models show that the proposed method enhances performance in almost all cases. Based on these results, the authors suggested empirical critical values to determine the severity of autocorrelated errors. Some of the limitations mentioned in the paper includes (1) the method is not applicable to probabilistic forecasting, and (2), if the underlying timeseries can be modelled well by a known process, then the benefit of adopting this approach will be decreasing. For future research directions, the authors suggested exploring more complex, higherorder autocorrelated errors with quantile regression and probabilistic forecasting.
by Jiawei He
Many complex time series can be effectively subdivided into distinct regimes that exhibit persistent dynamics. Discovering the switching behaviour and the statistical patterns in these regimes is important for understanding the underlying dynamical system. State Space Models (SSMs) are a powerful tool for such tasks—especially when combined with neural networks — since they provide a principled framework for time series modelling One of the most popular SSMs is the Linear Dynamical System, which models the dynamics of the data using a continuous latent variable, called state, that evolves with Markovian linear transitions. The assumptions of LDS allow for exact inference of the states; however, they are too restrictive for realworld systems that often exhibit piecewise linear or nonlinear hidden dynamics with a finite number of operating modes or regimes.
In this paper, the Recurrent Explicit Duration Switching Dynamical System (REDSDS) is proposed. REDSDS is a nonlinear state space model that is capable of identifying both state and timedependent switching dynamics. Statedependent switching is enabled by a recurrent statetoswitch connection and an explicit duration count variable is used to improve the timedependent switching behaviour. The authors also proposed an efficient hybrid inference and learning algorithm that combines approximate inference for states with conditionally exact inference for switches and counts. The model is trained by maximizing a Monte Carlo lower bound of the marginal loglikelihood that can be computed efficiently as a byproduct of the inference routine. Thorough evaluation on a number of benchmark datasets for time series segmentation and forecasting demonstrated that REDSDS can learn meaningful duration models, identify both state and timedependent switching patterns and extrapolate the learned patterns consistently into the future. Future research directions include semisupervised time series segmentation. For timesteps where the correct regime label is known, it is straightforward to condition on this additional information rather than performing inference; this may improve segmentation accuracy while providing an inductive bias that corresponds to an interpretable segmentation.
by Nazanin Mehrasa
Event sequences as a special form of timeseries data, are discrete events in continuous time, meaning that events happen asynchronously in time. This type of data is prevalent in a wide variety of applications, such as socialnetworks, stock market, healthcare, seismology and etc. To analyze event sequences and perform tasks such as future prediction, it is crucial to understand the complex influences of events on each other, including excitation, inhibition, and how the strength of these influences varies with time.
In this work, the authors propose a temporal point process framework for modeling event sequences. A temporal point process (TPP) is a mathematical framework for characterizing and modeling event sequences. A TPP is usually defined by specifying an intensity function which encodes the expected rate of events. To define the intensity, most previous neuralbased point processes use recurrent neural networks and often couple all the temporal dependency in a blackbox, which lacks interpretability on how events influence each other. In addition, existing work often assume simple functional forms for modeling the influence strength which limits the model's expressiveness e.g. exponential time decay of the influence strength. In this paper, the authors propose SPRITE, short for Selfadaptable Point pRocess wIth nonparametric Time dEcays which defines the intensity by decoupling the influences between every pair of the events in the history and model the influence via a nonparametric function of events types and timing. They introduce a general construction that covers all possible time decaying functions of the influence strength, resulting into a more flexible and expressive model while providing more interpretability. The proposed model outperforms baseline models on synthetic and realword datasets, demonstrating the effusiveness of the proposed approach.
by Siqi Liu
OutofDistribution (OOD) detection aims to detect examples in the test data that are not from the same distribution as the training data. Detecting these anomalous instances can not only have great value on its own for applications like alerting systems, where the purpose is to discover these instances, but also help to avoid or reduce risks of applying machine learning models, especially in riskaverse situations, such as healthcare and finance. In this work, the authors study the problem of OOD detection for data generated by temporal point processes (TPPs), i.e., event sequences. They connect OOD with goodnessoffit (GOF) tests in TPPs and propose a new statistic, SumofSquaredSpacings (3S), for GOF tests addressing some limitations of existing widelyused methods, such as being insensitive to the total number of events. In the experiments, their method shows strong and stable performance across different types of generative processes and realworld datasets.
by Siqi Liu and Yik Chau Liu
Empirical Risk Minimization (ERM) is commonly used to train models in machine learning, but in practice, distribution shifts (or domain shifts) can cause problems for these models and result in suboptimal performance. Previously researchers have studied similar problems in several related areas such as domain adaptation, domain generalization, and metalearning. In this work, the authors combine ideas from metalearning and domain adaptation and propose a generic framework termed Adaptive Risk Minimization (ARM). In this framework, the model tries to metalearn on the training data such that it can adapt to distribution shifts at test time with only unlabeled data. The model consists of an adaptation model and a prediction model and is optimized for postadaptation performance. They develop several methods within this framework, using either the contextual approach or the gradientbased approach, which show better performance in the experiments than previous methods that focus on either test domain adaptation or training domain generalization alone, demonstrating the benefits of combining adaptation with metalearning.
by Ruizhi Deng
How should we fit partial observations of continuous timeseries dynamics on discrete time grids? Using a probabilistic model with continuous dynamics would be an intuitively promising idea. Defining continuous dynamics could permit us to sample trajectories in a continuous time range and perform inference on arbitrary time points. Deep learning models equipped with continuous dynamics are not actively studied until recently. Continuous Latent Process Flows (CLPF) can be viewed as an extension of two recent models: latent SDE and Continuous Time Flow Process (CTFP). CLPF combines the expressive power of latent SDE with the timedependent decoding to CTFP as a better inductive bias to generate trajectories continuous in time. In addition, CLPF also proposes a flexible approach to the posterior process for variational approximation in a principled piecewise manner. CLPF demonstrates competitive performance on both synthetic and realworld data.
by Ruizhi Deng
Normalizing flows are generative models that transform a simple base distribution to a complex target distribution using invertible mapping. Affine coupling layer is a popular choice of basic building blocks in normalizing flows as the determinant of the transformation’s triangular Jacobian can be computed in linear time. Normalizing flows using affine coupling layers also demonstrate promising success when scaled up to highdimensional data like images. However, understanding of affine coupling flows’ theoretical properties, especially its representation power, remains ambiguous until recently. Previous works studying the universal approximation property of affine coupling flows rely on constructions leading to illbehaved Jacobian that are nearly singular and causing difficulties in practice. With mild assumptions, the recommended work employs a different construction to show the standard Gaussian can be transformed by affine coupling flows to approximate a target distribution arbitrarily well in Wasserstein distance sense. The construction pads the target distribution with standard Gaussian noise and the determinant of the transformation’s Jacobian is bounded above and below by constants. The proposed construction is supported by practice in previous works which improves the training of normalizing flows and has broader implications on the universal approximation power and training of other types of normalizing flows.
by Alex Radovic
This NeurIPS a number of papers have drawn exciting connections between diffusion, normalizing flow, and variational autoencoder based generative models. These connections, motivated by theory, are allowing for improved optimization of diffusion models, an extremely exciting and performant family of generative models. This paper specifically uses these connections to help motivate a new optimization strategy to improve likelihood estimation with score base diffusion models. Diffusion models learn a process to transform data samples to pure noise, and a reversion of that same process which allows them to act as powerful generative models, creating convincing data samples from noise . Scorebased diffusion models are trained to minimize a weighted combination of score matching losses, and are defined by a SDE. These scorebased models can be interpreted as continuous normalizing flows, allowing for exact likelihood calculations, while still being trained with a score matching loss. Training with score matching is much more efficient than training a continuous normalizing flow, which requires expensive calls to an ODE solver at every step of training. However this objective provides no guarantee that likelihood scores are improved. This paper provides a new weighted score loss which is shown to upper bound the negative loglikelihood, analogous to the lower bound used when training variational autoencoders. This novel, theory motivated loss, is then shown to empirically improve likelihood estimation across a variety of score based diffusion models and datasets. Broadly, this work and others at NeurIPS suggest score based diffusion models, with appropriate optimization choices, can provide performance in likelihood estimation competitive with continuous normalizing flows but with far more efficient training.
by Andreas Lehrmann
Neural ordinary differential equations (Neural ODEs) are a popular class of statistical models that are based on a continuousdepth parametrization of a hidden state’s derivative. Extensions of this idea form the basis for a variety of latent variable models for asynchronous timeseries data, including models based on latent ODEs, continuouslyindexed normalizing flows, and neural stochastic differential equations. The optimization of Neural ODEs (i.e., computing gradients of network parameters w.r.t. a loss) is based on the adjoint sensitivity method, which includes expensive calls to a blackbox ODE solver. Neural flows circumvent this problem by directly modeling the solution curves (the flow) instead of the original ODE and, as a result, do not have to rely on ODE solvers. One technical challenge is that the architecture representing the flow must be a valid solution to an ODE (e.g., the solution curves corresponding to different initial values cannot intersect). The paper formalizes these constraints and demonstrates how popular neural network layers, such as residual layers, GRU cells, or coupling layers, can be adapted accordingly. Applications of this approach include flow versions of encoderdecoder architectures like ODERNNs for filtering/smoothing as well as flow versions of normalizing flows for timedependent density estimation (1). Comprehensive experiments show that neural flows do not only outperform their ODEcounterparts in terms of raw performance but, depending on the task and architecture, can also be up to an order of magnitude faster.
^{1 }Note that the use of flow in this sentence is overloaded, with “normalizing flows” and “neural flows” referring to two completely different concepts.
by Matthew Schlegel
As more machine learning models are applied to highstakes applications, explaining a model’s predictions is a necessary part of responsible use of these models. Reasoning about how changes in the input change a model’s prediction is known as a counterfactual explanation. This paper extends the framework of counterfactual explanations to sequences of decisions to find optimal counterfactual policies that maximize an outcome constrained to remaining close to the observed action sequence. The policies returned from their polynomialtime algorithm improve outcomes on a series of synthetic and real datasets. The authors posit that the counterfactual policies can be used to further elucidate complex decisionmaking processes and, specifically, give insight when counterfactual actions are concentrated on a few critical decision points. Looking beyond onestep decisions to multistep action sequences is critical for explaining complex decisionmaking algorithms. This paper excellently provides the groundwork for building counterfactual explanations along trajectories.
by Yanshuai Cao
Many have hypothesized that deep learning and causal inference could complement each other well. On the one hand, understanding cause and effect could help fix some known issues with deep learning, such as the poor ability to generalize out of distribution and lack of robustness to adversarial attacks. On the other hand, the power of representation learning could scale causal inference to highdimensional problems. However, most existing works that employ causal inference with deep neural networks use them as separate stages. For example, following Pearl’s DOCalculus, a symbolic computation step is first executed for causal identification, turning a causal question into a statistical estimation problem, which can then be solved by fitting deep neural nets.
In this work, the authors combine causal inference and neural networks on a more fundamental level with the proposed neural causal models (NCMs) and perform the causal identification via gradient descent in the same process as neural net parameter learning. No more symbolic computation is needed, just the structural knowledge expressed through the design of the neural net, which deep learning researchers already spend lots of time engineering. The paper also has theoretical results about expressivity and identifiability of NCMs, which follow from the universal approximation theorem of feedforward neural nets and the “Causal No Free Lunch” principle entailed by Pearl’s Causal Hierarchy.
by Peng Xu
What problem does it solve?  Leverage sparsity to make large Tranformer models scale efficiently. To be specific, the goal is to perform inference faster than the standard Transformer as the model size scales up, while retaining the empirical performances on real tasks.
Why is this important?  The Transformer architecture have achieved huge successes in the field of natural language processing in recent years. Lately, it also gains great popularity among other fields. At the same time, the model size of Transformer models grows larger and larger, as well as the huge costs such models incur. As a result, it is increasingly important to make them scale efficiently.
The apporach taken  This paper address this problem by proposing Scaling Transformers with a separate sparse mechanism for the query, key, value and output layers (Sparse QKV for short) and combine it with sparse feedforward blocks (Sparse FF for short) to get a fully sparse Transformer architecture.
Results:
Scaling Transformers also yield competitive results on challenging realworld tasks like summarizing arxiv articles, as compared to stateoftheart approaches.
by Thibaut Durand
Predicting the future is a fundamental research problem with a large range of applications like demand forecasting, autonomous driving, robotics, and health care. However this research problem is very challenging because future is uncertain. Probabilistic generative models have shown promising results for this problem. This paper introduces the Probabilistic Transformer (ProTran) model, which is a state space model (SSM) based on transformer architectures for multivariate time series. Unlike existing models, this model does not rely on recurrent neural networks but relies on the attention mechanism because the attention mechanism shows promising results to model longrange dependencies. Compared to transformerbased models, the Probabilistic Transformer model is able to capable of generating diverse longterm forecasts with uncertainty estimates. The Probabilistic Transformer model shows very good performances on several tasks like time series forecasting and human motion prediction. Probabilistic time series forecasting is an active research problem at Borealis AI. I really like the Probabilistic Transformer model because it combines strengths of state space models and transformer architectures. I think that capturing the uncertainty inherent to the future can lead to strong time series forecasting models that will help to make better financial decisions.
by Peng Xu
In the past years, Transformers have shown great successes in multiple domains. However, the training cost of a Transformer can be expensive, in particular for the large models designed recently. This paper proposes to reduce the costs of Transformers by searching for a more efficient variant. To find Transformer alternatives, the authors designed a new search space. Then, they used the Regularized Evolution with hurdles search algorithm to find the most training efficient architecture in the search space. They discovered a new model called Primer (PRIMitives searched transformER). The main finding is that the compute savings of Primer over Transformers increase as training cost grows, when controlling for model size and quality. The authors also found that the improvements of Primer over Transformer can be mostly attributed to two main modifications: squaring ReLU activations and adding a depthwise convolution layer after each Q, K, and V projection in selfattention. It is interesting to see these modifications are easy to implement and can be easily added into existing Transformer architecture codebase. The authors observed that these changes can significantly speed up training of existing Transformer models without additional tuning. Improving the training of Transformers is an active research area at Borealis AI. Making training efficient can be critical for models working on nonstationary data like time series forecasting models. I like this paper because it shows that some small architectural changes can improve a lot the training of Transformers.
by Amir Abdi
One of the challenges of Representation Learning with Variational AutoEncoders (VAEs) is the model identification issue related to the rotational symmetries of the latent space caused by the rotational invariance of the standard Gaussian prior. The mentioned rotational symmetries in the latent space can cause strong correlations between latent variables, which, in turn, hinders disentanglement. Moreover, because of the lack of rotational constraints, high variations are observed inbetween experiments on different seeds with respect to the disentanglement metrics.
In this work, and inspired by Independent Component Analysis (ICA) the authors propose the Jacobian L1 regularized VAE as an extension of BetaVAE with an added L1 norm on the Jacobian of the generator function to address the rotational identifiability issue. The L1 loss encourages local alignment of the axes of the latent representation with individual factors of variation. They demonstrate improvements on extended versions of disentanglement metrics, i.e., MIG and Modularity, which focus on local disentanglement across factors of variations, compared to BetaVAE, FactorVAE, DIPVAEI, DIPVAEII, BetaTcVae, and annealedVAE. This solution helps with local alignment of the factors of variation, yet, does not address global alignment. Because the full Jacobian of the generator is calculated during training, the compute time is scaled linearly with the number of latent dimensions.
]]>The views expressed in this article are those of the interviewee and do not necessarily reflect the position of RBC or Borealis AI.
We offer the tools to solve any natural language processing problem that a developer might have through the use of large language models. We have an API that allows people to access, fine tune and deploy these stateoftheart models, giving them the ability to solve pretty much any problem they can formulate.
(NF):
We're doing something that is transformative; getting computers to understand language has really broad impacts. We know we need to respect the power of the technology and understand the ways in which it could be used for good and for bad.
As the builder of that tool, you want to enable the good things that can be done with it while, at the same time, make the bad things that can be done with it more difficult to do and less effective.
(NF):
There's no silver bullet for this, even though there's a lot of people working on it. Languages continuously change – they're living things – so there will never be a complete lock on this.
If you take an extreme view, one way to address safety concerns would be to limit access to just the handful of companies that have the resources to create their own large language models. But we think the technology's really good. We think it's really transformative. And we want people to have access to it. So limiting access as a way to improve security is obviously not ideal.
The middle ground is that you make the technology as good as you can, and as ethical and responsible as you can. You then deploy it in a way that gives as many people access to it as possible, while balancing the risk and ensuring it is deployed responsibly.
(NF):
Let’s use hateful content as an example. Prior to deploying the model, we spend a lot of time trying to reduce the likelihood of it generating hateful or identitybased hate content.
The most straightforward way is by changing the distribution of the training data. And that can be done with some really simple techniques like word level filtration – where documents are removed from the training data if they contain a word from a prepopulated list of slurs, for example. But that obviously doesn’t catch everything.
Some techniques are much more sophisticated. For example, we recently posted a paper that described how we are using our model to selfidentify words and text that should be added to the list. In other words, we are using earlier versions of our big language model to remove harmful data for the next iteration of the model.
(NF):
Not really. We haven’t seen a drop in performance when filtering out identitybased hate speech, for example. If we do see a drop in performance, the impact is generally on the model’s ability to generate identitybased hate speech. So it’s really a winwin.
(NF):
I think it’s tempting to just say that morality is subjective. To ask, “Who are we to make the decisions?” It’s easy to abdicate the decision. But I don't agree with that at all.
I think it’s far better to recognize that it’s subjective, and then to work really hard to make the right decisions based on input from as many smart people as possible. And I think founders of startups have an even greater responsibility to ensure the technologies they are building are contributing to the good in the world. We cannot just simply abdicate that responsibility to users.
(NF):
I’ll be the first to admit that I'm not an expert in ethics. That's not my background. And I know that. So it's really helpful to have a group of people who have studied that area and its intersections with technology.
We set up a Responsibility Council at Cohere. And when we’re faced with a complicated problem, we can reach out to this group of diverse people to get their input. They give us suggestions. They pay attention to how we're doing things. And they give us advice and recommendations and tell us if we're doing the right stuff.
I think in the technology sector, we often think most problems can be addressed by applying more tech. But the reality is that there are a whole bunch of complicated problems that can't be addressed with pure tech solutions. These are problems that require people who have spent a lot of time thinking about a bunch of the other domains of research that are not the hard sciences.
(NF):
We take a holistic and distributed approach to this. Alongside our Responsibility Council, we have our own internal experts who are largely dedicated to working on responsibility. We also want these concepts and ideas to be flowing across the organization and through the culture. So we try to distribute some of the responsibilities across the whole team, encouraging as many people as possible to work on it.
The point is to ensure the idea of responsibility doesn’t get stuck in siloed thinking – that people are engaged on these topics as much as possible, and you are making sure it is spread out across the organization. Responsibility can’t just exist on a slide in the organizational mission statement.
(NF):
We really need to respect the technology that we work with. Machine learning can work. It can be transformative. It can have a massive impact on people’s lives. So you need to make sure you are building something that is having a positive impact and minimizing the potential for negative impact.
At Cohere, we try to think about these issues as early as possible in the development cycle. And we are working with a bunch of really smart people to help ensure we don’t allow a blindspot to emerge down the road.
My advice would be to get as much input from as many different people as possible. And to start thinking about it from the very start. Other than that, just try to do your best.
Nick Frosst is the cofounder of Cohere. Prior to founding Cohere, Nick worked on neural network research as part of Geoffrey Hinton’s Toronto Google Brain Team, focusing on capsule networks, adversarial examples, and explanability. Nick holds a BSc from the University of Toronto, with a double major in Computer Science and Cognitive Science. Nick cofounded Cohere in January 2019 with Aiden Gomez and Ivan Zhang.
]]>
At Borealis AI, we firmly believe that the development of responsible ML requires diverse views, research and talent. And we are committed to encouraging greater diversity and inclusion in our actions, our research and our collaborative partnerships.
That is why Borealis AI is proud to support the 2021 Women in Machine Learning (WiML) Workshop. This important event gives femaleidentified faculty, research scientists, and graduate students in the machine learning community an opportunity to meet, exchange ideas and learn from each other. In doing so, WiML is on a mission to increase gender diversity in ML, help womenidentified individuals in ML to succeed professionally, and increase their impact within their communities.
“At Borealis AI, we are committed to empowering and engaging femaleidentified researchers in the field of ML,”
noted Dr. Eirene Seiradaki, Director of Research Partnerships at Borealis AI.
“Alongside our range of other diversity and inclusion initiatives, we hope our support of the 2021 WiML Workshop at NeurIPS provides those researchers – and those aspiring to join the field of ML – with the role models, ideas and inspiration to drive their career in ML forward.”
Hosted virtually within the 2021 Conference on Neural Information Processing Systems (NeurIPS), this year’s event builds on 15 years of programs designed around substantive technical and professional conversations held within positive, supportive environments. To learn more about WiML and the WiML Workshop, click here.
]]>The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) is the preeminent forum for collaboration around computational linguistics and natural language processing. This year’s conference is expected to attract around 4,000 attendees, both physical and virtual. But, for a wide variety of reasons, forums like this can be often be difficult to access for some researchers. And that directly impacts diversity.
As a Diversity and Inclusion sponsor of the EMNLP, we aim to support researchers facing various types of hardships. We are helping provide accommodations for researchers with disabilities. We are helping to subsidize attendance for those dealing with financial hardship, those with family or childcare responsibilities, and those firsttime attendees from underrepresented regions or groups. And we are helping to enable remote participation for researchers unable to travel to the conference.
“Borealis AI is dedicated to growing, strengthening and diversifying the global machine learning talent pool through innovative and smart partnerships like our Diversity and Inclusion sponsorship of EMNLP 2021,”
“We look forward to meeting the attendees at our virtual booth and we are excited to see what new ideas, models and technologies will emerge from the event.”
]]>
In contrast to STL, multitask learning (MTL) optimizes a single model to perform multiple related tasks simultaneously, aiming to improve generalization and parameter efficiency across tasks. In this case, two or more output targets are associated with the same input data. Effective multitasking learning typically requires task balancing to prevent one or more tasks from dominating the optimization, to decrease negative transfer, and to avoid overfitting. Standard MTL settings usually assume a homogeneous set of tasks, for example all tasks are classification or regression tasks, and usually they are nonsequential data. This scenario can greatly benefit MTL approaches with strong shared representations. In contrast, heterogeneous multitask learning is defined by multiple classes of tasks, such as classification, regression with single or multi label characteristics and temporal data, being optimized simultaneously. The latter setting is more realistic but lacks further exploration. In this post, we share a novel method that we recently developed for heterogeneous MTL.
Hardparameter sharing networks [1], shown in Figure 1.b are one of the pillars of multitask learning. These networks are composed of a shared bottom and taskspecific branches. Ma et al. [2] suggested that a unique shared bottom might not be enough to generalize for all tasks in an application, and proposed to use several shared bottoms, or what they call experts. The experts are combined using gate functions, and their combination is forwarded to the towers. The final architecture is called Multigate MixtureofExperts(MMoE), and is shown in Figure 1.c. MMoE generalizes better than its traditional hardparameter sharing counterpart, but there are two weaknesses: first, it lacks a taskbalancing mechanism; second, the only source of diversity among the experts is due to the random initialization. Although the experts can indeed be diverse enough if they specialize in different tasks, there are no guarantees that this will happen in practice. We propose the Multigate MixtureofExperts with Exclusivity (MMoEEx) (Figure 1.d), a model that induces more diversity among the experts and has a taskbalancing component.
Multigate MixtureofExperts with Exclusivity (MMoEEx) takes its inspiration from ensemble learning, where diversity among their learners tend to generalize better. MMoEEx can be divided in three parts: gates, experts and towers. Considering an application with $K$ tasks, input data $x \in \mathbb{R}^d$, the gate function $g^k()$ is defined as:
\begin{equation}
\label{eq:g}
g^k(x) = \text{softmax}(W^k x), \forall k \in \{0,...,K\} \tag{1}
\end{equation}
where $W^k \in \mathbb{R}^{E \times d}$ are learnable weights and $E$ is the number of experts, defined by the user. The gates control the contribution of each expert to each task.
The experts $f_e(),\forall e\in \{0,...,E\}$, and our implementation is very flexible to accept several experts architectures, which is essential to work with applications with different data types. For example, if working with temporal data, the experts can be LSTMs, GRUs, RNNs; for nontemporal data, the experts can be dense layers. The number of experts $E$ is defined by the user. The experts and gates' outputs are combined as follows:
\begin{equation}
\label{eq:f}
f^k(x) = \sum_{e=0}^Eg^k(x)f_e(x), \forall k \in \{0,...,K\} \tag{2}
\end{equation}
The $f^k()$ are input to the towers, the taskspecific part of the architecture. Their design depends on the data type and tasks. The towers $h^k$ output the task predictions as follows:
\begin{equation}
y^k = h^k(f^k(x)), \forall k \in \{0,...,K \} \tag{3}
\end{equation}
Previous Mixture of Experts models like [2] leverage several experts to make their final predictions; however, they rely on indirect approaches, such as random initialization, to foster diversity among the experts, and on the expectation that the gate function will learn how to combine these experts. Here we propose a mechanism to induce diversity among the experts, defined as $\textit{exclusivity}$.
Exclusivity: We set $\alpha E$ experts to be exclusively connected to one task. The value $\alpha\in[0,1]$ controls the proportion of experts that will be $\textit{exclusive}$. If $\alpha=1$, all experts are exclusive, and if $\alpha=0$, all experts are shared (same as MMoE). An exclusive expert is randomly assigned to one of the tasks $T_k$, but the task $T_k$ can still be associated with other exclusive experts and shared experts.
MMoEEx, similarly to MMoE, relies on the expectation that gate functions will learn how to combine the experts. Our approach induces more diversity by forcing some of these gates to be 'closed' to some experts, and the exclusivity mechanism is used to close part of the gates. The remaining nonclosed gates learn to combine the output of each expert based on the input data, according to Equation 1.
Competing task optimization is another challenge of optimizing heterogenous tasks. The goal of the MAMLMTL optimization is to balance the tasks on the gradient level. Finn et al. [3] proposed the Modelagnostic Metalearning (MAML), a twostep optimization approach originally intend to be used with transferlearning and fewshot learning due to its fast convergence. Initial attempts to apply MAML to MTL show that MAML can balance the tasks on the gradient level and yield better results than some existing taskbalancing approaches [4]. The core idea is that MAML's temporary update yields smoothed losses, which also smooth the gradients on direction and magnitude. However, differently from [4], we do not freeze taskspecific layers during the intermediate/inner update. The MAMLMTL approach is shown in Figure 2. The approach consists of evaluating each task loss. After that each task loss is used to temporarily update the network which are reevaluated and the task specific temporarily losses are aggregated to form the final loss which will provide the actual network update.
The Medical Information Mart for Intensive Care (MIMICIII) database was proposed by [5] to be a benchmark dataset for MTL in timeseries data. It contains metrics of patients from over 40,000 intensive care units (ICU) stays. This dataset has 4 tasks: two binary tasks, one temporal multilabel task, and one temporal classification. Figure 3 shows the neural network adopted in our work and where each task is calculated.
The full set of results for MIMICIII dataset is presented in Table 1. We compared our approach with the multitask channel wise LSTM (MCWLSTM) [6], single task trained network, shared bottom, MMoE [2] and MMoEEx.
MMoEEx outperforms all the compared approaches except on the Phenotype (Pheno) task. For both time series tasks (LOS and Decomp) our approach outperforms all baselines. It is worth noting that for the LOS task, which is the hardest task on MIMICIII, we present a relative improvement superior to $40$ percentage points when compared to multitask channel wise LSTM [6] and over $16$ percentage points to MMoE.
Method  Pheno  LOS  Decomp  Ihm  $\Delta$ 
MCWLSTM[6]  77.4  $45.0$  $90.5$  $87.0$  $+0.28%$ 
Single Task [6]  $77.0$  $45.0$  $91.0$  $86.0$   
Shared Bottom  $73.36$  $30.60$  $94.12$  $82.71$  $9.28%$ 
MMoE  75.09  $54.48$  $96.20$  $90.44$  $+7.36%$ 
MMoEEx  $72.44$  63.45  96.82  90.73  +11.74% 
We measured how diverse MMoEEx experts are compared to traditional MMoE.
The diversity among experts can be scored by the distance between the experts' outputs $f_e, \forall e\in\{0,..., E\}$. Considering a pair of experts $i$ and $j$, the distance between them is defined as:
\begin{equation}
d_{i,j} = \sqrt{\sum_{n=0}^N(f_i(x_n)f_j(x_n))^2} \tag{4}
\end{equation}
where $N$ is the number of samples in the dataset, $d_{i,j} = d_{j,i}$, and a matrix $D \in \mathbb{R}^{E\times E}$ is used to keep all the distances. To scale the distances into $d_{i,j}\in [0,1]$, we divide the raw entries in the distance matrix $D$ by the maximum distance observed, $\text{max}(D)$. A pair of experts $i,j$ with $d_{i,j} = 0$ are considered identical, and experts distances $d_{i,j}$ close to 0 are considered very similar; analogously, experts with $d_{i,j}$ close to 1 are considered very dissimilar. To compare the overall distance between the experts of a model, we define the $\textit{diversity score}$ $\bar{d}$ as the mean entry in $D$.
We analyze the diversity score of the MMoE and MMoEEx experts on MIMICIII. The MMoE and MMoEEx models compared have the same neural network structure, but the MMoEEx uses the MAML  MTL optimization and has the diversity enforced. The MMoEEx model in Figure 4 was created with $\alpha = 0.5$ and exclusivity. In other words, half of the experts in the MMoEEx model were randomly assigned to be exclusive to one of the tasks, while in the MMoE model all experts are shared among all tasks. Figure 4 shows a heatmap of the distances $D^{MMoE}$ and $D^{MMoEEx}$ calculated on the MIMICIII testing set with 12 experts. MMoE's heatmap has overall lighter colors, indicating smaller diversity scores, compared with MMoEEx. Quantitatively, MMoEEx produces a relative lift of $43\%$ in diversity score.
We presented a novel multitask learning approach called Multigate MixtureofExperts with Exclusivity (MMoEEx), which extends previous methods by introducing an exclusivity mechanism that induces more diversity among experts, allowing the network to learn representations that are more effective for heterogeneous MTL. We also introduce a twostep optimization approach called MAMLMTL, which balances tasks at the gradient level and enhances MMoEEx's capability to optimize imbalanced tasks.
MTL has achieve critical mass in multiple areas like natural language processing [7, 8, 9], computer vision [10, 11, 12], reinforcement learning [13, 14] and multimodal learning [15, 16]. Standard soft/hard parameter sharing approaches are a well established technique to handle multiple tasks. While they show improvements over single task learning for tasks with similar characteristics, it is not fully explored how MTL can further improve heterogeneous task scenarios. Hybrid approaches like mixture of experts can mitigate several limitations of standard approaches and further extend its capabilities when coupled with specialized optimization methods. Optimization methods for MTL are in their infancy, and more research on metalearning task balancing can greatly benefit MTL research. We hope this work inspires the community to further investigate multitask learning at the network architecture and optimization levels.
^{[1] } Rich Caruana. Multitask learning: A knowledgebased source of inductive bias. In Proceedings of the Tenth International Conference on Machine Learning, 1993.
^{[2] } Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. Modeling task relationships in multitask learning with multigate mixtureofexperts. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.
^{[3] } Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017.
^{[4] } Sungjae Lee and Youngdoo Son. Multitask learning with single gradient step update for task balancing. arXiv preprint arXiv:2005.09910, 2020. 8
^{[5] } Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. Scientific data, 6(1):1–18, 2019.
^{[6] } Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman LiWei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimiciii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
^{[7] } Victor Sanh, Thomas Wolf, and Sebastian Ruder. A hierarchical multitask approach for learning embeddings from semantic tasks. In AAAI Conference on Artificial Intelligence, volume 33, 2019.
^{[8] } Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multitask deep neural networks for natural language understanding. In Annual Meeting of the Association for Computational Linguistics, 2019.
^{[9] } Cagla Aksoy, Alper Ahmetoglu, and Tunga Gung ¨ or. Hierar ¨ chical multitask learning approach for BERT. arXiv preprint arXiv:2011.04451, 2020.
^{[10] } Ximeng Sun, Rameswar Panda, Rogerio Feris, and Kate Saenko. AdaShare: Learning what to share for efficient deep multitask learning. In Advances in Neural Information Processing Systems, 2020.
^{[11] } Shikun Liu, Edward Johns, and Andrew J Davison. Endtoend multitask learning with attention. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
^{[12] } Simon Vandenhende, Stamatios Georgoulis, and Luc Van Gool. MTINet: Multiscale task interaction networks for multitask learning. In European Conference on Computer Vision, 2020.
^{[13] } Lerrel Pinto and Abhinav Gupta. Learning to push by grasping: Using multiple tasks for effective learning. arXiv preprint arXiv:1609.09025, 2016. 9
^{[14] } Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado van Hasselt. Multitask deep reinforcement learning with popart. Technical report, DeepMind, 2019.
^{[15] } Subhojeet Pramanik, Priyanka Agrawal, and Aman Hussain. OmniNet: A unified architecture for multimodal multitask learning. arXiv preprint arXiv:1907.07804, 2019.
^{[16] } Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 12in1: Multitask vision and language representation learning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020.
]]>Reinforcement learning (RL) has moved from toy domains to realworld applications such as navigation [4], software engineering [2], industrial design [11], and finance [10]. Each of these applications has inherent difficulties which are longstanding fundamental challenges in RL, such as: limited training time, partial observability, large action or state spaces, costly exploration and safety considerations, among others. Similar problems occur when using RL in trading markets, however, we focus on three main aspects that we consider are highly relevant for financial applications: riskawareness, variance reduction, and robustness.
Risk is such a common term that can have many definitions in different scenario. Our first question then is, what is risk? In the context of trading, risk is the potential that your chosen investments may fail to deliver your anticipated outcome. That could mean getting lower returns than expected, or losing your original investment – and in certain forms of trading, it can even mean a loss that exceeds your deposit.
Our second question is, how to measure risk? Risk assessment is a cornerstone in financial applications, a wellknown approach is to consider risk while assessing the performance (profit) of a trading strategy. Here, risk is a quantity related to the variance (or standard deviation) of the profit and it is commonly refereed to as "volatility". In particular, the Sharpe ratio [15], a commonly used measure in trading markets, considers both the generated profit and the risk (variance) associated with a trading strategy. Sharpe ratio is commonly defined to be the asset return divided by the standard deviation of the asset return.
Traditional RL aims to optimize the expected return, usually, without considerations of risk. However, riskaverse RL is a recent area that has proposed to optimize an objective function with risk consideration.
Riskaverse Qlearning (RAQL): Shen et al. [16] proposed a Qlearning algorithm that is shown to converge to the optimal of a risksensitive objective function:
\begin{align}
\label{eq:Risk_Averse_Objective}
\tilde{J}_{\pi}= \frac{1}{\beta}\mathbb{E}_{\pi}\left[exp\left(\beta\sum_{t=0}^{\infty}\gamma^t r_t\right)\right]=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t r_t\right] + \frac{\beta}{2}\mathbb{V} ar\left[\sum_{t=0}^{\infty}\gamma^t r_t\right] + O(\beta^2).
\end{align}
The training scheme is the same as Qlearning, except that in each iteration, a utility function is applied to the TDerror. A utility function is a monotonically increasing function. A concave utility function is applied when we want to optimize a riskaverse objective function, in contrast, a convex utility function is applied when we want to optimize a riskseeking objective function. To summarize, applying a utility function to Qlearning is a concise way to consider risk in RL.
In trading markets, we do not only care about the expected return, but also how 'safe' a strategy is. In RL, one common approach to measure 'safety' is by measuring the variance of the return [7]. Here we we mention two recent works.
AveragedDQN [1]: This approach reduces training variance by training multiple Q tables in parallel and averaging previously learned Qvalue estimates, which leads to a more stable training procedure and improved performance by reducing approximation error variance in the target values. Averaged DQN is theoretically shown to reducing the training variance, but there is no convergence guarantee.
Variance reduced Qlearning (VQL): Wainwrigth [18] proposed a variance reduction Qlearning algorithm which can be seen as a variant of the SVRG algorithm in stochastic optimization [9]. Given an algorithm that converges to $Q^*$, one of its iterates $\bar{Q}$ could be used as a proxy for $Q^*$, and then recenter the ordinary Qlearning updates by a quantity $\hat{\mathcal{T}}_k(\bar{Q}) + \mathcal{T}(\bar{Q})$, where $\hat{\mathcal{T}}_k$ is an empirical Bellman operator, $\mathcal{T}$ is the population Bellman operator, which is not computable, but an unbiased approximation of it could be used instead. This algorithm is shown to be convergent to the optimal of expected return and enjoys minimax optimality up to a logarithmic factor.
Novel proposed algorithm: RA2Q [6]: Since RAQL has the advantage that it converges to the optimal of a riskaverse objective function, we could use it as a building block and design novel riskaverse RL algorithms based on it. The idea of training multiple Q tables in parallel could be integrated with the utility function technique, more specifically, we train $k$ Q tables in parallel using the update rules in RAQL, and select more stable actions by estimating the variance by the sample variance of those $k$ Q tables, then compute a riskaverse $\hat{Q}$ table and select actions according to it. We name this algorithm RA2Q, which preserves the convergence property of RAQL.
Novel proposed algorithm: RA2.1Q [6]: We can also combine the 'recenter' technique of VQL with the utility function technique in a novel algorithm RA2.1Q. For each empirical Bellman operator $\hat{\mathcal{T}}_k$, we apply a riskaverse utility function to the TD error. Although we cannot show any convergence guarantee of RA2.1Q, empirically, RA2.1Q obtained better results than RA2Q and RAQL in a multiagent evaluation.
What is robustness? We usually say an algorithm is robust if it is stable under different challenging scenarios. Recent works, have improved robustness of algorithms with adversarial learning by assuming two opposing learning processes: one that aims to disturb the most and another one that tries to control the perturbations [12].
RiskAverse Robust Adversarial Reinforcement Learning (RARL): The same concept has been adapted with neural networks in the context of deep RL [14] and in particular RARL [13] extended this idea by combining with Averaged DQN. RARL trains two agents  protagonist and adversary in parallel, and the goal for those two agents are respectively to maximize/minimize the expected return as well as minimize/maximize the variance of expected return. RARL showed good experimental results, enhancing stability and robustness, without providing theoretical guarantees.
Novel proposed algorithm: RA3Q [6]: The idea of having a protagonist and adversary in the same environment lends itself to multiagent learning algorithms. In this context, Nash Qlearning [8] is a wellknown multiagent algorithm that can obtain the optimal strategy when there exists a unique Nash equilibrium in generalsum stochastic games. Our last proposal takes inspiriation from multiagent learning algorithms and adversarial RL. In order to achieve a robust riskaverse agent, we combine the idea of adversarial learning with RA2Q. We assume two opposing learning process: one protagonist aims to maximize the expected reward and minimize the variance, while one adversary aims to disturb the protagonist by minimizing the expected reward and maximize the variance. We name this adversarial learning algorithm RA3Q and although RA3Q does not have a convergence guarantee, empirically, RA3Q shows better results in terms of robustness compared to RA2Q.
How do we measure the superiority of RL agents in trading markets? We use game theory and treat each agent as a player in a stochastic game. In empirical game theory, a meta game payoff table could be seen as a combination of two matrices $(NR)$, where each row $N_i$ contains a discrete distribution of $p$ players over $k$ strategies, and each row yields a discrete profile $(n_{\pi_1}, ..., n_{\pi_k})$ indicating exactly how many players play each strategy with $\sum_{j}n_{\pi_j} = p$. A strategy profile $\mathbf{u} = \left(\frac{n_{\pi_1}}{p}, ..., \frac{n_{\pi_k}}{p}\right)$. And each row $R_i$ captures the rewards corresponding to the rows in $N$. Once we have a metagame payoff table, to view the dominance of different strategies, one can plot a directional field of the payoff tables where arrows in the strategy space indicates the direction of flow of the population composition over the strategies [17].
In our first experiment with the opensourced ABIDES [5] market simulator our setting consisted of one nonlearning agent that replays the market deterministically [3]and learning agents. The learning agents considered are: RAQL, RA2Q, RA2.1Q. The measure we use is Sharpe Ratio, which is a commonly used riskaverse measure in financial markets. The results are shown in the Figure below.
Our second experiment tested robustness and we trained RA2Q and RA3Q agents under the same conditions as a first step. Then in testing phase we added two types of perturbations, one adversarial agent (trained within RA3Q) or adding noise (aka. zerointelligence) agents in the environment. In both cases, the agents will act in a perturbed environment. The results presented in Table 1 shown that RA3Q obtained better results than RA2Q, highlighting its robustness.
Algorithm/Setting  Adversarial Perturbation  ZI Agents Perturbation 
RA2Q RA3Q 
0.5269 0.9347 
0.9538 1.0692 
We have argued that riskawareness, variance reduction and robustness are relevant characteristics for RL agents since those can be used as building blocks to construct algorithms. For example, by using utility functions, parallel training of Q tables, and adversarial learning, different algorithms can be constructed, as shown in Fig. 2.
Table 2 presents a summary of properties of the algorithms mentioned in this post, those with bold typeface are our novel algorithms [6].
Algorithm  Riskawareness  Variance reduction  Robustness 
RAQL  ●  
Averaged DQN  ●  
VQL  ●  
RARL  ●  ●  
RA2Q  ●  ●  
RA2.1Q  ●  ●  
RA3Q  ●  ●  ● 
^{[1] } Oron Anschel, Nir Baram, and Nahum Shimkin. Averageddqn: Variance reduction and stabilization for deep reinforcement learning. InInternational Conference on Machine Learning,pages 176–185. PMLR, 2017.
^{[2] }Mojtaba Bagherzadeh, Nafiseh Kahani, and Lionel Briand. Reinforcement learning for test caseprioritization.arXiv preprint arXiv:2011.01834, 2020.
^{[3] } Tucker Hybinette Balch, Mahmoud Mahfouz, Joshua Lockhart, Maria Hybinette, and DavidByrd. How to evaluate trading strategies: Single agent market replay or multiple agent interactive simulation?arXiv preprint arXiv:1906.12010, 2019.
^{[4] }Marc G Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C Machado,Subhodeep Moitra, Sameera S Ponda, and Ziyu Wang. Autonomous navigation of stratosphericballoons using reinforcement learning.Nature, 588(7836):77–82, 2020.
^{[5] }David Byrd, Maria Hybinette, and Tucker Hybinette Balch. Abides: Towards highfidelitymarket simulation for ai research.arXiv preprint arXiv:1904.12066, 2019.
^{[6] } Yue Gao, Kry Yik Chau Lui, and Pablo HernandezLeal. Robust RiskSensitive ReinforcementLearning Agents for Trading Markets. InReinforcement Learning for Real Life (RL4RealLife)Workshop at ICML, 2021.
^{[7] } Javier Garcıa and Fernando Fern ́andez. A comprehensive survey on safe reinforcement learning.Journal of Machine Learning Research, 16(1):1437–1480, 2015.
^{[8] } Junling Hu and Michael P. Wellman. Multiagent reinforcement learning: Theoretical frameworkand an algorithm. InProceedings of the Fifteenth International Conference on Machine Learning, ICML ’98, page 242–250, San Francisco, CA, USA, 1998. Morgan Kaufmann PublishersInc.
^{[9] } Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variancereduction. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,editors,Advances in Neural Information Processing Systems, volume 26. Curran Associates,Inc., 2013.
^{[10] } Yuxi Li. Deep reinforcement learning: An overview.arXiv preprint arXiv:1701.07274, 2017.
^{[11] } Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Jiang, Ebrahim Songhori, Shen Wang,YoungJoon Lee, Eric Johnson, Omkar Pathak, Sungmin Bae, et al. Chip placement with deepreinforcement learning.arXiv preprint arXiv:2004.10746, 2020.
^{[12] } Jun Morimoto and Kenji Doya. Robust reinforcement learning.Neural computation, 17(2):335–359, 2005.5
^{[13] } Xinlei Pan, Daniel Seita, Yang Gao, and John Canny. Risk averse robust adversarial reinforcement learning. In2019 International Conference on Robotics and Automation (ICRA), pages8522–8528. IEEE, 2019.
^{[14] } Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarialreinforcement learning. InInternational Conference on Machine Learning, pages 2817–2826.PMLR, 2017.
^{[15] } William F Sharpe. The sharpe ratio.Journal of portfolio management, 21(1):49–58, 1994.
^{[16] } Yun Shen, Michael J Tobia, Tobias Sommer, and Klaus Obermayer. Risksensitive reinforcementlearning.Neural computation, 26(7):1298–1328, 2014.
^{[17] } Karl Tuyls, Julien Perolat, Marc Lanctot, Edward Hughes, Richard Everett, Joel Z Leibo, CsabaSzepesv ́ari, and Thore Graepel. Bounds and dynamics for empirical game theoretic analysis.Autonomous Agents and MultiAgent Systems, 34(1):1–30, 2020.
^{[18] } Martin J Wainwright.Variancereducedqlearning is minimax optimal.arXiv preprintarXiv:1906.04697, 2019.
The views expressed in this article are those of the interviewee and do not necessarily reflect the position of RBC or Borealis AI.
Nassim Abdi (NA):
Businesses have been talking about diversity and inclusion for a long time. Unfortunately, most of the time, they were doing little more than completing a list of ‘checkbox’ items. More recently, however, we have started to see companies thinking more clearly about the business case behind diversity and inclusion. Businesses are starting to realize that they are losing high quality talent and slowing their rate of innovation and creativity simply because they do not have diverse voices at the table. You can’t create the most innovative tool if you aren’t looking at it from all angles and from different perspectives.
The data proves it. Recent research shows that companies with an inclusive workplace enjoy a 42% increase in collaboration. Those with diverse teams report a 1.7x increase in innovation. And there is emerging data that suggests lack of diversity leads to an US8 billion loss in productivity. Those are numbers businesses can’t simply ignore.
(NA):
I think Microsoft’s experience with Tay demonstrated that AI is only as good as the data you put into it. Recent studies suggest that facial recognition tends to be much, much more accurate when it comes to white males than other groups – particularly women of colour – largely because there is much more test data for white males than other populations. So I would argue that bias is already in the system and in the data before we even start.
That is why it is so critically important that AI developers and researchers pay attention to this issue. You must be able to bring that diversity of views or that empathetic mindset approach that only comes from understanding other peoples’ perspectives and walking in their shoes. Otherwise, you are allowing your decisions to be influenced by implicit biases.
(NA):
It really comes down to the relationship between bias and privilege. As developers, you have the privilege of being the decisionmakers – you are the ones that control whether you are creating something that will ultimately be inclusive or exclusive. And that influences the ultimate structure of power. It’s not easy. But it starts with understanding that privilege and really checking our biases.
Machine learning models love binary decisions. Yet diversity does not lend itself well to binary thinking. How can we tackle this idea of ‘intersectionality’ as we think about our models?
That is one the big challenges when it comes to how we use and define technology. For now, it really comes down to ensuring you have real diversity and inclusion in your teams and that you define standards that are more aligned to the human world we actually live in. That means that the governance of these platforms is going to become much more important.
(NA):
The first thing is just being aware that we all have implicit bias. It’s human nature; for millions of years, humans were trained to avoid things that were unfamiliar. So the big question is how we can avoid it, particularly in a world of social media echo chambers. The real challenge here is to help people walk in someone else’s shoes – to really start to understand their situation and the world from their perspective.
(NA):
The first step for executives and business leaders is to be willing to address the problem. It isn’t always easy to change the status quo. The good news is that the new generation of workers is really starting to change the conversation. They care about where they work and the vision of the organization. They want to see the bigger picture and they want to have a positive impact.
One of the more successful approaches that businesses are adopting is ‘reverse mentorship’. As a practice, it’s been around for a while. But now we are seeing a lot of success from companies using reverse mentorship to create safe spaces for conversations about diversity, equity and inclusion.
(NA):
I believe the key is in helping people walk in other peoples’ shoes and get a real understanding of their perspectives and lived experiences. And that’s really the foundation point for the company I founded – StoryBolt. Simply put, we use the power of storytelling to help organizations unpack implicit bias, gender equality, mental health and more.
As a teacher, I realized very quickly that people learn better from visual stories. Different parts of the brain start working; language comprehension, visual cues and sounds all fire up the neurons. And it creates an experience that can stay with us for the rest of our lives. Rather than just show smart documentaries that unpack an issue, we then invite the filmmaker to come into the room for a Q&A and discussion on the issues. It’s amazing how that sparks a kind of awareness in people that simply does not go away.
Nassim Abdi, Ph.D., is the CEO and Cofounder of StoryBolt. She is a storyteller and evangelist on finding the intersection of entertainment and learning in the area of diversity, equity, and inclusion. She has 12 years of academic experience in the field of intersectionalities of gender, race, and other identities as they relate to systems of discrimination or disadvantage. Nassim is also the leading actress of a Netflixfeatured film, Secret Ballot, (by Sony Pictures). Her vision for StoryBolt was shaped by the lifechanging experience of the film as it engaged her in Q&A sessions and exposed her to the power of movies and how candid human connections could change perspectives and facilitate courageous conversations in the workplace.
]]>If you’ve got an idea and passion to explore what you can do with it, we want to help you solve it.
Let’s SOLVE it is a new Borealis AI mentorship program that aims to help undergraduate students use AI to make a difference and solve real problems in their communities. You bring the idea and the team, we’ll provide the industry exposure, mentorship, contacts and training you need to make the project a reality.
We’re looking for teams of 3 to 5 undergraduate students (enrolled at Canadian universities) with idas on how AI and ML could be used to tackle a specific community problem. With this program, we hope to reach as many communities as possible, and so we’re encouraging teams from across the country and from every walk of life to apply.
The mentorship program is free and will be conducted virtually. It runs from October 1st to December 2nd, so it falls within the school semester, and you’d need to allocate about 10 hours per week during that time.
In return, you’ll get all the support you need to turn your idea into a proofofconcept implementation. You’ll get valuable experience, skills and ecosystem contacts that will help you consider launching your career in AI. Think of it as a ‘fasttrack’ accelerator for your idea, and your skills and capabilities.
This program is open to all undergraduate students at all Canadian universities. You don’t need to be enrolled in a Computer Science program – each team member should have some basic programming knowledge, but specific experience using AI or ML isn’t necessary. What you do need, however, is a strong sense of curiosity, a passion for solving problems using AI and a burning desire to accelerate your personal development.
Let’s SOLVE it is one of a handful of initiatives that Borealis AI and the Royal Bank support in order to encourage diversity, skills development and innovation at Canadian universities. Along with initiatives like our Borealis AI Fellowships program (aimed at postgrad students) and our Internships program (focused on Masterslevel students), our goal is to help nurture the AI leaders of tomorrow.
If you are an undergraduate student with dreams of solving real problems in your community using AI, we want to help you get there. We look forward to seeing your team’s application!
]]>