RBC, NVIDIA and Red Hat announced today a new Borealis AI infrastructure to support machine learning research and development at RBC. Our team of over 100 engineers and researchers now has access to state-of-the-art computing resources that speed up our work significantly in domains such as natural language processing, reinforcement learning, time series modeling and AI ethics.  

High Performance Computing (HPC) infrastructure, with a distributed and fully automated environment, is extremely important when building modern AI models, especially when this research is applied in production environments like RBC’s, where the size of the datasets can be massive (~10 Billion new client interactions every month).

Our objective was to build an AI infrastructure that could handle both research and production workload, ensuring that Borealis AI’s research projects could transition to production efficiently. We believe in quick iterations and therefore this infrastructure is designed to be flexible and easy to use. It encompasses two GPU clusters to accommodate the distinct needs of Borealis AI’s research and production work.

Throughout the research community, there has been a growing number of HPC clusters that use Slurm – a resource scheduler and cluster management software. AI researchers are familiar with this technology and it was adopted at Borealis AI in order to facilitate use and reduce the learning curve for new users. Our researchers coming from academia can now quickly onboard onto our platform and start their research. 

Building a powerful cluster is more than just stacking together GPUs. An AI cluster requires every component, including networking and storage, to operate in harmony and at high performance. With the AI community moving towards training larger models, an integrated system became more important than piling up servers. We built our cluster using AIRI based on NVIDIA‘s reference architecture which provided us with a high performance integrated solution, and the flexibility to increase capacity efficiently. 

Taking a machine learning model into production is not a trivial task. These applications need to handle complexities such as data reliability and stochasticity which are not there in traditional software development. In order to manage this complexity we designed a compute infrastructure based on industry standards and best practices. The emergence of Docker and Kubernetes has changed the way we build AI infrastructure and, with RBC’s vast expertise in managing the Red Hat OpenShift platform, Borealis AI built its production cluster using OpenShift and allowed developers to deploy containerized ML applications and services into production using GPUs. 

Borealis AI is leveraging the power of this new infrastructure across a broad spectrum of projects, ranging from personal & commercial banking, to wealth management and capital markets. Prediction tasks in the finance industry are particularly challenging, because they are driven by massive datasets and require exhaustive analysis of multiple dependent axes, including data filtering, neural architecture search, hyperparameter optimization, dynamic targets, and path-dependent metrics. A thorough exploration of the resulting joint parameter space typically requires optimization of tens of thousands of configurations, or the equivalent of thousands of CPU years. 

Our new HPC infrastructure, with a distributed and fully automated environment, enables parallel execution of above tasks in a matter of days. In production, this infrastructure enabled us with parallel online computation of complex feature representations and, as a consequence, ultra-fast reactions in an environment that is primarily dominated by one factor…speed!