Introduction

At Borealis AI, a research institute backed by RBC, we conduct machine learning (ML) research to bring state-of-the-art practices to our financial use cases, all the while adhering to strict data governance and risk management protocols. Our data and engineering teams are constantly looking to improve the iteration speed for machine learning workflows and optimize our processes.

This commitment to excellence in data handling extends to our choice of tools. Among the AI vendors we work with is ClickHouse, a fast, open-source, column-oriented database management system that enables real-time analytical data reporting via SQL queries [I]. In this blog post, we’ll explore how Borealis AI has integrated ClickHouse into our operations and what drove us to adopt it.

The role of ClickHouse at Borealis AI

The figure 1 below represents the data and compute ecosystem at RBC and Borealis AI. With data residing across multiple sources and exceeding hundreds of gigabytes, our teams face the challenge of creating data sets that improve learning objectives.

This process is inherently interactive, iterative, and compute intensive. Any improvements to the speed of iteration directly contribute to faster delivery of results.

Figure 1: Represents the data and compute ecosystem at RBC and Borealis AI.

The following three key areas highlight the challenges within our ecosystem and how ClickHouse helps address them:

  1. The Extract-Load-Transform (ELT)  Challenge:
    • Challenge: Our data tends to exist in many disparate data systems, which slows down data exploration across data sources.
    • Solution: Our data engineers ingest data from relevant sources into an on-premises ClickHouse instance, employing ClickHouse’s MergeTree Engine. This gives research teams the power to explore and transform cross-source data rapidly.
  2. Data Exploration Challenge:
    • Challenge: Many teams require the ability to conduct ad-hoc analysis to move projects forward. This often involves running iterative queries with quick feedback loops (under 10 seconds). In-memory tools like Pandas or Polars are effective for small data sets, but as the volume grows, the feedback loop becomes sluggish without a more optimized execution engine. For context, our largest tables consist of hundreds of billions of rows and tens of terabytes of data.
    • Solution: ClickHouse allows hundreds of users to process large amounts of data concurrently.
  3. Data Reconciliation Challenge:
    • Challenge: Enterprise Data Lake (EDL) sources from raw data, whereas the Enterprise Data Warehouse (EDW) exposes data in the form of views with business logic applied in some cases. Thus, teams must reconcile this data to ensure we are using the same logic applied in the EDW.   
    • Solution: We need to redefine data as a service, complete with Service Level Agreements (SLAs) for metadata quality, update frequency, data type accuracy, and more. This requires a responsive platform support network capable of addressing user needs daily, not weekly. Additionally, the ability to aggregate and reconcile data at scale and in a performant way using ClickHouse helped Borealis surface inconsistencies between the data lake and the enterprise data warehouse.

Why we chose ClickHouse

At RBC, data resides across multiple sources, and enterprise-wide systems like Teradata and Spark-based solutions have performed adequately for relatively stable, non-iterative solutions that require production-grade SLAs.  However, many of Borealis AI’s workloads are iterative, involving cycles of training, testing, and repeating. These processes require handling hundreds of gigabytes of feature transformations interactively, making speed a critical factor. There are public benchmarks [I] that empirically determine ClickHouse to be faster than many well-known databases, and this was proven true, in our experience.   

Impact and improvements

Since adopting ClickHouse, we’ve been able to iterate on ML workloads 5-10x faster than with Spark or Teradata-based solutions, thanks to the faster compute performance and the ability to consolidate data in one location. User feedback has been overwhelmingly positive, particularly among researchers, who appreciate ClickHouse quality-of-life features like time series functions, ASOF joins and parametrized views.

The adoption of AI vendors such as ClickHouse, is part of our digital transformation journey, enabling us to leverage advanced data analytics and machine learning capabilities to drive innovation and efficiency.

If you are interested in learning more about how Borealis AI uses ClickHouse our engineers will be delivering an in-depth talk on September 10, 2024 at Shopify in Toronto. To learn more and register, visit: https://www.meetup.com/clickhouse-toronto-user-group/events/301490855.

References

[I] ClickHouse. (2024). Real-time data analytics platform. https://clickhouse.com/clickhouse

[I] ClickBench – a benchmark for analytical DBMS. ClickBench – a Benchmark For Analytical DBMS. (2024). https://benchmark.clickhouse.com/