Managing data throughout the machine learning product lifecycle
Data has been the catalyst to innovation for centuries. When Herman Hollerith, a former employee of the US Census Office, invented the first counting machine in 1896,  he was not only solving a massive resource challenge (in those days, it took the Census Office more than seven years to do their tabulations), he was launching an era of data-driven innovation.
A century and a quarter later, data remains at the very heart of innovation. Today, data systems are much more sophisticated than Hollerith’s basic counting machine. Modern day systems are driven by sophisticated technologies – like machine learning (ML) – designed to uncover new insights and value from data.
Most businesses around the world now recognize that data is the foundation for future competitive advantage. But there's so much hype around data being the ‘new oil’ that business leaders often miss the nuances of managing data in a way that can actually create value from algorithms.
The adage ‘garbage in, garbage out’ is as true for basic analytics as it is for ML. As teams build ML products, it's important they focus not only on the aptness of the algorithm, but also on the quality of the data throughout the development lifecycle. At Borealis AI, we often say that data must be a first-class citizen.
Understanding the limitations, requirements and complexities of dealing with data is critical for ML product managers, given the cost, time and resource requirements involved.
In this article, we walk through each step of the typical ML project, (as outlined in this article by Sam Chow), to reflect on the nuances of managing data in the ML product lifecycle. We start by exploring the properties of real data and reviewing the more common data design patterns of data-driven systems.
Properties of real data
Real data is messy. It’s inconsistent. And it’s poorly organized. In large part, that is because most data is collected to serve a distinct application and not necessarily intended for future data-driven analytics uses.
Consider, for example, how a simple concept such as counting ‘customer orders’ might result in different data across various company functions. For a sales department, the data would represent the number of contracts signed, regardless of whether or not the invoices have been paid. For an accounting function, the data would reflect the number of paid sales, whether or not the product has been shipped. Supply chain, on the other hand, is concerned with the number of customer orders delivered. A simple query therefore – something like number_of_orders_per_day – could result in a number of different data points, adding to the uncertain nature of ML systems.
Part of the challenge is that ML algorithms are generally downstream consumers of real data. Data is typically stored for one purpose, and then reformatted for an ML algorithm. Even in applications designed to feed data to ML models, the data often has a transactional role beyond its potential use in ML models.
For example, Twitter stores tweets and social interaction in a format that makes it responsive and interesting to its users. But behind the scenes, the company then transforms this data to a format appropriate for their Who-To-Follow recommendation system.
The process of converting data is known as the Extract-Transfer-Load (ETL) process. At its most basic, ETL is used to extract the right data from one (or more) system(s), transfer its format into something usable by another system, and then load it into that other system.
While it may seem like a fairly simple process, it is widely recognized to be the most expensive and time-consuming step in the development of an ML system. It is also critically important, as it ends up influencing the models’ success, fairness, scalability, and reliability (2).
Data pipeline for machine learning products
Given the costs and importance of gathering the right data, it is critical that there are enough controls to ensure it is done correctly and that the limitations of the data are well understood. Figure (1) shows a simple heuristic to break down data preparation into three stages: raw data, clean data, and aggregated data.
Raw data is the dirtiest (or at least has the greatest potential to be dirty). It's the original data collected through a data-generating business process – most often from business transactions – with all its potential quality issues (duplication, transcription errors, etc.).
Clean data is a verified and unified version of the real data. The data types and business definition of these data are known and documented, making them legible to ML scientists and engineers. However, it does not include new information from data aggregation or embedded representations (such as vectorized, mathematical representations of latent properties in the data that can be used as inputs to a machine learning algorithm).
Aggregate data includes any derived representation of the clean data. For example, it may include total_number_orders for each day or month. It may also include text or customer transaction embeddings, which are often used for deep learning applications.
A data scientist might say that conjoint clean and aggregate data are the main components of the feature store used as inputs to an ML algorithm. For those without a data science background, a ‘feature’ is a piece of information that the system uses to identify a data entry (so, for example, a customer can be viewed as a group of features including name, data of birth, residency, and so on).
Data sits at the heart of the product management process
As teams advance through the ML product development lifecycle (as aptly described by Sam Chow in this introductory post), it is critical that data be a first-class citizen that is carefully considered at each stage.
There's no machine learning without data. When exploring business problems and impact with executive stakeholders, it's important that teams also connect with data stewards to understand what data is available to use as inputs to the algorithmic system and solve the problem. The following aspects related to the data should be identified:
- What does the process that generates the data for the machine learning system look like?
- What data sets will be used to solve the problem?
- Where is the data stored and how difficult is it to access?
- How is the system going to consume the data and what results will it provide to the rest of the environment (Data Input/Output format)?
During the feasibility stage, it is important to zoom out to reassess project assumptions, review data discovery activities and think through any new data sources that may need to be added. This is the time to mitigate any potential risk from a data perspective. Here is what you can do at this stage:
- Access the data identified in the discovery phase and add new data sources as necessary.
- Conduct data discovery and data cleaning activities for a data sample.
- Create the required aggregate data and train a baseline model using sample data to identify issues in data sources or the data cleaning process.
- Verify the model output format, regardless of model accuracy, with the expected format at the receiving application.
Keep in mind that a data sample will probably come with just a sample of the data problems that will need to be resolved. Yet it could be expensive to clean the entire data set at this stage. There is an obvious trade-off decision that needs to be made. Our upcoming blog about the Feasibility stage will talk in further detail about data preparation techniques, such as normalization, clipping outliers, data encoding, and so on.
Design & Implementation
Here, we need to start looking at entire data sets, in addition to investigating any new machine learning models and application integrations. This is your chance to:
- Collect the entire data set as raw data.
- Clean and aggregate the data, as identified in the feasibility phase.
Deployment & Support
The importance of the data does not diminish at this stage: the data as a first-class citizen rule applies here too, as you prepare the data for feedback loops and any future retraining. ML systems in production need two things: efficient access to data features to be used in the down-stream task; and continuous data preparation for new real data that should be fed to the system's feature store. Business applications which use ML systems continuously receive new data from different sources, and therefore the data preparation pipelines used during training must be deployed with the ML system. The following considerations are generally important at this stage:
- Appropriate data governance for consumed and generated data (including masking personal information, data storage and retention, data encryption, and others).
- Continuous addition of the collected data to the feature store, to prepare data for future updates and retraining.
- Evaluation of the system feedback loop to continuously assess the quality of the deployed model.
- Model KPIs development to identify potential data drift.
Respect the data
Experienced product managers help their organizations view their data-generating processes with a view towards future AI and machine learning innovation. Many of the common data challenges product managers face could be solved by ensuring data is captured and treated appropriately from the very beginning.
Yet this does not absolve project and product managers from understanding the nuances of their data. In fact, as machine learning systems become increasingly important parts of the business problem-solving equation, a sound understanding of data and data quality will become an increasingly valuable capability.
Expect to see more organizations start looking at building systems that are designed to democratize data, including unified data governance, data catalogues to simplify the process of discovering new data sources, data preparation pipelines, continuous data integration, and building knowledge graphs.
This post is part of an ongoing series developed by the product managers at Borealis AI. In future posts, my colleagues will delve deeper into other aspects of the PM process. But don’t be surprised to see a continuous return to the theme of data as a first class citizen. That is a fundamental of the ML PM process.
This is part 2 in our Field Guide to Machine Learning Product Development: What Product Managers Need to Know (link) series. Read the Introduction here, and stay tuned for deep dives into the six main stages of the machine learning PM process.
(1) Tabulation and Processing history at United States Census: https://www.census.gov/history/www/innovations/technology/tabulation_and_processing.html - accessed on March 12th 2021
(2) Sambasivan, Nithya, et al. " Everyone wants to do the model work, not the data work": Data Cascades in High-Stakes AI. (CHI 2021).
Borealis AI offers a stimulating work environment and the platform to do world-class machine learning in a startup culture. We're growing the team, hiring for roles such as Product Manager, Business Development Lead, Research Engineer - Data Systems and more.