This is our final instalment in our Field Guide to Machine Learning Product Development: What Product Managers Need to Know series. Read the Introduction, learn about how to manage data throughout the machine learning product lifecycle, read up on the Discovery and Feasibility phases; Design and Requirements; Model and Development; explore our thinking on the Production phase and, read the final phase Support and Maintenance below.
In the world of stable software releases, the launch day was often an event not to be missed. Development teams would gather (pre-COVID) for a ‘shipping party’ to celebrate the end of a long and arduous, yet ultimately successful journey (rumour has it that, at the Microsoft Windows 95 event, one developer rode his motorcycle through the office halls in celebration… much to the chagrin of the facilities manager).
The advent of web-based systems put an end to the ‘shipping parties’. The world has discovered agile. However, for the most part, the PM’s work did not naturally extend to include support and maintenance of a software product, especially in more standard SaaS products. Traditionally, this responsibility sat with DevOps/SRE/Engineering teams. In the context of machine learning, it is much more important for PMs to get more involved in support & maintenance, and put the plans in place early in the product lifecycle.
Support & Maintenance are certainly not an afterthought for ML product managers.
In ML, product managers tend to consider support and maintenance much more deeply, and here’s why. For the vast majority of ML models, there will never be a state in which the model isn’t changing, even if no new lines of code are added. ML models are best considered as living, evolving organisms that require continuous support and maintenance (often right until the end of their lives). To complicate matters further, they are highly susceptible to forces outside of their control – changes in user patterns, malicious actors, or business needs for example – which can affect input data patterns. That, in turn, can impact the model’s accuracy in production.
It all comes down to developing a continuous support and maintenance mindset. This requires PMs and their research teams to start thinking about the challenges that may come up throughout the active lifespan of the product as early as possible. For example, if you can calculate how quickly the model accuracy will likely degrade in production, you can start building the right pipelines and infrastructure to trigger retraining at the appropriate intervals. Similarly, if you can test for bias, adversarial robustness, and fairness earlier in the lifecycle, you can reduce the risk and complexity of monitoring and alerting on the appropriate metrics.
In this – our final post in our Field Guide to ML Product Development – we’ll explore some of the unusual complexities of planning for and managing support and maintenance in an ML project environment. We’ll pick up on some consistent themes that have been raised throughout this series: the uncertainty of dealing with evolving models; the complexity of aligning technology and business demands; and the critical importance of understanding the data.
We’ll also offer a few tips and insights from our work at Borealis AI where, through our relationship with RBC, we have learned a lot about deploying, supporting and maintaining ML and AI models within large enterprise governance, control and technology environments.
Know what you are getting into
ML systems tend to require more attention than your typical SaaS, especially when in production. Think all the usual maintenance problems that come with traditional software systems, plus new ML-specific issues that are continuously evolving. Here are five key areas that should be top of mind for PMs as they plan for the support and maintenance stage of the ML product lifecycle:
- Go beyond the code. The reality is that the ML Model makes up only a small fraction of a realworld ML system. The model will need to integrate into a vast and complex surrounding infrastructure. It’s not just the layers of enterprise technology that need to be considered; it’s also the ML-specific tooling that may be required down the road (raising questions around things like whether you need GPUs for inference to support the SLA, what tools are available for monitoring for model accuracy and data drift, possible need for developing re-training pipelines and feedback loops).
- Think about the data. ML systems have an extraordinarily high dependency on data. At the front end, data access and quality are critical to getting an ML project off the ground. But when it comes to the support and maintenance stage, the focus shifts to ensuring you have the right monitoring and controls in place to detect changes in data that could have downstream effects on model performance and user experience. Often, complexities and risks arise from having different data sources for research and model training on the one side, and production ML inference systems on the other. In each environment, there are different constraints to data availability, stability, and quality (in some cases, the data may not be available at all). PMs may want to consider attaching a ‘cost’ to each data item based on what it takes to acquire it in production and how much value it adds to model and product performance.
- Know what you are monitoring for. The biggest question here, perhaps, is whether you want to optimize model performance or product performance. In some situations, a 10% loss in model accuracy may have very little impact on the end user experience. In other situations, product performance may be entirely predicated on model performance. To complicate matters further, in some scenarios product performance is easy to measure (for example, in a bill payment prediction situation where the ‘ground truth’ is observable and can be used for comparisons in the near future), but in others it could take months or years to receive signals correlated to the output of the model (like in a credit adjudication context, where defaults only happen years down the road). Knowing what is important to you to monitor for will help PMs to prepare a plan to interpret and act on degraded performance when it happens.
- Understand the realworld context. It is now widely recognized that ML models can influence changes in user behaviour. Combined with the natural social ‘shift’ that influences user expectations at a macro level, these forces can mean that the user context changes during the lifecycle of the ML product. What’s more, these shifts can be unexpected and unpredictable, making them very hard to plan for. There are ways to reduce the complexity, however: consider running the model over your historical data to specifically identify how these changes manifested in the past. In particular, monitor for prediction bias, action limits and changes in behaviours in upstream producers. To ensure you remain on top of evolving user expectations, try capturing changes in user behaviour through data feedback loops that are then funneled back into either retraining pipelines or further research, thereby enabling your system to adapt and learn.
- Recognize the limitations of ML models. Just because a model has been retrained with new or more data does not mean it will necessarily perform better than it did before. ML models can be unpredictable that way. PMs may want to consider how they will get production data and production models into nonproduction environments for testing throughout the lifecycle of the product. And they will want to consider how to test the new models (before and after deployment) for regression. Inference services that are set up to run champion challenger A/B testing and the use of canary deployments will likely become increasingly important.
These are just some of the considerations that PMs should be thinking about early as they plan for their ML product lifecycle. But, depending on your situation, model and context, the range of issues could be much broader.
Keep your focus: monitoring & stakeholder expectations
While there is much a PM can be doing to prepare for this support & maintenance phase, it helps to have these two key elements figured out very early in the process – the monitoring process and the stakeholder expectations.
Given the recent pace of the background rate of change and the ongoing maintenance costs associated with ML products, PMs will need a comprehensive live monitoring framework that enhances responsiveness and ensures long-term system reliability. And it’s best to be thinking about this early on when a PM can still influence the data streams, production impact, and stakeholder expectations.
Start thinking about stakeholder expectations early in the process. It may be helpful to build your plan around SLAs, SLOs and SLIs. SLAs (service level agreements) represent your agreement with your users. SLOs (service level objectives) are the goals you aim to achieve in pursuit of the SLA. Service level indicators (SLIs), on the other hand, allow you to measure how well your team is performing against those SLOs.
In conclusion, as this Field Guide to Machine Learning Product Development has made clear, ML products are complex, unpredictable and extraordinarily difficult to manage. This is as true in the Support and Maintenance stage as it is throughout the lifecycle. It is an exciting challenge, however, and one of the reasons behind us sharing more of what we know about doing this right.
Looking back across the articles in this Field Guide, the key takeaways for PMs would be to start by thinking about the end-to-end lifecycle, identify where early planning and research can help reduce some of that uncertainty, and start planning to get your arms around the complexity.
Those who miss the old camaraderie of the ‘shipping party’ may also want to schedule some time to celebrate milestones with their teams (motorcycles to remain in the parking lots, please).
We are scientists, engineers and product experts dedicated to forging new frontiers of AI in finance, backed by RBC. The team is growing! Please view the open roles, including Product Manager, Business Development Lead, Research Engineer – Data Systems, and more.