Machine Learning Observability


Despite significant investments in Machine Learning (ML), we are yet in a highly experimental state and the success rates of ML models vary significantly from one application to another. The common question is that once the models are deployed, how can we ensure that these technologies actually work? The fact is that providing high-quality ML models on a continuous basis is tough and making sure that these models continue to perform excellently long into their life in production is tougher.

What can go Wrong?

  • Training Serving Skew: When installing a model, there are chances that the model does not perform as it did while validating it offline. Handoffs to production do not always go well and is commonly termed as training/serving skew.
  • Changing Data Distributions: It is possible that the distribution of data that the model is exposed to, could change with time which is often known as feature drift or data drift.
  • Messy Data: Things do not stay constant like code. In a research lab, several hours often go into generating high quality data sets with negligible noise and accurate labels. In the real world there is no such guarantee of quality.

How Can ML Observability Help?

ML Observability is the practice of gaining a deep understanding into a model’s performance across various stages of the model development cycle: as it is being built, after it has been deployed, and long into its life in production. Observability is the main difference between a team flying blind after deploying a model versus a team that has the ability to improve its models quickly.

As tools are built to facilitate the three stages of the ML workflow–data preparation, model building, and production, it is typical for teams to develop some misconceptions in the complex ML Infrastructure space.

Misconceptions of ML Observability and the Facts

  • Discovering the problem means half the battle is won.

With ML observability, teams can expedite time-to-resolution by moving beyond knowing a problem exists to comprehending why the issue arose in the first place and how to resolve it.

  • The ML lifecycle is static.

In ML environments, ML models are being provided with data that is essentially not a static input. Also, some models are designed to continuously evolve during production, commonly termed to as online models. Furthermore, the task that a model is trying to perform may change with time. As ML models and the environments they operate in are dynamic, ML teams need to observe their model’s performance to understand how their models respond to changing data and tasks.

  • ML observability is just about production.

Though many problems that teams face while deploying ML solutions are found in production, ML observability principles can help stamp out some of these problems earlier in the model development process. Observability tools enable ML teams to set a baseline reference to compare production and performance.

  • ML observability only matters when you have real-time models/serving in real-time.

While many applications enjoy real-time ground truth for their model’s predictions, performance data for some models is not immediately available due to the nature of the application.

  • Production ground truth is required.

For many production ML models, ground truths surface for every prediction, providing real-time visibility into model performance.


As ML observability emerges as the missing foundational piece of ML infrastructure, its applications and benefits are continuously being revealed. ML observability can be used to deliver and continuously improve models with confidence and gain a competitive ML advantage.