Despite the overall positive effects of this decision, on-call events more than doubled: more often than ever ML models break and engineers need to figure out what went wrong, causing extensive delays and Eng human hours of work.
Your Eng director comes up with a solution to this issue: They build a meta-predictive framework that rejects data snapshots that are likely to cause on-call events (i.e., it does not allow certain model versions to go into production if their performance during a short testing period is not within some pre-defined range).
As the genius data scientist of the team, you are tasked to measure the impact of this meta-predictive framework. How would you do it? How would you test whether your metrics are accurate?
Problem statement: We need to measure the impact of this meta-predictive framework. The framework in simple terms prevents bad models to go into production.
Problem understanding: Let’s see some example clarification questions we could ask on this problem:
Q: How many models are being filtered through this model every hour?
A: You can assume that every hour roughly 1000 models are being filtered.
Q: What percentage of these models are being filtered out?
A: Right now, you can assume that the model filters out roughly 10% of models.
Q: What is the precision and the recall of the system?
A: For simplicity, you can assume that the system has recall=1
and precision=0.1
Q: Is it rational to assume that in order to filter out these models we introduce an additional delay in pushing models to production? How long does it take for each model to be tested and then potentially be filtered out?
A: It takes roughly 10 minutes for each model to be tested.
Q: In this case, is it rational to assume that our measurement should include some sort of revenue loss due to this additional delay that the filter introduces?
A: Correct, this is a trade-off you will need to take into account.
We can keep asking questions but let’s assume that we now have a good understanding of what the interviewer is asking: We need to figure out a way to measure the counterfactual impact of this system that on one hand prevents bad models from going into production (i.e., models that would have cost revenue loss), and on the other hand, introduces delays that cost money.
Metric definition: Now that we have a good understanding of the problem, how could we measure the impact of this system? When filtered models are true positives (i.e., they will end up breaking and cause on-call events) the system saves revenue losses. On average, we can assume that:
\[ \text{Revenue savings} = \Pr(\text{filtered and positive}) * \text{Expected cost of positive} \]
On the other hand, when the model introduces a fixed delay cost by delaying all models to reach production by 10 minutes:
\[ \text{Fixed delay cost} = 10 * \text{Expected revenue increase of a new model per minute} \] Finally, the model introduces an unnecessary delay of roughly an hour when a good model gets filtered out (i.e., False positives)
\[ \begin{align} \text{False Positives delay cost} &= \Pr(\text{filtered and positive}) \\ &* \text{Expected revenue increase of a new model per hour} \end{align} \]
As a result, the total impact of the system will be:
\[ \text{Total impact} \propto \text{Revenue savings} - \text{Fixed delay cost} - \text{False Positives delay cost} \]
This is one way that you can solve this problem, but there are potentially several others.
You might think that the above question is extreme as it focuses on a niche topic. This might be true, however questions that focus on counterfactual impact are very common.