We are in Beta and we are offering 50% off! Use code BETATESTER at checkout.
You can access a significantly larger sample of the platform's content for free by logging in with your Gmail account. Sign in now to explore.

Measuring counterfactual impact

Metrics Hard Seen in real interview

Assume that you are working in Ads, where model freshness is particularly important as new information significantly improves the performance of your models. Assume that your company deploys 100s of ML Ad models. The Eng ranking team decides to constantly (e.g., every hour) update all models with new data snapshots.

Despite the overall positive effects of this decision, on-call events more than doubled: more often than ever ML models break and engineers need to figure out what went wrong, causing extensive delays and Eng human hours of work.

Your Eng director comes up with a solution to this issue: They build a meta-predictive framework that rejects data snapshots that are likely to cause on-call events (i.e., it does not allow certain model versions to go into production if their performance during a short testing period is not within some pre-defined range).

As the genius data scientist of the team, you are tasked to measure the impact of this meta-predictive framework. How would you do it? How would you test whether your metrics are accurate?

We will solve this problem step by step, following the green path of this diagram:



  • Problem statement: We need to measure the impact of this meta-predictive framework. The framework in simple terms prevents bad models to go into production.

  • Problem understanding: Let’s see some example clarification questions we could ask on this problem:

    • Q: How many models are being filtered through this model every hour?

    • A: You can assume that every hour roughly 1000 models are being filtered.

    • Q: What percentage of these models are being filtered out?

    • A: Right now, you can assume that the model filters out roughly 10% of models.

    • Q: What is the precision and the recall of the system?

    • A: For simplicity, you can assume that the system has recall=1 and precision=0.1

    • Q: Is it rational to assume that in order to filter out these models we introduce an additional delay in pushing models to production? How long does it take for each model to be tested and then potentially be filtered out?

    • A: It takes roughly 10 minutes for each model to be tested.

    • Q: In this case, is it rational to assume that our measurement should include some sort of revenue loss due to this additional delay that the filter introduces?

    • A: Correct, this is a trade-off you will need to take into account.

    We can keep asking questions but let’s assume that we now have a good understanding of what the interviewer is asking: We need to figure out a way to measure the counterfactual impact of this system that on one hand prevents bad models from going into production (i.e., models that would have cost revenue loss), and on the other hand, introduces delays that cost money.

  • Metric definition: Now that we have a good understanding of the problem, how could we measure the impact of this system? When filtered models are true positives (i.e., they will end up breaking and cause on-call events) the system saves revenue losses. On average, we can assume that:

\[ \text{Revenue savings} = \Pr(\text{filtered and positive}) * \text{Expected cost of positive} \]

       On the other hand, when the model introduces a fixed delay cost by delaying all models to reach        production by 10 minutes:

\[ \text{Fixed delay cost} = 10 * \text{Expected revenue increase of a new model per minute} \]        Finally, the model introduces an unnecessary delay of roughly an hour when a good model gets        filtered out (i.e., False positives)

\[ \begin{align} \text{False Positives delay cost} &= \Pr(\text{filtered and positive}) \\ &* \text{Expected revenue increase of a new model per hour} \end{align} \]

       As a result, the total impact of the system will be:

\[ \text{Total impact} \propto \text{Revenue savings} - \text{Fixed delay cost} - \text{False Positives delay cost} \]

  • AB test: If you note, in our metric definition, there are several counterfactual components:
    • We never observe the actual delay cost since those models never reach production
    • We never observe the actual revenue increase that we have missed since these models never reach production
    • We never observe the actual cost of an on-call event since the true positives never reach production.
  • Then how do we know that our estimates are correct? In practice, we don’t. In theory, we could design an AB test that evaluates the whole system – in practice, the necessary sample size might be too big or the cost of allowing faulty models to go into production extremely high for a director to approve such an A/B test. Perhaps, the best thing we can do is to run AB tests of different versions of models and keep recalculating the expected revenue boost from each new model…


  • This is one way that you can solve this problem, but there are potentially several others.

  • You might think that the above question is extreme as it focuses on a niche topic. This might be true, however questions that focus on counterfactual impact are very common.


Topics

Problem solving, Counterfactual
Similar questions

Provide feedback