We are in Beta and we are offering 50% off! Use code BETATESTER at checkout.
You can access a significantly larger sample of the platform's content for free by logging in with your Gmail account. Sign in now to explore.

NDCG vs. Mean Average Precision (MAP) vs. Mean Reciprocal Rank (MRR)

Machine Learning Medium Seen in real interview

Describe and compare NDCG and mean Average precision.

NDCG (normalized discounted cumulative gain) is based on the idea that items that are higher in the ranking should be given more credit than items that are lower in the ranking. As a result, it penalizes ranking errors in accordance with their ranking placement (i.e., misplacing the first item as last will be penalized more than misplacing the first item as second). To calculate NDCG we need to know the gold truth (ideal) ranking of items in a list – a serious drawback in cases where such rankings cannot be observed (e.g., bianry decisions such as choosing a movie from a list). Its formula is the following:

\[ DCG_p = \sum_{i=1}^p \frac{rel_i}{\log_2 (i+1)} \]

where \(p\) is the final position of the ranking that we care about (e.g., p=10 if we are looking at the quality of a top-10 ranking), and \(rel_i\) is the relevance of the result at position \(i\), which is given by the user/evaluator.

NDCG compares the DCG of the resulting ranking with an ideal ranking as follows (\(IDCG_p\)):

\[ NDCG = \frac{DCG_p}{IDCG_p} \]

NDCG ranges between 0 and 1, with higher numbers indicating better rankings.

MAP (mean average precision) on the other hand takes into account the number of relevant items in the ranked list. It is calculated by averaging the average precision at each position in the ranked list, which is defined as the number of relevant items in the list up to that position divided by the total number of items in the list up to that position. MAP ranges from 0 to 1, with higher values indicating better performance.

For instance, we first calculate the average precision by using the formula:

\[ AP = \frac{1}{RD} \sum_{k=1}^n P(k) r(k) \]

where \(RD\) is the number of relevant documents for the query, \(n\) is the total number of documents, \(P(k)\) is the precision at position \(k\), and \(r(k)\) is the relevance of the \(k^{th}\) retrived document (0 if not relevant, 1 if relevant).


Note that MAP is designed for binary relevances; i.e., it works well when we care when each item is either relevant or not. When relevancy is continuous, in a sense that we care about the position as well, NDCG is a much better measure.

 

MRR (mean reciprocal rank) measures the rank of the first relevant item in a ranked list. It is calculated by taking the reciprocal of the rank of the first relevant item, and averaging this value across all queries or users. For example, if the first relevant item for a given query has a rank of 3, the MRR for that query would be 1/3. MRR ranges from 0 to 1, with higher values indicating better performance.

From sklearn: AP summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight:

\[ AP = \sum_n (R_n - R_{n-1}) P_n \] where \(R_n, P_n\) are the recall and precision at the \(n-th\) threshold. This implementation is not interpolated and is different from computing the area under the precision-recall curve with the trapezoidal rule, which uses linear interpolation and can be too optimistic. –>


Some additional nuances across several ranking metrics:

  • Precision @ top N does not care about position: as long as you are ranked in the top N it gives equal weight

  • MRR only focuses on the first relevant item that is found. Hence this metric will give the same result in a scenario where we have:

    • Ranking 1: Not relevant, Not relevant, Relevant, Relevant, Relevant (Reciprocal Rank = 1/3)

    • Ranking 2: Not relevant, Not relevant, Relevant, Not Relevant, Not relevant (Reciprocal Rank = 1/3)

  • Finally, Recall @ k is typically problematic when we have a ton of relevant items (e.g. search): This is because the denominator will always be huge, but we do not really care about that.



Topics

Ranking metrics, NDCG, Mean average precision (MAP), Mean Reciprocal Rank (MRR), Recommender Systems
Similar questions

Provide feedback