\[ DCG_p = \sum_{i=1}^p \frac{rel_i}{\log_2 (i+1)} \]
where \(p\) is the final position of the ranking that we care about (e.g., p=10 if we are looking at the quality of a top-10 ranking), and \(rel_i\) is the relevance of the result at position \(i\), which is given by the user/evaluator.
NDCG compares the DCG of the resulting ranking with an ideal ranking as follows (\(IDCG_p\)):
\[ NDCG = \frac{DCG_p}{IDCG_p} \]
NDCG ranges between 0 and 1, with higher numbers indicating better rankings.
MAP (mean average precision) on the other hand takes into account the number of relevant items in the ranked list. It is calculated by averaging the average precision at each position in the ranked list, which is defined as the number of relevant items in the list up to that position divided by the total number of items in the list up to that position. MAP ranges from 0 to 1, with higher values indicating better performance.
For instance, we first calculate the average precision by using the formula:
\[ AP = \frac{1}{RD} \sum_{k=1}^n P(k) r(k) \]
where \(RD\) is the number of relevant documents for the query, \(n\) is the total number of documents, \(P(k)\) is the precision at position \(k\), and \(r(k)\) is the relevance of the \(k^{th}\) retrived document (0 if not relevant, 1 if relevant).
MRR (mean reciprocal rank) measures the rank of the first relevant item in a ranked list. It is calculated by taking the reciprocal of the rank of the first relevant item, and averaging this value across all queries or users. For example, if the first relevant item for a given query has a rank of 3, the MRR for that query would be 1/3. MRR ranges from 0 to 1, with higher values indicating better performance.
From sklearn: AP summarizes a precision-recall curve as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight:
\[ AP = \sum_n (R_n - R_{n-1}) P_n \] where \(R_n, P_n\) are the recall and precision at the \(n-th\) threshold. This implementation is not interpolated and is different from computing the area under the precision-recall curve with the trapezoidal rule, which uses linear interpolation and can be too optimistic. –>
Some additional nuances across several ranking metrics:
Precision @ top N does not care about position: as long as you are ranked in the top N it gives equal weight
MRR only focuses on the first relevant item that is found. Hence this metric will give the same result in a scenario where we have:
Ranking 1: Not relevant, Not relevant, Relevant, Relevant, Relevant (Reciprocal Rank = 1/3)
Ranking 2: Not relevant, Not relevant, Relevant, Not Relevant, Not relevant (Reciprocal Rank = 1/3)
Finally, Recall @ k is typically problematic when we have a ton of relevant items (e.g. search): This is because the denominator will always be huge, but we do not really care about that.