A. Is this a problem? Why?
B. What are some things we could do to address the issue?
C. What are some metrics that you would use to evaluate the offline predictive performance of your model?
A:. There are several issues of why class imbalance might be problematic when trying to build predictive models. Some of the reasons include:
There might be insufficient signal for your model to learn to detect minority classes. In case there is a small number of instances in the minority class, the problem becomes a few-shot learning problem where your model only gets to see the minority class a few times before having to make a decision on it. In the case where there are no instances in the training sample of your rare class, the model might assume that the class doesn’t exist.
Class imbalance makes it easier for your model to get stuck in non-optimal solutions by exploiting a simple heuristic instead of learning anything useful about the underlying pattern of the data.
The cost of misclassification is asymmetric, as the cost of misclassifying a minority class instance tends to be much higher than the cost of misclassifying a majority class instance (e.g., labeling a fraudulent transaction as non-fraudulent has a much higher cost than labeling a legitimate transaction as fraudulent).
As we will discuss in C, evaluating the performance of the model becomes non-trivial as several metrics might be overly optimistic or completely misleading.
Despite these, problems with class imbalance tend to be the more interesting, challenging, and rewarding to solve (e.g., predict the likelihood of ad conversions, predict credit card fraud, etc.).
B: Unfortunately, there is no perfect solution to class imbalance. Some mitigation techniques we could employ include:
SMOTE (see here: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)
Use XGBoost (boosting in general) as it tends to work relatively well in imbalanced datasets
Get more data if possible
Easy ensemble (https://machinelearningmastery.com/bagging-and-random-forest-for-imbalanced-classification/)
Use class weight in the loss function (it penalizes mistakes in samples of class[i]
with class\_weight[i]
instead of 1)
Use a Cost-sensitive loss (p.111 Huyen (2022)):
\[ L = \sum_i C_{ij} P(j | x;\theta) \]
where \(L\) is the loss of a single instance, and \(i\) is the true class, and \(j\) is the predicted class
Ensemble models with different ratios in classes
Bagging with random undersampling for imbalanced classification. (BalancedBaggingClassifier
, https://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.BalancedBaggingClassifier.html)
Two-phase learning: we first train a model on resampled balanced data. Then, we finetune our model on the original data.
Focal loss (adjust the loss so that if a sample has a lower probability of being right, it will have a higher weight).
Some additional points on class imbalance worthy of noting:
Class imbalance in binary classification is a much easier problem than class imbalance in multiclass classification (p.105 Huyen (2022))
Ding et al. (2017) showed that very deep neural networks – with very deep meaning over 10 layers back in 2017 – performed much better on imbalanced data than shallower neural nets.
C: In terms of metrics, accuracy will be misleading since a majority classifier in our example that blindly predicts “negative” will have an accuracy of 99%. Checking the confusion matrix will be a good start, but such a matrix (along with precision and recall and F-scores) is threshold-based.
A better approach would be to evaluate the predictive performance of our model in a probabilistic approach:
The area under the ROC curve is a good start, but it tends to be overly optimistic in highly imbalanced datasets as it includes the False Positive Rate in its calculation (i.e., it includes the total number of negatives in the denominator of False Positive Rate).
The Area under the Precision-Recall curve AUPR is a better metric as it directly focuses on how the model is doing on the minority class through precision and recall.
Finally, besides AUPR, accuracy can still be a good metric if we use it for each class individually. Similarly, F-score, precision, and recall metrics that measure model performance wrt the minority class can also be good metrics.