Gradient descent vs. stochastic gradient descent
Linear regression with gradient descent
A. Is this a problem? Why?
B. What are some things we could do to address the issue?
C. What are some metrics that you would use to evaluate the offline predictive performance of your model?
A:. There are several issues of why class imbalance might be problematic when trying to build predictive models. Some of the reasons include:
There might be insufficient signal for your model to learn to detect minority classes. In case there is a small number of instances in the minority class, the problem becomes a few-shot learning problem where your model only gets to see the minority class a few times before having to make a decision on it. In the case where there are no instances in the training sample of your rare class, the model might assume that the class doesn’t exist.
Class imbalance makes it easier for your model to get stuck in non-optimal solutions by exploiting a simple heuristic instead of learning anything useful about the underlying pattern of the data.
The cost of misclassification is asymmetric, as the cost of misclassifying a minority class instance tends to be much higher than the cost of misclassifying a majority class instance (e.g., labeling a fraudulent transaction as non-fraudulent has a much higher cost than labeling a legitimate transaction as fraudulent).
As we will discuss in C, evaluating the performance of the model becomes non-trivial as several metrics might be overly optimistic or completely misleading.
Despite these, problems with class imbalance tend to be the more interesting, challenging, and rewarding to solve (e.g., predict the likelihood of ad conversions, predict credit card fraud, etc.).
B: Unfortunately, there is no perfect solution to class imbalance. Some mitigation techniques we could employ include:
SMOTE (see here: https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/)
Use XGBoost (boosting in general) as it tends to work relatively well in imbalanced datasets
Get more data if possible
Easy ensemble (https://machinelearningmastery.com/bagging-and-random-forest-for-imbalanced-classification/)
Use class weight in the loss function (it penalizes mistakes in samples of class[i]
with class\_weight[i]
instead of 1)
Use a Cost-sensitive loss (p.111 Huyen (2022)):
\[ L = \sum_i C_{ij} P(j | x;\theta) \]
where \(L\) is the loss of a single instance, and \(i\) is the true class, and \(j\) is the predicted class
Ensemble models with different ratios in classes
Bagging with random undersampling for imbalanced classification. (BalancedBaggingClassifier
, https://imbalanced-learn.org/stable/references/generated/imblearn.ensemble.BalancedBaggingClassifier.html)
Two-phase learning: we first train a model on resampled balanced data. Then, we finetune our model on the original data.
Focal loss (adjust the loss so that if a sample has a lower probability of being right, it will have a higher weight).
Some additional points on class imbalance worthy of noting:
Class imbalance in binary classification is a much easier problem than class imbalance in multiclass classification (p.105 Huyen (2022))
Ding et al. (2017) showed that very deep neural networks – with very deep meaning over 10 layers back in 2017 – performed much better on imbalanced data than shallower neural nets.
C: In terms of metrics, accuracy will be misleading since a majority classifier in our example that blindly predicts “negative” will have an accuracy of 99%. Checking the confusion matrix will be a good start, but such a matrix (along with precision and recall and F-scores) is threshold-based.
A better approach would be to evaluate the predictive performance of our model in a probabilistic approach:
The area under the ROC curve is a good start, but it tends to be overly optimistic in highly imbalanced datasets as it includes the False Positive Rate in its calculation (i.e., it includes the total number of negatives in the denominator of False Positive Rate).
The Area under the Precision-Recall curve AUPR is a better metric as it directly focuses on how the model is doing on the minority class through precision and recall.
Finally, besides AUPR, accuracy can still be a good metric if we use it for each class individually. Similarly, F-score, precision, and recall metrics that measure model performance wrt the minority class can also be good metrics.
This is an AI-enhanced solution that took as input the original solution.
This question highlights the implications of class imbalance in data science and machine learning tasks. Below are detailed responses for each part.
A: Class imbalance can indeed be problematic for various reasons:
Lack of sufficient signal for identifying minority classes can turn the problem into few-shot learning, making it challenging for models to learn meaningful patterns.
Relying on simple heuristics due to imbalance might result in models not capturing true data patterns.
Misclassification costs are often asymmetric, particularly when errors involve minority classes, such as wrongly classifying fraudulent transactions.
Performance evaluation becomes complex, as traditional metrics may give a false sense of model accuracy.
B: There is no one-size-fits-all strategy, but possible methods include:
Increasing the minority class instances using SMOTE.
Applying boosting techniques like XGBoost, which handle imbalance reasonably well.
Acquiring more data to balance classes.
Implementing an Easy Ensemble approach or using class weighting in loss functions.
Cost-sensitive learning and utilizing specific loss functions like Focal Loss.
Employing ensemble models with varied class ratios or bagging with random undersampling.
Introducing two-phase learning strategies.
C: For evaluating model performance:
Standard accuracy can be misleading; instead, assess individual class metrics.
AUC-PR (Area under Precision-Recall curve) effectively measures minority class performance.
Incorporating metrics like class-specific accuracy, precision, recall, and F-score provides a more comprehensive evaluation.
F1, precision, and recall metrics notably vary depending on the positive class designation, highlighting their asymmetrical nature.
Class imbalance