1 Balancing Misclassification Rates

For this part of the assignment we will be observing the misclassification rates for various thresholds using an unbalanced dataset. In particular we are using the sentiment_25_75.arff file which is included in the data/assignment2data folder of the course repository.

1.2 Max accuracy

Max accuracy is obtained with a threshold of 0.5693 and obtains an accuracy of 0.7503. However if we look at the confusion matrix we can see that is has essentially classified almost everything as negative to get this accuracy since 75% of the data was negative. Actually there is a tie and a threshold of 0.5551 has the exact same accuracy.

Confusion matrix for a threshold of 0.5693
Pred N P
Actual N 2238 12
P 737 13

1.3 Selecting the “optimal” threshold

The best threshold is defined as the one that balances the FNR and FPR rates while keeping FNR as low as possible. We will first visualize these rates as threshold increases.

Plot of FPR and FNR against increasing values of threshold

Plot of FPR and FNR against increasing values of threshold

As observed earlier, as the threshold increases the FNR increases while the FPR decreases. We can see that the value we want is at somewhere just over 0.2. We will have a look at the actual data to get our exact numbers.

Top 20 closest values of FNR and FPR
threshold fnr fpr AbsDif
0.2180 0.4253 0.3947 0.0306
0.2189 0.4280 0.3942 0.0338
0.2200 0.4320 0.3911 0.0409
0.2210 0.4387 0.3822 0.0565
0.2218 0.4400 0.3804 0.0596
0.2229 0.4400 0.3796 0.0604
0.2249 0.4400 0.3787 0.0613
0.2261 0.4400 0.3729 0.0671
0.2273 0.4427 0.3698 0.0729
0.2292 0.4467 0.3667 0.0800
0.2171 0.3693 0.4622 0.0929
0.2300 0.4853 0.3089 0.1764
0.2309 0.4867 0.3076 0.1791
0.2313 0.4893 0.3071 0.1822
0.2166 0.3040 0.5307 0.2267
0.2152 0.3040 0.5324 0.2284
0.2147 0.3027 0.5338 0.2311
0.2139 0.3000 0.5351 0.2351
0.2124 0.2973 0.5387 0.2414
0.2119 0.2907 0.5431 0.2524

Based upon this data I would simply select the top value of a threshold of 0.2180. While a threshold of 0.2171 results in a lower FNR, the difference is almost 10.

threshold f1 f2 f0point5 accuracy precision absolute_MCC min_per_class_accuracy tns fns fps tps tnr fnr fpr tpr idx eAccuracy AbsDif
0.218 0.4166 0.499 0.3576 0.5977 0.3268 0.157 0.5747 1362 319 888 431 0.6053 0.4253 0.3947 0.5747 258 0.5976667 0.0306

2 Random Undersampling

For this part of the assignment we will look at ways to help offset some of the effects of class imbalance. In particular we will use Random Under-sampling (RUS) in order to make the dataset more balanced and observe the resulting AUC when the datasets are used to train a Random Forest classifier.

Full results grouped by dataset and ordered by AUC
Dataset Rus Ratio AUC
05/95 50/50 0.614501
05/95 35/65 0.606096
05/95 NA 0.581108
15/85 50/50 0.653665
15/85 35/65 0.630403
15/85 NA 0.594640
25/75 50/50 0.666885
25/75 35/65 0.662549
25/75 NA 0.632107

We can see that in every case across all datasets RUS with a 50/50 ratio performs the best followed by 35/65 and finally no sampling at all. In particular this indicates RUS helps in all of these datasets.

Full results ordered by AUC
Dataset Rus Ratio AUC
25/75 50/50 0.666885
25/75 35/65 0.662549
15/85 50/50 0.653665
25/75 NA 0.632107
15/85 35/65 0.630403
05/95 50/50 0.614501
05/95 35/65 0.606096
15/85 NA 0.594640
05/95 NA 0.581108

When ordering all combinations by AUC we see that doing 50/50 sampling on the 25/75 dataset achieves the highest AUC followed closely by 35/65 and 50/50 on the 15/85 dataset. Interestingly, we note that doing 50/50 sampling on the 15/85 dataset has a higher AUC than no sampling on the 25/75 dataset. So even though the 25/75 dataset has 300 additional instances of the positive class on which to train, doing RUS on the 15/85 dataset managed to perform better. This shows the power of RUS on imbalanced datasets quite well.