For this part of the assignment we will be observing the misclassification rates for various thresholds using an unbalanced dataset. In particular we are using the sentiment_25_75.arff file which is included in the data/assignment2data folder of the course repository.
In general as the threshold increases the FNR increases while the FPR decreases.
Max accuracy is obtained with a threshold of 0.5693 and obtains an accuracy of 0.7503. However if we look at the confusion matrix we can see that is has essentially classified almost everything as negative to get this accuracy since 75% of the data was negative. Actually there is a tie and a threshold of 0.5551 has the exact same accuracy.
| Pred N | P | |
|---|---|---|
| Actual N | 2238 | 12 |
| P | 737 | 13 |
The best threshold is defined as the one that balances the FNR and FPR rates while keeping FNR as low as possible. We will first visualize these rates as threshold increases.
Plot of FPR and FNR against increasing values of threshold
As observed earlier, as the threshold increases the FNR increases while the FPR decreases. We can see that the value we want is at somewhere just over 0.2. We will have a look at the actual data to get our exact numbers.
| threshold | fnr | fpr | AbsDif |
|---|---|---|---|
| 0.2180 | 0.4253 | 0.3947 | 0.0306 |
| 0.2189 | 0.4280 | 0.3942 | 0.0338 |
| 0.2200 | 0.4320 | 0.3911 | 0.0409 |
| 0.2210 | 0.4387 | 0.3822 | 0.0565 |
| 0.2218 | 0.4400 | 0.3804 | 0.0596 |
| 0.2229 | 0.4400 | 0.3796 | 0.0604 |
| 0.2249 | 0.4400 | 0.3787 | 0.0613 |
| 0.2261 | 0.4400 | 0.3729 | 0.0671 |
| 0.2273 | 0.4427 | 0.3698 | 0.0729 |
| 0.2292 | 0.4467 | 0.3667 | 0.0800 |
| 0.2171 | 0.3693 | 0.4622 | 0.0929 |
| 0.2300 | 0.4853 | 0.3089 | 0.1764 |
| 0.2309 | 0.4867 | 0.3076 | 0.1791 |
| 0.2313 | 0.4893 | 0.3071 | 0.1822 |
| 0.2166 | 0.3040 | 0.5307 | 0.2267 |
| 0.2152 | 0.3040 | 0.5324 | 0.2284 |
| 0.2147 | 0.3027 | 0.5338 | 0.2311 |
| 0.2139 | 0.3000 | 0.5351 | 0.2351 |
| 0.2124 | 0.2973 | 0.5387 | 0.2414 |
| 0.2119 | 0.2907 | 0.5431 | 0.2524 |
Based upon this data I would simply select the top value of a threshold of 0.2180. While a threshold of 0.2171 results in a lower FNR, the difference is almost 10.
| threshold | f1 | f2 | f0point5 | accuracy | precision | absolute_MCC | min_per_class_accuracy | tns | fns | fps | tps | tnr | fnr | fpr | tpr | idx | eAccuracy | AbsDif |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.218 | 0.4166 | 0.499 | 0.3576 | 0.5977 | 0.3268 | 0.157 | 0.5747 | 1362 | 319 | 888 | 431 | 0.6053 | 0.4253 | 0.3947 | 0.5747 | 258 | 0.5976667 | 0.0306 |
For this part of the assignment we will look at ways to help offset some of the effects of class imbalance. In particular we will use Random Under-sampling (RUS) in order to make the dataset more balanced and observe the resulting AUC when the datasets are used to train a Random Forest classifier.
| Dataset | Rus Ratio | AUC |
|---|---|---|
| 05/95 | 50/50 | 0.614501 |
| 05/95 | 35/65 | 0.606096 |
| 05/95 | NA | 0.581108 |
| 15/85 | 50/50 | 0.653665 |
| 15/85 | 35/65 | 0.630403 |
| 15/85 | NA | 0.594640 |
| 25/75 | 50/50 | 0.666885 |
| 25/75 | 35/65 | 0.662549 |
| 25/75 | NA | 0.632107 |
We can see that in every case across all datasets RUS with a 50/50 ratio performs the best followed by 35/65 and finally no sampling at all. In particular this indicates RUS helps in all of these datasets.
| Dataset | Rus Ratio | AUC |
|---|---|---|
| 25/75 | 50/50 | 0.666885 |
| 25/75 | 35/65 | 0.662549 |
| 15/85 | 50/50 | 0.653665 |
| 25/75 | NA | 0.632107 |
| 15/85 | 35/65 | 0.630403 |
| 05/95 | 50/50 | 0.614501 |
| 05/95 | 35/65 | 0.606096 |
| 15/85 | NA | 0.594640 |
| 05/95 | NA | 0.581108 |
When ordering all combinations by AUC we see that doing 50/50 sampling on the 25/75 dataset achieves the highest AUC followed closely by 35/65 and 50/50 on the 15/85 dataset. Interestingly, we note that doing 50/50 sampling on the 15/85 dataset has a higher AUC than no sampling on the 25/75 dataset. So even though the 25/75 dataset has 300 additional instances of the positive class on which to train, doing RUS on the 15/85 dataset managed to perform better. This shows the power of RUS on imbalanced datasets quite well.