1 Balancing Misclassification Rates

For this part of the assignment we will be observing the misclassification rates for various thresholds using an unbalanced dataset. In particular we are using the sentiment_25_75.arff file which is included in the data/assignment2data folder of the course repository.

1.1 Observe the trends of FNR and FPR for varying thresholds

In general as the threshold increases the FNR increases while the FPR decreases.

1.2 Max accuracy

Max accuracy is obtained with a threshold of 0.5693 and obtains an accuracy of 0.7503. However if we look at the confusion matrix we can see that is has essentially classified almost everything as negative to get this accuracy since 75% of the data was negative. Actually there is a tie and a threshold of 0.5551 has the exact same accuracy.

Confusion matrix for a threshold of `0.5693`
	Pred N	P
Actual N	2238	12
P	737	13

1.3 Selecting the “optimal” threshold

The best threshold is defined as the one that balances the FNR and FPR rates while keeping FNR as low as possible. We will first visualize these rates as threshold increases.

Plot of FPR and FNR against increasing values of threshold

As observed earlier, as the threshold increases the FNR increases while the FPR decreases. We can see that the value we want is at somewhere just over 0.2. We will have a look at the actual data to get our exact numbers.

Top 20 closest values of FNR and FPR
threshold	fnr	fpr	AbsDif
0.2180	0.4253	0.3947	0.0306
0.2189	0.4280	0.3942	0.0338
0.2200	0.4320	0.3911	0.0409
0.2210	0.4387	0.3822	0.0565
0.2218	0.4400	0.3804	0.0596
0.2229	0.4400	0.3796	0.0604
0.2249	0.4400	0.3787	0.0613
0.2261	0.4400	0.3729	0.0671
0.2273	0.4427	0.3698	0.0729
0.2292	0.4467	0.3667	0.0800
0.2171	0.3693	0.4622	0.0929
0.2300	0.4853	0.3089	0.1764
0.2309	0.4867	0.3076	0.1791
0.2313	0.4893	0.3071	0.1822
0.2166	0.3040	0.5307	0.2267
0.2152	0.3040	0.5324	0.2284
0.2147	0.3027	0.5338	0.2311
0.2139	0.3000	0.5351	0.2351
0.2124	0.2973	0.5387	0.2414
0.2119	0.2907	0.5431	0.2524

Based upon this data I would simply select the top value of a threshold of 0.2180. While a threshold of 0.2171 results in a lower FNR, the difference is almost 10.

threshold	f1	f2	f0point5	accuracy	precision	absolute_MCC	min_per_class_accuracy	tns	fns	fps	tps	tnr	fnr	fpr	tpr	idx	eAccuracy	AbsDif
0.218	0.4166	0.499	0.3576	0.5977	0.3268	0.157	0.5747	1362	319	888	431	0.6053	0.4253	0.3947	0.5747	258	0.5976667	0.0306

2 Random Undersampling

For this part of the assignment we will look at ways to help offset some of the effects of class imbalance. In particular we will use Random Under-sampling (RUS) in order to make the dataset more balanced and observe the resulting AUC when the datasets are used to train a Random Forest classifier.

Full results grouped by dataset and ordered by AUC
Dataset	Rus Ratio	AUC
05/95	50/50	0.614501
05/95	35/65	0.606096
05/95	NA	0.581108
15/85	50/50	0.653665
15/85	35/65	0.630403
15/85	NA	0.594640
25/75	50/50	0.666885
25/75	35/65	0.662549
25/75	NA	0.632107

We can see that in every case across all datasets RUS with a 50/50 ratio performs the best followed by 35/65 and finally no sampling at all. In particular this indicates RUS helps in all of these datasets.

Full results ordered by AUC
Dataset	Rus Ratio	AUC
25/75	50/50	0.666885
25/75	35/65	0.662549
15/85	50/50	0.653665
25/75	NA	0.632107
15/85	35/65	0.630403
05/95	50/50	0.614501
05/95	35/65	0.606096
15/85	NA	0.594640
05/95	NA	0.581108

When ordering all combinations by AUC we see that doing 50/50 sampling on the 25/75 dataset achieves the highest AUC followed closely by 35/65 and 50/50 on the 15/85 dataset. Interestingly, we note that doing 50/50 sampling on the 15/85 dataset has a higher AUC than no sampling on the 25/75 dataset. So even though the 25/75 dataset has 300 additional instances of the positive class on which to train, doing RUS on the 15/85 dataset managed to perform better. This shows the power of RUS on imbalanced datasets quite well.

Crawford_hwk2

Michael Crawford

September 29, 2015

1 Balancing Misclassification Rates

1.1 Observe the trends of FNR and FPR for varying thresholds

1.2 Max accuracy

1.3 Selecting the “optimal” threshold

2 Random Undersampling