Prepare two dataframe with skewed data which will help us to show the impact of factor. Test and Traing datasetp prepared and that will be skewed one…
##
## Class1 Class2
## 14278 722
## Training Dataset Nature: 0.9518667 0.04813333
## Test Dataset Nature: 0.9549333 0.04506667
Decision tree will now be run and this will show us the impact of skewness of data. Following one will plot the decision tree
## Call:
## rpart(formula = my_cart_trng_skew_data$Class ~ ., data = my_cart_trng_skew_data)
## n= 15000
##
## CP nsplit rel error xerror xstd
## 1 0.09279778 0 1.0000000 1.0000000 0.03630943
## 2 0.04293629 1 0.9072022 0.9141274 0.03479074
## 3 0.02908587 4 0.7770083 0.8379501 0.03337342
## 4 0.01000000 6 0.7188366 0.8130194 0.03289375
##
## Variable importance
## TwoFactor1 TwoFactor2 Linear02 Linear04 Linear11 Linear05
## 46 43 5 3 1 1
##
## Node number 1: 15000 observations, complexity param=0.09279778
## predicted class=Class1 expected loss=0.04813333 P(node) =1
## class counts: 14278 722
## probabilities: 0.952 0.048
## left son=2 (14705 obs) right son=3 (295 obs)
## Primary splits:
## TwoFactor1 < -2.894558 to the right, improve=192.410900, (0 missing)
## TwoFactor2 < 2.792813 to the left, improve=158.841000, (0 missing)
## Linear03 < 0.9462808 to the left, improve= 11.908630, (0 missing)
## Linear02 < -0.2482223 to the right, improve= 10.426050, (0 missing)
## Linear06 < -0.6187431 to the right, improve= 9.654109, (0 missing)
## Surrogate splits:
## TwoFactor2 < -4.423733 to the right, agree=0.981, adj=0.017, (0 split)
##
## Node number 2: 14705 observations, complexity param=0.04293629
## predicted class=Class1 expected loss=0.03679021 P(node) =0.9803333
## class counts: 14164 541
## probabilities: 0.963 0.037
## left son=4 (14350 obs) right son=5 (355 obs)
## Primary splits:
## TwoFactor2 < 2.792813 to the left, improve=166.726000, (0 missing)
## TwoFactor1 < 2.738351 to the left, improve= 49.024830, (0 missing)
## Linear03 < 0.9587968 to the left, improve= 12.173160, (0 missing)
## Linear02 < -0.1566939 to the right, improve= 8.210712, (0 missing)
## Linear06 < -0.6187431 to the right, improve= 6.205716, (0 missing)
## Surrogate splits:
## TwoFactor1 < 4.062251 to the left, agree=0.977, adj=0.031, (0 split)
##
## Node number 3: 295 observations, complexity param=0.02908587
## predicted class=Class2 expected loss=0.3864407 P(node) =0.01966667
## class counts: 114 181
## probabilities: 0.386 0.614
## left son=6 (157 obs) right son=7 (138 obs)
## Primary splits:
## TwoFactor1 < -3.395955 to the right, improve=21.853960, (0 missing)
## TwoFactor2 < -1.566998 to the right, improve=13.582610, (0 missing)
## Linear06 < 0.4925175 to the right, improve=12.009620, (0 missing)
## Linear04 < 0.09267313 to the right, improve=10.957430, (0 missing)
## Linear07 < 0.5664922 to the left, improve= 6.985292, (0 missing)
## Surrogate splits:
## TwoFactor2 < -2.588363 to the right, agree=0.617, adj=0.181, (0 split)
## Linear11 < 0.1288809 to the left, agree=0.580, adj=0.101, (0 split)
## Noise8 < -0.9216182 to the right, agree=0.580, adj=0.101, (0 split)
## Linear05 < -0.6381934 to the right, agree=0.573, adj=0.087, (0 split)
## Linear13 < 0.2060783 to the right, agree=0.569, adj=0.080, (0 split)
##
## Node number 4: 14350 observations
## predicted class=Class1 expected loss=0.02494774 P(node) =0.9566667
## class counts: 13992 358
## probabilities: 0.975 0.025
##
## Node number 5: 355 observations, complexity param=0.04293629
## predicted class=Class2 expected loss=0.484507 P(node) =0.02366667
## class counts: 172 183
## probabilities: 0.485 0.515
## left son=10 (277 obs) right son=11 (78 obs)
## Primary splits:
## TwoFactor2 < 3.590468 to the left, improve=29.16555, (0 missing)
## TwoFactor1 < 2.752161 to the left, improve=18.10634, (0 missing)
## Linear02 < 0.09007999 to the right, improve=16.65899, (0 missing)
## Linear07 < -0.3725718 to the left, improve=12.32738, (0 missing)
## Linear08 < 0.1297837 to the right, improve=11.49439, (0 missing)
## Surrogate splits:
## TwoFactor1 < 4.123151 to the left, agree=0.792, adj=0.051, (0 split)
## Linear03 < 2.307691 to the left, agree=0.792, adj=0.051, (0 split)
## Linear02 < -2.452591 to the right, agree=0.786, adj=0.026, (0 split)
## Linear05 < -2.078757 to the right, agree=0.786, adj=0.026, (0 split)
## Noise1 < -2.211655 to the right, agree=0.786, adj=0.026, (0 split)
##
## Node number 6: 157 observations, complexity param=0.02908587
## predicted class=Class1 expected loss=0.433121 P(node) =0.01046667
## class counts: 89 68
## probabilities: 0.567 0.433
## left son=12 (100 obs) right son=13 (57 obs)
## Primary splits:
## Linear04 < -0.3544217 to the right, improve=11.283960, (0 missing)
## Linear06 < 0.4925175 to the right, improve= 8.342101, (0 missing)
## Linear07 < -0.5636414 to the left, improve= 6.055980, (0 missing)
## Linear08 < -0.6013069 to the right, improve= 5.660513, (0 missing)
## Linear05 < 1.528027 to the left, improve= 5.417689, (0 missing)
## Surrogate splits:
## Noise5 < 1.29194 to the left, agree=0.682, adj=0.123, (0 split)
## Linear15 < 0.9260273 to the left, agree=0.675, adj=0.105, (0 split)
## TwoFactor2 < -0.1970952 to the left, agree=0.656, adj=0.053, (0 split)
## Linear02 < -1.36453 to the right, agree=0.656, adj=0.053, (0 split)
## Linear07 < -2.134443 to the right, agree=0.656, adj=0.053, (0 split)
##
## Node number 7: 138 observations
## predicted class=Class2 expected loss=0.1811594 P(node) =0.0092
## class counts: 25 113
## probabilities: 0.181 0.819
##
## Node number 10: 277 observations, complexity param=0.04293629
## predicted class=Class1 expected loss=0.4079422 P(node) =0.01846667
## class counts: 164 113
## probabilities: 0.592 0.408
## left son=20 (173 obs) right son=21 (104 obs)
## Primary splits:
## Linear02 < -0.2491237 to the right, improve=20.138540, (0 missing)
## Linear05 < 0.2395451 to the left, improve=11.137940, (0 missing)
## Linear08 < 0.1473945 to the right, improve=10.320440, (0 missing)
## TwoFactor1 < 2.709536 to the left, improve= 9.119340, (0 missing)
## Linear07 < -0.3715908 to the left, improve= 8.122135, (0 missing)
## Surrogate splits:
## Noise4 < -1.93987 to the right, agree=0.650, adj=0.067, (0 split)
## Linear07 < 1.867579 to the left, agree=0.646, adj=0.058, (0 split)
## Linear11 < 1.434563 to the left, agree=0.646, adj=0.058, (0 split)
## Linear04 < -1.87922 to the right, agree=0.639, adj=0.038, (0 split)
## Linear15 < 2.024757 to the left, agree=0.639, adj=0.038, (0 split)
##
## Node number 11: 78 observations
## predicted class=Class2 expected loss=0.1025641 P(node) =0.0052
## class counts: 8 70
## probabilities: 0.103 0.897
##
## Node number 12: 100 observations
## predicted class=Class1 expected loss=0.29 P(node) =0.006666667
## class counts: 71 29
## probabilities: 0.710 0.290
##
## Node number 13: 57 observations
## predicted class=Class2 expected loss=0.3157895 P(node) =0.0038
## class counts: 18 39
## probabilities: 0.316 0.684
##
## Node number 20: 173 observations
## predicted class=Class1 expected loss=0.2601156 P(node) =0.01153333
## class counts: 128 45
## probabilities: 0.740 0.260
##
## Node number 21: 104 observations
## predicted class=Class2 expected loss=0.3461538 P(node) =0.006933333
## class counts: 36 68
## probabilities: 0.346 0.654
Predictor calcualted from above decision tree will now be used to predict the value from test data. Then precision is being calculated and these metrics provides an interesting interpretation. With threshold value as 0.5, Precision = 0.7 says there are very less false positives. Precision is very low here and this means we have higher number of false negative. F-Value is also very low suggesting model inaccuracy.
##
## Call:
## accuracy.meas(response = my_cart_test_skew_data$Class, predicted = pred.treeimb[,
## 2])
##
## Examples are labelled as positive when predicted is greater than 0.5
##
## precision: 0.673
## recall: 0.391
## F: 0.247
Now we should calculate the area under ROC curve to see the precision. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.The ROC curve is thus the sensitivity as a function of fall-out(false positive). Area under ROC curve is 0.789 which is very near to poor range of 0.6-0.7.
## Area under the curve (AUC): 0.759
Now we will do undersampling. This is most popular and simplest technique to handle imbalanced data. we will use ovun.sample function from ROSE package for under sampling. Oroginal test dataset has observation of Class1 classification of 14337 and Class2 classification 663. So we want to reduce class to approx near to class2. So I am setting value of N as 663* - a small value=1260. So now data is more balanced.
##
## Class1 Class2
## 7171 7079
We have now balanced data and lets run the classification and get the result
##
## Call:
## accuracy.meas(response = my_cart_test_skew_data$Class, predicted = pred.treeimb_undersample[,
## 2])
##
## Examples are labelled as positive when predicted is greater than 0.5
##
## precision: 0.198
## recall: 0.879
## F: 0.161
## Call:
## rpart(formula = my_cart_trng_skew_data_undersampled$Class ~ .,
## data = my_cart_trng_skew_data_undersampled)
## n= 14250
##
## CP nsplit rel error xerror xstd
## 1 0.34983755 0 1.0000000 1.0000000 0.008431337
## 2 0.01243113 2 0.3003249 0.3018788 0.006020725
## 3 0.01000000 5 0.2630315 0.2813957 0.005847566
##
## Variable importance
## TwoFactor2 TwoFactor1 Linear03 Linear02 Linear07 Linear04
## 49 41 4 2 1 1
##
## Node number 1: 14250 observations, complexity param=0.3498375
## predicted class=Class1 expected loss=0.4967719 P(node) =1
## class counts: 7171 7079
## probabilities: 0.503 0.497
## left son=2 (11130 obs) right son=3 (3120 obs)
## Primary splits:
## TwoFactor1 < -2.162347 to the right, improve=1266.1590, (0 missing)
## TwoFactor2 < 1.883636 to the left, improve=1179.5610, (0 missing)
## Linear02 < 0.478949 to the right, improve= 293.0633, (0 missing)
## Linear06 < -0.6187431 to the right, improve= 278.4019, (0 missing)
## Linear03 < -0.2658186 to the left, improve= 272.5172, (0 missing)
## Surrogate splits:
## TwoFactor2 < -1.822519 to the right, agree=0.887, adj=0.484, (0 split)
## Nonlinear1 < 0.9903925 to the left, agree=0.782, adj=0.005, (0 split)
## Nonlinear3 < 0.001196535 to the right, agree=0.782, adj=0.004, (0 split)
## Noise1 < 2.94556 to the left, agree=0.782, adj=0.004, (0 split)
## Linear11 < -3.837981 to the right, agree=0.782, adj=0.004, (0 split)
##
## Node number 2: 11130 observations, complexity param=0.3498375
## predicted class=Class1 expected loss=0.3851752 P(node) =0.7810526
## class counts: 6843 4287
## probabilities: 0.615 0.385
## left son=4 (7533 obs) right son=5 (3597 obs)
## Primary splits:
## TwoFactor2 < 1.855876 to the left, improve=2257.0270, (0 missing)
## TwoFactor1 < 1.648306 to the left, improve=1048.3750, (0 missing)
## Linear03 < 0.9587968 to the left, improve= 316.3265, (0 missing)
## Linear02 < 0.08677152 to the right, improve= 274.6478, (0 missing)
## Linear07 < 0.1161027 to the left, improve= 216.9542, (0 missing)
## Surrogate splits:
## TwoFactor1 < 1.648306 to the left, agree=0.838, adj=0.499, (0 split)
## Linear04 < -2.230791 to the right, agree=0.681, adj=0.013, (0 split)
## Linear01 < 2.207985 to the left, agree=0.680, adj=0.009, (0 split)
## Linear02 < -3.306216 to the right, agree=0.679, adj=0.007, (0 split)
## Linear05 < 2.878201 to the left, agree=0.679, adj=0.006, (0 split)
##
## Node number 3: 3120 observations
## predicted class=Class2 expected loss=0.1051282 P(node) =0.2189474
## class counts: 328 2792
## probabilities: 0.105 0.895
##
## Node number 4: 7533 observations, complexity param=0.01243113
## predicted class=Class1 expected loss=0.1651401 P(node) =0.5286316
## class counts: 6289 1244
## probabilities: 0.835 0.165
## left son=8 (5827 obs) right son=9 (1706 obs)
## Primary splits:
## Linear03 < 0.9399573 to the left, improve=245.2514, (0 missing)
## TwoFactor1 < -1.700381 to the right, improve=188.2020, (0 missing)
## Linear02 < -0.2958551 to the right, improve=144.7394, (0 missing)
## Linear04 < -0.9184932 to the right, improve=139.8880, (0 missing)
## Linear07 < 0.5650217 to the left, improve=133.3823, (0 missing)
## Surrogate splits:
## Linear04 < -3.608102 to the right, agree=0.776, adj=0.011, (0 split)
## Nonlinear3 < 0.9995222 to the left, agree=0.776, adj=0.009, (0 split)
## Linear08 < -2.749207 to the right, agree=0.776, adj=0.009, (0 split)
## Noise8 < 2.975199 to the left, agree=0.775, adj=0.008, (0 split)
## Linear02 < -2.660221 to the right, agree=0.774, adj=0.003, (0 split)
##
## Node number 5: 3597 observations
## predicted class=Class2 expected loss=0.1540172 P(node) =0.2524211
## class counts: 554 3043
## probabilities: 0.154 0.846
##
## Node number 8: 5827 observations
## predicted class=Class1 expected loss=0.09610434 P(node) =0.4089123
## class counts: 5267 560
## probabilities: 0.904 0.096
##
## Node number 9: 1706 observations, complexity param=0.01243113
## predicted class=Class1 expected loss=0.4009379 P(node) =0.1197193
## class counts: 1022 684
## probabilities: 0.599 0.401
## left son=18 (645 obs) right son=19 (1061 obs)
## Primary splits:
## Linear02 < 0.02311349 to the right, improve=128.60320, (0 missing)
## TwoFactor1 < -1.492157 to the right, improve=115.08770, (0 missing)
## Linear07 < 0.5621504 to the left, improve= 95.22832, (0 missing)
## Linear04 < -0.9165976 to the right, improve= 94.80444, (0 missing)
## Linear06 < 0.526447 to the right, improve= 94.62899, (0 missing)
## Surrogate splits:
## Nonlinear1 < -0.6436725 to the left, agree=0.643, adj=0.056, (0 split)
## Linear12 < 1.529133 to the right, agree=0.637, adj=0.040, (0 split)
## Linear08 < 0.9863967 to the right, agree=0.635, adj=0.034, (0 split)
## Linear06 < 0.9449652 to the right, agree=0.633, adj=0.029, (0 split)
## Linear09 < -1.383241 to the left, agree=0.632, adj=0.028, (0 split)
##
## Node number 18: 645 observations
## predicted class=Class1 expected loss=0.151938 P(node) =0.04526316
## class counts: 547 98
## probabilities: 0.848 0.152
##
## Node number 19: 1061 observations, complexity param=0.01243113
## predicted class=Class2 expected loss=0.4476909 P(node) =0.07445614
## class counts: 475 586
## probabilities: 0.448 0.552
## left son=38 (311 obs) right son=39 (750 obs)
## Primary splits:
## Linear07 < -0.09480701 to the left, improve=78.29273, (0 missing)
## Linear06 < 0.5389032 to the right, improve=64.83913, (0 missing)
## Linear08 < 0.6526716 to the right, improve=57.91726, (0 missing)
## Linear04 < -0.07703882 to the right, improve=51.51000, (0 missing)
## TwoFactor1 < -1.492157 to the right, improve=50.22675, (0 missing)
## Surrogate splits:
## Linear14 < -1.533601 to the left, agree=0.733, adj=0.090, (0 split)
## Linear03 < 2.457075 to the right, agree=0.730, adj=0.080, (0 split)
## Nonlinear2 < 0.01958602 to the left, agree=0.724, adj=0.058, (0 split)
## Noise4 < 1.680474 to the right, agree=0.723, adj=0.055, (0 split)
## TwoFactor1 < 2.430835 to the right, agree=0.722, adj=0.051, (0 split)
##
## Node number 38: 311 observations
## predicted class=Class1 expected loss=0.2540193 P(node) =0.02182456
## class counts: 232 79
## probabilities: 0.746 0.254
##
## Node number 39: 750 observations
## predicted class=Class2 expected loss=0.324 P(node) =0.05263158
## class counts: 243 507
## probabilities: 0.324 0.676
## Area under the curve (AUC): 0.880