R Markdown

Prepare two dataframe with skewed data which will help us to show the impact of factor. Test and Traing datasetp prepared and that will be skewed one…

## 
## Class1 Class2 
##  14278    722
## Training Dataset Nature: 0.9518667 0.04813333
## Test Dataset Nature: 0.9549333 0.04506667

Decision tree will now be run and this will show us the impact of skewness of data. Following one will plot the decision tree

## Call:
## rpart(formula = my_cart_trng_skew_data$Class ~ ., data = my_cart_trng_skew_data)
##   n= 15000 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.09279778      0 1.0000000 1.0000000 0.03630943
## 2 0.04293629      1 0.9072022 0.9141274 0.03479074
## 3 0.02908587      4 0.7770083 0.8379501 0.03337342
## 4 0.01000000      6 0.7188366 0.8130194 0.03289375
## 
## Variable importance
## TwoFactor1 TwoFactor2   Linear02   Linear04   Linear11   Linear05 
##         46         43          5          3          1          1 
## 
## Node number 1: 15000 observations,    complexity param=0.09279778
##   predicted class=Class1  expected loss=0.04813333  P(node) =1
##     class counts: 14278   722
##    probabilities: 0.952 0.048 
##   left son=2 (14705 obs) right son=3 (295 obs)
##   Primary splits:
##       TwoFactor1 < -2.894558  to the right, improve=192.410900, (0 missing)
##       TwoFactor2 < 2.792813   to the left,  improve=158.841000, (0 missing)
##       Linear03   < 0.9462808  to the left,  improve= 11.908630, (0 missing)
##       Linear02   < -0.2482223 to the right, improve= 10.426050, (0 missing)
##       Linear06   < -0.6187431 to the right, improve=  9.654109, (0 missing)
##   Surrogate splits:
##       TwoFactor2 < -4.423733  to the right, agree=0.981, adj=0.017, (0 split)
## 
## Node number 2: 14705 observations,    complexity param=0.04293629
##   predicted class=Class1  expected loss=0.03679021  P(node) =0.9803333
##     class counts: 14164   541
##    probabilities: 0.963 0.037 
##   left son=4 (14350 obs) right son=5 (355 obs)
##   Primary splits:
##       TwoFactor2 < 2.792813   to the left,  improve=166.726000, (0 missing)
##       TwoFactor1 < 2.738351   to the left,  improve= 49.024830, (0 missing)
##       Linear03   < 0.9587968  to the left,  improve= 12.173160, (0 missing)
##       Linear02   < -0.1566939 to the right, improve=  8.210712, (0 missing)
##       Linear06   < -0.6187431 to the right, improve=  6.205716, (0 missing)
##   Surrogate splits:
##       TwoFactor1 < 4.062251   to the left,  agree=0.977, adj=0.031, (0 split)
## 
## Node number 3: 295 observations,    complexity param=0.02908587
##   predicted class=Class2  expected loss=0.3864407  P(node) =0.01966667
##     class counts:   114   181
##    probabilities: 0.386 0.614 
##   left son=6 (157 obs) right son=7 (138 obs)
##   Primary splits:
##       TwoFactor1 < -3.395955  to the right, improve=21.853960, (0 missing)
##       TwoFactor2 < -1.566998  to the right, improve=13.582610, (0 missing)
##       Linear06   < 0.4925175  to the right, improve=12.009620, (0 missing)
##       Linear04   < 0.09267313 to the right, improve=10.957430, (0 missing)
##       Linear07   < 0.5664922  to the left,  improve= 6.985292, (0 missing)
##   Surrogate splits:
##       TwoFactor2 < -2.588363  to the right, agree=0.617, adj=0.181, (0 split)
##       Linear11   < 0.1288809  to the left,  agree=0.580, adj=0.101, (0 split)
##       Noise8     < -0.9216182 to the right, agree=0.580, adj=0.101, (0 split)
##       Linear05   < -0.6381934 to the right, agree=0.573, adj=0.087, (0 split)
##       Linear13   < 0.2060783  to the right, agree=0.569, adj=0.080, (0 split)
## 
## Node number 4: 14350 observations
##   predicted class=Class1  expected loss=0.02494774  P(node) =0.9566667
##     class counts: 13992   358
##    probabilities: 0.975 0.025 
## 
## Node number 5: 355 observations,    complexity param=0.04293629
##   predicted class=Class2  expected loss=0.484507  P(node) =0.02366667
##     class counts:   172   183
##    probabilities: 0.485 0.515 
##   left son=10 (277 obs) right son=11 (78 obs)
##   Primary splits:
##       TwoFactor2 < 3.590468   to the left,  improve=29.16555, (0 missing)
##       TwoFactor1 < 2.752161   to the left,  improve=18.10634, (0 missing)
##       Linear02   < 0.09007999 to the right, improve=16.65899, (0 missing)
##       Linear07   < -0.3725718 to the left,  improve=12.32738, (0 missing)
##       Linear08   < 0.1297837  to the right, improve=11.49439, (0 missing)
##   Surrogate splits:
##       TwoFactor1 < 4.123151   to the left,  agree=0.792, adj=0.051, (0 split)
##       Linear03   < 2.307691   to the left,  agree=0.792, adj=0.051, (0 split)
##       Linear02   < -2.452591  to the right, agree=0.786, adj=0.026, (0 split)
##       Linear05   < -2.078757  to the right, agree=0.786, adj=0.026, (0 split)
##       Noise1     < -2.211655  to the right, agree=0.786, adj=0.026, (0 split)
## 
## Node number 6: 157 observations,    complexity param=0.02908587
##   predicted class=Class1  expected loss=0.433121  P(node) =0.01046667
##     class counts:    89    68
##    probabilities: 0.567 0.433 
##   left son=12 (100 obs) right son=13 (57 obs)
##   Primary splits:
##       Linear04 < -0.3544217 to the right, improve=11.283960, (0 missing)
##       Linear06 < 0.4925175  to the right, improve= 8.342101, (0 missing)
##       Linear07 < -0.5636414 to the left,  improve= 6.055980, (0 missing)
##       Linear08 < -0.6013069 to the right, improve= 5.660513, (0 missing)
##       Linear05 < 1.528027   to the left,  improve= 5.417689, (0 missing)
##   Surrogate splits:
##       Noise5     < 1.29194    to the left,  agree=0.682, adj=0.123, (0 split)
##       Linear15   < 0.9260273  to the left,  agree=0.675, adj=0.105, (0 split)
##       TwoFactor2 < -0.1970952 to the left,  agree=0.656, adj=0.053, (0 split)
##       Linear02   < -1.36453   to the right, agree=0.656, adj=0.053, (0 split)
##       Linear07   < -2.134443  to the right, agree=0.656, adj=0.053, (0 split)
## 
## Node number 7: 138 observations
##   predicted class=Class2  expected loss=0.1811594  P(node) =0.0092
##     class counts:    25   113
##    probabilities: 0.181 0.819 
## 
## Node number 10: 277 observations,    complexity param=0.04293629
##   predicted class=Class1  expected loss=0.4079422  P(node) =0.01846667
##     class counts:   164   113
##    probabilities: 0.592 0.408 
##   left son=20 (173 obs) right son=21 (104 obs)
##   Primary splits:
##       Linear02   < -0.2491237 to the right, improve=20.138540, (0 missing)
##       Linear05   < 0.2395451  to the left,  improve=11.137940, (0 missing)
##       Linear08   < 0.1473945  to the right, improve=10.320440, (0 missing)
##       TwoFactor1 < 2.709536   to the left,  improve= 9.119340, (0 missing)
##       Linear07   < -0.3715908 to the left,  improve= 8.122135, (0 missing)
##   Surrogate splits:
##       Noise4   < -1.93987   to the right, agree=0.650, adj=0.067, (0 split)
##       Linear07 < 1.867579   to the left,  agree=0.646, adj=0.058, (0 split)
##       Linear11 < 1.434563   to the left,  agree=0.646, adj=0.058, (0 split)
##       Linear04 < -1.87922   to the right, agree=0.639, adj=0.038, (0 split)
##       Linear15 < 2.024757   to the left,  agree=0.639, adj=0.038, (0 split)
## 
## Node number 11: 78 observations
##   predicted class=Class2  expected loss=0.1025641  P(node) =0.0052
##     class counts:     8    70
##    probabilities: 0.103 0.897 
## 
## Node number 12: 100 observations
##   predicted class=Class1  expected loss=0.29  P(node) =0.006666667
##     class counts:    71    29
##    probabilities: 0.710 0.290 
## 
## Node number 13: 57 observations
##   predicted class=Class2  expected loss=0.3157895  P(node) =0.0038
##     class counts:    18    39
##    probabilities: 0.316 0.684 
## 
## Node number 20: 173 observations
##   predicted class=Class1  expected loss=0.2601156  P(node) =0.01153333
##     class counts:   128    45
##    probabilities: 0.740 0.260 
## 
## Node number 21: 104 observations
##   predicted class=Class2  expected loss=0.3461538  P(node) =0.006933333
##     class counts:    36    68
##    probabilities: 0.346 0.654

Predictor calcualted from above decision tree will now be used to predict the value from test data. Then precision is being calculated and these metrics provides an interesting interpretation. With threshold value as 0.5, Precision = 0.7 says there are very less false positives. Precision is very low here and this means we have higher number of false negative. F-Value is also very low suggesting model inaccuracy.

## 
## Call: 
## accuracy.meas(response = my_cart_test_skew_data$Class, predicted = pred.treeimb[, 
##     2])
## 
## Examples are labelled as positive when predicted is greater than 0.5 
## 
## precision: 0.673
## recall: 0.391
## F: 0.247

Now we should calculate the area under ROC curve to see the precision. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.The ROC curve is thus the sensitivity as a function of fall-out(false positive). Area under ROC curve is 0.789 which is very near to poor range of 0.6-0.7.

## Area under the curve (AUC): 0.759

Now we will do undersampling. This is most popular and simplest technique to handle imbalanced data. we will use ovun.sample function from ROSE package for under sampling. Oroginal test dataset has observation of Class1 classification of 14337 and Class2 classification 663. So we want to reduce class to approx near to class2. So I am setting value of N as 663* - a small value=1260. So now data is more balanced.

## 
## Class1 Class2 
##   7171   7079

We have now balanced data and lets run the classification and get the result

## 
## Call: 
## accuracy.meas(response = my_cart_test_skew_data$Class, predicted = pred.treeimb_undersample[, 
##     2])
## 
## Examples are labelled as positive when predicted is greater than 0.5 
## 
## precision: 0.198
## recall: 0.879
## F: 0.161
## Call:
## rpart(formula = my_cart_trng_skew_data_undersampled$Class ~ ., 
##     data = my_cart_trng_skew_data_undersampled)
##   n= 14250 
## 
##           CP nsplit rel error    xerror        xstd
## 1 0.34983755      0 1.0000000 1.0000000 0.008431337
## 2 0.01243113      2 0.3003249 0.3018788 0.006020725
## 3 0.01000000      5 0.2630315 0.2813957 0.005847566
## 
## Variable importance
## TwoFactor2 TwoFactor1   Linear03   Linear02   Linear07   Linear04 
##         49         41          4          2          1          1 
## 
## Node number 1: 14250 observations,    complexity param=0.3498375
##   predicted class=Class1  expected loss=0.4967719  P(node) =1
##     class counts:  7171  7079
##    probabilities: 0.503 0.497 
##   left son=2 (11130 obs) right son=3 (3120 obs)
##   Primary splits:
##       TwoFactor1 < -2.162347   to the right, improve=1266.1590, (0 missing)
##       TwoFactor2 < 1.883636    to the left,  improve=1179.5610, (0 missing)
##       Linear02   < 0.478949    to the right, improve= 293.0633, (0 missing)
##       Linear06   < -0.6187431  to the right, improve= 278.4019, (0 missing)
##       Linear03   < -0.2658186  to the left,  improve= 272.5172, (0 missing)
##   Surrogate splits:
##       TwoFactor2 < -1.822519   to the right, agree=0.887, adj=0.484, (0 split)
##       Nonlinear1 < 0.9903925   to the left,  agree=0.782, adj=0.005, (0 split)
##       Nonlinear3 < 0.001196535 to the right, agree=0.782, adj=0.004, (0 split)
##       Noise1     < 2.94556     to the left,  agree=0.782, adj=0.004, (0 split)
##       Linear11   < -3.837981   to the right, agree=0.782, adj=0.004, (0 split)
## 
## Node number 2: 11130 observations,    complexity param=0.3498375
##   predicted class=Class1  expected loss=0.3851752  P(node) =0.7810526
##     class counts:  6843  4287
##    probabilities: 0.615 0.385 
##   left son=4 (7533 obs) right son=5 (3597 obs)
##   Primary splits:
##       TwoFactor2 < 1.855876    to the left,  improve=2257.0270, (0 missing)
##       TwoFactor1 < 1.648306    to the left,  improve=1048.3750, (0 missing)
##       Linear03   < 0.9587968   to the left,  improve= 316.3265, (0 missing)
##       Linear02   < 0.08677152  to the right, improve= 274.6478, (0 missing)
##       Linear07   < 0.1161027   to the left,  improve= 216.9542, (0 missing)
##   Surrogate splits:
##       TwoFactor1 < 1.648306    to the left,  agree=0.838, adj=0.499, (0 split)
##       Linear04   < -2.230791   to the right, agree=0.681, adj=0.013, (0 split)
##       Linear01   < 2.207985    to the left,  agree=0.680, adj=0.009, (0 split)
##       Linear02   < -3.306216   to the right, agree=0.679, adj=0.007, (0 split)
##       Linear05   < 2.878201    to the left,  agree=0.679, adj=0.006, (0 split)
## 
## Node number 3: 3120 observations
##   predicted class=Class2  expected loss=0.1051282  P(node) =0.2189474
##     class counts:   328  2792
##    probabilities: 0.105 0.895 
## 
## Node number 4: 7533 observations,    complexity param=0.01243113
##   predicted class=Class1  expected loss=0.1651401  P(node) =0.5286316
##     class counts:  6289  1244
##    probabilities: 0.835 0.165 
##   left son=8 (5827 obs) right son=9 (1706 obs)
##   Primary splits:
##       Linear03   < 0.9399573   to the left,  improve=245.2514, (0 missing)
##       TwoFactor1 < -1.700381   to the right, improve=188.2020, (0 missing)
##       Linear02   < -0.2958551  to the right, improve=144.7394, (0 missing)
##       Linear04   < -0.9184932  to the right, improve=139.8880, (0 missing)
##       Linear07   < 0.5650217   to the left,  improve=133.3823, (0 missing)
##   Surrogate splits:
##       Linear04   < -3.608102   to the right, agree=0.776, adj=0.011, (0 split)
##       Nonlinear3 < 0.9995222   to the left,  agree=0.776, adj=0.009, (0 split)
##       Linear08   < -2.749207   to the right, agree=0.776, adj=0.009, (0 split)
##       Noise8     < 2.975199    to the left,  agree=0.775, adj=0.008, (0 split)
##       Linear02   < -2.660221   to the right, agree=0.774, adj=0.003, (0 split)
## 
## Node number 5: 3597 observations
##   predicted class=Class2  expected loss=0.1540172  P(node) =0.2524211
##     class counts:   554  3043
##    probabilities: 0.154 0.846 
## 
## Node number 8: 5827 observations
##   predicted class=Class1  expected loss=0.09610434  P(node) =0.4089123
##     class counts:  5267   560
##    probabilities: 0.904 0.096 
## 
## Node number 9: 1706 observations,    complexity param=0.01243113
##   predicted class=Class1  expected loss=0.4009379  P(node) =0.1197193
##     class counts:  1022   684
##    probabilities: 0.599 0.401 
##   left son=18 (645 obs) right son=19 (1061 obs)
##   Primary splits:
##       Linear02   < 0.02311349  to the right, improve=128.60320, (0 missing)
##       TwoFactor1 < -1.492157   to the right, improve=115.08770, (0 missing)
##       Linear07   < 0.5621504   to the left,  improve= 95.22832, (0 missing)
##       Linear04   < -0.9165976  to the right, improve= 94.80444, (0 missing)
##       Linear06   < 0.526447    to the right, improve= 94.62899, (0 missing)
##   Surrogate splits:
##       Nonlinear1 < -0.6436725  to the left,  agree=0.643, adj=0.056, (0 split)
##       Linear12   < 1.529133    to the right, agree=0.637, adj=0.040, (0 split)
##       Linear08   < 0.9863967   to the right, agree=0.635, adj=0.034, (0 split)
##       Linear06   < 0.9449652   to the right, agree=0.633, adj=0.029, (0 split)
##       Linear09   < -1.383241   to the left,  agree=0.632, adj=0.028, (0 split)
## 
## Node number 18: 645 observations
##   predicted class=Class1  expected loss=0.151938  P(node) =0.04526316
##     class counts:   547    98
##    probabilities: 0.848 0.152 
## 
## Node number 19: 1061 observations,    complexity param=0.01243113
##   predicted class=Class2  expected loss=0.4476909  P(node) =0.07445614
##     class counts:   475   586
##    probabilities: 0.448 0.552 
##   left son=38 (311 obs) right son=39 (750 obs)
##   Primary splits:
##       Linear07   < -0.09480701 to the left,  improve=78.29273, (0 missing)
##       Linear06   < 0.5389032   to the right, improve=64.83913, (0 missing)
##       Linear08   < 0.6526716   to the right, improve=57.91726, (0 missing)
##       Linear04   < -0.07703882 to the right, improve=51.51000, (0 missing)
##       TwoFactor1 < -1.492157   to the right, improve=50.22675, (0 missing)
##   Surrogate splits:
##       Linear14   < -1.533601   to the left,  agree=0.733, adj=0.090, (0 split)
##       Linear03   < 2.457075    to the right, agree=0.730, adj=0.080, (0 split)
##       Nonlinear2 < 0.01958602  to the left,  agree=0.724, adj=0.058, (0 split)
##       Noise4     < 1.680474    to the right, agree=0.723, adj=0.055, (0 split)
##       TwoFactor1 < 2.430835    to the right, agree=0.722, adj=0.051, (0 split)
## 
## Node number 38: 311 observations
##   predicted class=Class1  expected loss=0.2540193  P(node) =0.02182456
##     class counts:   232    79
##    probabilities: 0.746 0.254 
## 
## Node number 39: 750 observations
##   predicted class=Class2  expected loss=0.324  P(node) =0.05263158
##     class counts:   243   507
##    probabilities: 0.324 0.676

## Area under the curve (AUC): 0.880