Loading all required libraries

R Markdown

str(MChurn)
## tibble [5,000 × 20] (S3: tbl_df/tbl/data.frame)
##  $ state                        : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
##  $ account_length               : int [1:5000] 128 107 137 84 75 118 121 147 117 141 ...
##  $ area_code                    : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
##  $ international_plan           : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
##  $ voice_mail_plan              : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
##  $ number_vmail_messages        : int [1:5000] 25 26 0 0 0 0 24 0 0 37 ...
##  $ total_day_minutes            : num [1:5000] 265 162 243 299 167 ...
##  $ total_day_calls              : int [1:5000] 110 123 114 71 113 98 88 79 97 84 ...
##  $ total_day_charge             : num [1:5000] 45.1 27.5 41.4 50.9 28.3 ...
##  $ total_eve_minutes            : num [1:5000] 197.4 195.5 121.2 61.9 148.3 ...
##  $ total_eve_calls              : int [1:5000] 99 103 110 88 122 101 108 94 80 111 ...
##  $ total_eve_charge             : num [1:5000] 16.78 16.62 10.3 5.26 12.61 ...
##  $ total_night_minutes          : num [1:5000] 245 254 163 197 187 ...
##  $ total_night_calls            : int [1:5000] 91 103 104 89 121 118 118 96 90 97 ...
##  $ total_night_charge           : num [1:5000] 11.01 11.45 7.32 8.86 8.41 ...
##  $ total_intl_minutes           : num [1:5000] 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
##  $ total_intl_calls             : int [1:5000] 3 3 5 7 3 6 7 6 4 5 ...
##  $ total_intl_charge            : num [1:5000] 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
##  $ number_customer_service_calls: int [1:5000] 1 1 0 2 3 0 3 0 1 0 ...
##  $ churn                        : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
head(MChurn)
## # A tibble: 6 × 20
##   state accoun…¹ area_…² inter…³ voice…⁴ numbe…⁵ total…⁶ total…⁷ total…⁸ total…⁹
##   <fct>    <int> <fct>   <fct>   <fct>     <int>   <dbl>   <int>   <dbl>   <dbl>
## 1 KS         128 area_c… no      yes          25    265.     110    45.1   197. 
## 2 OH         107 area_c… no      yes          26    162.     123    27.5   196. 
## 3 NJ         137 area_c… no      no            0    243.     114    41.4   121. 
## 4 OH          84 area_c… yes     no            0    299.      71    50.9    61.9
## 5 OK          75 area_c… yes     no            0    167.     113    28.3   148. 
## 6 AL         118 area_c… yes     no            0    223.      98    38.0   221. 
## # … with 10 more variables: total_eve_calls <int>, total_eve_charge <dbl>,
## #   total_night_minutes <dbl>, total_night_calls <int>,
## #   total_night_charge <dbl>, total_intl_minutes <dbl>, total_intl_calls <int>,
## #   total_intl_charge <dbl>, number_customer_service_calls <int>, churn <fct>,
## #   and abbreviated variable names ¹​account_length, ²​area_code,
## #   ³​international_plan, ⁴​voice_mail_plan, ⁵​number_vmail_messages,
## #   ⁶​total_day_minutes, ⁷​total_day_calls, ⁸​total_day_charge, …
## # ℹ Use `colnames()` to see all variable names
dim(MChurn)
## [1] 5000   20
summary(MChurn)
##      state      account_length          area_code    international_plan
##  WV     : 158   Min.   :  1.0   area_code_408:1259   no :4527          
##  MN     : 125   1st Qu.: 73.0   area_code_415:2495   yes: 473          
##  AL     : 124   Median :100.0   area_code_510:1246                     
##  ID     : 119   Mean   :100.3                                          
##  VA     : 118   3rd Qu.:127.0                                          
##  OH     : 116   Max.   :243.0                                          
##  (Other):4240                                                          
##  voice_mail_plan number_vmail_messages total_day_minutes total_day_calls
##  no :3677        Min.   : 0.000        Min.   :  0.0     Min.   :  0    
##  yes:1323        1st Qu.: 0.000        1st Qu.:143.7     1st Qu.: 87    
##                  Median : 0.000        Median :180.1     Median :100    
##                  Mean   : 7.755        Mean   :180.3     Mean   :100    
##                  3rd Qu.:17.000        3rd Qu.:216.2     3rd Qu.:113    
##                  Max.   :52.000        Max.   :351.5     Max.   :165    
##                                                                         
##  total_day_charge total_eve_minutes total_eve_calls total_eve_charge
##  Min.   : 0.00    Min.   :  0.0     Min.   :  0.0   Min.   : 0.00   
##  1st Qu.:24.43    1st Qu.:166.4     1st Qu.: 87.0   1st Qu.:14.14   
##  Median :30.62    Median :201.0     Median :100.0   Median :17.09   
##  Mean   :30.65    Mean   :200.6     Mean   :100.2   Mean   :17.05   
##  3rd Qu.:36.75    3rd Qu.:234.1     3rd Qu.:114.0   3rd Qu.:19.90   
##  Max.   :59.76    Max.   :363.7     Max.   :170.0   Max.   :30.91   
##                                                                     
##  total_night_minutes total_night_calls total_night_charge total_intl_minutes
##  Min.   :  0.0       Min.   :  0.00    Min.   : 0.000     Min.   : 0.00     
##  1st Qu.:166.9       1st Qu.: 87.00    1st Qu.: 7.510     1st Qu.: 8.50     
##  Median :200.4       Median :100.00    Median : 9.020     Median :10.30     
##  Mean   :200.4       Mean   : 99.92    Mean   : 9.018     Mean   :10.26     
##  3rd Qu.:234.7       3rd Qu.:113.00    3rd Qu.:10.560     3rd Qu.:12.00     
##  Max.   :395.0       Max.   :175.00    Max.   :17.770     Max.   :20.00     
##                                                                             
##  total_intl_calls total_intl_charge number_customer_service_calls churn     
##  Min.   : 0.000   Min.   :0.000     Min.   :0.00                  yes: 707  
##  1st Qu.: 3.000   1st Qu.:2.300     1st Qu.:1.00                  no :4293  
##  Median : 4.000   Median :2.780     Median :1.00                            
##  Mean   : 4.435   Mean   :2.771     Mean   :1.57                            
##  3rd Qu.: 6.000   3rd Qu.:3.240     3rd Qu.:2.00                            
##  Max.   :20.000   Max.   :5.400     Max.   :9.00                            
## 
#Dividing the dataset in 70% training and 30% testing
set.seed(12345)
crows=nrow(MChurn)
T_index<-sample(crows, .7*crows)

churnTrain<-MChurn[T_index, ]
churnTest<-MChurn[-T_index, ]

prop.table(table(churnTrain$churn))
## 
##       yes        no 
## 0.1434286 0.8565714
prop.table(table(churnTest$churn))
## 
##       yes        no 
## 0.1366667 0.8633333
dt_model <-train(churnTrain, churnTrain$churn, metric = "Accuracy", method = "rpart")
## Warning: Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
typeof(dt_model)
## [1] "list"
names(dt_model)
##  [1] "method"       "modelInfo"    "modelType"    "results"      "pred"        
##  [6] "bestTune"     "call"         "dots"         "metric"       "control"     
## [11] "finalModel"   "preProcess"   "trainingData" "ptype"        "resample"    
## [16] "resampledCM"  "perfNames"    "maximize"     "yLimits"      "times"       
## [21] "levels"
print(dt_model)
## CART 
## 
## 3500 samples
##   20 predictor
##    2 classes: 'yes', 'no' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3500, 3500, 3500, 3500, 3500, 3500, ... 
## Resampling results across tuning parameters:
## 
##   cp   Accuracy   Kappa
##   0.0  1.0000000  1    
##   0.5  1.0000000  1    
##   1.0  0.8548381  0    
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.5.

To do: Interpret the results in this output section

# The training dataset had 3500 rows. 20 predictors were used #The proportion of yes was 14% and that of no was 85%. This dataset is imbalanced. # The function train() was used to train regression trees on the attribute churn. It produces an model of type tibble with 6 rows and 20 columns ## To do: Explain cp accuracy and kappa (Look up kappa)

cp: The complexity parameter (cp) is used to control the size of the decision tree and to select the optimal tree size. If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue. We could also say that tree construction does not continue unless it would decrease the error.

kappa: The Cohen Kappa, in the context of a classification model, is used to compare the machine learning model predictions with the actual values. It is thought to give a better evaluation of a model accuracy when the dataset is imbalanced.

dt_model is obtained using rpart(). It is a list that consists of several parameters listed under names(df_model). print(dt_model) displays the tree that was obtained: 3500 samples were used and all 20 predictors to build the model. The tree that was obtained only contained two subtrees with terminal nodes. The split was done on the churn dependent variable. 2) churn=yes 502 0 yes (1.0000000 0.0000000) 3) churn=no 2998 0 no (0.0000000 1.0000000)

print(dt_model$finalModel)
## n= 3500 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 3500 502 no (0.1434286 0.8565714)  
##   2) churn=yes 502   0 yes (1.0000000 0.0000000) *
##   3) churn=no 2998   0 no (0.0000000 1.0000000) *
rpart.plot(dt_model$finalModel)

## Model Predictions (1)


dt_predict <-predict(dt_model, newdata=churnTest, na.action=na.omit, type="prob")
head(dt_predict, 5)
##   yes no
## 1   0  1
## 2   0  1
## 3   0  1
## 4   1  0
## 5   0  1

To do: Explain the results in the output of this section

the print() statement shows the tree built with one root and two terminal nodes(leaves), so we can say that the tree building was not successful dt_predict <-predict(dt_model, newdata=churnTest, na.action=na.omit, type=“prob”) The statement above calls the function predict() using the model dt_model on the test dataset and provides the probability for each data point to be in yes or no category since type=“prob:

dt_predict2 <- predict(dt_model, newdata=churnTest, type="raw")
head(dt_predict2)
## [1] no  no  no  yes no  no 
## Levels: yes no

# To do: Explain the results in the output of this section det_predict2 provides predictions of the class (churn y/n) for the data points in the test dataset. Here the 3 first datapoints were predicted to be no, the 4th to be yes and the 5th to be no.

## Model Tuning (1)

dt_model_tune <- train(churn ~. , data=churnTrain, method="rpart", metric="Accuracy",tuneLength = 8)
print(dt_model_tune)
## CART 
## 
## 3500 samples
##   19 predictor
##    2 classes: 'yes', 'no' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3500, 3500, 3500, 3500, 3500, 3500, ... 
## Resampling results across tuning parameters:
## 
##   cp           Accuracy   Kappa    
##   0.009960159  0.9352655  0.7179393
##   0.011952191  0.9344804  0.7129491
##   0.017928287  0.9319172  0.7007532
##   0.021912351  0.9295950  0.6854555
##   0.023240372  0.9287546  0.6803802
##   0.059760956  0.9034606  0.5136166
##   0.069721116  0.8803967  0.3455416
##   0.083665339  0.8620972  0.1740432
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.009960159.
print(dt_model_tune$finalModel)
## n= 3500 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 3500 502 no (0.14342857 0.85657143)  
##     2) total_day_minutes>=265.75 215  88 yes (0.59069767 0.40930233)  
##       4) voice_mail_planyes< 0.5 160  38 yes (0.76250000 0.23750000)  
##         8) total_eve_minutes>=163.95 117  11 yes (0.90598291 0.09401709) *
##         9) total_eve_minutes< 163.95 43  16 no (0.37209302 0.62790698)  
##          18) total_day_minutes>=303.8 11   1 yes (0.90909091 0.09090909) *
##          19) total_day_minutes< 303.8 32   6 no (0.18750000 0.81250000) *
##       5) voice_mail_planyes>=0.5 55   5 no (0.09090909 0.90909091) *
##     3) total_day_minutes< 265.75 3285 375 no (0.11415525 0.88584475)  
##       6) number_customer_service_calls>=3.5 263 125 no (0.47528517 0.52471483)  
##        12) total_day_minutes< 162.7 106  18 yes (0.83018868 0.16981132) *
##        13) total_day_minutes>=162.7 157  37 no (0.23566879 0.76433121) *
##       7) number_customer_service_calls< 3.5 3022 250 no (0.08272667 0.91727333)  
##        14) international_planyes>=0.5 288 108 no (0.37500000 0.62500000)  
##          28) total_intl_calls< 2.5 60   0 yes (1.00000000 0.00000000) *
##          29) total_intl_calls>=2.5 228  48 no (0.21052632 0.78947368)  
##            58) total_intl_minutes>=13.05 44   0 yes (1.00000000 0.00000000) *
##            59) total_intl_minutes< 13.05 184   4 no (0.02173913 0.97826087) *
##        15) international_planyes< 0.5 2734 142 no (0.05193855 0.94806145)  
##          30) total_day_minutes>=224.15 417  75 no (0.17985612 0.82014388)  
##            60) total_eve_minutes>=242.35 91  38 yes (0.58241758 0.41758242)  
##             120) voice_mail_planyes< 0.5 71  18 yes (0.74647887 0.25352113)  
##               240) total_night_minutes>=183.7 49   4 yes (0.91836735 0.08163265) *
##               241) total_night_minutes< 183.7 22   8 no (0.36363636 0.63636364) *
##             121) voice_mail_planyes>=0.5 20   0 no (0.00000000 1.00000000) *
##            61) total_eve_minutes< 242.35 326  22 no (0.06748466 0.93251534) *
##          31) total_day_minutes< 224.15 2317  67 no (0.02891670 0.97108330) *

the new model obtained by setting the tuneLength parameter to 8. This parameter tells the algorithm the number of different values to try for each tuning parameter (Accuracy). With this setting, the tree splits at total_day_minutes column, then further splitting on voice_mail_planyes, etc. The leaf nodes have values of yes or no, so that following a path from the root, one would follow the conditions and get to the final decision of the tree.

The cp for this model is 0.009960159 and Accuracy is 0.9370867

##Model  tuning (2)
dt_model_tune2 <- train(churn ~. , data=churnTrain, method="rpart", tuneGrid = expand.grid(cp=seq(0, 0.1, 0.01)))
print(dt_model_tune2)
## CART 
## 
## 3500 samples
##   19 predictor
##    2 classes: 'yes', 'no' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 3500, 3500, 3500, 3500, 3500, 3500, ... 
## Resampling results across tuning parameters:
## 
##   cp    Accuracy   Kappa     
##   0.00  0.9326555  0.70525814
##   0.01  0.9370867  0.71554195
##   0.02  0.9330645  0.69330994
##   0.03  0.9276298  0.66044753
##   0.04  0.9268269  0.65725060
##   0.05  0.9230044  0.63251378
##   0.06  0.9002708  0.46890843
##   0.07  0.8817473  0.31229502
##   0.08  0.8694393  0.18443190
##   0.09  0.8634030  0.11052551
##   0.10  0.8601060  0.04344398
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01.
print(dt_model_tune2$finalModel)
## n= 3500 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 3500 502 no (0.14342857 0.85657143)  
##     2) total_day_minutes>=265.75 215  88 yes (0.59069767 0.40930233)  
##       4) voice_mail_planyes< 0.5 160  38 yes (0.76250000 0.23750000)  
##         8) total_eve_minutes>=163.95 117  11 yes (0.90598291 0.09401709) *
##         9) total_eve_minutes< 163.95 43  16 no (0.37209302 0.62790698)  
##          18) total_day_minutes>=303.8 11   1 yes (0.90909091 0.09090909) *
##          19) total_day_minutes< 303.8 32   6 no (0.18750000 0.81250000) *
##       5) voice_mail_planyes>=0.5 55   5 no (0.09090909 0.90909091) *
##     3) total_day_minutes< 265.75 3285 375 no (0.11415525 0.88584475)  
##       6) number_customer_service_calls>=3.5 263 125 no (0.47528517 0.52471483)  
##        12) total_day_minutes< 162.7 106  18 yes (0.83018868 0.16981132) *
##        13) total_day_minutes>=162.7 157  37 no (0.23566879 0.76433121) *
##       7) number_customer_service_calls< 3.5 3022 250 no (0.08272667 0.91727333)  
##        14) international_planyes>=0.5 288 108 no (0.37500000 0.62500000)  
##          28) total_intl_calls< 2.5 60   0 yes (1.00000000 0.00000000) *
##          29) total_intl_calls>=2.5 228  48 no (0.21052632 0.78947368)  
##            58) total_intl_minutes>=13.05 44   0 yes (1.00000000 0.00000000) *
##            59) total_intl_minutes< 13.05 184   4 no (0.02173913 0.97826087) *
##        15) international_planyes< 0.5 2734 142 no (0.05193855 0.94806145)  
##          30) total_day_minutes>=224.15 417  75 no (0.17985612 0.82014388)  
##            60) total_eve_minutes>=242.35 91  38 yes (0.58241758 0.41758242)  
##             120) voice_mail_planyes< 0.5 71  18 yes (0.74647887 0.25352113)  
##               240) total_night_minutes>=183.7 49   4 yes (0.91836735 0.08163265) *
##               241) total_night_minutes< 183.7 22   8 no (0.36363636 0.63636364) *
##             121) voice_mail_planyes>=0.5 20   0 no (0.00000000 1.00000000) *
##            61) total_eve_minutes< 242.35 326  22 no (0.06748466 0.93251534) *
##          31) total_day_minutes< 224.15 2317  67 no (0.02891670 0.97108330) *

# To do: What is the default cp= in the tuneGrid argument? The tuneGrid parameter lets us decide which values the main parameter will take. In the statement above, cp will take the values: 0, 01 and 0.01 and the model will be trained with each of these values. The model selected will be the one with the best performance. Accuracy was used to select the optimal model using the largest value. The final value used for the model was cp = 0.01. # To do: Notice the terminal nodes indicated by * in the output A branch in which a decision can be reached at a relatively shallow level (leaf ) is churn =no if total.day.minutes>=265 & voice.mail.plan<0.5 and total_evening.minutes>=163.

## Model Pre-Pruning

dt_model_preprune <- train(churn ~. , data=churnTrain, method="rpart", metric="Accuracy",  tuneLength = 8, control = rpart.control(minsplit=50, minbucket=20, maxdepth=5))
print(dt_model_preprune$finalModel)
## n= 3500 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 3500 502 no (0.14342857 0.85657143)  
##    2) total_day_minutes>=265.75 215  88 yes (0.59069767 0.40930233)  
##      4) voice_mail_planyes< 0.5 160  38 yes (0.76250000 0.23750000)  
##        8) total_eve_minutes>=163.95 117  11 yes (0.90598291 0.09401709) *
##        9) total_eve_minutes< 163.95 43  16 no (0.37209302 0.62790698) *
##      5) voice_mail_planyes>=0.5 55   5 no (0.09090909 0.90909091) *
##    3) total_day_minutes< 265.75 3285 375 no (0.11415525 0.88584475)  
##      6) number_customer_service_calls>=3.5 263 125 no (0.47528517 0.52471483)  
##       12) total_day_minutes< 162.7 106  18 yes (0.83018868 0.16981132) *
##       13) total_day_minutes>=162.7 157  37 no (0.23566879 0.76433121) *
##      7) number_customer_service_calls< 3.5 3022 250 no (0.08272667 0.91727333)  
##       14) international_planyes>=0.5 288 108 no (0.37500000 0.62500000)  
##         28) total_intl_calls< 2.5 60   0 yes (1.00000000 0.00000000) *
##         29) total_intl_calls>=2.5 228  48 no (0.21052632 0.78947368)  
##           58) total_intl_minutes>=13.05 44   0 yes (1.00000000 0.00000000) *
##           59) total_intl_minutes< 13.05 184   4 no (0.02173913 0.97826087) *
##       15) international_planyes< 0.5 2734 142 no (0.05193855 0.94806145)  
##         30) total_day_minutes>=224.15 417  75 no (0.17985612 0.82014388)  
##           60) total_eve_minutes>=242.35 91  38 yes (0.58241758 0.41758242) *
##           61) total_eve_minutes< 242.35 326  22 no (0.06748466 0.93251534) *
##         31) total_day_minutes< 224.15 2317  67 no (0.02891670 0.97108330) *

To do: Explain the control argument.

the control argument sets various parameters: minsplit=50 the minimum number of observations that must exist in a node in order for a split to be attempted.minbucket=20 is the minimum number of observations in any terminal node.maxdepth=5, Set the maximum depth of any node of the final tree, with the root node counted as depth 0.

## Model Post-Pruning  

dt_model_postprune <-prune(dt_model_tune$finalModel, cp = 0.2)

print(dt_model_postprune)
## n= 3500 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 3500 502 no (0.1434286 0.8565714) *

The post-pruning reduced the tree to only the root node, therefore the cp value is too high to yield an interesting result. In what follows I used the tree obtained by dt_model_tune <- train(churn ~. , data=churnTrain, method=“rpart”, metric=“Accuracy”,tuneLength = 8), since the original one only contained the root node.

## Check Decision Tree Classifier (1)

plot(dt_model_tune$finalModel)

text(dt_model_tune$finalModel)

##summary(dt_model_tune$finalModel)
## Check Decision Tree Classifier (2)


fancyRpartPlot(dt_model_tune$finalModel)

# To do:  Try to explain the decision tree in the output

The leaf nodes of the tree are yes/no which are the decision reached after following a path of conditions setup by the tree. One such path starts with testing if the total_day_minutes >266: The yes branch leads to voice_mail_plan and so on until a leaf node is reached and with it a classification. On each node shows: the predicted class(yes or no), the predicted probability of no and the % of observations in the node.

## Check Decision Tree Classifier (3) 


prp(dt_model_tune$finalModel)

rpart.plot(dt_model_tune$finalModel)