str(MChurn)
## tibble [5,000 × 20] (S3: tbl_df/tbl/data.frame)
## $ state : Factor w/ 51 levels "AK","AL","AR",..: 17 36 32 36 37 2 20 25 19 50 ...
## $ account_length : int [1:5000] 128 107 137 84 75 118 121 147 117 141 ...
## $ area_code : Factor w/ 3 levels "area_code_408",..: 2 2 2 1 2 3 3 2 1 2 ...
## $ international_plan : Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 1 2 ...
## $ voice_mail_plan : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 2 1 1 2 ...
## $ number_vmail_messages : int [1:5000] 25 26 0 0 0 0 24 0 0 37 ...
## $ total_day_minutes : num [1:5000] 265 162 243 299 167 ...
## $ total_day_calls : int [1:5000] 110 123 114 71 113 98 88 79 97 84 ...
## $ total_day_charge : num [1:5000] 45.1 27.5 41.4 50.9 28.3 ...
## $ total_eve_minutes : num [1:5000] 197.4 195.5 121.2 61.9 148.3 ...
## $ total_eve_calls : int [1:5000] 99 103 110 88 122 101 108 94 80 111 ...
## $ total_eve_charge : num [1:5000] 16.78 16.62 10.3 5.26 12.61 ...
## $ total_night_minutes : num [1:5000] 245 254 163 197 187 ...
## $ total_night_calls : int [1:5000] 91 103 104 89 121 118 118 96 90 97 ...
## $ total_night_charge : num [1:5000] 11.01 11.45 7.32 8.86 8.41 ...
## $ total_intl_minutes : num [1:5000] 10 13.7 12.2 6.6 10.1 6.3 7.5 7.1 8.7 11.2 ...
## $ total_intl_calls : int [1:5000] 3 3 5 7 3 6 7 6 4 5 ...
## $ total_intl_charge : num [1:5000] 2.7 3.7 3.29 1.78 2.73 1.7 2.03 1.92 2.35 3.02 ...
## $ number_customer_service_calls: int [1:5000] 1 1 0 2 3 0 3 0 1 0 ...
## $ churn : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
head(MChurn)
## # A tibble: 6 × 20
## state accoun…¹ area_…² inter…³ voice…⁴ numbe…⁵ total…⁶ total…⁷ total…⁸ total…⁹
## <fct> <int> <fct> <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 KS 128 area_c… no yes 25 265. 110 45.1 197.
## 2 OH 107 area_c… no yes 26 162. 123 27.5 196.
## 3 NJ 137 area_c… no no 0 243. 114 41.4 121.
## 4 OH 84 area_c… yes no 0 299. 71 50.9 61.9
## 5 OK 75 area_c… yes no 0 167. 113 28.3 148.
## 6 AL 118 area_c… yes no 0 223. 98 38.0 221.
## # … with 10 more variables: total_eve_calls <int>, total_eve_charge <dbl>,
## # total_night_minutes <dbl>, total_night_calls <int>,
## # total_night_charge <dbl>, total_intl_minutes <dbl>, total_intl_calls <int>,
## # total_intl_charge <dbl>, number_customer_service_calls <int>, churn <fct>,
## # and abbreviated variable names ¹account_length, ²area_code,
## # ³international_plan, ⁴voice_mail_plan, ⁵number_vmail_messages,
## # ⁶total_day_minutes, ⁷total_day_calls, ⁸total_day_charge, …
## # ℹ Use `colnames()` to see all variable names
dim(MChurn)
## [1] 5000 20
summary(MChurn)
## state account_length area_code international_plan
## WV : 158 Min. : 1.0 area_code_408:1259 no :4527
## MN : 125 1st Qu.: 73.0 area_code_415:2495 yes: 473
## AL : 124 Median :100.0 area_code_510:1246
## ID : 119 Mean :100.3
## VA : 118 3rd Qu.:127.0
## OH : 116 Max. :243.0
## (Other):4240
## voice_mail_plan number_vmail_messages total_day_minutes total_day_calls
## no :3677 Min. : 0.000 Min. : 0.0 Min. : 0
## yes:1323 1st Qu.: 0.000 1st Qu.:143.7 1st Qu.: 87
## Median : 0.000 Median :180.1 Median :100
## Mean : 7.755 Mean :180.3 Mean :100
## 3rd Qu.:17.000 3rd Qu.:216.2 3rd Qu.:113
## Max. :52.000 Max. :351.5 Max. :165
##
## total_day_charge total_eve_minutes total_eve_calls total_eve_charge
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.00
## 1st Qu.:24.43 1st Qu.:166.4 1st Qu.: 87.0 1st Qu.:14.14
## Median :30.62 Median :201.0 Median :100.0 Median :17.09
## Mean :30.65 Mean :200.6 Mean :100.2 Mean :17.05
## 3rd Qu.:36.75 3rd Qu.:234.1 3rd Qu.:114.0 3rd Qu.:19.90
## Max. :59.76 Max. :363.7 Max. :170.0 Max. :30.91
##
## total_night_minutes total_night_calls total_night_charge total_intl_minutes
## Min. : 0.0 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.:166.9 1st Qu.: 87.00 1st Qu.: 7.510 1st Qu.: 8.50
## Median :200.4 Median :100.00 Median : 9.020 Median :10.30
## Mean :200.4 Mean : 99.92 Mean : 9.018 Mean :10.26
## 3rd Qu.:234.7 3rd Qu.:113.00 3rd Qu.:10.560 3rd Qu.:12.00
## Max. :395.0 Max. :175.00 Max. :17.770 Max. :20.00
##
## total_intl_calls total_intl_charge number_customer_service_calls churn
## Min. : 0.000 Min. :0.000 Min. :0.00 yes: 707
## 1st Qu.: 3.000 1st Qu.:2.300 1st Qu.:1.00 no :4293
## Median : 4.000 Median :2.780 Median :1.00
## Mean : 4.435 Mean :2.771 Mean :1.57
## 3rd Qu.: 6.000 3rd Qu.:3.240 3rd Qu.:2.00
## Max. :20.000 Max. :5.400 Max. :9.00
##
#Dividing the dataset in 70% training and 30% testing
set.seed(12345)
crows=nrow(MChurn)
T_index<-sample(crows, .7*crows)
churnTrain<-MChurn[T_index, ]
churnTest<-MChurn[-T_index, ]
prop.table(table(churnTrain$churn))
##
## yes no
## 0.1434286 0.8565714
prop.table(table(churnTest$churn))
##
## yes no
## 0.1366667 0.8633333
dt_model <-train(churnTrain, churnTrain$churn, metric = "Accuracy", method = "rpart")
## Warning: Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
## Setting row names on a tibble is deprecated.
typeof(dt_model)
## [1] "list"
names(dt_model)
## [1] "method" "modelInfo" "modelType" "results" "pred"
## [6] "bestTune" "call" "dots" "metric" "control"
## [11] "finalModel" "preProcess" "trainingData" "ptype" "resample"
## [16] "resampledCM" "perfNames" "maximize" "yLimits" "times"
## [21] "levels"
print(dt_model)
## CART
##
## 3500 samples
## 20 predictor
## 2 classes: 'yes', 'no'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 3500, 3500, 3500, 3500, 3500, 3500, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.0 1.0000000 1
## 0.5 1.0000000 1
## 1.0 0.8548381 0
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.5.
# The training dataset had 3500 rows. 20 predictors were used #The proportion of yes was 14% and that of no was 85%. This dataset is imbalanced. # The function train() was used to train regression trees on the attribute churn. It produces an model of type tibble with 6 rows and 20 columns ## To do: Explain cp accuracy and kappa (Look up kappa)
cp: The complexity parameter (cp) is used to control the size of the decision tree and to select the optimal tree size. If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue. We could also say that tree construction does not continue unless it would decrease the error.
kappa: The Cohen Kappa, in the context of a classification model, is used to compare the machine learning model predictions with the actual values. It is thought to give a better evaluation of a model accuracy when the dataset is imbalanced.
dt_model is obtained using rpart(). It is a list that consists of several parameters listed under names(df_model). print(dt_model) displays the tree that was obtained: 3500 samples were used and all 20 predictors to build the model. The tree that was obtained only contained two subtrees with terminal nodes. The split was done on the churn dependent variable. 2) churn=yes 502 0 yes (1.0000000 0.0000000) 3) churn=no 2998 0 no (0.0000000 1.0000000)
print(dt_model$finalModel)
## n= 3500
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 3500 502 no (0.1434286 0.8565714)
## 2) churn=yes 502 0 yes (1.0000000 0.0000000) *
## 3) churn=no 2998 0 no (0.0000000 1.0000000) *
rpart.plot(dt_model$finalModel)
## Model Predictions (1)
dt_predict <-predict(dt_model, newdata=churnTest, na.action=na.omit, type="prob")
head(dt_predict, 5)
## yes no
## 1 0 1
## 2 0 1
## 3 0 1
## 4 1 0
## 5 0 1
the print() statement shows the tree built with one root and two terminal nodes(leaves), so we can say that the tree building was not successful dt_predict <-predict(dt_model, newdata=churnTest, na.action=na.omit, type=“prob”) The statement above calls the function predict() using the model dt_model on the test dataset and provides the probability for each data point to be in yes or no category since type=“prob:
dt_predict2 <- predict(dt_model, newdata=churnTest, type="raw")
head(dt_predict2)
## [1] no no no yes no no
## Levels: yes no
# To do: Explain the results in the output of this section det_predict2 provides predictions of the class (churn y/n) for the data points in the test dataset. Here the 3 first datapoints were predicted to be no, the 4th to be yes and the 5th to be no.
## Model Tuning (1)
dt_model_tune <- train(churn ~. , data=churnTrain, method="rpart", metric="Accuracy",tuneLength = 8)
print(dt_model_tune)
## CART
##
## 3500 samples
## 19 predictor
## 2 classes: 'yes', 'no'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 3500, 3500, 3500, 3500, 3500, 3500, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.009960159 0.9352655 0.7179393
## 0.011952191 0.9344804 0.7129491
## 0.017928287 0.9319172 0.7007532
## 0.021912351 0.9295950 0.6854555
## 0.023240372 0.9287546 0.6803802
## 0.059760956 0.9034606 0.5136166
## 0.069721116 0.8803967 0.3455416
## 0.083665339 0.8620972 0.1740432
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.009960159.
print(dt_model_tune$finalModel)
## n= 3500
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 3500 502 no (0.14342857 0.85657143)
## 2) total_day_minutes>=265.75 215 88 yes (0.59069767 0.40930233)
## 4) voice_mail_planyes< 0.5 160 38 yes (0.76250000 0.23750000)
## 8) total_eve_minutes>=163.95 117 11 yes (0.90598291 0.09401709) *
## 9) total_eve_minutes< 163.95 43 16 no (0.37209302 0.62790698)
## 18) total_day_minutes>=303.8 11 1 yes (0.90909091 0.09090909) *
## 19) total_day_minutes< 303.8 32 6 no (0.18750000 0.81250000) *
## 5) voice_mail_planyes>=0.5 55 5 no (0.09090909 0.90909091) *
## 3) total_day_minutes< 265.75 3285 375 no (0.11415525 0.88584475)
## 6) number_customer_service_calls>=3.5 263 125 no (0.47528517 0.52471483)
## 12) total_day_minutes< 162.7 106 18 yes (0.83018868 0.16981132) *
## 13) total_day_minutes>=162.7 157 37 no (0.23566879 0.76433121) *
## 7) number_customer_service_calls< 3.5 3022 250 no (0.08272667 0.91727333)
## 14) international_planyes>=0.5 288 108 no (0.37500000 0.62500000)
## 28) total_intl_calls< 2.5 60 0 yes (1.00000000 0.00000000) *
## 29) total_intl_calls>=2.5 228 48 no (0.21052632 0.78947368)
## 58) total_intl_minutes>=13.05 44 0 yes (1.00000000 0.00000000) *
## 59) total_intl_minutes< 13.05 184 4 no (0.02173913 0.97826087) *
## 15) international_planyes< 0.5 2734 142 no (0.05193855 0.94806145)
## 30) total_day_minutes>=224.15 417 75 no (0.17985612 0.82014388)
## 60) total_eve_minutes>=242.35 91 38 yes (0.58241758 0.41758242)
## 120) voice_mail_planyes< 0.5 71 18 yes (0.74647887 0.25352113)
## 240) total_night_minutes>=183.7 49 4 yes (0.91836735 0.08163265) *
## 241) total_night_minutes< 183.7 22 8 no (0.36363636 0.63636364) *
## 121) voice_mail_planyes>=0.5 20 0 no (0.00000000 1.00000000) *
## 61) total_eve_minutes< 242.35 326 22 no (0.06748466 0.93251534) *
## 31) total_day_minutes< 224.15 2317 67 no (0.02891670 0.97108330) *
the new model obtained by setting the tuneLength parameter to 8. This parameter tells the algorithm the number of different values to try for each tuning parameter (Accuracy). With this setting, the tree splits at total_day_minutes column, then further splitting on voice_mail_planyes, etc. The leaf nodes have values of yes or no, so that following a path from the root, one would follow the conditions and get to the final decision of the tree.
The cp for this model is 0.009960159 and Accuracy is 0.9370867
##Model tuning (2)
dt_model_tune2 <- train(churn ~. , data=churnTrain, method="rpart", tuneGrid = expand.grid(cp=seq(0, 0.1, 0.01)))
print(dt_model_tune2)
## CART
##
## 3500 samples
## 19 predictor
## 2 classes: 'yes', 'no'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 3500, 3500, 3500, 3500, 3500, 3500, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.00 0.9326555 0.70525814
## 0.01 0.9370867 0.71554195
## 0.02 0.9330645 0.69330994
## 0.03 0.9276298 0.66044753
## 0.04 0.9268269 0.65725060
## 0.05 0.9230044 0.63251378
## 0.06 0.9002708 0.46890843
## 0.07 0.8817473 0.31229502
## 0.08 0.8694393 0.18443190
## 0.09 0.8634030 0.11052551
## 0.10 0.8601060 0.04344398
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01.
print(dt_model_tune2$finalModel)
## n= 3500
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 3500 502 no (0.14342857 0.85657143)
## 2) total_day_minutes>=265.75 215 88 yes (0.59069767 0.40930233)
## 4) voice_mail_planyes< 0.5 160 38 yes (0.76250000 0.23750000)
## 8) total_eve_minutes>=163.95 117 11 yes (0.90598291 0.09401709) *
## 9) total_eve_minutes< 163.95 43 16 no (0.37209302 0.62790698)
## 18) total_day_minutes>=303.8 11 1 yes (0.90909091 0.09090909) *
## 19) total_day_minutes< 303.8 32 6 no (0.18750000 0.81250000) *
## 5) voice_mail_planyes>=0.5 55 5 no (0.09090909 0.90909091) *
## 3) total_day_minutes< 265.75 3285 375 no (0.11415525 0.88584475)
## 6) number_customer_service_calls>=3.5 263 125 no (0.47528517 0.52471483)
## 12) total_day_minutes< 162.7 106 18 yes (0.83018868 0.16981132) *
## 13) total_day_minutes>=162.7 157 37 no (0.23566879 0.76433121) *
## 7) number_customer_service_calls< 3.5 3022 250 no (0.08272667 0.91727333)
## 14) international_planyes>=0.5 288 108 no (0.37500000 0.62500000)
## 28) total_intl_calls< 2.5 60 0 yes (1.00000000 0.00000000) *
## 29) total_intl_calls>=2.5 228 48 no (0.21052632 0.78947368)
## 58) total_intl_minutes>=13.05 44 0 yes (1.00000000 0.00000000) *
## 59) total_intl_minutes< 13.05 184 4 no (0.02173913 0.97826087) *
## 15) international_planyes< 0.5 2734 142 no (0.05193855 0.94806145)
## 30) total_day_minutes>=224.15 417 75 no (0.17985612 0.82014388)
## 60) total_eve_minutes>=242.35 91 38 yes (0.58241758 0.41758242)
## 120) voice_mail_planyes< 0.5 71 18 yes (0.74647887 0.25352113)
## 240) total_night_minutes>=183.7 49 4 yes (0.91836735 0.08163265) *
## 241) total_night_minutes< 183.7 22 8 no (0.36363636 0.63636364) *
## 121) voice_mail_planyes>=0.5 20 0 no (0.00000000 1.00000000) *
## 61) total_eve_minutes< 242.35 326 22 no (0.06748466 0.93251534) *
## 31) total_day_minutes< 224.15 2317 67 no (0.02891670 0.97108330) *
# To do: What is the default cp= in the tuneGrid argument? The tuneGrid parameter lets us decide which values the main parameter will take. In the statement above, cp will take the values: 0, 01 and 0.01 and the model will be trained with each of these values. The model selected will be the one with the best performance. Accuracy was used to select the optimal model using the largest value. The final value used for the model was cp = 0.01. # To do: Notice the terminal nodes indicated by * in the output A branch in which a decision can be reached at a relatively shallow level (leaf ) is churn =no if total.day.minutes>=265 & voice.mail.plan<0.5 and total_evening.minutes>=163.
## Model Pre-Pruning
dt_model_preprune <- train(churn ~. , data=churnTrain, method="rpart", metric="Accuracy", tuneLength = 8, control = rpart.control(minsplit=50, minbucket=20, maxdepth=5))
print(dt_model_preprune$finalModel)
## n= 3500
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 3500 502 no (0.14342857 0.85657143)
## 2) total_day_minutes>=265.75 215 88 yes (0.59069767 0.40930233)
## 4) voice_mail_planyes< 0.5 160 38 yes (0.76250000 0.23750000)
## 8) total_eve_minutes>=163.95 117 11 yes (0.90598291 0.09401709) *
## 9) total_eve_minutes< 163.95 43 16 no (0.37209302 0.62790698) *
## 5) voice_mail_planyes>=0.5 55 5 no (0.09090909 0.90909091) *
## 3) total_day_minutes< 265.75 3285 375 no (0.11415525 0.88584475)
## 6) number_customer_service_calls>=3.5 263 125 no (0.47528517 0.52471483)
## 12) total_day_minutes< 162.7 106 18 yes (0.83018868 0.16981132) *
## 13) total_day_minutes>=162.7 157 37 no (0.23566879 0.76433121) *
## 7) number_customer_service_calls< 3.5 3022 250 no (0.08272667 0.91727333)
## 14) international_planyes>=0.5 288 108 no (0.37500000 0.62500000)
## 28) total_intl_calls< 2.5 60 0 yes (1.00000000 0.00000000) *
## 29) total_intl_calls>=2.5 228 48 no (0.21052632 0.78947368)
## 58) total_intl_minutes>=13.05 44 0 yes (1.00000000 0.00000000) *
## 59) total_intl_minutes< 13.05 184 4 no (0.02173913 0.97826087) *
## 15) international_planyes< 0.5 2734 142 no (0.05193855 0.94806145)
## 30) total_day_minutes>=224.15 417 75 no (0.17985612 0.82014388)
## 60) total_eve_minutes>=242.35 91 38 yes (0.58241758 0.41758242) *
## 61) total_eve_minutes< 242.35 326 22 no (0.06748466 0.93251534) *
## 31) total_day_minutes< 224.15 2317 67 no (0.02891670 0.97108330) *
the control argument sets various parameters: minsplit=50 the minimum
number of observations that must exist in a node in order for a split to
be attempted.minbucket=20 is the minimum number of observations in any
terminal
## Model Post-Pruning
dt_model_postprune <-prune(dt_model_tune$finalModel, cp = 0.2)
print(dt_model_postprune)
## n= 3500
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 3500 502 no (0.1434286 0.8565714) *
The post-pruning reduced the tree to only the root node, therefore the cp value is too high to yield an interesting result. In what follows I used the tree obtained by dt_model_tune <- train(churn ~. , data=churnTrain, method=“rpart”, metric=“Accuracy”,tuneLength = 8), since the original one only contained the root node.
## Check Decision Tree Classifier (1)
plot(dt_model_tune$finalModel)
text(dt_model_tune$finalModel)
##summary(dt_model_tune$finalModel)
## Check Decision Tree Classifier (2)
fancyRpartPlot(dt_model_tune$finalModel)
# To do: Try to explain the decision tree in the output
The leaf nodes of the tree are yes/no which are the decision reached after following a path of conditions setup by the tree. One such path starts with testing if the total_day_minutes >266: The yes branch leads to voice_mail_plan and so on until a leaf node is reached and with it a classification. On each node shows: the predicted class(yes or no), the predicted probability of no and the % of observations in the node.
## Check Decision Tree Classifier (3)
prp(dt_model_tune$finalModel)
rpart.plot(dt_model_tune$finalModel)