library(C50)
library(modeldata)
library(rsample)
dataset <- mlc_churn
# Split the dataset into training (80%) and testing (20%) sets
split <- initial_split(dataset, prop = 0.8)
# Extract training and testing sets
train_data <- training(split)
test_data <- testing(split)
# Convert target variable (churn) to a factor (ensuring it's a classification problem)
train_data$churn <- as.factor(train_data$churn)
test_data$churn <- as.factor(test_data$churn)
train_data
## # A tibble: 4,000 × 20
## state account_length area_code international_plan voice_mail_plan
## <fct> <int> <fct> <fct> <fct>
## 1 LA 150 area_code_415 no no
## 2 HI 111 area_code_408 no yes
## 3 IA 46 area_code_510 no no
## 4 ID 92 area_code_415 no yes
## 5 IA 113 area_code_408 no no
## 6 TX 50 area_code_510 no no
## 7 KY 59 area_code_510 no no
## 8 IL 131 area_code_510 no no
## 9 WY 90 area_code_408 no yes
## 10 UT 89 area_code_415 no no
## # ℹ 3,990 more rows
## # ℹ 15 more variables: number_vmail_messages <int>, total_day_minutes <dbl>,
## # total_day_calls <int>, total_day_charge <dbl>, total_eve_minutes <dbl>,
## # total_eve_calls <int>, total_eve_charge <dbl>, total_night_minutes <dbl>,
## # total_night_calls <int>, total_night_charge <dbl>,
## # total_intl_minutes <dbl>, total_intl_calls <int>, total_intl_charge <dbl>,
## # number_customer_service_calls <int>, churn <fct>
test_data
## # A tibble: 1,000 × 20
## state account_length area_code international_plan voice_mail_plan
## <fct> <int> <fct> <fct> <fct>
## 1 FL 147 area_code_415 no no
## 2 SC 111 area_code_415 no no
## 3 AZ 12 area_code_408 no no
## 4 AK 36 area_code_408 no yes
## 5 NJ 149 area_code_408 no no
## 6 GA 98 area_code_408 no no
## 7 AR 34 area_code_510 no no
## 8 ID 119 area_code_415 no no
## 9 IA 52 area_code_408 no no
## 10 IN 81 area_code_408 no no
## # ℹ 990 more rows
## # ℹ 15 more variables: number_vmail_messages <int>, total_day_minutes <dbl>,
## # total_day_calls <int>, total_day_charge <dbl>, total_eve_minutes <dbl>,
## # total_eve_calls <int>, total_eve_charge <dbl>, total_night_minutes <dbl>,
## # total_night_calls <int>, total_night_charge <dbl>,
## # total_intl_minutes <dbl>, total_intl_calls <int>, total_intl_charge <dbl>,
## # number_customer_service_calls <int>, churn <fct>
In the above output it can be observed that the dataset was splitted into 80% training and 20% testing.
library(caret)
## Warning: package 'caret' was built under R version 4.4.1
## Loading required package: ggplot2
## Loading required package: lattice
library(rpart)
dt_model <- train (churn ~. , data = train_data, metric = "Accuracy", method = "rpart")
print(dt_model)
## CART
##
## 4000 samples
## 19 predictor
## 2 classes: 'yes', 'no'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 4000, 4000, 4000, 4000, 4000, 4000, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.07478261 0.8833413 0.3718067
## 0.08869565 0.8611944 0.1557603
## 0.09391304 0.8600250 0.1309640
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.07478261.
The output presents the results of training a decision tree model using the C5.0 algorithm on a churn dataset with 4,000 samples and 19 predictors. The model was evaluated using bootstrapped resampling with 25 repetitions, and different values of the complexity parameter (cp) were tested. The results show that as cp increases, accuracy decreases slightly, while Kappa—a measure of agreement between predicted and actual classifications—drops significantly. The optimal model was selected based on the highest accuracy, corresponding to cp = 0.06846847, achieving an accuracy of 88.97% and a Kappa of 0.39. This suggests that while the model performs well overall, its ability to distinguish between churn and non-churn classes could be further improved.
print(dt_model$finalModel)
## n= 4000
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 4000 575 no (0.1437500 0.8562500)
## 2) total_day_minutes>=264.45 264 105 yes (0.6022727 0.3977273)
## 4) voice_mail_planyes< 0.5 201 48 yes (0.7611940 0.2388060) *
## 5) voice_mail_planyes>=0.5 63 6 no (0.0952381 0.9047619) *
## 3) total_day_minutes< 264.45 3736 416 no (0.1113490 0.8886510) *
dt_predict <- predict(dt_model, newdata=test_data, na.action=na.omit, type="prob")
head(dt_predict, 5)
## yes no
## 1 0.111349 0.888651
## 2 0.111349 0.888651
## 3 0.111349 0.888651
## 4 0.111349 0.888651
## 5 0.111349 0.888651
The output shows the predicted probabilities for the test dataset using the trained decision tree model. Each row represents a prediction for an instance, with two probability values: one for the “yes” class (churn) and one for the “no” class (non-churn). In this case, all five instances have the same predicted probabilities, with a high likelihood (≈ 89.25%) of belonging to the “no” class and a low probability (≈ 10.75%) for the “yes” class. This uniformity may indicate that the model is not effectively differentiating among instances, potentially due to overfitting, imbalanced data, or insufficient complexity in the decision tree.
dt_predict2 <- predict(dt_model, newdata=test_data, type="raw")
head(dt_predict2, 5)
## [1] no no no no no
## Levels: yes no
The output shows the predicted class labels for the test dataset using the trained decision tree model. Unlike the previous prediction, which provided probability scores, this output directly assigns each instance to the most probable class. In this case, all five instances are classified as “no” (non-churn).
dt_model_tune <- train(churn ~. , data=train_data, method="rpart", metric="Accuracy", tuneLength = 8)
print(dt_model_tune$finalModel)
## n= 4000
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 4000 575 no (0.14375000 0.85625000)
## 2) total_day_minutes>=264.45 264 105 yes (0.60227273 0.39772727)
## 4) voice_mail_planyes< 0.5 201 48 yes (0.76119403 0.23880597)
## 8) total_eve_minutes>=138.25 177 27 yes (0.84745763 0.15254237) *
## 9) total_eve_minutes< 138.25 24 3 no (0.12500000 0.87500000) *
## 5) voice_mail_planyes>=0.5 63 6 no (0.09523810 0.90476190) *
## 3) total_day_minutes< 264.45 3736 416 no (0.11134904 0.88865096)
## 6) number_customer_service_calls>=3.5 295 144 no (0.48813559 0.51186441)
## 12) total_day_minutes< 160.25 114 14 yes (0.87719298 0.12280702) *
## 13) total_day_minutes>=160.25 181 44 no (0.24309392 0.75690608) *
## 7) number_customer_service_calls< 3.5 3441 272 no (0.07904679 0.92095321)
## 14) international_planyes>=0.5 317 120 no (0.37854890 0.62145110)
## 28) total_intl_calls< 2.5 64 0 yes (1.00000000 0.00000000) *
## 29) total_intl_calls>=2.5 253 56 no (0.22134387 0.77865613)
## 58) total_intl_minutes>=13.05 49 0 yes (1.00000000 0.00000000) *
## 59) total_intl_minutes< 13.05 204 7 no (0.03431373 0.96568627) *
## 15) international_planyes< 0.5 3124 152 no (0.04865557 0.95134443)
## 30) total_day_minutes>=244.65 160 44 no (0.27500000 0.72500000)
## 60) total_eve_minutes>=205.6 72 31 yes (0.56944444 0.43055556)
## 120) voice_mail_planyes< 0.5 54 14 yes (0.74074074 0.25925926) *
## 121) voice_mail_planyes>=0.5 18 1 no (0.05555556 0.94444444) *
## 61) total_eve_minutes< 205.6 88 3 no (0.03409091 0.96590909) *
## 31) total_day_minutes< 244.65 2964 108 no (0.03643725 0.96356275)
## 62) total_eve_minutes>=267.2 290 40 no (0.13793103 0.86206897)
## 124) total_day_minutes>=221.75 37 10 yes (0.72972973 0.27027027) *
## 125) total_day_minutes< 221.75 253 13 no (0.05138340 0.94861660) *
## 63) total_eve_minutes< 267.2 2674 68 no (0.02543007 0.97456993) *
The argument method=“rpart” specifies that the model will use the rpart decision tree algorithm, while metric=“Accuracy” ensures that model selection is based on the highest classification accuracy. This means the model is trained using decision trees and evaluated to maximize correct predictions.
dt_model_tune2 <- train(churn ~. , data=train_data, method="rpart", tuneGrid = expand.grid(cp=seq(0, 0.1, 0.01)))
print(dt_model_tune2$finalModel)
## n= 4000
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 4000 575 no (0.14375000 0.85625000)
## 2) total_day_minutes>=264.45 264 105 yes (0.60227273 0.39772727)
## 4) voice_mail_planyes< 0.5 201 48 yes (0.76119403 0.23880597)
## 8) total_eve_minutes>=138.25 177 27 yes (0.84745763 0.15254237)
## 16) total_night_minutes>=115.85 170 20 yes (0.88235294 0.11764706) *
## 17) total_night_minutes< 115.85 7 0 no (0.00000000 1.00000000) *
## 9) total_eve_minutes< 138.25 24 3 no (0.12500000 0.87500000) *
## 5) voice_mail_planyes>=0.5 63 6 no (0.09523810 0.90476190) *
## 3) total_day_minutes< 264.45 3736 416 no (0.11134904 0.88865096)
## 6) number_customer_service_calls>=3.5 295 144 no (0.48813559 0.51186441)
## 12) total_day_minutes< 160.25 114 14 yes (0.87719298 0.12280702) *
## 13) total_day_minutes>=160.25 181 44 no (0.24309392 0.75690608)
## 26) total_eve_minutes< 155.5 30 11 yes (0.63333333 0.36666667) *
## 27) total_eve_minutes>=155.5 151 25 no (0.16556291 0.83443709) *
## 7) number_customer_service_calls< 3.5 3441 272 no (0.07904679 0.92095321)
## 14) international_planyes>=0.5 317 120 no (0.37854890 0.62145110)
## 28) total_intl_calls< 2.5 64 0 yes (1.00000000 0.00000000) *
## 29) total_intl_calls>=2.5 253 56 no (0.22134387 0.77865613)
## 58) total_intl_minutes>=13.05 49 0 yes (1.00000000 0.00000000) *
## 59) total_intl_minutes< 13.05 204 7 no (0.03431373 0.96568627) *
## 15) international_planyes< 0.5 3124 152 no (0.04865557 0.95134443)
## 30) total_day_minutes>=244.65 160 44 no (0.27500000 0.72500000)
## 60) total_eve_minutes>=205.6 72 31 yes (0.56944444 0.43055556)
## 120) voice_mail_planyes< 0.5 54 14 yes (0.74074074 0.25925926) *
## 121) voice_mail_planyes>=0.5 18 1 no (0.05555556 0.94444444) *
## 61) total_eve_minutes< 205.6 88 3 no (0.03409091 0.96590909) *
## 31) total_day_minutes< 244.65 2964 108 no (0.03643725 0.96356275)
## 62) total_eve_minutes>=267.2 290 40 no (0.13793103 0.86206897)
## 124) total_day_minutes>=221.75 37 10 yes (0.72972973 0.27027027)
## 248) voice_mail_planyes< 0.5 30 3 yes (0.90000000 0.10000000) *
## 249) voice_mail_planyes>=0.5 7 0 no (0.00000000 1.00000000) *
## 125) total_day_minutes< 221.75 253 13 no (0.05138340 0.94861660) *
## 63) total_eve_minutes< 267.2 2674 68 no (0.02543007 0.97456993) *
The tuneGrid argument allows manual specification of hyperparameter values for tuning. In this case, expand.grid(cp=seq(0, 0.1, 0.01)) tests complexity parameter (cp) values from 0 to 0.1 in increments of 0.01. By default, if tuneGrid is not provided, cp is automatically tuned based on the dataset. The terminal nodes in the output, marked by *, represent the final splits where no further division occurs, meaning the decision tree has reached its stopping criteria.
dt_model_preprune <- train(churn ~. , data=train_data, method="rpart", metric="Accuracy", tuneLength = 8, control = rpart.control(minsplit=50, minbucket=20, maxdepth=5))
print(dt_model_preprune$finalModel)
## n= 4000
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 4000 575 no (0.14375000 0.85625000)
## 2) total_day_minutes>=264.45 264 105 yes (0.60227273 0.39772727)
## 4) voice_mail_planyes< 0.5 201 48 yes (0.76119403 0.23880597)
## 8) total_eve_minutes>=138.25 177 27 yes (0.84745763 0.15254237) *
## 9) total_eve_minutes< 138.25 24 3 no (0.12500000 0.87500000) *
## 5) voice_mail_planyes>=0.5 63 6 no (0.09523810 0.90476190) *
## 3) total_day_minutes< 264.45 3736 416 no (0.11134904 0.88865096)
## 6) number_customer_service_calls>=3.5 295 144 no (0.48813559 0.51186441)
## 12) total_day_minutes< 160.25 114 14 yes (0.87719298 0.12280702) *
## 13) total_day_minutes>=160.25 181 44 no (0.24309392 0.75690608) *
## 7) number_customer_service_calls< 3.5 3441 272 no (0.07904679 0.92095321)
## 14) international_planyes>=0.5 317 120 no (0.37854890 0.62145110)
## 28) total_intl_calls< 2.5 64 0 yes (1.00000000 0.00000000) *
## 29) total_intl_calls>=2.5 253 56 no (0.22134387 0.77865613)
## 58) total_intl_minutes>=13.05 49 0 yes (1.00000000 0.00000000) *
## 59) total_intl_minutes< 13.05 204 7 no (0.03431373 0.96568627) *
## 15) international_planyes< 0.5 3124 152 no (0.04865557 0.95134443) *
The control argument in train() sets constraints on tree growth using rpart.control(). Here, minsplit=50 requires at least 50 observations to split a node, minbucket=20 ensures terminal nodes have at least 20 observations, and maxdepth=5 limits the tree to 5 levels. These pre-pruning rules help prevent overfitting and improve generalization.
## n= 4000
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 4000 575 no (0.1437500 0.8562500) *
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(dt_model$finalModel)
The initial node classifies as “no.” However, a split occurs based on
total day minutes: if the value is greater than 265, the instance is
classified as “yes”; otherwise, it remains “no.” Within the “yes” node,
another split happens based on the voice mail plan—if its value is less
than 0.5, the classification remains “yes”; otherwise, it changes to
“no.”
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.1
prp(dt_model$finalModel)
rpart.plot(dt_model$finalModel)