Assignment 5C

In the following code chunk the library C50 is executed as well as others to import the dataset churn and to split the dataset into training and testing

library(C50)
library(modeldata)
library(rsample)

dataset <- mlc_churn

# Split the dataset into training (80%) and testing (20%) sets
split <- initial_split(dataset, prop = 0.8)

# Extract training and testing sets
train_data <- training(split)
test_data <- testing(split)

# Convert target variable (churn) to a factor (ensuring it's a classification problem)
train_data$churn <- as.factor(train_data$churn)
test_data$churn <- as.factor(test_data$churn)

train_data

## # A tibble: 4,000 × 20
##    state account_length area_code     international_plan voice_mail_plan
##    <fct>          <int> <fct>         <fct>              <fct>          
##  1 LA               150 area_code_415 no                 no             
##  2 HI               111 area_code_408 no                 yes            
##  3 IA                46 area_code_510 no                 no             
##  4 ID                92 area_code_415 no                 yes            
##  5 IA               113 area_code_408 no                 no             
##  6 TX                50 area_code_510 no                 no             
##  7 KY                59 area_code_510 no                 no             
##  8 IL               131 area_code_510 no                 no             
##  9 WY                90 area_code_408 no                 yes            
## 10 UT                89 area_code_415 no                 no             
## # ℹ 3,990 more rows
## # ℹ 15 more variables: number_vmail_messages <int>, total_day_minutes <dbl>,
## #   total_day_calls <int>, total_day_charge <dbl>, total_eve_minutes <dbl>,
## #   total_eve_calls <int>, total_eve_charge <dbl>, total_night_minutes <dbl>,
## #   total_night_calls <int>, total_night_charge <dbl>,
## #   total_intl_minutes <dbl>, total_intl_calls <int>, total_intl_charge <dbl>,
## #   number_customer_service_calls <int>, churn <fct>

test_data

## # A tibble: 1,000 × 20
##    state account_length area_code     international_plan voice_mail_plan
##    <fct>          <int> <fct>         <fct>              <fct>          
##  1 FL               147 area_code_415 no                 no             
##  2 SC               111 area_code_415 no                 no             
##  3 AZ                12 area_code_408 no                 no             
##  4 AK                36 area_code_408 no                 yes            
##  5 NJ               149 area_code_408 no                 no             
##  6 GA                98 area_code_408 no                 no             
##  7 AR                34 area_code_510 no                 no             
##  8 ID               119 area_code_415 no                 no             
##  9 IA                52 area_code_408 no                 no             
## 10 IN                81 area_code_408 no                 no             
## # ℹ 990 more rows
## # ℹ 15 more variables: number_vmail_messages <int>, total_day_minutes <dbl>,
## #   total_day_calls <int>, total_day_charge <dbl>, total_eve_minutes <dbl>,
## #   total_eve_calls <int>, total_eve_charge <dbl>, total_night_minutes <dbl>,
## #   total_night_calls <int>, total_night_charge <dbl>,
## #   total_intl_minutes <dbl>, total_intl_calls <int>, total_intl_charge <dbl>,
## #   number_customer_service_calls <int>, churn <fct>

In the above output it can be observed that the dataset was splitted into 80% training and 20% testing.

Model Training

library(caret)

## Warning: package 'caret' was built under R version 4.4.1

## Loading required package: ggplot2

## Loading required package: lattice

library(rpart)

dt_model <- train (churn ~. , data = train_data, metric = "Accuracy", method = "rpart")

print(dt_model)

## CART 
## 
## 4000 samples
##   19 predictor
##    2 classes: 'yes', 'no' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 4000, 4000, 4000, 4000, 4000, 4000, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.07478261  0.8833413  0.3718067
##   0.08869565  0.8611944  0.1557603
##   0.09391304  0.8600250  0.1309640
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.07478261.

The output presents the results of training a decision tree model using the C5.0 algorithm on a churn dataset with 4,000 samples and 19 predictors. The model was evaluated using bootstrapped resampling with 25 repetitions, and different values of the complexity parameter (cp) were tested. The results show that as cp increases, accuracy decreases slightly, while Kappa—a measure of agreement between predicted and actual classifications—drops significantly. The optimal model was selected based on the highest accuracy, corresponding to cp = 0.06846847, achieving an accuracy of 88.97% and a Kappa of 0.39. This suggests that while the model performs well overall, its ability to distinguish between churn and non-churn classes could be further improved.

Check Decision Tree Classifier Detail

print(dt_model$finalModel)

## n= 4000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 4000 575 no (0.1437500 0.8562500)  
##   2) total_day_minutes>=264.45 264 105 yes (0.6022727 0.3977273)  
##     4) voice_mail_planyes< 0.5 201  48 yes (0.7611940 0.2388060) *
##     5) voice_mail_planyes>=0.5 63   6 no (0.0952381 0.9047619) *
##   3) total_day_minutes< 264.45 3736 416 no (0.1113490 0.8886510) *

Model Prediction (1)

dt_predict <- predict(dt_model, newdata=test_data, na.action=na.omit, type="prob")

head(dt_predict, 5)

##        yes       no
## 1 0.111349 0.888651
## 2 0.111349 0.888651
## 3 0.111349 0.888651
## 4 0.111349 0.888651
## 5 0.111349 0.888651

The output shows the predicted probabilities for the test dataset using the trained decision tree model. Each row represents a prediction for an instance, with two probability values: one for the “yes” class (churn) and one for the “no” class (non-churn). In this case, all five instances have the same predicted probabilities, with a high likelihood (≈ 89.25%) of belonging to the “no” class and a low probability (≈ 10.75%) for the “yes” class. This uniformity may indicate that the model is not effectively differentiating among instances, potentially due to overfitting, imbalanced data, or insufficient complexity in the decision tree.

Model Prediction (2)

dt_predict2 <- predict(dt_model, newdata=test_data, type="raw")
head(dt_predict2, 5)

## [1] no no no no no
## Levels: yes no

The output shows the predicted class labels for the test dataset using the trained decision tree model. Unlike the previous prediction, which provided probability scores, this output directly assigns each instance to the most probable class. In this case, all five instances are classified as “no” (non-churn).

Model Tuning (1)

dt_model_tune <- train(churn ~. , data=train_data, method="rpart", metric="Accuracy", tuneLength = 8)
print(dt_model_tune$finalModel)

## n= 4000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 4000 575 no (0.14375000 0.85625000)  
##     2) total_day_minutes>=264.45 264 105 yes (0.60227273 0.39772727)  
##       4) voice_mail_planyes< 0.5 201  48 yes (0.76119403 0.23880597)  
##         8) total_eve_minutes>=138.25 177  27 yes (0.84745763 0.15254237) *
##         9) total_eve_minutes< 138.25 24   3 no (0.12500000 0.87500000) *
##       5) voice_mail_planyes>=0.5 63   6 no (0.09523810 0.90476190) *
##     3) total_day_minutes< 264.45 3736 416 no (0.11134904 0.88865096)  
##       6) number_customer_service_calls>=3.5 295 144 no (0.48813559 0.51186441)  
##        12) total_day_minutes< 160.25 114  14 yes (0.87719298 0.12280702) *
##        13) total_day_minutes>=160.25 181  44 no (0.24309392 0.75690608) *
##       7) number_customer_service_calls< 3.5 3441 272 no (0.07904679 0.92095321)  
##        14) international_planyes>=0.5 317 120 no (0.37854890 0.62145110)  
##          28) total_intl_calls< 2.5 64   0 yes (1.00000000 0.00000000) *
##          29) total_intl_calls>=2.5 253  56 no (0.22134387 0.77865613)  
##            58) total_intl_minutes>=13.05 49   0 yes (1.00000000 0.00000000) *
##            59) total_intl_minutes< 13.05 204   7 no (0.03431373 0.96568627) *
##        15) international_planyes< 0.5 3124 152 no (0.04865557 0.95134443)  
##          30) total_day_minutes>=244.65 160  44 no (0.27500000 0.72500000)  
##            60) total_eve_minutes>=205.6 72  31 yes (0.56944444 0.43055556)  
##             120) voice_mail_planyes< 0.5 54  14 yes (0.74074074 0.25925926) *
##             121) voice_mail_planyes>=0.5 18   1 no (0.05555556 0.94444444) *
##            61) total_eve_minutes< 205.6 88   3 no (0.03409091 0.96590909) *
##          31) total_day_minutes< 244.65 2964 108 no (0.03643725 0.96356275)  
##            62) total_eve_minutes>=267.2 290  40 no (0.13793103 0.86206897)  
##             124) total_day_minutes>=221.75 37  10 yes (0.72972973 0.27027027) *
##             125) total_day_minutes< 221.75 253  13 no (0.05138340 0.94861660) *
##            63) total_eve_minutes< 267.2 2674  68 no (0.02543007 0.97456993) *

The argument method=“rpart” specifies that the model will use the rpart decision tree algorithm, while metric=“Accuracy” ensures that model selection is based on the highest classification accuracy. This means the model is trained using decision trees and evaluated to maximize correct predictions.

Model Tuning (2)

dt_model_tune2 <- train(churn ~. , data=train_data, method="rpart", tuneGrid = expand.grid(cp=seq(0, 0.1, 0.01)))

print(dt_model_tune2$finalModel)

## n= 4000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 4000 575 no (0.14375000 0.85625000)  
##     2) total_day_minutes>=264.45 264 105 yes (0.60227273 0.39772727)  
##       4) voice_mail_planyes< 0.5 201  48 yes (0.76119403 0.23880597)  
##         8) total_eve_minutes>=138.25 177  27 yes (0.84745763 0.15254237)  
##          16) total_night_minutes>=115.85 170  20 yes (0.88235294 0.11764706) *
##          17) total_night_minutes< 115.85 7   0 no (0.00000000 1.00000000) *
##         9) total_eve_minutes< 138.25 24   3 no (0.12500000 0.87500000) *
##       5) voice_mail_planyes>=0.5 63   6 no (0.09523810 0.90476190) *
##     3) total_day_minutes< 264.45 3736 416 no (0.11134904 0.88865096)  
##       6) number_customer_service_calls>=3.5 295 144 no (0.48813559 0.51186441)  
##        12) total_day_minutes< 160.25 114  14 yes (0.87719298 0.12280702) *
##        13) total_day_minutes>=160.25 181  44 no (0.24309392 0.75690608)  
##          26) total_eve_minutes< 155.5 30  11 yes (0.63333333 0.36666667) *
##          27) total_eve_minutes>=155.5 151  25 no (0.16556291 0.83443709) *
##       7) number_customer_service_calls< 3.5 3441 272 no (0.07904679 0.92095321)  
##        14) international_planyes>=0.5 317 120 no (0.37854890 0.62145110)  
##          28) total_intl_calls< 2.5 64   0 yes (1.00000000 0.00000000) *
##          29) total_intl_calls>=2.5 253  56 no (0.22134387 0.77865613)  
##            58) total_intl_minutes>=13.05 49   0 yes (1.00000000 0.00000000) *
##            59) total_intl_minutes< 13.05 204   7 no (0.03431373 0.96568627) *
##        15) international_planyes< 0.5 3124 152 no (0.04865557 0.95134443)  
##          30) total_day_minutes>=244.65 160  44 no (0.27500000 0.72500000)  
##            60) total_eve_minutes>=205.6 72  31 yes (0.56944444 0.43055556)  
##             120) voice_mail_planyes< 0.5 54  14 yes (0.74074074 0.25925926) *
##             121) voice_mail_planyes>=0.5 18   1 no (0.05555556 0.94444444) *
##            61) total_eve_minutes< 205.6 88   3 no (0.03409091 0.96590909) *
##          31) total_day_minutes< 244.65 2964 108 no (0.03643725 0.96356275)  
##            62) total_eve_minutes>=267.2 290  40 no (0.13793103 0.86206897)  
##             124) total_day_minutes>=221.75 37  10 yes (0.72972973 0.27027027)  
##               248) voice_mail_planyes< 0.5 30   3 yes (0.90000000 0.10000000) *
##               249) voice_mail_planyes>=0.5 7   0 no (0.00000000 1.00000000) *
##             125) total_day_minutes< 221.75 253  13 no (0.05138340 0.94861660) *
##            63) total_eve_minutes< 267.2 2674  68 no (0.02543007 0.97456993) *

The tuneGrid argument allows manual specification of hyperparameter values for tuning. In this case, expand.grid(cp=seq(0, 0.1, 0.01)) tests complexity parameter (cp) values from 0 to 0.1 in increments of 0.01. By default, if tuneGrid is not provided, cp is automatically tuned based on the dataset. The terminal nodes in the output, marked by *, represent the final splits where no further division occurs, meaning the decision tree has reached its stopping criteria.

Model Pre-Pruning

dt_model_preprune <- train(churn ~. , data=train_data, method="rpart", metric="Accuracy",  tuneLength = 8, control = rpart.control(minsplit=50, minbucket=20, maxdepth=5))

print(dt_model_preprune$finalModel)

## n= 4000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 4000 575 no (0.14375000 0.85625000)  
##    2) total_day_minutes>=264.45 264 105 yes (0.60227273 0.39772727)  
##      4) voice_mail_planyes< 0.5 201  48 yes (0.76119403 0.23880597)  
##        8) total_eve_minutes>=138.25 177  27 yes (0.84745763 0.15254237) *
##        9) total_eve_minutes< 138.25 24   3 no (0.12500000 0.87500000) *
##      5) voice_mail_planyes>=0.5 63   6 no (0.09523810 0.90476190) *
##    3) total_day_minutes< 264.45 3736 416 no (0.11134904 0.88865096)  
##      6) number_customer_service_calls>=3.5 295 144 no (0.48813559 0.51186441)  
##       12) total_day_minutes< 160.25 114  14 yes (0.87719298 0.12280702) *
##       13) total_day_minutes>=160.25 181  44 no (0.24309392 0.75690608) *
##      7) number_customer_service_calls< 3.5 3441 272 no (0.07904679 0.92095321)  
##       14) international_planyes>=0.5 317 120 no (0.37854890 0.62145110)  
##         28) total_intl_calls< 2.5 64   0 yes (1.00000000 0.00000000) *
##         29) total_intl_calls>=2.5 253  56 no (0.22134387 0.77865613)  
##           58) total_intl_minutes>=13.05 49   0 yes (1.00000000 0.00000000) *
##           59) total_intl_minutes< 13.05 204   7 no (0.03431373 0.96568627) *
##       15) international_planyes< 0.5 3124 152 no (0.04865557 0.95134443) *

The control argument in train() sets constraints on tree growth using rpart.control(). Here, minsplit=50 requires at least 50 observations to split a node, minbucket=20 ensures terminal nodes have at least 20 observations, and maxdepth=5 limits the tree to 5 levels. These pre-pruning rules help prevent overfitting and improve generalization.

Model Post-Pruning

## n= 4000 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 4000 575 no (0.1437500 0.8562500) *

Check Decision Tree Classifier (1)

Check Decision Tree Classifier (2)

library(rattle)

## Loading required package: tibble

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

fancyRpartPlot(dt_model$finalModel)

The initial node classifies as “no.” However, a split occurs based on total day minutes: if the value is greater than 265, the instance is classified as “yes”; otherwise, it remains “no.” Within the “yes” node, another split happens based on the voice mail plan—if its value is less than 0.5, the classification remains “yes”; otherwise, it changes to “no.”

Check Decision Tree Classifier (3)

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.4.1

prp(dt_model$finalModel)

rpart.plot(dt_model$finalModel)