Part I

1.1 Assignment Introduction

  1. Choose a dataset You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.

  2. Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.

  3. To complete this task:

  1. Describe the problem you are trying to solve.
  2. Describe your datases and what you did to prepare the data for analysis.
  3. Methodologies you used for analyzing the data
  4. What’s the purpose of the analysis performed
  5. Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

1.2 About Dataset

Telco Customer Churn:

  • The Telco customer churn data contains information about a fictional telco company that provided home phone and Internet services to 7043 customers in California in Q3. It indicates which customers have left, stayed, or signed up for their service. Multiple important demographics are included for each customer, as well as a Satisfaction Score, Churn Score, and Customer Lifetime Value (CLTV) index.

1.3 The data set includes information about:

  • CustomerID: A unique ID that identifies each customer.

  • Gender: The customer’s gender: Male, Female

  • Senior Citizen: Indicates if the customer is 65 or older: Yes, No

  • Partner: Indicates if the customer is married: Yes, No

  • Dependents: Indicates if the customer lives with any dependents: Yes, No. Dependents could be children, parents, grandparents, etc.

  • Number of Dependents: Indicates the number of dependents that live with the customer.

  • Tenure: Indicates the total amount of months that the customer has been with the company by the end of the quarter specified above.

  • Phone Service: Indicates if the customer subscribes to home phone service with the company: Yes, No

  • Multiple Lines: Indicates if the customer subscribes to multiple telephone lines with the company: Yes, No

  • Internet Service: Indicates if the customer subscribes to Internet service with the company: No, DSL, Fiber Optic, Cable.

  • Online Security: Indicates if the customer subscribes to an additional online security service provided by the company: Yes, No

  • Online Backup: Indicates if the customer subscribes to an additional online backup service provided by the company: Yes, No

  • Device Protection Plan: Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company: Yes, No

  • Premium Tech Support: Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times: Yes, No

  • Streaming TV: Indicates if the customer uses their Internet service to stream television programing from a third party provider: Yes, No. The company does not charge an additional fee for this service.

  • Streaming Movies: Indicates if the customer uses their Internet service to stream movies from a third party provider: Yes, No. The company does not charge an additional fee for this service.

  • Contract: Indicates the customer’s current contract type: Month-to-Month, One Year, Two Year.

  • Paperless Billing: Indicates if the customer has chosen paperless billing: Yes, No

  • Payment Method: Indicates how the customer pays their bill: Bank Withdrawal, Credit Card, Mailed Check

  • Monthly Charge: Indicates the customer’s current total monthly charge for all their services from the company.

  • Total Charges: Indicates the customer’s total charges, calculated to the end of the quarter specified above.

  • Churn: Yes = the customer left the company this quarter. No = the customer remained with the company. Directly related to Churn Value.

This dataset is from Kaggle: https://www.kaggle.com/datasets/blastchar/telco-customer-churn

New version from IBM: https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113

1.4 Problem Description:

In subscription-based businesses such as telecommunications, customer churn represents a major threat to revenue and long-term growth. Acquiring a new customer is typically more expensive than retaining an existing one, so the ability to identify high-risk customers in advance can directly reduce marketing costs and increase profitability.

The business problem in this project is: How can a telecom provider predict which customers are likely to cancel their service, so that it can intervene proactively?

PART II: Summary of Data and Data Cleaning

# Load the libraries
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(tidyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(tidyverse)
## Warning: package 'readr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.4     ✔ stringr   1.5.1
## ✔ purrr     1.0.2     ✔ tibble    3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(kernlab)
## 
## Attaching package: 'kernlab'
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
# View the data
df <- read.csv('https://raw.githubusercontent.com/Jennyjjxxzz/HW4/refs/heads/main/WA_Fn-UseC_-Telco-Customer-Churn.csv')

head(df)
dim(df)
## [1] 7043   21
summary(df)
##   customerID           gender          SeniorCitizen      Partner         
##  Length:7043        Length:7043        Min.   :0.0000   Length:7043       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.1621                     
##                                        3rd Qu.:0.0000                     
##                                        Max.   :1.0000                     
##                                                                           
##   Dependents            tenure      PhoneService       MultipleLines     
##  Length:7043        Min.   : 0.00   Length:7043        Length:7043       
##  Class :character   1st Qu.: 9.00   Class :character   Class :character  
##  Mode  :character   Median :29.00   Mode  :character   Mode  :character  
##                     Mean   :32.37                                        
##                     3rd Qu.:55.00                                        
##                     Max.   :72.00                                        
##                                                                          
##  InternetService    OnlineSecurity     OnlineBackup       DeviceProtection  
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  TechSupport        StreamingTV        StreamingMovies      Contract        
##  Length:7043        Length:7043        Length:7043        Length:7043       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  PaperlessBilling   PaymentMethod      MonthlyCharges    TotalCharges   
##  Length:7043        Length:7043        Min.   : 18.25   Min.   :  18.8  
##  Class :character   Class :character   1st Qu.: 35.50   1st Qu.: 401.4  
##  Mode  :character   Mode  :character   Median : 70.35   Median :1397.5  
##                                        Mean   : 64.76   Mean   :2283.3  
##                                        3rd Qu.: 89.85   3rd Qu.:3794.7  
##                                        Max.   :118.75   Max.   :8684.8  
##                                                         NA's   :11      
##     Churn          
##  Length:7043       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 
str(df)
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...
colSums(is.na(df))
##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0
# Convert TotalCharges imports as character
df$TotalCharges <- as.numeric(df$TotalCharges)

# Drop missing values
df <- df %>% drop_na()

PART III: Exploratory Data Analysis

3.1: Churn Disrribution

The churn rate is visibly imbalanced: the majority of customers did not churn, while a smaller portion (approximately 26–27%) did. This imbalance is typical in churn datasets. And accuracy alone is not a reliable metric (model may predict “No Churn” for everyone).Metrics like precision, recall, F1-score, and ROC-AUC will provide more meaningful evaluation.

df %>%
  count(Churn) %>%
  ggplot(aes(x = Churn, y = n, fill = Churn)) +
  geom_col() +
  labs(title = "Churn Distribution")

3.2: Tenure Distribution by Churn

  • Customers who churn tend to have much lower tenure, meaning they leave early in their relationship with the company.

  • New customers may be at higher risk of cancelling.

ggplot(df, aes(x = tenure, color = Churn)) +
  geom_density(size = 1.2) +
  labs(title = "Tenure Distribution by Churn",
       x = "Months with Company",
       y = "Density") +
  scale_color_manual(values = c("No" = "#1f77b4", "Yes" = "#d62728"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

3.3 Monthly Charges by Churn

  • Customers who churn tend to have higher monthly charges.

  • This suggests a potential price-sensitivity problem: Customers facing higher monthly costs may feel dissatisfied.

  • Discounts or flexible plans could reduce churn among this group.

ggplot(df, aes(x = Churn, y = MonthlyCharges, fill = Churn)) +
  geom_boxplot() +
  labs(title = "Monthly Charges by Churn")

3.4 Churn Rate by Tenure Group

  • Month-to-Month customers have the highest churn rate

  • One-year and two-year contracts show significantly high churn

  • This suggests customers with short-term contracts feel less committed or more dissatisfied.

  • Business implication: Offering promotions or incentives to encourage monthly customers to upgrade to annual contracts may significantly reduce churn.

df <- df %>%
  mutate(tenure_group = case_when(
    tenure <= 12 ~ "0-1 year",
    tenure <= 24 ~ "1-2 years",
    tenure <= 36 ~ "2-3 years",
    tenure <= 48 ~ "3-4 years",
    tenure <= 60 ~ "4-5 years",
    TRUE ~ "5-6 years"
  ))

ggplot(df, aes(x = tenure_group, fill = Churn)) +
  geom_bar(position = "fill") +
  labs(title = "Churn Rate by Tenure Group",
       y = "Proportion",
       x = "Tenure Group") +
  scale_fill_manual(values=c("#1f77b4","#d62728")) +
  theme_minimal(base_size = 14)

### 3.5 MonthlyCharges vs Tenure

ggplot(df, aes(x = tenure, y = MonthlyCharges, color = Churn)) +
  geom_point(alpha = 0.4) +
  labs(title = "Monthly Charges vs Tenure",
       x = "Tenure (Months)",
       y = "Monthly Charges") +
  scale_color_manual(values=c("#1f77b4","#d62728")) +
  theme_minimal(base_size = 14)

3.6 Churn Proportions Across Customer Demographic

  • Dependents: Customers without dependents have a higher churn proportion, similar to the partner pattern.

  • Partner: shows that customers without partners churn more often, possibly reflecting differences in household stability or financial constraints.

  • Gender shows almost no difference in churn rate between male and female customers, suggesting that gender is not an important predictor.

  • Senior Citizens churn at a substantially higher rate than non-senior customers, indicating that older customers may be less satisfied with services or more price-sensitive.

# Choose categorical variables to include
cat_vars <- c("gender", 
              "SeniorCitizen", 
              "Partner",
              "Dependents")

# Convert SeniorCitizen to Yes/No for readability
df$SeniorCitizen <- ifelse(df$SeniorCitizen == 1, "Yes", "No")

# Reshape data to long format
df_long <- df %>%
  select(all_of(cat_vars), Churn) %>%
  pivot_longer(cols = all_of(cat_vars),
               names_to = "variable",
               values_to = "value")

# bar plot
ggplot(df_long, aes(x = value, fill = Churn)) +
  geom_bar(position = "fill") +
  facet_wrap(~ variable, scales = "free_x") +
  labs(title = "Churn Proportions Across Customer Demographics",
       x = "",
       y = "Proportion",
       fill = "Churn") +
  scale_fill_manual(values = c("No" = "#FF6F69", "Yes" = "#40E0D0")) +
  theme_minimal(base_size = 14) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

### 3.7 Churn Proportion Across Add-on and Internet Services

  • DeviceProtection: Slightly higher churn among customers without protection plans.

  • OnlineBackup: customers without this service churn more.

  • StreamingTV: Similar churn rates for “No” and “Yes”. Streaming services do not strongly influence churn.

  • InternetService: Fiber optic customers show the highest churn proportion. DSL users churn less. Customers with no internet service churn the least. This reflecting dissatisfaction with Fiber.

  • OnlineSecurity: Customers without Online Security churn at much higher rates. Customers with security add-ons churn less.

  • StreamingTV: Similar churn rates for “No” and “Yes”.

  • Multiple Lines: Customers with multiple phone lines churn slightly more than customers with a single line or no phone service, possibly reflecting dissatisfaction with bundled service plans.

  • Phone Service: does not appear to influence churn significantly.

  • TechSupport: Customers with no tech support churn disproportionately more.

service_vars <- c("InternetService",
                  "OnlineSecurity",
                  "OnlineBackup",
                  "DeviceProtection",
                  "TechSupport",
                  "StreamingTV",
                  "StreamingMovies",
                  "MultipleLines",
                  "PhoneService")

df_services_long <- df %>%
  select(all_of(service_vars), Churn) %>%
  pivot_longer(cols = all_of(service_vars),
               names_to = "Service",
               values_to = "Status")

ggplot(df_services_long, aes(x = Status, fill = Churn)) +
  geom_bar(position = "fill") +
  facet_wrap(~ Service, scales = "free_x") +
  labs(title = "Churn Proportion Across Add-on and Internet Services",
       x = "",
       y = "Proportion") +
  scale_fill_manual(values = c("No" = "#FF6F69", "Yes" = "#40E0D0")) +
  theme_minimal(base_size = 9) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1),
        strip.text = element_text(face = "bold"))

3.8 Churn Proportion Across Contract, and Billing

  • Contract: Month-to-month contracts have the highest churn rate.

  • PaperlessBilling: Customers with paperless billing churn much more.

  • PaymentMethod: Electronic Check customers churn the most. Automatic payments via credit card or bank transfer churn the least.

final_vars <- c("Contract",
                "PaperlessBilling",
                "PaymentMethod")

df_final_long <- df %>%
  select(all_of(final_vars), Churn) %>%
  pivot_longer(cols = all_of(final_vars),
               names_to = "variable",
               values_to = "value")

ggplot(df_final_long, aes(x = value, fill = Churn)) +
  geom_bar(position = "fill") +
  facet_wrap(~ variable, scales = "free_x") +
  labs(title = "Churn Proportion Across Contract, and Billing",
       x = "",
       y = "Proportion",
       fill = "Churn") +
  scale_fill_manual(values = c("No" = "#FF6F69", "Yes" = "#40E0D0")) +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1),
        strip.text = element_text(face = "bold"))

3.9 Key Insights From the Heatmap

  • Tenure is strongly correlated with TotalCharges, because customers who stay longer accumulate higher total charges.

  • There are moderate positive correlations between tech-support–related services, such as OnlineSecurity, OnlineBackup, DeviceProtection, and TechSupport.

  • Churn shows small but notable correlations with several predictors, including Contract, MonthlyCharges, InternetService, and SeniorCitizen

df <- df %>% select(-tenure_group)

# Convert categorical to numeric temporarily
df_numeric <- df

for(col in names(df_numeric)){
  if(is.character(df_numeric[[col]]) | is.factor(df_numeric[[col]])){
    df_numeric[[col]] <- as.numeric(as.factor(df_numeric[[col]]))
  }
}

cor_mat <- cor(df_numeric)

corrplot(cor_mat,
         method = "color",
         type = "lower",
         tl.col = "black",
         tl.cex = 0.6,
         number.cex = 0.4,
         addCoef.col = "black",
         col = colorRampPalette(c("blue","white","red"))(200))

PART IV: Modeling

4.1 Setup & Data Prep

# Remove ID column

df_model <- df %>%
  select(-customerID)

# Make sure Churn is a factor with "Yes" as positive class
df_model$Churn <- factor(df_model$Churn,
                         levels = c("Yes", "No"))

str(df_model$Churn)
##  Factor w/ 2 levels "Yes","No": 2 2 1 2 1 1 2 2 1 2 ...
# Train / test split (70/30)
set.seed(123)
train_index <- createDataPartition(df_model$Churn, p = 0.7, list = FALSE)

train <- df_model[train_index, ]
test  <- df_model[-train_index, ]
# Count unique values for each column in train
unique_levels <- sapply(train, function(x) length(unique(x)))

# Identify columns that have only 1 unique value
single_level_vars <- names(unique_levels[unique_levels == 1])

# Drop them from both train and test
if (length(single_level_vars) > 0) {
  train <- train %>% dplyr::select(-dplyr::all_of(single_level_vars))
  test  <- test  %>% dplyr::select(-dplyr::all_of(single_level_vars))
}
# Set up cross-validation

ctrl <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 2,
  classProbs = TRUE,
  summaryFunction = twoClassSummary, 
  savePredictions = "final"
)
evaluate_model <- function(model, test){
  preds_prob <- predict(model, test, type = "prob")[, "Yes"]
  preds_class <- predict(model, test)

  cm <- confusionMatrix(preds_class, test$Churn, positive = "Yes")

  auc <- roc(test$Churn, preds_prob)$auc

  tibble(
    Accuracy  = cm$overall["Accuracy"],
    Precision = cm$byClass["Precision"],
    Recall    = cm$byClass["Recall"],
    F1_Score  = cm$byClass["F1"],
    AUC       = as.numeric(auc)
  )
}

4.2 Random Forest

# Random Forest

set.seed(123)

rf_grid <- expand.grid(
  mtry = c(3, 5, 7, 9)
)

fit_rf <- train(
  Churn ~ ., data = train,
  method = "rf",
  metric = "ROC",
  trControl = ctrl,
  tuneGrid = rf_grid,
  ntree = 500
)
rf_pred_class <- predict(fit_rf, newdata = test)
rf_pred_prob  <- predict(fit_rf, newdata = test, type = "prob")[, "Yes"]

confusionMatrix(rf_pred_class, test$Churn, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes  264  134
##        No   296 1414
##                                          
##                Accuracy : 0.796          
##                  95% CI : (0.7782, 0.813)
##     No Information Rate : 0.7343         
##     P-Value [Acc > NIR] : 2.663e-11      
##                                          
##                   Kappa : 0.424          
##                                          
##  Mcnemar's Test P-Value : 8.222e-15      
##                                          
##             Sensitivity : 0.4714         
##             Specificity : 0.9134         
##          Pos Pred Value : 0.6633         
##          Neg Pred Value : 0.8269         
##              Prevalence : 0.2657         
##          Detection Rate : 0.1252         
##    Detection Prevalence : 0.1888         
##       Balanced Accuracy : 0.6924         
##                                          
##        'Positive' Class : Yes            
## 
roc_rf <- roc(response = test$Churn,
              predictor = rf_pred_prob,
              levels   = c("No", "Yes"))
## Setting direction: controls < cases
auc(roc_rf)
## Area under the curve: 0.8228

4.3 XGBoost

# XGBoost

set.seed(123)

xgb_grid <- expand.grid(
  nrounds = 200,
  max_depth = c(3, 5, 7),
  eta = c(0.05, 0.1),
  gamma = 0,
  colsample_bytree = 0.8,
  min_child_weight = 1,
  subsample = 0.8
)

fit_xgb <- train(
  Churn ~ ., data = train,
  method = "xgbTree",
  metric = "ROC",
  trControl = ctrl,
  tuneGrid = xgb_grid
)

fit_xgb
## eXtreme Gradient Boosting 
## 
## 4924 samples
##   19 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 3939, 3940, 3939, 3939, 3939, 3939, ... 
## Resampling results across tuning parameters:
## 
##   eta   max_depth  ROC        Sens       Spec     
##   0.05  3          0.8528526  0.5416191  0.8986169
##   0.05  5          0.8466158  0.5271095  0.8947441
##   0.05  7          0.8395332  0.5183177  0.8890733
##   0.10  3          0.8474664  0.5335922  0.8923928
##   0.10  5          0.8377449  0.5217543  0.8849239
##   0.10  7          0.8291069  0.5049531  0.8822960
## 
## Tuning parameter 'nrounds' was held constant at a value of 200
## Tuning
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## 
## Tuning parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 200, max_depth = 3, eta
##  = 0.05, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 and
##  subsample = 0.8.
plot(fit_xgb)

xgb_pred_class <- predict(fit_xgb, newdata = test)
xgb_pred_prob  <- predict(fit_xgb, newdata = test, type = "prob")[, "Yes"]

confusionMatrix(xgb_pred_class, test$Churn, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes  286  141
##        No   274 1407
##                                           
##                Accuracy : 0.8031          
##                  95% CI : (0.7855, 0.8199)
##     No Information Rate : 0.7343          
##     P-Value [Acc > NIR] : 9.978e-14       
##                                           
##                   Kappa : 0.454           
##                                           
##  Mcnemar's Test P-Value : 9.195e-11       
##                                           
##             Sensitivity : 0.5107          
##             Specificity : 0.9089          
##          Pos Pred Value : 0.6698          
##          Neg Pred Value : 0.8370          
##              Prevalence : 0.2657          
##          Detection Rate : 0.1357          
##    Detection Prevalence : 0.2026          
##       Balanced Accuracy : 0.7098          
##                                           
##        'Positive' Class : Yes             
## 
roc_xgb <- roc(test$Churn, xgb_pred_prob, levels = c("No", "Yes"))
## Setting direction: controls < cases
auc(roc_xgb)
## Area under the curve: 0.8354

4.4 SVM – Linear (Tuned)

# SVM – Linear (Tuned)

set.seed(123)

svm_linear_grid <- expand.grid(C = c(0.001, 0.01, 0.1, 1, 5, 10))

fit_svm_linear <- train(
  Churn ~ ., data = train,
  method = "svmLinear",
  metric = "ROC",
  trControl = ctrl,
  tuneGrid = svm_linear_grid
)

print(fit_svm_linear)
## Support Vector Machines with Linear Kernel 
## 
## 4924 samples
##   19 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 3939, 3940, 3939, 3939, 3939, 3939, ... 
## Resampling results across tuning parameters:
## 
##   C      ROC        Sens       Spec     
##   1e-03  0.8404066  0.6470489  0.8478562
##   1e-02  0.8419028  0.5206121  0.9048409
##   1e-01  0.8401097  0.5160261  0.9048409
##   1e+00  0.8391409  0.5133500  0.9059474
##   5e+00  0.8389771  0.5125881  0.9055325
##   1e+01  0.8389854  0.5122050  0.9051176
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.01.
svm_linear_pred_class <- predict(fit_svm_linear, newdata = test)
svm_linear_pred_prob  <- predict(fit_svm_linear, newdata = test, type = "prob")[, "Yes"]

confusionMatrix(svm_linear_pred_class, test$Churn, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes  276  157
##        No   284 1391
##                                          
##                Accuracy : 0.7908         
##                  95% CI : (0.7728, 0.808)
##     No Information Rate : 0.7343         
##     P-Value [Acc > NIR] : 1.062e-09      
##                                          
##                   Kappa : 0.422          
##                                          
##  Mcnemar's Test P-Value : 1.973e-09      
##                                          
##             Sensitivity : 0.4929         
##             Specificity : 0.8986         
##          Pos Pred Value : 0.6374         
##          Neg Pred Value : 0.8304         
##              Prevalence : 0.2657         
##          Detection Rate : 0.1309         
##    Detection Prevalence : 0.2054         
##       Balanced Accuracy : 0.6957         
##                                          
##        'Positive' Class : Yes            
## 
roc_linear_svm <- roc(test$Churn, svm_linear_pred_prob, levels = c("No", "Yes"))
## Setting direction: controls < cases
auc(roc_linear_svm)
## Area under the curve: 0.8195

4.5 SVM – Radial (Tuned)

# SVM – Radial (Tuned)
set.seed(123)

sigma_est <- sigest(Churn ~ ., data = train) 
sigma_est
##        90%        50%        10% 
## 0.03524553 0.06305870 0.14437394
svm_radial_grid <- expand.grid(
  sigma = as.numeric(sigma_est),
  C     = c(0.1, 1, 10)
)

fit_svm_radial <- train(
  Churn ~ .,
  data      = train,
  method    = "svmRadial",
  preProcess = c("center", "scale"), 
  metric    = "ROC",
  trControl = ctrl,
  tuneGrid  = svm_radial_grid
)
## line search fails -1.024273 -0.5028463 1.050995e-05 6.535726e-06 -1.390835e-08 -6.045517e-09 -1.856879e-13
## Warning in method$predict(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class prediction calculations failed; returning NAs
## Warning in method$prob(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class probability calculations failed; returning NAs
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(fit_svm_radial)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 4924 samples
##   19 predictor
##    2 classes: 'Yes', 'No' 
## 
## Pre-processing: centered (30), scaled (30) 
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 3939, 3940, 3939, 3939, 3939, 3939, ... 
## Resampling results across tuning parameters:
## 
##   sigma       C     ROC        Sens       Spec     
##   0.03524553   0.1  0.8228235  0.4579524  0.9233748
##   0.03524553   1.0  0.8124042  0.4659808  0.9222683
##   0.03524553  10.0  0.7954825  0.4274159  0.9107115
##   0.06305870   0.1  0.8169394  0.4365666  0.9273859
##   0.06305870   1.0  0.8085704  0.4671302  0.9179806
##   0.06305870  10.0  0.7814190  0.3961013  0.9109267
##   0.14437394   0.1  0.8092819  0.3632345  0.9426003
##   0.14437394   1.0  0.7977023  0.4316019  0.9193638
##   0.14437394  10.0  0.7629016  0.3624623  0.9149378
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.03524553 and C = 0.1.
plot(fit_svm_radial)

# Predict 
svm_pred_class <- predict(fit_svm_radial, newdata = test)
svm_pred_prob  <- predict(fit_svm_radial, newdata = test, type = "prob")[, "Yes"]

confusionMatrix(svm_pred_class, test$Churn, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes  236  131
##        No   324 1417
##                                          
##                Accuracy : 0.7842         
##                  95% CI : (0.766, 0.8016)
##     No Information Rate : 0.7343         
##     P-Value [Acc > NIR] : 7.09e-08       
##                                          
##                   Kappa : 0.3784         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.4214         
##             Specificity : 0.9154         
##          Pos Pred Value : 0.6431         
##          Neg Pred Value : 0.8139         
##              Prevalence : 0.2657         
##          Detection Rate : 0.1120         
##    Detection Prevalence : 0.1741         
##       Balanced Accuracy : 0.6684         
##                                          
##        'Positive' Class : Yes            
## 
roc_svm <- roc(test$Churn, svm_pred_prob, levels = c("No", "Yes"))
## Setting direction: controls < cases
auc(roc_svm)
## Area under the curve: 0.7896
# Generate results for all models
rf_results   <- evaluate_model(fit_rf, test)
## Setting levels: control = Yes, case = No
## Setting direction: controls > cases
xgb_results  <- evaluate_model(fit_xgb, test)
## Setting levels: control = Yes, case = No
## Setting direction: controls > cases
svmL_results <- evaluate_model(fit_svm_linear, test)
## Setting levels: control = Yes, case = No
## Setting direction: controls > cases
svmR_results <- evaluate_model(fit_svm_radial, test)
## Setting levels: control = Yes, case = No
## Setting direction: controls > cases
# Add the Model Name to result dataframe
rf_results$Model   <- "Random Forest (Tuned)"
xgb_results$Model  <- "XGBoost (Tuned)"
svmL_results$Model <- "SVM - Linear"
svmR_results$Model <- "SVM - Radial"
# Combine result
model_compare <- bind_rows(
  rf_results,
  xgb_results,
  svmL_results,
  svmR_results
)
# Reorder columns
model_compare <- model_compare %>%
  select(Model, Accuracy, Precision, Recall, F1_Score, AUC) %>%
  arrange(desc(AUC)) # Sort by best AUC

knitr::kable(model_compare, digits = 4, caption = "Model Comparison: RF, XGB, SVM")
Model Comparison: RF, XGB, SVM
Model Accuracy Precision Recall F1_Score AUC
XGBoost (Tuned) 0.8031 0.6698 0.5107 0.5795 0.8354
Random Forest (Tuned) 0.7970 0.6650 0.4750 0.5542 0.8228
SVM - Linear 0.7908 0.6374 0.4929 0.5559 0.8195
SVM - Radial 0.7842 0.6431 0.4214 0.5092 0.7896