Choose a dataset You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.
To complete this task:
Telco Customer Churn:
CustomerID: A unique ID that identifies each customer.
Gender: The customer’s gender: Male, Female
Senior Citizen: Indicates if the customer is 65 or older: Yes, No
Partner: Indicates if the customer is married: Yes, No
Dependents: Indicates if the customer lives with any dependents: Yes, No. Dependents could be children, parents, grandparents, etc.
Number of Dependents: Indicates the number of dependents that live with the customer.
Tenure: Indicates the total amount of months that the customer has been with the company by the end of the quarter specified above.
Phone Service: Indicates if the customer subscribes to home phone service with the company: Yes, No
Multiple Lines: Indicates if the customer subscribes to multiple telephone lines with the company: Yes, No
Internet Service: Indicates if the customer subscribes to Internet service with the company: No, DSL, Fiber Optic, Cable.
Online Security: Indicates if the customer subscribes to an additional online security service provided by the company: Yes, No
Online Backup: Indicates if the customer subscribes to an additional online backup service provided by the company: Yes, No
Device Protection Plan: Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company: Yes, No
Premium Tech Support: Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times: Yes, No
Streaming TV: Indicates if the customer uses their Internet service to stream television programing from a third party provider: Yes, No. The company does not charge an additional fee for this service.
Streaming Movies: Indicates if the customer uses their Internet service to stream movies from a third party provider: Yes, No. The company does not charge an additional fee for this service.
Contract: Indicates the customer’s current contract type: Month-to-Month, One Year, Two Year.
Paperless Billing: Indicates if the customer has chosen paperless billing: Yes, No
Payment Method: Indicates how the customer pays their bill: Bank Withdrawal, Credit Card, Mailed Check
Monthly Charge: Indicates the customer’s current total monthly charge for all their services from the company.
Total Charges: Indicates the customer’s total charges, calculated to the end of the quarter specified above.
Churn: Yes = the customer left the company this quarter. No = the customer remained with the company. Directly related to Churn Value.
This dataset is from Kaggle: https://www.kaggle.com/datasets/blastchar/telco-customer-churn
New version from IBM: https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113
In subscription-based businesses such as telecommunications, customer churn represents a major threat to revenue and long-term growth. Acquiring a new customer is typically more expensive than retaining an existing one, so the ability to identify high-risk customers in advance can directly reduce marketing costs and increase profitability.
The business problem in this project is: How can a telecom provider predict which customers are likely to cancel their service, so that it can intervene proactively?
# Load the libraries
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(tidyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(tidyverse)
## Warning: package 'readr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.4 ✔ stringr 1.5.1
## ✔ purrr 1.0.2 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(kernlab)
##
## Attaching package: 'kernlab'
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## alpha
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:ggplot2':
##
## margin
##
## The following object is masked from 'package:dplyr':
##
## combine
# View the data
df <- read.csv('https://raw.githubusercontent.com/Jennyjjxxzz/HW4/refs/heads/main/WA_Fn-UseC_-Telco-Customer-Churn.csv')
head(df)
dim(df)
## [1] 7043 21
summary(df)
## customerID gender SeniorCitizen Partner
## Length:7043 Length:7043 Min. :0.0000 Length:7043
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.1621
## 3rd Qu.:0.0000
## Max. :1.0000
##
## Dependents tenure PhoneService MultipleLines
## Length:7043 Min. : 0.00 Length:7043 Length:7043
## Class :character 1st Qu.: 9.00 Class :character Class :character
## Mode :character Median :29.00 Mode :character Mode :character
## Mean :32.37
## 3rd Qu.:55.00
## Max. :72.00
##
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TechSupport StreamingTV StreamingMovies Contract
## Length:7043 Length:7043 Length:7043 Length:7043
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## Length:7043 Length:7043 Min. : 18.25 Min. : 18.8
## Class :character Class :character 1st Qu.: 35.50 1st Qu.: 401.4
## Mode :character Mode :character Median : 70.35 Median :1397.5
## Mean : 64.76 Mean :2283.3
## 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :118.75 Max. :8684.8
## NA's :11
## Churn
## Length:7043
## Class :character
## Mode :character
##
##
##
##
str(df)
## 'data.frame': 7043 obs. of 21 variables:
## $ customerID : chr "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr "Yes" "No" "No" "No" ...
## $ Dependents : chr "No" "No" "No" "No" ...
## $ tenure : int 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr "No" "No" "No" "Yes" ...
## $ StreamingTV : chr "No" "No" "No" "No" ...
## $ StreamingMovies : chr "No" "No" "No" "No" ...
## $ Contract : chr "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr "No" "No" "Yes" "No" ...
colSums(is.na(df))
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
# Convert TotalCharges imports as character
df$TotalCharges <- as.numeric(df$TotalCharges)
# Drop missing values
df <- df %>% drop_na()
The churn rate is visibly imbalanced: the majority of customers did not churn, while a smaller portion (approximately 26–27%) did. This imbalance is typical in churn datasets. And accuracy alone is not a reliable metric (model may predict “No Churn” for everyone).Metrics like precision, recall, F1-score, and ROC-AUC will provide more meaningful evaluation.
df %>%
count(Churn) %>%
ggplot(aes(x = Churn, y = n, fill = Churn)) +
geom_col() +
labs(title = "Churn Distribution")
Customers who churn tend to have much lower tenure, meaning they leave early in their relationship with the company.
New customers may be at higher risk of cancelling.
ggplot(df, aes(x = tenure, color = Churn)) +
geom_density(size = 1.2) +
labs(title = "Tenure Distribution by Churn",
x = "Months with Company",
y = "Density") +
scale_color_manual(values = c("No" = "#1f77b4", "Yes" = "#d62728"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Customers who churn tend to have higher monthly charges.
This suggests a potential price-sensitivity problem: Customers facing higher monthly costs may feel dissatisfied.
Discounts or flexible plans could reduce churn among this group.
ggplot(df, aes(x = Churn, y = MonthlyCharges, fill = Churn)) +
geom_boxplot() +
labs(title = "Monthly Charges by Churn")
Month-to-Month customers have the highest churn rate
One-year and two-year contracts show significantly high churn
This suggests customers with short-term contracts feel less committed or more dissatisfied.
Business implication: Offering promotions or incentives to encourage monthly customers to upgrade to annual contracts may significantly reduce churn.
df <- df %>%
mutate(tenure_group = case_when(
tenure <= 12 ~ "0-1 year",
tenure <= 24 ~ "1-2 years",
tenure <= 36 ~ "2-3 years",
tenure <= 48 ~ "3-4 years",
tenure <= 60 ~ "4-5 years",
TRUE ~ "5-6 years"
))
ggplot(df, aes(x = tenure_group, fill = Churn)) +
geom_bar(position = "fill") +
labs(title = "Churn Rate by Tenure Group",
y = "Proportion",
x = "Tenure Group") +
scale_fill_manual(values=c("#1f77b4","#d62728")) +
theme_minimal(base_size = 14)
### 3.5 MonthlyCharges vs Tenure
ggplot(df, aes(x = tenure, y = MonthlyCharges, color = Churn)) +
geom_point(alpha = 0.4) +
labs(title = "Monthly Charges vs Tenure",
x = "Tenure (Months)",
y = "Monthly Charges") +
scale_color_manual(values=c("#1f77b4","#d62728")) +
theme_minimal(base_size = 14)
Dependents: Customers without dependents have a higher churn proportion, similar to the partner pattern.
Partner: shows that customers without partners churn more often, possibly reflecting differences in household stability or financial constraints.
Gender shows almost no difference in churn rate between male and female customers, suggesting that gender is not an important predictor.
Senior Citizens churn at a substantially higher rate than non-senior customers, indicating that older customers may be less satisfied with services or more price-sensitive.
# Choose categorical variables to include
cat_vars <- c("gender",
"SeniorCitizen",
"Partner",
"Dependents")
# Convert SeniorCitizen to Yes/No for readability
df$SeniorCitizen <- ifelse(df$SeniorCitizen == 1, "Yes", "No")
# Reshape data to long format
df_long <- df %>%
select(all_of(cat_vars), Churn) %>%
pivot_longer(cols = all_of(cat_vars),
names_to = "variable",
values_to = "value")
# bar plot
ggplot(df_long, aes(x = value, fill = Churn)) +
geom_bar(position = "fill") +
facet_wrap(~ variable, scales = "free_x") +
labs(title = "Churn Proportions Across Customer Demographics",
x = "",
y = "Proportion",
fill = "Churn") +
scale_fill_manual(values = c("No" = "#FF6F69", "Yes" = "#40E0D0")) +
theme_minimal(base_size = 14) +
theme(axis.text.x = element_text(angle = 25, hjust = 1))
### 3.7 Churn Proportion Across Add-on and Internet Services
DeviceProtection: Slightly higher churn among customers without protection plans.
OnlineBackup: customers without this service churn more.
StreamingTV: Similar churn rates for “No” and “Yes”. Streaming services do not strongly influence churn.
InternetService: Fiber optic customers show the highest churn proportion. DSL users churn less. Customers with no internet service churn the least. This reflecting dissatisfaction with Fiber.
OnlineSecurity: Customers without Online Security churn at much higher rates. Customers with security add-ons churn less.
StreamingTV: Similar churn rates for “No” and “Yes”.
Multiple Lines: Customers with multiple phone lines churn slightly more than customers with a single line or no phone service, possibly reflecting dissatisfaction with bundled service plans.
Phone Service: does not appear to influence churn significantly.
TechSupport: Customers with no tech support churn disproportionately more.
service_vars <- c("InternetService",
"OnlineSecurity",
"OnlineBackup",
"DeviceProtection",
"TechSupport",
"StreamingTV",
"StreamingMovies",
"MultipleLines",
"PhoneService")
df_services_long <- df %>%
select(all_of(service_vars), Churn) %>%
pivot_longer(cols = all_of(service_vars),
names_to = "Service",
values_to = "Status")
ggplot(df_services_long, aes(x = Status, fill = Churn)) +
geom_bar(position = "fill") +
facet_wrap(~ Service, scales = "free_x") +
labs(title = "Churn Proportion Across Add-on and Internet Services",
x = "",
y = "Proportion") +
scale_fill_manual(values = c("No" = "#FF6F69", "Yes" = "#40E0D0")) +
theme_minimal(base_size = 9) +
theme(axis.text.x = element_text(angle = 25, hjust = 1),
strip.text = element_text(face = "bold"))
Contract: Month-to-month contracts have the highest churn rate.
PaperlessBilling: Customers with paperless billing churn much more.
PaymentMethod: Electronic Check customers churn the most. Automatic payments via credit card or bank transfer churn the least.
final_vars <- c("Contract",
"PaperlessBilling",
"PaymentMethod")
df_final_long <- df %>%
select(all_of(final_vars), Churn) %>%
pivot_longer(cols = all_of(final_vars),
names_to = "variable",
values_to = "value")
ggplot(df_final_long, aes(x = value, fill = Churn)) +
geom_bar(position = "fill") +
facet_wrap(~ variable, scales = "free_x") +
labs(title = "Churn Proportion Across Contract, and Billing",
x = "",
y = "Proportion",
fill = "Churn") +
scale_fill_manual(values = c("No" = "#FF6F69", "Yes" = "#40E0D0")) +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 25, hjust = 1),
strip.text = element_text(face = "bold"))
Tenure is strongly correlated with TotalCharges, because customers who stay longer accumulate higher total charges.
There are moderate positive correlations between tech-support–related services, such as OnlineSecurity, OnlineBackup, DeviceProtection, and TechSupport.
Churn shows small but notable correlations with several predictors, including Contract, MonthlyCharges, InternetService, and SeniorCitizen
df <- df %>% select(-tenure_group)
# Convert categorical to numeric temporarily
df_numeric <- df
for(col in names(df_numeric)){
if(is.character(df_numeric[[col]]) | is.factor(df_numeric[[col]])){
df_numeric[[col]] <- as.numeric(as.factor(df_numeric[[col]]))
}
}
cor_mat <- cor(df_numeric)
corrplot(cor_mat,
method = "color",
type = "lower",
tl.col = "black",
tl.cex = 0.6,
number.cex = 0.4,
addCoef.col = "black",
col = colorRampPalette(c("blue","white","red"))(200))
# Remove ID column
df_model <- df %>%
select(-customerID)
# Make sure Churn is a factor with "Yes" as positive class
df_model$Churn <- factor(df_model$Churn,
levels = c("Yes", "No"))
str(df_model$Churn)
## Factor w/ 2 levels "Yes","No": 2 2 1 2 1 1 2 2 1 2 ...
# Train / test split (70/30)
set.seed(123)
train_index <- createDataPartition(df_model$Churn, p = 0.7, list = FALSE)
train <- df_model[train_index, ]
test <- df_model[-train_index, ]
# Count unique values for each column in train
unique_levels <- sapply(train, function(x) length(unique(x)))
# Identify columns that have only 1 unique value
single_level_vars <- names(unique_levels[unique_levels == 1])
# Drop them from both train and test
if (length(single_level_vars) > 0) {
train <- train %>% dplyr::select(-dplyr::all_of(single_level_vars))
test <- test %>% dplyr::select(-dplyr::all_of(single_level_vars))
}
# Set up cross-validation
ctrl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 2,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "final"
)
evaluate_model <- function(model, test){
preds_prob <- predict(model, test, type = "prob")[, "Yes"]
preds_class <- predict(model, test)
cm <- confusionMatrix(preds_class, test$Churn, positive = "Yes")
auc <- roc(test$Churn, preds_prob)$auc
tibble(
Accuracy = cm$overall["Accuracy"],
Precision = cm$byClass["Precision"],
Recall = cm$byClass["Recall"],
F1_Score = cm$byClass["F1"],
AUC = as.numeric(auc)
)
}
# Random Forest
set.seed(123)
rf_grid <- expand.grid(
mtry = c(3, 5, 7, 9)
)
fit_rf <- train(
Churn ~ ., data = train,
method = "rf",
metric = "ROC",
trControl = ctrl,
tuneGrid = rf_grid,
ntree = 500
)
rf_pred_class <- predict(fit_rf, newdata = test)
rf_pred_prob <- predict(fit_rf, newdata = test, type = "prob")[, "Yes"]
confusionMatrix(rf_pred_class, test$Churn, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 264 134
## No 296 1414
##
## Accuracy : 0.796
## 95% CI : (0.7782, 0.813)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 2.663e-11
##
## Kappa : 0.424
##
## Mcnemar's Test P-Value : 8.222e-15
##
## Sensitivity : 0.4714
## Specificity : 0.9134
## Pos Pred Value : 0.6633
## Neg Pred Value : 0.8269
## Prevalence : 0.2657
## Detection Rate : 0.1252
## Detection Prevalence : 0.1888
## Balanced Accuracy : 0.6924
##
## 'Positive' Class : Yes
##
roc_rf <- roc(response = test$Churn,
predictor = rf_pred_prob,
levels = c("No", "Yes"))
## Setting direction: controls < cases
auc(roc_rf)
## Area under the curve: 0.8228
# XGBoost
set.seed(123)
xgb_grid <- expand.grid(
nrounds = 200,
max_depth = c(3, 5, 7),
eta = c(0.05, 0.1),
gamma = 0,
colsample_bytree = 0.8,
min_child_weight = 1,
subsample = 0.8
)
fit_xgb <- train(
Churn ~ ., data = train,
method = "xgbTree",
metric = "ROC",
trControl = ctrl,
tuneGrid = xgb_grid
)
fit_xgb
## eXtreme Gradient Boosting
##
## 4924 samples
## 19 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 3939, 3940, 3939, 3939, 3939, 3939, ...
## Resampling results across tuning parameters:
##
## eta max_depth ROC Sens Spec
## 0.05 3 0.8528526 0.5416191 0.8986169
## 0.05 5 0.8466158 0.5271095 0.8947441
## 0.05 7 0.8395332 0.5183177 0.8890733
## 0.10 3 0.8474664 0.5335922 0.8923928
## 0.10 5 0.8377449 0.5217543 0.8849239
## 0.10 7 0.8291069 0.5049531 0.8822960
##
## Tuning parameter 'nrounds' was held constant at a value of 200
## Tuning
##
## Tuning parameter 'min_child_weight' was held constant at a value of 1
##
## Tuning parameter 'subsample' was held constant at a value of 0.8
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 200, max_depth = 3, eta
## = 0.05, gamma = 0, colsample_bytree = 0.8, min_child_weight = 1 and
## subsample = 0.8.
plot(fit_xgb)
xgb_pred_class <- predict(fit_xgb, newdata = test)
xgb_pred_prob <- predict(fit_xgb, newdata = test, type = "prob")[, "Yes"]
confusionMatrix(xgb_pred_class, test$Churn, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 286 141
## No 274 1407
##
## Accuracy : 0.8031
## 95% CI : (0.7855, 0.8199)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 9.978e-14
##
## Kappa : 0.454
##
## Mcnemar's Test P-Value : 9.195e-11
##
## Sensitivity : 0.5107
## Specificity : 0.9089
## Pos Pred Value : 0.6698
## Neg Pred Value : 0.8370
## Prevalence : 0.2657
## Detection Rate : 0.1357
## Detection Prevalence : 0.2026
## Balanced Accuracy : 0.7098
##
## 'Positive' Class : Yes
##
roc_xgb <- roc(test$Churn, xgb_pred_prob, levels = c("No", "Yes"))
## Setting direction: controls < cases
auc(roc_xgb)
## Area under the curve: 0.8354
# SVM – Linear (Tuned)
set.seed(123)
svm_linear_grid <- expand.grid(C = c(0.001, 0.01, 0.1, 1, 5, 10))
fit_svm_linear <- train(
Churn ~ ., data = train,
method = "svmLinear",
metric = "ROC",
trControl = ctrl,
tuneGrid = svm_linear_grid
)
print(fit_svm_linear)
## Support Vector Machines with Linear Kernel
##
## 4924 samples
## 19 predictor
## 2 classes: 'Yes', 'No'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 3939, 3940, 3939, 3939, 3939, 3939, ...
## Resampling results across tuning parameters:
##
## C ROC Sens Spec
## 1e-03 0.8404066 0.6470489 0.8478562
## 1e-02 0.8419028 0.5206121 0.9048409
## 1e-01 0.8401097 0.5160261 0.9048409
## 1e+00 0.8391409 0.5133500 0.9059474
## 5e+00 0.8389771 0.5125881 0.9055325
## 1e+01 0.8389854 0.5122050 0.9051176
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.01.
svm_linear_pred_class <- predict(fit_svm_linear, newdata = test)
svm_linear_pred_prob <- predict(fit_svm_linear, newdata = test, type = "prob")[, "Yes"]
confusionMatrix(svm_linear_pred_class, test$Churn, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 276 157
## No 284 1391
##
## Accuracy : 0.7908
## 95% CI : (0.7728, 0.808)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 1.062e-09
##
## Kappa : 0.422
##
## Mcnemar's Test P-Value : 1.973e-09
##
## Sensitivity : 0.4929
## Specificity : 0.8986
## Pos Pred Value : 0.6374
## Neg Pred Value : 0.8304
## Prevalence : 0.2657
## Detection Rate : 0.1309
## Detection Prevalence : 0.2054
## Balanced Accuracy : 0.6957
##
## 'Positive' Class : Yes
##
roc_linear_svm <- roc(test$Churn, svm_linear_pred_prob, levels = c("No", "Yes"))
## Setting direction: controls < cases
auc(roc_linear_svm)
## Area under the curve: 0.8195
# SVM – Radial (Tuned)
set.seed(123)
sigma_est <- sigest(Churn ~ ., data = train)
sigma_est
## 90% 50% 10%
## 0.03524553 0.06305870 0.14437394
svm_radial_grid <- expand.grid(
sigma = as.numeric(sigma_est),
C = c(0.1, 1, 10)
)
fit_svm_radial <- train(
Churn ~ .,
data = train,
method = "svmRadial",
preProcess = c("center", "scale"),
metric = "ROC",
trControl = ctrl,
tuneGrid = svm_radial_grid
)
## line search fails -1.024273 -0.5028463 1.050995e-05 6.535726e-06 -1.390835e-08 -6.045517e-09 -1.856879e-13
## Warning in method$predict(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class prediction calculations failed; returning NAs
## Warning in method$prob(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class probability calculations failed; returning NAs
## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(fit_svm_radial)
## Support Vector Machines with Radial Basis Function Kernel
##
## 4924 samples
## 19 predictor
## 2 classes: 'Yes', 'No'
##
## Pre-processing: centered (30), scaled (30)
## Resampling: Cross-Validated (5 fold, repeated 2 times)
## Summary of sample sizes: 3939, 3940, 3939, 3939, 3939, 3939, ...
## Resampling results across tuning parameters:
##
## sigma C ROC Sens Spec
## 0.03524553 0.1 0.8228235 0.4579524 0.9233748
## 0.03524553 1.0 0.8124042 0.4659808 0.9222683
## 0.03524553 10.0 0.7954825 0.4274159 0.9107115
## 0.06305870 0.1 0.8169394 0.4365666 0.9273859
## 0.06305870 1.0 0.8085704 0.4671302 0.9179806
## 0.06305870 10.0 0.7814190 0.3961013 0.9109267
## 0.14437394 0.1 0.8092819 0.3632345 0.9426003
## 0.14437394 1.0 0.7977023 0.4316019 0.9193638
## 0.14437394 10.0 0.7629016 0.3624623 0.9149378
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.03524553 and C = 0.1.
plot(fit_svm_radial)
# Predict
svm_pred_class <- predict(fit_svm_radial, newdata = test)
svm_pred_prob <- predict(fit_svm_radial, newdata = test, type = "prob")[, "Yes"]
confusionMatrix(svm_pred_class, test$Churn, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 236 131
## No 324 1417
##
## Accuracy : 0.7842
## 95% CI : (0.766, 0.8016)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 7.09e-08
##
## Kappa : 0.3784
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4214
## Specificity : 0.9154
## Pos Pred Value : 0.6431
## Neg Pred Value : 0.8139
## Prevalence : 0.2657
## Detection Rate : 0.1120
## Detection Prevalence : 0.1741
## Balanced Accuracy : 0.6684
##
## 'Positive' Class : Yes
##
roc_svm <- roc(test$Churn, svm_pred_prob, levels = c("No", "Yes"))
## Setting direction: controls < cases
auc(roc_svm)
## Area under the curve: 0.7896
# Generate results for all models
rf_results <- evaluate_model(fit_rf, test)
## Setting levels: control = Yes, case = No
## Setting direction: controls > cases
xgb_results <- evaluate_model(fit_xgb, test)
## Setting levels: control = Yes, case = No
## Setting direction: controls > cases
svmL_results <- evaluate_model(fit_svm_linear, test)
## Setting levels: control = Yes, case = No
## Setting direction: controls > cases
svmR_results <- evaluate_model(fit_svm_radial, test)
## Setting levels: control = Yes, case = No
## Setting direction: controls > cases
# Add the Model Name to result dataframe
rf_results$Model <- "Random Forest (Tuned)"
xgb_results$Model <- "XGBoost (Tuned)"
svmL_results$Model <- "SVM - Linear"
svmR_results$Model <- "SVM - Radial"
# Combine result
model_compare <- bind_rows(
rf_results,
xgb_results,
svmL_results,
svmR_results
)
# Reorder columns
model_compare <- model_compare %>%
select(Model, Accuracy, Precision, Recall, F1_Score, AUC) %>%
arrange(desc(AUC)) # Sort by best AUC
knitr::kable(model_compare, digits = 4, caption = "Model Comparison: RF, XGB, SVM")
| Model | Accuracy | Precision | Recall | F1_Score | AUC |
|---|---|---|---|---|---|
| XGBoost (Tuned) | 0.8031 | 0.6698 | 0.5107 | 0.5795 | 0.8354 |
| Random Forest (Tuned) | 0.7970 | 0.6650 | 0.4750 | 0.5542 | 0.8228 |
| SVM - Linear | 0.7908 | 0.6374 | 0.4929 | 0.5559 | 0.8195 |
| SVM - Radial | 0.7842 | 0.6431 | 0.4214 | 0.5092 | 0.7896 |