This document contains all code for Assignment 4. The business goal is to predict customer churn for a fictional telecommunications company using the “Telco Customer Churn” dataset sourced from Kaggle.
Two modeling approaches are used: random forest and neural networks, with both default and tuned versions of each. All models are evaluated on multiple metrics, which are stored in a matrix for comparison. Balanced performance is important but recall is prioritized somewhat, reflecting the desire to capture as many at-risk customers as possible due to the high cost of a lost customer.
library(tidyverse)
library(fastDummies)
library(randomForest)
library(pROC)
library(caret)
library(nnet)
library(scales)
# Load data
df_raw <- read_csv("https://raw.githubusercontent.com/AmandaSFox/DATA622/refs/heads/main/Assignment_4/WA_Fn-UseC_-Telco-Customer-Churn.csv")
# View raw dataset
glimpse(df_raw)
## Rows: 7,043
## Columns: 21
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…
This dataset is preprocessed and very clean: no duplicate records were found and only 11 NA values. All NA values were in the “Total Charges” column and associated with brand new accounts. These were removed from the dataset as not useful for churn analysis.
# Duplicate rows
sum(duplicated(df_raw))
## [1] 0
# Count NA values by column
colSums(is.na(df_raw))
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
# View NA data
missing_chgs <- df_raw %>%
filter(is.na(TotalCharges)) # All have tenure = 0
missing_chgs
## # A tibble: 11 × 21
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 4472-LVYGI Female 0 Yes Yes 0 No
## 2 3115-CZMZD Male 0 No Yes 0 Yes
## 3 5709-LVOEQ Female 0 Yes Yes 0 Yes
## 4 4367-NUYAO Male 0 Yes Yes 0 Yes
## 5 1371-DWPAZ Female 0 Yes Yes 0 No
## 6 7644-OMVMY Male 0 Yes Yes 0 Yes
## 7 3213-VVOLG Male 0 Yes Yes 0 Yes
## 8 2520-SGTTA Female 0 Yes Yes 0 Yes
## 9 2923-ARZLG Male 0 Yes Yes 0 Yes
## 10 4075-WKNIU Female 0 Yes Yes 0 Yes
## 11 2775-SEFEE Male 0 No Yes 0 Yes
## # ℹ 14 more variables: MultipleLines <chr>, InternetService <chr>,
## # OnlineSecurity <chr>, OnlineBackup <chr>, DeviceProtection <chr>,
## # TechSupport <chr>, StreamingTV <chr>, StreamingMovies <chr>,
## # Contract <chr>, PaperlessBilling <chr>, PaymentMethod <chr>,
## # MonthlyCharges <dbl>, TotalCharges <dbl>, Churn <chr>
# Remove NA rows
df_raw <- df_raw %>%
filter(!is.na(TotalCharges))
The dataset includes character flags which are transformed to numeric as required by neural networks. One-hot encoding is used for four columns with >2 values to preserve information about service types, and then the remaining binary Yes/No columns are simply converted to 0/1. Finally, the target variable is transformed to a factor to force the random forest models to classification rather than regression.
#----------------------
# 1. Manual one-hot encoding: Internet Service
#----------------------
# List values
df_raw %>%
count(InternetService)
## # A tibble: 3 × 2
## InternetService n
## <chr> <int>
## 1 DSL 2416
## 2 Fiber optic 3096
## 3 No 1520
# Create new cols with binary flags
df_model <- df_raw %>%
mutate(Internet_Svc = if_else(InternetService == "No", 0, 1),
DSL = if_else(InternetService == "DSL", 1, 0),
Fiber = if_else(InternetService == "Fiber optic", 1, 0))
# Compare new cols to original
df_model %>%
count(InternetService, Internet_Svc, DSL,Fiber)
## # A tibble: 3 × 5
## InternetService Internet_Svc DSL Fiber n
## <chr> <dbl> <dbl> <dbl> <int>
## 1 DSL 1 1 0 2416
## 2 Fiber optic 1 0 1 3096
## 3 No 0 0 0 1520
# Remove original column
df_model <- df_model %>%
select(-InternetService)
#----------------------
# 2. Automated one-hot encoding: gender, contract, payment types
#----------------------
# List values for each column
df_model %>%
count(gender)
## # A tibble: 2 × 2
## gender n
## <chr> <int>
## 1 Female 3483
## 2 Male 3549
df_model %>%
count(Contract)
## # A tibble: 3 × 2
## Contract n
## <chr> <int>
## 1 Month-to-month 3875
## 2 One year 1472
## 3 Two year 1685
df_model %>%
count(PaymentMethod)
## # A tibble: 4 × 2
## PaymentMethod n
## <chr> <int>
## 1 Bank transfer (automatic) 1542
## 2 Credit card (automatic) 1521
## 3 Electronic check 2365
## 4 Mailed check 1604
# One-hot encoding, removing original columns
df_model <- df_model %>%
dummy_cols(
select_columns = c("PaymentMethod", "Contract", "gender"),
remove_selected_columns = TRUE
)
#----------------------
# 3. Convert remaining binary flags to 0/1 (treat "No service" as 0)
#----------------------
df_model <- df_model %>%
mutate(PhoneService = if_else(PhoneService == "Yes", 1, 0),
MultipleLines = if_else(MultipleLines == "Yes", 1, 0),
OnlineSecurity = if_else(OnlineSecurity == "Yes", 1, 0),
OnlineBackup = if_else(OnlineBackup == "Yes", 1, 0),
DeviceProtection = if_else(DeviceProtection == "Yes", 1, 0),
TechSupport = if_else(TechSupport == "Yes", 1, 0),
StreamingTV = if_else(StreamingTV == "Yes", 1, 0),
StreamingMovies = if_else(StreamingMovies == "Yes", 1, 0),
Partner = if_else(Partner == "Yes", 1, 0),
Dependents = if_else(Dependents == "Yes", 1, 0),
PaperlessBilling = if_else(PaperlessBilling == "Yes", 1, 0),
Churn = if_else(Churn == "Yes", 1, 0))
#----------------------
# Convert target variable to factor
#----------------------
df_model$Churn <- as.factor(df_model$Churn)
#----------------------
# Final dataframe
#----------------------
glimpse(df_model)
## Rows: 7,032
## Columns: 29
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", …
## $ SeniorCitizen <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Partner <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ Dependents <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1…
## $ tenure <dbl> 1, 34, 2, 45, 2, 8, 22, 10, …
## $ PhoneService <dbl> 0, 1, 1, 0, 1, 1, 1, 0, 1, 1…
## $ MultipleLines <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0…
## $ OnlineSecurity <dbl> 0, 1, 1, 1, 0, 0, 0, 1, 0, 1…
## $ OnlineBackup <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, 1…
## $ DeviceProtection <dbl> 0, 1, 0, 1, 0, 1, 0, 0, 1, 0…
## $ TechSupport <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 1, 0…
## $ StreamingTV <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 1, 0…
## $ StreamingMovies <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0…
## $ PaperlessBilling <dbl> 1, 0, 1, 0, 1, 1, 1, 0, 1, 0…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, …
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840…
## $ Churn <fct> 0, 0, 1, 0, 1, 1, 0, 0, 1, 0…
## $ Internet_Svc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ DSL <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 0, 1…
## $ Fiber <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 1, 0…
## $ `PaymentMethod_Bank transfer (automatic)` <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 1…
## $ `PaymentMethod_Credit card (automatic)` <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0…
## $ `PaymentMethod_Electronic check` <int> 1, 0, 0, 0, 1, 1, 0, 0, 1, 0…
## $ `PaymentMethod_Mailed check` <int> 0, 1, 1, 0, 0, 0, 0, 1, 0, 0…
## $ `Contract_Month-to-month` <int> 1, 0, 1, 0, 1, 1, 1, 1, 1, 0…
## $ `Contract_One year` <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 1…
## $ `Contract_Two year` <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ gender_Female <int> 1, 0, 0, 0, 1, 1, 0, 1, 1, 0…
## $ gender_Male <int> 0, 1, 1, 1, 0, 0, 1, 0, 0, 1…
#----------------------
# Summary stats for essay
#----------------------
summary_stats <- df_model %>%
summarize(
`Customers` = n(),
`Churn %` = mean(Churn == 1),
`Mean Tenure` = mean(tenure, na.rm = TRUE),
`Mean Monthly Charges` = mean(MonthlyCharges, na.rm = TRUE),
`% Month to Month` = mean(`Contract_Month-to-month` == 1),
`% Phone Services` = mean(PhoneService == 1),
`% Internet Services` = mean(Internet_Svc == 1)) %>%
pivot_longer(cols = everything(), names_to = "Metric", values_to = "Value")
summary_stats
## # A tibble: 7 × 2
## Metric Value
## <chr> <dbl>
## 1 Customers 7032
## 2 Churn % 0.266
## 3 Mean Tenure 32.4
## 4 Mean Monthly Charges 64.8
## 5 % Month to Month 0.551
## 6 % Phone Services 0.903
## 7 % Internet Services 0.784
A matrix is initialized to store metrics for each of our four models, and the dataset is split 70/30 into train and test data using a fixed seed for reproducibility. Customer ID is removed and the train and test datasets are compared to ensure the same one-hot encoded columns appear in both. Finally, column names are fixed to avoid issues with special characters in the random forest models.
# Initialize matrix to store metrics
model_results <- matrix(nrow = 4, ncol = 5,
dimnames = list(
c("RF_default", "RF_tuned", "NN_simple", "NN_deep"),
c("Accuracy", "Precision", "Recall", "F1", "AUC")
))
# Train/test split
set.seed(123)
train_index <- sample(nrow(df_model), 0.7 * nrow(df_model))
df_train <- df_model[train_index, ]
df_test <- df_model[-train_index, ]
# Drop customer ID
df_train <- df_train %>% select(-customerID)
df_test <- df_test %>% select(-customerID)
# Add any missing one-hot encoded cols to df_test
missing_cols <- setdiff(names(df_train), names(df_test))
df_test[missing_cols] <- 0
df_test <- df_test[, names(df_train)]
# Add any missing one-hot encoded cols to df_train
missing_cols_rev <- setdiff(names(df_test), names(df_train))
df_train[missing_cols_rev] <- 0
df_train <- df_train[, names(df_test)]
# Fix col names with special characters (i.e. one hot encoded cols)
names(df_train) <- make.names(names(df_train))
names(df_test) <- make.names(names(df_test))
The first model is a random forest with default parameters and 500 trees, with performance metrics calculated and stored in the results matrix.
# Train Random Forest
rf_default_model <- randomForest(Churn ~ .,
data = df_train,
ntree = 500,
importance = TRUE)
# Get predictions and probabilities
rf_default_preds <- predict(rf_default_model, newdata = df_test)
rf_default_probs <- predict(rf_default_model, newdata = df_test, type = "prob")[, 2]
# Confusion matrix using predictions
rf_default_cm <- confusionMatrix(rf_default_preds, df_test$Churn, positive = "1")
rf_default_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1412 291
## 1 138 269
##
## Accuracy : 0.7967
## 95% CI : (0.7789, 0.8137)
## No Information Rate : 0.7346
## P-Value [Acc > NIR] : 1.885e-11
##
## Kappa : 0.4287
##
## Mcnemar's Test P-Value : 2.158e-13
##
## Sensitivity : 0.4804
## Specificity : 0.9110
## Pos Pred Value : 0.6609
## Neg Pred Value : 0.8291
## Prevalence : 0.2654
## Detection Rate : 0.1275
## Detection Prevalence : 0.1929
## Balanced Accuracy : 0.6957
##
## 'Positive' Class : 1
##
# Calculate AUC using probabilities
rf_default_roc <- roc(df_test$Churn, rf_default_probs)
rf_default_auc <- auc(rf_default_roc)
# Store results from confusion matrix and AUC
model_results["RF_default", "Accuracy"] <- rf_default_cm$overall["Accuracy"]
model_results["RF_default", "Precision"] <- rf_default_cm$byClass["Precision"]
model_results["RF_default", "Recall"] <- rf_default_cm$byClass["Recall"]
model_results["RF_default", "F1"] <- rf_default_cm$byClass["F1"]
model_results["RF_default", "AUC"] <- rf_default_auc
# View results matrix
model_results
## Accuracy Precision Recall F1 AUC
## RF_default 0.7966825 0.6609337 0.4803571 0.5563599 0.8426123
## RF_tuned NA NA NA NA NA
## NN_simple NA NA NA NA NA
## NN_deep NA NA NA NA NA
To improve upon the first RF model, five-fold cross-validation was used to optimize the mtry value, or the number of predictors randomly selected at each split. Results were again stored in the results matrix for comparison.
# Train with 5-fold cross-validation for mtry value
set.seed(42)
rf_tuned <- train(
Churn ~ .,
data = df_train,
method = "rf",
trControl = trainControl(method = "cv", number = 5),
tuneLength = 5
)
# Get predictions and probabilities
rf_tuned_preds <- predict(rf_tuned, newdata = df_test)
rf_tuned_probs <- predict(rf_tuned, newdata = df_test, type = "prob")[, 2]
# Confusion matrix using predictions
rf_tuned_cm <- confusionMatrix(rf_tuned_preds, df_test$Churn, positive = "1")
rf_tuned_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1460 294
## 1 90 266
##
## Accuracy : 0.818
## 95% CI : (0.8009, 0.8343)
## No Information Rate : 0.7346
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4718
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4750
## Specificity : 0.9419
## Pos Pred Value : 0.7472
## Neg Pred Value : 0.8324
## Prevalence : 0.2654
## Detection Rate : 0.1261
## Detection Prevalence : 0.1687
## Balanced Accuracy : 0.7085
##
## 'Positive' Class : 1
##
# Calculate AUC using probabilities
rf_tuned_roc <- roc(df_test$Churn, rf_tuned_probs)
rf_tuned_auc <- auc(rf_tuned_roc)
# Store results from CM and AUC in matrix
model_results["RF_tuned", "Accuracy"] <- rf_tuned_cm$overall["Accuracy"]
model_results["RF_tuned", "Precision"] <- rf_tuned_cm$byClass["Precision"]
model_results["RF_tuned", "Recall"] <- rf_tuned_cm$byClass["Recall"]
model_results["RF_tuned", "F1"] <- rf_tuned_cm$byClass["F1"]
model_results["RF_tuned", "AUC"] <- rf_tuned_auc
# View updated matrix
model_results
## Accuracy Precision Recall F1 AUC
## RF_default 0.7966825 0.6609337 0.4803571 0.5563599 0.8426123
## RF_tuned 0.8180095 0.7471910 0.4750000 0.5807860 0.8523652
## NN_simple NA NA NA NA NA
## NN_deep NA NA NA NA NA
Next, a simple feedforward neural network with five hidden units was created using the nnet package. Multiple metrics were again evaluated and the results stored in the results matrix.
# Fit simple neural network with 5 hidden units and light regularization
nn_simple <- nnet(
Churn ~ .,
data = df_train,
size = 5,
decay = 0.01,
maxit = 200,
trace = FALSE
)
# Predict on test data without the "Churn" column
nn_simple_preds <- predict(nn_simple,
df_test %>% select(-Churn),
type = "class")
# Confusion matrix (note: no AUC in nnet)
nn_simple_cm <- confusionMatrix(
factor(nn_simple_preds, levels = c(0, 1)),
df_test$Churn,
positive = "1")
nn_simple_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1431 282
## 1 119 278
##
## Accuracy : 0.81
## 95% CI : (0.7925, 0.8265)
## No Information Rate : 0.7346
## P-Value [Acc > NIR] : 3.010e-16
##
## Kappa : 0.4627
##
## Mcnemar's Test P-Value : 5.972e-16
##
## Sensitivity : 0.4964
## Specificity : 0.9232
## Pos Pred Value : 0.7003
## Neg Pred Value : 0.8354
## Prevalence : 0.2654
## Detection Rate : 0.1318
## Detection Prevalence : 0.1882
## Balanced Accuracy : 0.7098
##
## 'Positive' Class : 1
##
# Store results from CM in matrix
model_results["NN_simple", "Accuracy"] <- nn_simple_cm$overall["Accuracy"]
model_results["NN_simple", "Precision"] <- nn_simple_cm$byClass["Precision"]
model_results["NN_simple", "Recall"] <- nn_simple_cm$byClass["Recall"]
model_results["NN_simple", "F1"] <- nn_simple_cm$byClass["F1"]
model_results["NN_simple", "AUC"] <- NA
# Display updated metrics
model_results
## Accuracy Precision Recall F1 AUC
## RF_default 0.7966825 0.6609337 0.4803571 0.5563599 0.8426123
## RF_tuned 0.8180095 0.7471910 0.4750000 0.5807860 0.8523652
## NN_simple 0.8099526 0.7002519 0.4964286 0.5809822 NA
## NN_deep NA NA NA NA NA
Finally, a deep feedforward neural network with ten hidden units and stronger regularization is created in an attempt to improve the still-poor recall performance.
# Fit neural network with 10 hidden units and stronger regularization
nn_deep <- nnet(
Churn ~ .,
data = df_train,
size = 10,
decay = 0.05,
maxit = 300,
trace = FALSE
)
# Predict on test data without the "Churn" column
nn_deep_preds <- predict(nn_deep, df_test %>% select(-Churn), type = "class")
# Confusion matrix
nn_deep_cm <- confusionMatrix(
factor(nn_deep_preds, levels = c(0, 1)),
df_test$Churn,
positive = "1")
nn_deep_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1421 271
## 1 129 289
##
## Accuracy : 0.8104
## 95% CI : (0.793, 0.8269)
## No Information Rate : 0.7346
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.471
##
## Mcnemar's Test P-Value : 1.789e-12
##
## Sensitivity : 0.5161
## Specificity : 0.9168
## Pos Pred Value : 0.6914
## Neg Pred Value : 0.8398
## Prevalence : 0.2654
## Detection Rate : 0.1370
## Detection Prevalence : 0.1981
## Balanced Accuracy : 0.7164
##
## 'Positive' Class : 1
##
# Store metrics
model_results["NN_deep", "Accuracy"] <- nn_deep_cm$overall["Accuracy"]
model_results["NN_deep", "Precision"] <- nn_deep_cm$byClass["Precision"]
model_results["NN_deep", "Recall"] <- nn_deep_cm$byClass["Recall"]
model_results["NN_deep", "F1"] <- nn_deep_cm$byClass["F1"]
model_results["NN_deep", "AUC"] <- NA # no raw probs available
# Display updated metrics
model_results
## Accuracy Precision Recall F1 AUC
## RF_default 0.7966825 0.6609337 0.4803571 0.5563599 0.8426123
## RF_tuned 0.8180095 0.7471910 0.4750000 0.5807860 0.8523652
## NN_simple 0.8099526 0.7002519 0.4964286 0.5809822 NA
## NN_deep 0.8104265 0.6913876 0.5160714 0.5910020 NA