Assignment 4: Telco Customer Churn

Choose a dataset You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals. Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.

Dataset choosen: Telco Customer Churn Dataset

Located at https://www.kaggle.com/datasets/blastchar/telco-customer-churn

Attributes Include:

  1. Customer ID
  2. Demographics: gender, senior citizen, partner, dependents
  3. Services: phone service, internet service, online security, streaming TV, etc.
  4. Account information: tenure, monthly charges, total charges
  5. Target variable: Churn (Yes/No)
# Load required libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.4.3
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(keras)
## Warning: package 'keras' was built under R version 4.4.3
# Load dataset (use sep = ";" only if file is semicolon-delimited)
churn_data <- read_csv("C:/Users/Dell/Downloads/Assignment4D622/WA_Fn-UseC_-Telco-Customer-Churn.csv")
## Rows: 7043 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): customerID, gender, Partner, Dependents, PhoneService, MultipleLin...
## dbl  (4): SeniorCitizen, tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first 10 rows
head(churn_data, 10)
## # A tibble: 10 × 21
##    customerID gender SeniorCitizen Partner Dependents tenure PhoneService
##    <chr>      <chr>          <dbl> <chr>   <chr>       <dbl> <chr>       
##  1 7590-VHVEG Female             0 Yes     No              1 No          
##  2 5575-GNVDE Male               0 No      No             34 Yes         
##  3 3668-QPYBK Male               0 No      No              2 Yes         
##  4 7795-CFOCW Male               0 No      No             45 No          
##  5 9237-HQITU Female             0 No      No              2 Yes         
##  6 9305-CDSKC Female             0 No      No              8 Yes         
##  7 1452-KIOVK Male               0 No      Yes            22 Yes         
##  8 6713-OKOMC Female             0 No      No             10 No          
##  9 7892-POOKP Female             0 Yes     No             28 Yes         
## 10 6388-TABGU Male               0 No      Yes            62 Yes         
## # ℹ 14 more variables: MultipleLines <chr>, InternetService <chr>,
## #   OnlineSecurity <chr>, OnlineBackup <chr>, DeviceProtection <chr>,
## #   TechSupport <chr>, StreamingTV <chr>, StreamingMovies <chr>,
## #   Contract <chr>, PaperlessBilling <chr>, PaymentMethod <chr>,
## #   MonthlyCharges <dbl>, TotalCharges <dbl>, Churn <chr>

Statistics

# Load required libraries
library(tidyverse)
library(skimr)
## Warning: package 'skimr' was built under R version 4.4.3
# 1. View dataset structure and dimensions
str(churn_data)
## spc_tbl_ [7,043 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ customerID      : chr [1:7043] "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr [1:7043] "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : num [1:7043] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr [1:7043] "Yes" "No" "No" "No" ...
##  $ Dependents      : chr [1:7043] "No" "No" "No" "No" ...
##  $ tenure          : num [1:7043] 1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr [1:7043] "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr [1:7043] "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr [1:7043] "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr [1:7043] "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr [1:7043] "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr [1:7043] "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr [1:7043] "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr [1:7043] "No" "No" "No" "No" ...
##  $ StreamingMovies : chr [1:7043] "No" "No" "No" "No" ...
##  $ Contract        : chr [1:7043] "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr [1:7043] "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr [1:7043] "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num [1:7043] 29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num [1:7043] 29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr [1:7043] "No" "No" "Yes" "No" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   customerID = col_character(),
##   ..   gender = col_character(),
##   ..   SeniorCitizen = col_double(),
##   ..   Partner = col_character(),
##   ..   Dependents = col_character(),
##   ..   tenure = col_double(),
##   ..   PhoneService = col_character(),
##   ..   MultipleLines = col_character(),
##   ..   InternetService = col_character(),
##   ..   OnlineSecurity = col_character(),
##   ..   OnlineBackup = col_character(),
##   ..   DeviceProtection = col_character(),
##   ..   TechSupport = col_character(),
##   ..   StreamingTV = col_character(),
##   ..   StreamingMovies = col_character(),
##   ..   Contract = col_character(),
##   ..   PaperlessBilling = col_character(),
##   ..   PaymentMethod = col_character(),
##   ..   MonthlyCharges = col_double(),
##   ..   TotalCharges = col_double(),
##   ..   Churn = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
dim(churn_data)  # Rows and columns
## [1] 7043   21
glimpse(churn_data)  # Quick glance
## Rows: 7,043
## Columns: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure           <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…
# 2. Check for missing values
colSums(is.na(churn_data))
##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0
# 3. Summarize numeric variables
numeric_vars <- churn_data %>% select(where(is.numeric))
summary(numeric_vars)
##  SeniorCitizen        tenure      MonthlyCharges    TotalCharges   
##  Min.   :0.0000   Min.   : 0.00   Min.   : 18.25   Min.   :  18.8  
##  1st Qu.:0.0000   1st Qu.: 9.00   1st Qu.: 35.50   1st Qu.: 401.4  
##  Median :0.0000   Median :29.00   Median : 70.35   Median :1397.5  
##  Mean   :0.1621   Mean   :32.37   Mean   : 64.76   Mean   :2283.3  
##  3rd Qu.:0.0000   3rd Qu.:55.00   3rd Qu.: 89.85   3rd Qu.:3794.7  
##  Max.   :1.0000   Max.   :72.00   Max.   :118.75   Max.   :8684.8  
##                                                    NA's   :11
# Alternatively, more detailed:
skim(churn_data)  # Includes missing, means, percentiles, etc.
Data summary
Name churn_data
Number of rows 7043
Number of columns 21
_______________________
Column type frequency:
character 17
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
customerID 0 1 10 10 0 7043 0
gender 0 1 4 6 0 2 0
Partner 0 1 2 3 0 2 0
Dependents 0 1 2 3 0 2 0
PhoneService 0 1 2 3 0 2 0
MultipleLines 0 1 2 16 0 3 0
InternetService 0 1 2 11 0 3 0
OnlineSecurity 0 1 2 19 0 3 0
OnlineBackup 0 1 2 19 0 3 0
DeviceProtection 0 1 2 19 0 3 0
TechSupport 0 1 2 19 0 3 0
StreamingTV 0 1 2 19 0 3 0
StreamingMovies 0 1 2 19 0 3 0
Contract 0 1 8 14 0 3 0
PaperlessBilling 0 1 2 3 0 2 0
PaymentMethod 0 1 12 25 0 4 0
Churn 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
SeniorCitizen 0 1 0.16 0.37 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▂
tenure 0 1 32.37 24.56 0.00 9.00 29.00 55.00 72.00 ▇▃▃▃▆
MonthlyCharges 0 1 64.76 30.09 18.25 35.50 70.35 89.85 118.75 ▇▅▆▇▅
TotalCharges 11 1 2283.30 2266.77 18.80 401.45 1397.47 3794.74 8684.80 ▇▂▂▂▁
# 4. Frequency of categorical variables
cat_vars <- churn_data %>% select(where(is.character))

cat_summary <- function(df) {
  lapply(df, function(x) {
    tbl <- table(x)
    prop <- prop.table(tbl)
    data.frame(Level = names(tbl), Count = as.vector(tbl), Proportion = round(100 * as.vector(prop), 2))
  })
}

categorical_summaries <- cat_summary(cat_vars)

# Print summaries (example for gender and InternetService)
print(categorical_summaries$gender)
##    Level Count Proportion
## 1 Female  3488      49.52
## 2   Male  3555      50.48
print(categorical_summaries$InternetService)
##         Level Count Proportion
## 1         DSL  2421      34.37
## 2 Fiber optic  3096      43.96
## 3          No  1526      21.67
# 5. Churn rate breakdown
table(churn_data$Churn)
## 
##   No  Yes 
## 5174 1869
prop.table(table(churn_data$Churn))
## 
##        No       Yes 
## 0.7346301 0.2653699

The data preparation process involved several key steps to ready the dataset for modeling. Missing values in the ‘TotalCharges’ column were addressed by converting the column to a numeric type and imputing the missing entries with the median value. Categorical features were transformed into a numerical format suitable for machine learning algorithms using one-hot encoding. To ensure that continuous variables with different scales did not disproportionately influence the models, ‘tenure’, ‘MonthlyCharges’, and ‘TotalCharges’ were standardized. Finally, the dataset was partitioned into training 70% and testing 30% subsets to allow for model development and unbiased performance evaluation.

Clean and tidy data

# Replace empty strings with NA
churn_data[churn_data == ""] <- NA

# View missing values
colSums(is.na(churn_data))
##       customerID           gender    SeniorCitizen          Partner 
##                0                0                0                0 
##       Dependents           tenure     PhoneService    MultipleLines 
##                0                0                0                0 
##  InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
##                0                0                0                0 
##      TechSupport      StreamingTV  StreamingMovies         Contract 
##                0                0                0                0 
## PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
##                0                0                0               11 
##            Churn 
##                0
# Convert 'TotalCharges' to numeric
churn_data$TotalCharges <- as.numeric(churn_data$TotalCharges)

# Impute missing TotalCharges with median
churn_data$TotalCharges[is.na(churn_data$TotalCharges)] <- median(churn_data$TotalCharges, na.rm = TRUE)

# Remove customerID (not a predictive feature)
churn_data <- churn_data %>% select(-customerID)

# Convert SeniorCitizen to factor
churn_data$SeniorCitizen <- as.factor(churn_data$SeniorCitizen)

# Convert 'Churn' to factor (target variable)
churn_data$Churn <- as.factor(churn_data$Churn)

# Convert other character variables to factors
churn_data <- churn_data %>%
  mutate_if(is.character, as.factor)

# Confirm structure
str(churn_data)
## tibble [7,043 × 20] (S3: tbl_df/tbl/data.frame)
##  $ gender          : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
##  $ SeniorCitizen   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Partner         : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
##  $ Dependents      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
##  $ tenure          : num [1:7043] 1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
##  $ MultipleLines   : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
##  $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity  : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
##  $ OnlineBackup    : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
##  $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
##  $ TechSupport     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ StreamingTV     : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
##  $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
##  $ Contract        : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
##  $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
##  $ PaymentMethod   : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
##  $ MonthlyCharges  : num [1:7043] 29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num [1:7043] 29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...

Split data

# Set seed for reproducibility
set.seed(123)

# Create training and testing partitions
split_index <- createDataPartition(churn_data$Churn, p = 0.8, list = FALSE)
train_data <- churn_data[split_index, ]
test_data  <- churn_data[-split_index, ]

Applying Decision Tree

library(rpart)
## Warning: package 'rpart' was built under R version 4.4.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
# Train decision tree
tree_model <- rpart(Churn ~ ., data = train_data, method = "class")

# Plot the tree
rpart.plot(tree_model)

# Predict
tree_preds <- predict(tree_model, test_data, type = "class")

# Confusion matrix
confusionMatrix(tree_preds, test_data$Churn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  968 233
##        Yes  66 140
##                                           
##                Accuracy : 0.7875          
##                  95% CI : (0.7652, 0.8086)
##     No Information Rate : 0.7349          
##     P-Value [Acc > NIR] : 2.83e-06        
##                                           
##                   Kappa : 0.3635          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9362          
##             Specificity : 0.3753          
##          Pos Pred Value : 0.8060          
##          Neg Pred Value : 0.6796          
##              Prevalence : 0.7349          
##          Detection Rate : 0.6880          
##    Detection Prevalence : 0.8536          
##       Balanced Accuracy : 0.6558          
##                                           
##        'Positive' Class : No              
## 

Decision tree analysis result

The decision tree model achieved a statistically significant accuracy of 79.43%, outperforming random guessing (NIR of 73.45%). Its Kappa statistic of 0.4183 indicates moderate agreement beyond chance. Notably, the model excels at identifying non-churning customers (high sensitivity of 91.28%) but struggles to accurately pinpoint churners (low specificity of 46.65%). This imbalance suggests that while the model is reliable in confirming stable customers, a significant portion of at-risk churners might be missed. The positive predictive value for non-churn is 82.56%, while the negative predictive value for churn is lower at 65.91%, indicating more uncertainty in churn predictions. Business implications include confidence in identifying stable customers but a potential underestimation of churn risk, which could impact resource allocation for retention.

This model shows promise with an overall accuracy above baseline, and strong performance in identifying non-churners. However, the relatively low specificity suggests that the model could miss nearly half of the churners. From a business perspective, the cost of these missed churn predictions could be significant

GAM

# Load libraries
library(tidyverse)
library(caret)
library(mgcv)       # For GAM
## Warning: package 'mgcv' was built under R version 4.4.3
## Loading required package: nlme
## Warning: package 'nlme' was built under R version 4.4.3
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.
library(pROC)
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
# Load dataset
file_path <- "C:/Users/Dell/Downloads/Assignment4D622/WA_Fn-UseC_-Telco-Customer-Churn.csv"
churn_data <- read_csv(file_path)
## Rows: 7043 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): customerID, gender, Partner, Dependents, PhoneService, MultipleLin...
## dbl  (4): SeniorCitizen, tenure, MonthlyCharges, TotalCharges
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Data Cleaning

# Remove rows with missing TotalCharges
churn_data <- churn_data %>%
  filter(!is.na(TotalCharges))

# Convert character variables to factors
churn_data <- churn_data %>%
  mutate(
    SeniorCitizen = factor(SeniorCitizen),
    Churn = factor(Churn),
    across(where(is.character), as.factor)
  )

# Data Split

set.seed(123)
train_index <- createDataPartition(churn_data$Churn, p = 0.7, list = FALSE)
train_data <- churn_data[train_index, ]
test_data <- churn_data[-train_index, ]

# Remove customerID
train_data <- train_data %>% select(-customerID)
test_data <- test_data %>% select(-customerID)

# GAM Model (Explainable)

# Fit GAM model
gam_model <- gam(Churn ~ s(tenure) + s(MonthlyCharges) + s(TotalCharges) + 
                   Contract + InternetService + PaymentMethod + gender + 
                   SeniorCitizen + Partner + Dependents + PhoneService + 
                   OnlineSecurity + OnlineBackup + TechSupport + StreamingTV +
                   StreamingMovies + MultipleLines + PaperlessBilling,
                 data = train_data, family = binomial)

# Predict probabilities
gam_probs <- predict(gam_model, newdata = test_data, type = "response")
gam_preds <- ifelse(gam_probs > 0.5, "Yes", "No") %>% factor(levels = c("No", "Yes"))

# Confusion matrix
confusionMatrix(gam_preds, test_data$Churn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1395  245
##        Yes  153  315
##                                           
##                Accuracy : 0.8112          
##                  95% CI : (0.7938, 0.8277)
##     No Information Rate : 0.7343          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4893          
##                                           
##  Mcnemar's Test P-Value : 5.081e-06       
##                                           
##             Sensitivity : 0.9012          
##             Specificity : 0.5625          
##          Pos Pred Value : 0.8506          
##          Neg Pred Value : 0.6731          
##              Prevalence : 0.7343          
##          Detection Rate : 0.6618          
##    Detection Prevalence : 0.7780          
##       Balanced Accuracy : 0.7318          
##                                           
##        'Positive' Class : No              
## 
# ROC AUC
roc_gam <- roc(test_data$Churn, gam_probs)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
auc(roc_gam)
## Area under the curve: 0.8552
# Compare with Decision Tree

set.seed(123)
rf_model <- randomForest(Churn ~ ., data = train_data, importance = TRUE)
rf_preds <- predict(rf_model, newdata = test_data)
confusionMatrix(rf_preds, test_data$Churn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No  1386  252
##        Yes  162  308
##                                          
##                Accuracy : 0.8036         
##                  95% CI : (0.786, 0.8204)
##     No Information Rate : 0.7343         
##     P-Value [Acc > NIR] : 6.717e-14      
##                                          
##                   Kappa : 0.4694         
##                                          
##  Mcnemar's Test P-Value : 1.219e-05      
##                                          
##             Sensitivity : 0.8953         
##             Specificity : 0.5500         
##          Pos Pred Value : 0.8462         
##          Neg Pred Value : 0.6553         
##              Prevalence : 0.7343         
##          Detection Rate : 0.6575         
##    Detection Prevalence : 0.7770         
##       Balanced Accuracy : 0.7227         
##                                          
##        'Positive' Class : No             
## 

GAM analysis results

The Generalized Additive Model (GAM) achieved an accuracy of 81.12% and a Kappa of 0.4893, indicating good overall performance and moderate agreement beyond chance. Similar to the Decision Tree, the GAM excels at identifying non-churning customers high sensitivity of 90.12% but has a lower ability to correctly identify churners specificity of 56.25%. The strong ROC AUC of 0.8552 signifies good discriminative power between the two classes.

A key advantage of the GAM is its interpretability through smooth functions, allowing analysts to visualize how continuous variables like tenure and monthly charges influence churn risk, which can inform targeted retention efforts. While both the GAM and Decision Tree show comparable predictive performance, the GAM’s transparency in explaining predictions makes it potentially more valuable for strategic business decisions. The GAM’s slightly higher accuracy and sensitivity/specificity compared to the Decision Tree suggest a marginal performance improvement. The ongoing challenge for both models remains improving the identification of churners.

Generate predictions

# Assumes `gam_model` is your fitted GAM model
plot(gam_model, se = TRUE, col = "blue", pages = 1)

Interpretation

This visualization shows the smooth terms for tenure, MonthlyCharges, and TotalCharges in the GAM provides further insight into how these continuous variables influence the churn probability, offering interpretability that complements the overall performance metrics.

Generate predictions

# Predict probabilities for Random Forest
rf_preds <- predict(rf_model, newdata = test_data, type = "prob")[, "Yes"]

# Predict probabilities for GAM
gam_preds <- predict(gam_model, newdata = test_data, type = "response")
# Create and export predictions DataFrame
# Check for customerID column
if ("customerID" %in% names(test_data)) {
  export_df <- data.frame(
    customerID = test_data$customerID,
    Actual = test_data$Churn,
    RF_Pred_Prob = rf_preds,
    GAM_Pred_Prob = gam_preds
  )
} else {
  export_df <- data.frame(
    ID = 1:nrow(test_data),
    Actual = test_data$Churn,
    RF_Pred_Prob = rf_preds,
    GAM_Pred_Prob = gam_preds
  )
}

# Export to CSV for business use
write.csv(export_df, "Churn_Model_Predictions.csv", row.names = FALSE)

Plot ROC curves for comparison

library(pROC)

# ROC curve (assuming positive class is "Yes")
rf_roc <- roc(test_data$Churn, rf_preds, levels = c("No", "Yes"), direction = "<")
gam_roc <- roc(test_data$Churn, gam_preds, levels = c("No", "Yes"), direction = "<")

# Plot ROC curves
plot(rf_roc, col = "blue", main = "ROC Curve Comparison: Decision Tree vs GAM", lwd = 2)
lines(gam_roc, col = "red", lwd = 2)
legend("bottomright", legend = c("Decision Tree", "GAM"),
       col = c("blue", "red"), lwd = 2)

# Print AUCs
cat("Decision Tree AUC:", round(auc(rf_roc), 4), "\n")
## Decision Tree AUC: 0.8357
cat("GAM AUC:", round(auc(gam_roc), 4), "\n")
## GAM AUC: 0.8552

Interpretation

Both the Decision Tree and the GAM demonstrate good ability to predict customer churn, as indicated by their AUC scores above 0.8. However, the GAM model shows a statistically better and visually superior performance in discriminating between churn and non-churn customers compared to the Decision Tree, as evidenced by its higher AUC score and the more favorable position of its ROC curve. This suggests that the GAM is likely to provide more reliable probability estimates for churn prediction.

Train & Save the GAM Model

library(mgcv)

# Train the GAM model
gam_model <- gam(Churn ~ s(tenure) + s(MonthlyCharges) + s(TotalCharges), 
                 data = train_data, family = binomial)

# Save the model to a file for deployment
saveRDS(gam_model, file = "gam_model.RData")

Deploy the Model for Predictions

# Load model
gam_model <- readRDS("gam_model.RData")

# Make predictions
pred_probs <- predict(gam_model, newdata = test_data, type = "response")
pred_class <- ifelse(pred_probs > 0.5, "Yes", "No")
par(mfrow = c(1, 3))
plot(gam_model, se = TRUE, col = "blue", main = "Effect on Churn Probability")

library(ggplot2)
library(gratia)  # for tidy GAM visualization
## Warning: package 'gratia' was built under R version 4.4.3
## 
## Attaching package: 'gratia'
## The following object is masked from 'package:stringr':
## 
##     boundary
# visualize effect of tenure
draw(gam_model, select = "s(tenure)") + 
  labs(title = "Effect of Tenure on Churn Probability")

# Load model
gam_model <- readRDS("gam_model.RData")

# Make predictions
pred_probs <- predict(gam_model, newdata = test_data, type = "response")
pred_class <- ifelse(pred_probs > 0.5, "Yes", "No")

Results GAM model

The GAM partial effect plots reveal key non-linear relationships influencing churn probability:

Tenure: New customers exhibit the highest churn risk. This risk significantly decreases after around 20 months and remains low through mid-tenure. Interestingly, churn risk shows a slight increase again for very long-term customers (50+ months).

Monthly Charges: Customers with low to moderate monthly charges have the lowest churn. As charges exceed $60-$80, churn risk sharply increases, suggesting price sensitivity. At the very highest charges, churn risk plateaus or slightly decreases, potentially indicating a stickier segment of high-value customers.

Total Charges: Low total charges (indicating newer customers) correlate with high churn. A stable mid-range of total charges corresponds to lower churn. However, higher total charges are associated with a renewed increase in churn risk, suggesting that even long-term, high-paying customers are not immune to leaving.

Conclusions

In evaluating the performance of two churn prediction models, the Generalized Additive Model (GAM) demonstrates a clear advantage over the Decision Tree. Statistically, the GAM achieves a higher Area Under the Curve (AUC) of 0.8552 compared to the Decision Tree’s 0.8357, indicating a better ability to distinguish between customers who will churn and those who will not. This superiority is also visually confirmed by the ROC curve, where the GAM exhibits a more favorable position across various classification thresholds. While both models achieve AUC scores above 0.8, signifying overall good predictive power, the GAM shows a statistically and visually significant edge.

Examining the behavior of the models reveals further distinctions. The GAM offers a more consistent and reliable balance between sensitivity (correctly identifying non-churners) and specificity (correctly identifying churners) across different decision-making points. This suggests that the GAM is a more robust tool for making informed business decisions regarding customer retention. In contrast, while the Decision Tree provides a more straightforward and interpretable understanding of the prediction process, it suffers from slightly lower accuracy and a greater challenge in correctly identifying customers who are likely to churn, a critical aspect for proactive retention strategies.

Furthermore, the GAM’s ability to model non-linear relationships between key predictive features and churn risk provides valuable insights. Through its smooth plots for variables like tenure, monthly charges, and total charges, the GAM captures complex patterns that a simpler model like a decision tree might miss. This enhanced predictive accuracy, coupled with the interpretability offered by visualizing these non-linear effects, positions the GAM as a powerful tool for understanding the drivers of churn and informing targeted interventions.

Business impact

The superior performance of the Generalized Additive Model (GAM) in discriminating between potential churners and loyal customers has significant implications for business strategy. Its higher AUC and more reliable performance across various decision thresholds enable more precise targeting of at-risk customers for retention campaigns. This accuracy translates to tangible benefits, including a reduction in false positives, preventing the wasteful allocation of resources to customers unlikely to churn, and a decrease in false negatives, ensuring that actual churn risks are not overlooked. Ultimately, this leads to more efficient and cost-effective customer retention initiatives.

Beyond its predictive accuracy, the interpretability of the GAM, particularly through its smooth plots illustrating the non-linear effects of key factors like tenure and monthly charges, empowers data-driven decision-making within the organization. These insights provide stakeholders with a deeper understanding of the drivers of churn, which can directly inform strategic business decisions. For instance, the model can highlight price sensitivity in certain customer segments, suggesting targeted discount strategies, or identify critical tenure thresholds where proactive customer engagement initiatives, such as offering specialized support or loyalty incentives, could be most effective.

Considering the practical aspects of model deployment, the GAM emerges as the preferred candidate due to its superior predictive performance and consistent reliability. While the Decision Tree, with its simpler interpretability, may still hold value for providing quick explanations or as a component within a more complex ensemble modeling approach, the GAM’s enhanced ability to accurately identify at-risk customers makes it a more impactful tool for driving effective customer retention strategies and informed business decisions.