Choose a dataset You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals. Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.
Dataset choosen: Telco Customer Churn Dataset
Located at https://www.kaggle.com/datasets/blastchar/telco-customer-churn
Attributes Include:
# Load required libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'tidyr' was built under R version 4.4.2
## Warning: package 'readr' was built under R version 4.4.2
## Warning: package 'purrr' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'stringr' was built under R version 4.4.2
## Warning: package 'lubridate' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.4.3
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.4.3
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(keras)
## Warning: package 'keras' was built under R version 4.4.3
# Load dataset (use sep = ";" only if file is semicolon-delimited)
churn_data <- read_csv("C:/Users/Dell/Downloads/Assignment4D622/WA_Fn-UseC_-Telco-Customer-Churn.csv")
## Rows: 7043 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): customerID, gender, Partner, Dependents, PhoneService, MultipleLin...
## dbl (4): SeniorCitizen, tenure, MonthlyCharges, TotalCharges
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View the first 10 rows
head(churn_data, 10)
## # A tibble: 10 × 21
## customerID gender SeniorCitizen Partner Dependents tenure PhoneService
## <chr> <chr> <dbl> <chr> <chr> <dbl> <chr>
## 1 7590-VHVEG Female 0 Yes No 1 No
## 2 5575-GNVDE Male 0 No No 34 Yes
## 3 3668-QPYBK Male 0 No No 2 Yes
## 4 7795-CFOCW Male 0 No No 45 No
## 5 9237-HQITU Female 0 No No 2 Yes
## 6 9305-CDSKC Female 0 No No 8 Yes
## 7 1452-KIOVK Male 0 No Yes 22 Yes
## 8 6713-OKOMC Female 0 No No 10 No
## 9 7892-POOKP Female 0 Yes No 28 Yes
## 10 6388-TABGU Male 0 No Yes 62 Yes
## # ℹ 14 more variables: MultipleLines <chr>, InternetService <chr>,
## # OnlineSecurity <chr>, OnlineBackup <chr>, DeviceProtection <chr>,
## # TechSupport <chr>, StreamingTV <chr>, StreamingMovies <chr>,
## # Contract <chr>, PaperlessBilling <chr>, PaymentMethod <chr>,
## # MonthlyCharges <dbl>, TotalCharges <dbl>, Churn <chr>
# Load required libraries
library(tidyverse)
library(skimr)
## Warning: package 'skimr' was built under R version 4.4.3
# 1. View dataset structure and dimensions
str(churn_data)
## spc_tbl_ [7,043 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ customerID : chr [1:7043] "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
## $ gender : chr [1:7043] "Female" "Male" "Male" "Male" ...
## $ SeniorCitizen : num [1:7043] 0 0 0 0 0 0 0 0 0 0 ...
## $ Partner : chr [1:7043] "Yes" "No" "No" "No" ...
## $ Dependents : chr [1:7043] "No" "No" "No" "No" ...
## $ tenure : num [1:7043] 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : chr [1:7043] "No" "Yes" "Yes" "No" ...
## $ MultipleLines : chr [1:7043] "No phone service" "No" "No" "No phone service" ...
## $ InternetService : chr [1:7043] "DSL" "DSL" "DSL" "DSL" ...
## $ OnlineSecurity : chr [1:7043] "No" "Yes" "Yes" "Yes" ...
## $ OnlineBackup : chr [1:7043] "Yes" "No" "Yes" "No" ...
## $ DeviceProtection: chr [1:7043] "No" "Yes" "No" "Yes" ...
## $ TechSupport : chr [1:7043] "No" "No" "No" "Yes" ...
## $ StreamingTV : chr [1:7043] "No" "No" "No" "No" ...
## $ StreamingMovies : chr [1:7043] "No" "No" "No" "No" ...
## $ Contract : chr [1:7043] "Month-to-month" "One year" "Month-to-month" "One year" ...
## $ PaperlessBilling: chr [1:7043] "Yes" "No" "Yes" "No" ...
## $ PaymentMethod : chr [1:7043] "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
## $ MonthlyCharges : num [1:7043] 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num [1:7043] 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : chr [1:7043] "No" "No" "Yes" "No" ...
## - attr(*, "spec")=
## .. cols(
## .. customerID = col_character(),
## .. gender = col_character(),
## .. SeniorCitizen = col_double(),
## .. Partner = col_character(),
## .. Dependents = col_character(),
## .. tenure = col_double(),
## .. PhoneService = col_character(),
## .. MultipleLines = col_character(),
## .. InternetService = col_character(),
## .. OnlineSecurity = col_character(),
## .. OnlineBackup = col_character(),
## .. DeviceProtection = col_character(),
## .. TechSupport = col_character(),
## .. StreamingTV = col_character(),
## .. StreamingMovies = col_character(),
## .. Contract = col_character(),
## .. PaperlessBilling = col_character(),
## .. PaymentMethod = col_character(),
## .. MonthlyCharges = col_double(),
## .. TotalCharges = col_double(),
## .. Churn = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
dim(churn_data) # Rows and columns
## [1] 7043 21
glimpse(churn_data) # Quick glance
## Rows: 7,043
## Columns: 21
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…
# 2. Check for missing values
colSums(is.na(churn_data))
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
# 3. Summarize numeric variables
numeric_vars <- churn_data %>% select(where(is.numeric))
summary(numeric_vars)
## SeniorCitizen tenure MonthlyCharges TotalCharges
## Min. :0.0000 Min. : 0.00 Min. : 18.25 Min. : 18.8
## 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.: 35.50 1st Qu.: 401.4
## Median :0.0000 Median :29.00 Median : 70.35 Median :1397.5
## Mean :0.1621 Mean :32.37 Mean : 64.76 Mean :2283.3
## 3rd Qu.:0.0000 3rd Qu.:55.00 3rd Qu.: 89.85 3rd Qu.:3794.7
## Max. :1.0000 Max. :72.00 Max. :118.75 Max. :8684.8
## NA's :11
# Alternatively, more detailed:
skim(churn_data) # Includes missing, means, percentiles, etc.
Name | churn_data |
Number of rows | 7043 |
Number of columns | 21 |
_______________________ | |
Column type frequency: | |
character | 17 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
customerID | 0 | 1 | 10 | 10 | 0 | 7043 | 0 |
gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
Partner | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Dependents | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
PhoneService | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
MultipleLines | 0 | 1 | 2 | 16 | 0 | 3 | 0 |
InternetService | 0 | 1 | 2 | 11 | 0 | 3 | 0 |
OnlineSecurity | 0 | 1 | 2 | 19 | 0 | 3 | 0 |
OnlineBackup | 0 | 1 | 2 | 19 | 0 | 3 | 0 |
DeviceProtection | 0 | 1 | 2 | 19 | 0 | 3 | 0 |
TechSupport | 0 | 1 | 2 | 19 | 0 | 3 | 0 |
StreamingTV | 0 | 1 | 2 | 19 | 0 | 3 | 0 |
StreamingMovies | 0 | 1 | 2 | 19 | 0 | 3 | 0 |
Contract | 0 | 1 | 8 | 14 | 0 | 3 | 0 |
PaperlessBilling | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
PaymentMethod | 0 | 1 | 12 | 25 | 0 | 4 | 0 |
Churn | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
SeniorCitizen | 0 | 1 | 0.16 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
tenure | 0 | 1 | 32.37 | 24.56 | 0.00 | 9.00 | 29.00 | 55.00 | 72.00 | ▇▃▃▃▆ |
MonthlyCharges | 0 | 1 | 64.76 | 30.09 | 18.25 | 35.50 | 70.35 | 89.85 | 118.75 | ▇▅▆▇▅ |
TotalCharges | 11 | 1 | 2283.30 | 2266.77 | 18.80 | 401.45 | 1397.47 | 3794.74 | 8684.80 | ▇▂▂▂▁ |
# 4. Frequency of categorical variables
cat_vars <- churn_data %>% select(where(is.character))
cat_summary <- function(df) {
lapply(df, function(x) {
tbl <- table(x)
prop <- prop.table(tbl)
data.frame(Level = names(tbl), Count = as.vector(tbl), Proportion = round(100 * as.vector(prop), 2))
})
}
categorical_summaries <- cat_summary(cat_vars)
# Print summaries (example for gender and InternetService)
print(categorical_summaries$gender)
## Level Count Proportion
## 1 Female 3488 49.52
## 2 Male 3555 50.48
print(categorical_summaries$InternetService)
## Level Count Proportion
## 1 DSL 2421 34.37
## 2 Fiber optic 3096 43.96
## 3 No 1526 21.67
# 5. Churn rate breakdown
table(churn_data$Churn)
##
## No Yes
## 5174 1869
prop.table(table(churn_data$Churn))
##
## No Yes
## 0.7346301 0.2653699
The data preparation process involved several key steps to ready the dataset for modeling. Missing values in the ‘TotalCharges’ column were addressed by converting the column to a numeric type and imputing the missing entries with the median value. Categorical features were transformed into a numerical format suitable for machine learning algorithms using one-hot encoding. To ensure that continuous variables with different scales did not disproportionately influence the models, ‘tenure’, ‘MonthlyCharges’, and ‘TotalCharges’ were standardized. Finally, the dataset was partitioned into training 70% and testing 30% subsets to allow for model development and unbiased performance evaluation.
# Replace empty strings with NA
churn_data[churn_data == ""] <- NA
# View missing values
colSums(is.na(churn_data))
## customerID gender SeniorCitizen Partner
## 0 0 0 0
## Dependents tenure PhoneService MultipleLines
## 0 0 0 0
## InternetService OnlineSecurity OnlineBackup DeviceProtection
## 0 0 0 0
## TechSupport StreamingTV StreamingMovies Contract
## 0 0 0 0
## PaperlessBilling PaymentMethod MonthlyCharges TotalCharges
## 0 0 0 11
## Churn
## 0
# Convert 'TotalCharges' to numeric
churn_data$TotalCharges <- as.numeric(churn_data$TotalCharges)
# Impute missing TotalCharges with median
churn_data$TotalCharges[is.na(churn_data$TotalCharges)] <- median(churn_data$TotalCharges, na.rm = TRUE)
# Remove customerID (not a predictive feature)
churn_data <- churn_data %>% select(-customerID)
# Convert SeniorCitizen to factor
churn_data$SeniorCitizen <- as.factor(churn_data$SeniorCitizen)
# Convert 'Churn' to factor (target variable)
churn_data$Churn <- as.factor(churn_data$Churn)
# Convert other character variables to factors
churn_data <- churn_data %>%
mutate_if(is.character, as.factor)
# Confirm structure
str(churn_data)
## tibble [7,043 × 20] (S3: tbl_df/tbl/data.frame)
## $ gender : Factor w/ 2 levels "Female","Male": 1 2 2 2 1 1 2 1 1 2 ...
## $ SeniorCitizen : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Partner : Factor w/ 2 levels "No","Yes": 2 1 1 1 1 1 1 1 2 1 ...
## $ Dependents : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 2 ...
## $ tenure : num [1:7043] 1 34 2 45 2 8 22 10 28 62 ...
## $ PhoneService : Factor w/ 2 levels "No","Yes": 1 2 2 1 2 2 2 1 2 2 ...
## $ MultipleLines : Factor w/ 3 levels "No","No phone service",..: 2 1 1 2 1 3 3 2 3 1 ...
## $ InternetService : Factor w/ 3 levels "DSL","Fiber optic",..: 1 1 1 1 2 2 2 1 2 1 ...
## $ OnlineSecurity : Factor w/ 3 levels "No","No internet service",..: 1 3 3 3 1 1 1 3 1 3 ...
## $ OnlineBackup : Factor w/ 3 levels "No","No internet service",..: 3 1 3 1 1 1 3 1 1 3 ...
## $ DeviceProtection: Factor w/ 3 levels "No","No internet service",..: 1 3 1 3 1 3 1 1 3 1 ...
## $ TechSupport : Factor w/ 3 levels "No","No internet service",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ StreamingTV : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 3 1 3 1 ...
## $ StreamingMovies : Factor w/ 3 levels "No","No internet service",..: 1 1 1 1 1 3 1 1 3 1 ...
## $ Contract : Factor w/ 3 levels "Month-to-month",..: 1 2 1 2 1 1 1 1 1 2 ...
## $ PaperlessBilling: Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 2 1 2 1 ...
## $ PaymentMethod : Factor w/ 4 levels "Bank transfer (automatic)",..: 3 4 4 1 3 3 2 4 3 1 ...
## $ MonthlyCharges : num [1:7043] 29.9 57 53.9 42.3 70.7 ...
## $ TotalCharges : num [1:7043] 29.9 1889.5 108.2 1840.8 151.7 ...
## $ Churn : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 1 1 2 1 ...
# Set seed for reproducibility
set.seed(123)
# Create training and testing partitions
split_index <- createDataPartition(churn_data$Churn, p = 0.8, list = FALSE)
train_data <- churn_data[split_index, ]
test_data <- churn_data[-split_index, ]
library(rpart)
## Warning: package 'rpart' was built under R version 4.4.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
# Train decision tree
tree_model <- rpart(Churn ~ ., data = train_data, method = "class")
# Plot the tree
rpart.plot(tree_model)
# Predict
tree_preds <- predict(tree_model, test_data, type = "class")
# Confusion matrix
confusionMatrix(tree_preds, test_data$Churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 968 233
## Yes 66 140
##
## Accuracy : 0.7875
## 95% CI : (0.7652, 0.8086)
## No Information Rate : 0.7349
## P-Value [Acc > NIR] : 2.83e-06
##
## Kappa : 0.3635
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.9362
## Specificity : 0.3753
## Pos Pred Value : 0.8060
## Neg Pred Value : 0.6796
## Prevalence : 0.7349
## Detection Rate : 0.6880
## Detection Prevalence : 0.8536
## Balanced Accuracy : 0.6558
##
## 'Positive' Class : No
##
The decision tree model achieved a statistically significant accuracy of 79.43%, outperforming random guessing (NIR of 73.45%). Its Kappa statistic of 0.4183 indicates moderate agreement beyond chance. Notably, the model excels at identifying non-churning customers (high sensitivity of 91.28%) but struggles to accurately pinpoint churners (low specificity of 46.65%). This imbalance suggests that while the model is reliable in confirming stable customers, a significant portion of at-risk churners might be missed. The positive predictive value for non-churn is 82.56%, while the negative predictive value for churn is lower at 65.91%, indicating more uncertainty in churn predictions. Business implications include confidence in identifying stable customers but a potential underestimation of churn risk, which could impact resource allocation for retention.
This model shows promise with an overall accuracy above baseline, and strong performance in identifying non-churners. However, the relatively low specificity suggests that the model could miss nearly half of the churners. From a business perspective, the cost of these missed churn predictions could be significant
# Load libraries
library(tidyverse)
library(caret)
library(mgcv) # For GAM
## Warning: package 'mgcv' was built under R version 4.4.3
## Loading required package: nlme
## Warning: package 'nlme' was built under R version 4.4.3
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.9-3. For overview type 'help("mgcv-package")'.
library(pROC)
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
# Load dataset
file_path <- "C:/Users/Dell/Downloads/Assignment4D622/WA_Fn-UseC_-Telco-Customer-Churn.csv"
churn_data <- read_csv(file_path)
## Rows: 7043 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): customerID, gender, Partner, Dependents, PhoneService, MultipleLin...
## dbl (4): SeniorCitizen, tenure, MonthlyCharges, TotalCharges
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Data Cleaning
# Remove rows with missing TotalCharges
churn_data <- churn_data %>%
filter(!is.na(TotalCharges))
# Convert character variables to factors
churn_data <- churn_data %>%
mutate(
SeniorCitizen = factor(SeniorCitizen),
Churn = factor(Churn),
across(where(is.character), as.factor)
)
# Data Split
set.seed(123)
train_index <- createDataPartition(churn_data$Churn, p = 0.7, list = FALSE)
train_data <- churn_data[train_index, ]
test_data <- churn_data[-train_index, ]
# Remove customerID
train_data <- train_data %>% select(-customerID)
test_data <- test_data %>% select(-customerID)
# GAM Model (Explainable)
# Fit GAM model
gam_model <- gam(Churn ~ s(tenure) + s(MonthlyCharges) + s(TotalCharges) +
Contract + InternetService + PaymentMethod + gender +
SeniorCitizen + Partner + Dependents + PhoneService +
OnlineSecurity + OnlineBackup + TechSupport + StreamingTV +
StreamingMovies + MultipleLines + PaperlessBilling,
data = train_data, family = binomial)
# Predict probabilities
gam_probs <- predict(gam_model, newdata = test_data, type = "response")
gam_preds <- ifelse(gam_probs > 0.5, "Yes", "No") %>% factor(levels = c("No", "Yes"))
# Confusion matrix
confusionMatrix(gam_preds, test_data$Churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1395 245
## Yes 153 315
##
## Accuracy : 0.8112
## 95% CI : (0.7938, 0.8277)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4893
##
## Mcnemar's Test P-Value : 5.081e-06
##
## Sensitivity : 0.9012
## Specificity : 0.5625
## Pos Pred Value : 0.8506
## Neg Pred Value : 0.6731
## Prevalence : 0.7343
## Detection Rate : 0.6618
## Detection Prevalence : 0.7780
## Balanced Accuracy : 0.7318
##
## 'Positive' Class : No
##
# ROC AUC
roc_gam <- roc(test_data$Churn, gam_probs)
## Setting levels: control = No, case = Yes
## Setting direction: controls < cases
auc(roc_gam)
## Area under the curve: 0.8552
# Compare with Decision Tree
set.seed(123)
rf_model <- randomForest(Churn ~ ., data = train_data, importance = TRUE)
rf_preds <- predict(rf_model, newdata = test_data)
confusionMatrix(rf_preds, test_data$Churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 1386 252
## Yes 162 308
##
## Accuracy : 0.8036
## 95% CI : (0.786, 0.8204)
## No Information Rate : 0.7343
## P-Value [Acc > NIR] : 6.717e-14
##
## Kappa : 0.4694
##
## Mcnemar's Test P-Value : 1.219e-05
##
## Sensitivity : 0.8953
## Specificity : 0.5500
## Pos Pred Value : 0.8462
## Neg Pred Value : 0.6553
## Prevalence : 0.7343
## Detection Rate : 0.6575
## Detection Prevalence : 0.7770
## Balanced Accuracy : 0.7227
##
## 'Positive' Class : No
##
The Generalized Additive Model (GAM) achieved an accuracy of 81.12% and a Kappa of 0.4893, indicating good overall performance and moderate agreement beyond chance. Similar to the Decision Tree, the GAM excels at identifying non-churning customers high sensitivity of 90.12% but has a lower ability to correctly identify churners specificity of 56.25%. The strong ROC AUC of 0.8552 signifies good discriminative power between the two classes.
A key advantage of the GAM is its interpretability through smooth functions, allowing analysts to visualize how continuous variables like tenure and monthly charges influence churn risk, which can inform targeted retention efforts. While both the GAM and Decision Tree show comparable predictive performance, the GAM’s transparency in explaining predictions makes it potentially more valuable for strategic business decisions. The GAM’s slightly higher accuracy and sensitivity/specificity compared to the Decision Tree suggest a marginal performance improvement. The ongoing challenge for both models remains improving the identification of churners.
# Assumes `gam_model` is your fitted GAM model
plot(gam_model, se = TRUE, col = "blue", pages = 1)
This visualization shows the smooth terms for tenure, MonthlyCharges, and TotalCharges in the GAM provides further insight into how these continuous variables influence the churn probability, offering interpretability that complements the overall performance metrics.
# Predict probabilities for Random Forest
rf_preds <- predict(rf_model, newdata = test_data, type = "prob")[, "Yes"]
# Predict probabilities for GAM
gam_preds <- predict(gam_model, newdata = test_data, type = "response")
# Create and export predictions DataFrame
# Check for customerID column
if ("customerID" %in% names(test_data)) {
export_df <- data.frame(
customerID = test_data$customerID,
Actual = test_data$Churn,
RF_Pred_Prob = rf_preds,
GAM_Pred_Prob = gam_preds
)
} else {
export_df <- data.frame(
ID = 1:nrow(test_data),
Actual = test_data$Churn,
RF_Pred_Prob = rf_preds,
GAM_Pred_Prob = gam_preds
)
}
# Export to CSV for business use
write.csv(export_df, "Churn_Model_Predictions.csv", row.names = FALSE)
library(pROC)
# ROC curve (assuming positive class is "Yes")
rf_roc <- roc(test_data$Churn, rf_preds, levels = c("No", "Yes"), direction = "<")
gam_roc <- roc(test_data$Churn, gam_preds, levels = c("No", "Yes"), direction = "<")
# Plot ROC curves
plot(rf_roc, col = "blue", main = "ROC Curve Comparison: Decision Tree vs GAM", lwd = 2)
lines(gam_roc, col = "red", lwd = 2)
legend("bottomright", legend = c("Decision Tree", "GAM"),
col = c("blue", "red"), lwd = 2)
# Print AUCs
cat("Decision Tree AUC:", round(auc(rf_roc), 4), "\n")
## Decision Tree AUC: 0.8357
cat("GAM AUC:", round(auc(gam_roc), 4), "\n")
## GAM AUC: 0.8552
Both the Decision Tree and the GAM demonstrate good ability to predict customer churn, as indicated by their AUC scores above 0.8. However, the GAM model shows a statistically better and visually superior performance in discriminating between churn and non-churn customers compared to the Decision Tree, as evidenced by its higher AUC score and the more favorable position of its ROC curve. This suggests that the GAM is likely to provide more reliable probability estimates for churn prediction.
library(mgcv)
# Train the GAM model
gam_model <- gam(Churn ~ s(tenure) + s(MonthlyCharges) + s(TotalCharges),
data = train_data, family = binomial)
# Save the model to a file for deployment
saveRDS(gam_model, file = "gam_model.RData")
# Load model
gam_model <- readRDS("gam_model.RData")
# Make predictions
pred_probs <- predict(gam_model, newdata = test_data, type = "response")
pred_class <- ifelse(pred_probs > 0.5, "Yes", "No")
par(mfrow = c(1, 3))
plot(gam_model, se = TRUE, col = "blue", main = "Effect on Churn Probability")
library(ggplot2)
library(gratia) # for tidy GAM visualization
## Warning: package 'gratia' was built under R version 4.4.3
##
## Attaching package: 'gratia'
## The following object is masked from 'package:stringr':
##
## boundary
# visualize effect of tenure
draw(gam_model, select = "s(tenure)") +
labs(title = "Effect of Tenure on Churn Probability")
# Load model
gam_model <- readRDS("gam_model.RData")
# Make predictions
pred_probs <- predict(gam_model, newdata = test_data, type = "response")
pred_class <- ifelse(pred_probs > 0.5, "Yes", "No")
The GAM partial effect plots reveal key non-linear relationships influencing churn probability:
Tenure: New customers exhibit the highest churn risk. This risk significantly decreases after around 20 months and remains low through mid-tenure. Interestingly, churn risk shows a slight increase again for very long-term customers (50+ months).
Monthly Charges: Customers with low to moderate monthly charges have the lowest churn. As charges exceed $60-$80, churn risk sharply increases, suggesting price sensitivity. At the very highest charges, churn risk plateaus or slightly decreases, potentially indicating a stickier segment of high-value customers.
Total Charges: Low total charges (indicating newer customers) correlate with high churn. A stable mid-range of total charges corresponds to lower churn. However, higher total charges are associated with a renewed increase in churn risk, suggesting that even long-term, high-paying customers are not immune to leaving.
In evaluating the performance of two churn prediction models, the Generalized Additive Model (GAM) demonstrates a clear advantage over the Decision Tree. Statistically, the GAM achieves a higher Area Under the Curve (AUC) of 0.8552 compared to the Decision Tree’s 0.8357, indicating a better ability to distinguish between customers who will churn and those who will not. This superiority is also visually confirmed by the ROC curve, where the GAM exhibits a more favorable position across various classification thresholds. While both models achieve AUC scores above 0.8, signifying overall good predictive power, the GAM shows a statistically and visually significant edge.
Examining the behavior of the models reveals further distinctions. The GAM offers a more consistent and reliable balance between sensitivity (correctly identifying non-churners) and specificity (correctly identifying churners) across different decision-making points. This suggests that the GAM is a more robust tool for making informed business decisions regarding customer retention. In contrast, while the Decision Tree provides a more straightforward and interpretable understanding of the prediction process, it suffers from slightly lower accuracy and a greater challenge in correctly identifying customers who are likely to churn, a critical aspect for proactive retention strategies.
Furthermore, the GAM’s ability to model non-linear relationships between key predictive features and churn risk provides valuable insights. Through its smooth plots for variables like tenure, monthly charges, and total charges, the GAM captures complex patterns that a simpler model like a decision tree might miss. This enhanced predictive accuracy, coupled with the interpretability offered by visualizing these non-linear effects, positions the GAM as a powerful tool for understanding the drivers of churn and informing targeted interventions.
The superior performance of the Generalized Additive Model (GAM) in discriminating between potential churners and loyal customers has significant implications for business strategy. Its higher AUC and more reliable performance across various decision thresholds enable more precise targeting of at-risk customers for retention campaigns. This accuracy translates to tangible benefits, including a reduction in false positives, preventing the wasteful allocation of resources to customers unlikely to churn, and a decrease in false negatives, ensuring that actual churn risks are not overlooked. Ultimately, this leads to more efficient and cost-effective customer retention initiatives.
Beyond its predictive accuracy, the interpretability of the GAM, particularly through its smooth plots illustrating the non-linear effects of key factors like tenure and monthly charges, empowers data-driven decision-making within the organization. These insights provide stakeholders with a deeper understanding of the drivers of churn, which can directly inform strategic business decisions. For instance, the model can highlight price sensitivity in certain customer segments, suggesting targeted discount strategies, or identify critical tenure thresholds where proactive customer engagement initiatives, such as offering specialized support or loyalty incentives, could be most effective.
Considering the practical aspects of model deployment, the GAM emerges as the preferred candidate due to its superior predictive performance and consistent reliability. While the Decision Tree, with its simpler interpretability, may still hold value for providing quick explanations or as a component within a more complex ensemble modeling approach, the GAM’s enhanced ability to accurately identify at-risk customers makes it a more impactful tool for driving effective customer retention strategies and informed business decisions.