Group X Data Mining Project

Andrew Campo, Zachary Fowler, Joel Grunhut, Pratulya Santharam, Juliet Schreiber

Variable Dictionary

RowNumber—corresponds to the record (row) number and has no effect on the output.

CustomerId—contains random values and has no effect on customer leaving the bank.

Surname—the surname of a customer has no impact on their decision to leave the bank.

CreditScore—can have an effect on customer churn, since a customer with a higher credit score is less likely to leave the bank.

Geography—a customer’s location can affect their decision to leave the bank.

Gender—it’s interesting to explore whether gender plays a role in a customer leaving the bank.

Age—this is certainly relevant, since older customers are less likely to leave their bank than younger ones.

Tenure—refers to the number of years that the customer has been a client of the bank. Normally, older clients are more loyal and less likely to leave a bank.

Balance—also a very good indicator of customer churn, as people with a higher balance in their accounts are less likely to leave the bank compared to those with lower balances.

NumOfProducts—refers to the number of products that a customer has purchased through the bank.

HasCrCard—denotes whether or not a customer has a credit card. This column is also relevant, since people with a credit card are less likely to leave the bank.

IsActiveMember—active customers are less likely to leave the bank.

EstimatedSalary—as with balance, people with lower salaries are more likely to leave the bank compared to those with higher salaries.

Exited—whether or not the customer left the bank.

Complain—customer has complaint or not.

Satisfaction Score—Score provided by the customer for their complaint resolution. Card Type—type of card hold by the customer. Points Earned—the points earned by the customer for using credit card.

Acknowledgements

As we know, it is much more expensive to sign a new client than keeping an existing one.

It is advantageous for banks to know what leads a client towards the decision to leave the company.

Churn prevention allows companies to develop loyalty programs and retention campaigns to keep as many customers as possible, saving from increased customer acquisition costs and decreased overall customer lifetime value.

Pre-processing

library(tidyverse)    # For data manipulation and ggplot2 visualizations
library(caret)        # For machine learning and model comparison
library(randomForest) # For RF
library(e1071)        # For SVM
library(class)        # For KNN
library(pROC)         # For ROC and AUC
library(corrplot)     # For correlation matrix
library(gridExtra)    # For arranging multiple plots
library(broom)        # For stepwise regression
library(kableExtra)   # For tables / charts
library(scales)       # For tables / graphs

churn_data <- read.csv("Customer-Churn-Records.csv", header = TRUE)

# Data Cleaning and Preprocessing
churn_clean <- churn_data %>%
  # Drop identifiers and non-predictive columns
  select(-RowNumber, -CustomerId, -Surname) %>%
  # Convert categorical variables to factors
  mutate(
    Geography = as.factor(Geography),
    Gender = as.factor(Gender),
    HasCrCard = as.factor(HasCrCard),
    IsActiveMember = as.factor(IsActiveMember),
    Complain = as.factor(Complain),
    Card.Type = as.factor(Card.Type),
    
    # Converting Number of Products to a categorical variable
    NumOfProducts = as.factor(NumOfProducts), 
    
    Exited = factor(Exited, levels = c(0, 1), labels = c("Retained", "Exited"))
  )

churn_clean <- churn_clean %>%
  # Create the Zero Balance variable
  mutate(IsZeroBalance = ifelse(Balance == 0, 1, 0)) %>%
  
  # Create Age Groups 
  mutate(AgeGroup = case_when(
    Age < 30 ~ "<30",
    Age >= 30 & Age < 45 ~ "30-45",
    Age >= 45 & Age < 60 ~ "45-60",
    Age >= 60 ~ ">60"
  )) %>%

  mutate(IsZeroBalance = as.factor(IsZeroBalance),
         AgeGroup = as.factor(AgeGroup))

Exploratory Data Analysis

Categorical Variables

# Exited Distribution
p_target <- ggplot(churn_clean, aes(x = Exited, fill = Exited)) +
  geom_bar() +
  scale_fill_manual(values = c("Retained" = "#2ecc71", "Exited" = "#e74c3c")) +
  theme_minimal() +
  labs(title = "Customer Churn Distribution", x = "Status", y = "Count")

p_target

We need to be careful here, since our dependent variable isn’t equally distributed, our models may be biased to mark everything as “Retained” as it will get a decent accuracy.

# Gender vs Churn
p_gender <- ggplot(churn_clean, aes(x = Gender, fill = Exited)) +
  geom_bar(position = "fill") + 
  scale_fill_manual(values = c("Retained" = "#2ecc71", "Exited" = "#e74c3c")) +
  theme_minimal() +
  labs(title = "Churn Proportion by Gender", x = "Gender", y = "Proportion")

p_gender

Females might be more likely to churn.

# Active Member Status vs Churn
p_active <- ggplot(churn_clean, aes(x = as.factor(IsActiveMember), fill = Exited)) + 
  geom_bar(position = "fill") + 
  scale_fill_manual(values = c("Retained" = "#2ecc71", "Exited" = "#e74c3c")) +
  theme_minimal() +
  labs(title = "Churn by Active Status", x = "Is Active Member (1=Yes)", y = "Proportion")

p_active

Expected, non active members appear to be more likely to churn.

# Number of Products vs Churn 
p_products <- ggplot(churn_clean, aes(x = as.factor(NumOfProducts), fill = Exited)) +
  geom_bar(position = "fill") + 
  scale_fill_manual(values = c("Retained" = "#2ecc71", "Exited" = "#e74c3c")) +
  theme_minimal() +
  labs(title = "Churn by Number of Products", x = "Number of Products", y = "Proportion")

p_products

Whoa, NumOfProducts = 3 or 4 almost perfectly predict churn! This could be a case of people abusing sign up benefits…

# Relationship between tenure and number of products?
ggplot(churn_clean, aes(x = as.factor(NumOfProducts), y = Tenure, fill = as.factor(NumOfProducts))) +
  geom_boxplot(alpha = 0.7, color = "black", show.legend = FALSE) +

  scale_fill_brewer(palette = "Blues") +
  theme_minimal() +
  theme(axis.text.x = element_text(face = "bold", size = 11),
        plot.title = element_text(face = "bold", size = 14)) +
  
  labs(title = "Customer Tenure vs. Number of Products",
       subtitle = "Analyzing the spread of tenure across product tiers",
       x = "Number of Products",
       y = "Tenure (Years)")

# Customer Densitty between Products and Tenure
ggplot(churn_clean, aes(x = as.factor(NumOfProducts), y = as.factor(Tenure))) +
  geom_count(color = "#0072B2", alpha = 0.8) +
  scale_size_area(max_size = 15, name = "Customer\nCount") +
  theme_minimal() +
  theme(axis.text = element_text(face = "bold", size = 10),
        plot.title = element_text(face = "bold", size = 14),
        panel.grid.minor = element_blank()) +
  
  labs(title = "Customer Density: Products vs. Tenure",
       subtitle = "Larger bubbles indicate a higher volume of customers",
       x = "Number of Products",
       y = "Tenure (Years)")

Less customers with 3 and 4 products, but they account for a large portion of the churn…

# Card Type vs Churn
p_card <- ggplot(churn_clean, aes(x = Card.Type, fill = Exited)) +
  geom_bar(position = "fill") + 
  scale_fill_manual(values = c("Retained" = "#2ecc71", "Exited" = "#e74c3c")) +
  theme_minimal() +
  labs(title = "Churn by Card Type", x = "Card Type", y = "Proportion") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Angle text for readability

p_card

Doesn’t seem to be too much difference between groups.

# Churn by Geography
p_geo <- ggplot(churn_clean, aes(x = Geography, fill = Exited)) +
  geom_bar(position = "fill") + # 'fill' shows proportions
  scale_fill_manual(values = c("Retained" = "#2ecc71", "Exited" = "#e74c3c")) +
  theme_minimal() +
  labs(title = "Proportion of Churn by Geography", x = "Country", y = "Proportion")

p_geo

Something is going on in Germany!

# Age Group vs Churn
p_agegroup <- ggplot(churn_clean, aes(x = AgeGroup, fill = Exited)) +
  geom_bar(position = "fill") + 
  scale_fill_manual(values = c("Retained" = "#2ecc71", "Exited" = "#e74c3c")) +
  theme_minimal() +
  labs(title = "Churn by AgeGroup", x = "AgeGroup", y = "Proportion")

p_agegroup

What are these boomers doing??

# Age Group vs Churn
p_0bal <- ggplot(churn_clean, aes(x = IsZeroBalance, fill = Exited)) +
  geom_bar(position = "fill") + 
  scale_fill_manual(values = c("Retained" = "#2ecc71", "Exited" = "#e74c3c")) +
  theme_minimal() +
  labs(title = "Churn by Balance = 0", x = "Balance = 0?", y = "Proportion")

p_0bal

Numeric Variables

# Age vs Churn
p_age <- ggplot(churn_clean, aes(x = Exited, y = Age, fill = Exited)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("Retained" = "#2ecc71", "Exited" = "#e74c3c")) +
  theme_minimal() +
  labs(title = "Age vs. Churn", x = "Status", y = "Age")

p_age

# Balance vs Churn
p_balance <- ggplot(churn_clean, aes(x = Exited, y = Balance, fill = Exited)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("Retained" = "#2ecc71", "Exited" = "#e74c3c")) +
  theme_minimal() +
  labs(title = "Balance vs. Churn", x = "Status", y = "Balance")

p_balance

# Credit Score vs Churn
p_credit <- ggplot(churn_clean, aes(x = Exited, y = CreditScore, fill = Exited)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("Retained" = "#2ecc71", "Exited" = "#e74c3c")) +
  theme_minimal() +
  labs(title = "Credit Score vs Churn", x = "Status", y = "Credit Score")

p_credit

Doesn’t seem to be too much difference between groups.

# Estimated Salary vs Churn
p_salary <- ggplot(churn_clean, aes(x = Exited, y = EstimatedSalary, fill = Exited)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("Retained" = "#2ecc71", "Exited" = "#e74c3c")) +
  theme_minimal() +
  labs(title = "Estimated Salary vs Churn", x = "Status", y = "Estimated Salary")

p_salary

Doesn’t seem to be too much difference between groups.

# Churn Proportion by Tenure
p_tenure <- ggplot(churn_clean, aes(x = as.factor(Tenure), fill = Exited)) +
  geom_bar(position = "fill") + 
  scale_fill_manual(values = c("Retained" = "#2ecc71", "Exited" = "#e74c3c")) +
  theme_minimal() +
  labs(title = "Churn Proportion by Tenure (Years)", x = "Tenure", y = "Proportion")

p_tenure

Doesn’t seem to be too much difference between groups.

# Correlation Matrix of Numeric Variables
numeric_vars <- churn_clean %>% select_if(is.numeric)
corr_matrix <- cor(numeric_vars)
corrplot(corr_matrix, method = "color", type = "upper", 
         addCoef.col = "black", tl.col = "black", tl.srt = 45,
         title = "Correlation Matrix of Numeric Variables",
         mar=c(0,0,2,0))

Not really any correlations between our numerics…

# Create a custom function to calculate Cramér's V
cramer_v <- function(x, y) {
  # Create a contingency table
  tbl <- table(x, y)
  
  # Calculate Chi-square statistic
  chi2 <- suppressWarnings(chisq.test(tbl)$statistic)
  
  # Calculate Cramér's V
  n <- sum(tbl)
  phi2 <- chi2 / n
  r <- nrow(tbl)
  k <- ncol(tbl)
  
  v <- sqrt(phi2 / min(r - 1, k - 1))
  return(as.numeric(v))
}

# Isolate all the categorical variables
cat_vars <- churn_clean %>% 
  select(where(is.factor))

# Create an empty matrix to hold our results
n_vars <- ncol(cat_vars)
cramer_matrix <- matrix(NA, nrow = n_vars, ncol = n_vars)
rownames(cramer_matrix) <- colnames(cat_vars)
colnames(cramer_matrix) <- colnames(cat_vars)

# Loop through every pair of categorical variables and calculate Cramér's V
for (i in 1:n_vars) {
  for (j in 1:n_vars) {
    cramer_matrix[i, j] <- cramer_v(cat_vars[[i]], cat_vars[[j]])
  }
}

# Visualize the Cramér's V matrix
corrplot(cramer_matrix, 
         method = "color", 
         type = "upper",
         addCoef.col = "black", 
         tl.col = "black", 
         tl.srt = 45, 
         is.corr = FALSE, 
         col.lim = c(0, 1), 
         title = "Categorical Associations (Cramér's V)",
         mar=c(0,0,2,0))

As seen, Complain is perfectly correlated with Exited. Glad we removed it. We need to keep an eye on NumOfProducts, Age, Geography and IsZeroBalance.

# 1. Define the factor
target_factor <- "Exited"

# 2. Isolate just the numeric columns to test against
numeric_cols <- names(churn_clean)[sapply(churn_clean, is.numeric)]

# 3. Create an empty data frame to store the results
anova_results <- data.frame(Numeric_Variable = character(),
                            P_Value = numeric(),
                            Significant = character(),
                            stringsAsFactors = FALSE)

# 4. Loop through each numeric variable
for (var in numeric_cols) {
  
  # Build the formula dynamically
  formula_str <- as.formula(paste(var, "~", target_factor))
  
  # Run the ANOVA
  test_result <- aov(formula_str, data = churn_clean)
  
  # Extract the p-value
  p_val <- summary(test_result)[[1]][["Pr(>F)"]][1]
  
  # Determine if it is statistically significant
  is_sig <- ifelse(p_val < 0.05, "Yes (Strong Relationship)", "No")
  
  # Add the result to our table
  anova_results <- rbind(anova_results, 
                         data.frame(Numeric_Variable = var, 
                                    P_Value = p_val, 
                                    Significant = is_sig))
}

# 5. Sort the table
anova_results <- anova_results[order(anova_results$P_Value), ]

# Print the final table
print(anova_results)

##     Numeric_Variable       P_Value               Significant
## 2                Age 1.346716e-186 Yes (Strong Relationship)
## 4            Balance  1.209208e-32 Yes (Strong Relationship)
## 1        CreditScore  7.422037e-03 Yes (Strong Relationship)
## 3             Tenure  1.721045e-01                        No
## 5    EstimatedSalary  2.117146e-01                        No
## 6 Satisfaction.Score  5.586474e-01                        No
## 7       Point.Earned  6.435350e-01                        No

Our ANOVAs showing that there is a difference in means between exited and retained customers in Age, Balance and CreditScore. This only analyzes our numeric variables though, but tells us these will likely be important classifying.

set.seed(2026)

# Train a single Random Forest for variable importance
rf_importance <- randomForest(Exited ~ .-Complain-AgeGroup-IsZeroBalance, 
                              data = churn_clean, 
                              importance = TRUE,
                              ntree = 500)

# Extract the importance metrics
importance_metrics <- varImp(rf_importance)

importance_metrics

##                      Retained     Exited
## CreditScore         3.7914295  3.7914295
## Geography          27.2075602 27.2075602
## Gender              6.5426676  6.5426676
## Age                98.1073075 98.1073075
## Tenure              1.8668390  1.8668390
## Balance            32.0819878 32.0819878
## NumOfProducts      97.5514820 97.5514820
## HasCrCard           1.4779697  1.4779697
## IsActiveMember     43.1694124 43.1694124
## EstimatedSalary     1.8395698  1.8395698
## Satisfaction.Score  0.3251555  0.3251555
## Card.Type           1.3572961  1.3572961
## Point.Earned        2.5980552  2.5980552

We excluded complain due to its perfect correlations with exited. We removed age group as it is for explanatory purposes only, we will keep Age in our regression. We also removed IsZeroBalance because if this is significant, our rf can just make a cut at is Balance > 0.

# 1. Extract the scores into a dataframe
imp_df <- as.data.frame(importance_metrics)

# Check if 'Overall' exists
if (!"Overall" %in% colnames(imp_df)) {
  imp_df$Overall <- apply(imp_df, 1, max)
}

# 2. Add feature names and sort
imp_df$Feature <- rownames(imp_df)
imp_df <- imp_df[order(-imp_df$Overall), ]
imp_df$Rank <- 1:nrow(imp_df)

# 3. Create the elbow Plot
ggplot(imp_df, aes(x = Rank, y = Overall)) +
  geom_line(color = "#0072B2", size = 1.2) +
  geom_point(color = "#D55E00", size = 3) +
  scale_x_continuous(breaks = imp_df$Rank, labels = imp_df$Feature) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", size = 10),
        plot.title = element_text(face = "bold", size = 14)) +
  labs(title = "Variable Importance Elbow Plot",
       subtitle = "Look for the 'elbow' where the slope dramatically flattens out",
       x = "Variables",
       y = "Importance Score (from rf)")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

According to RF we should keep: Age, NumOfProducts, IsActiveMember, Geography, Balance (but it also looks like we could just keep Age and NumOfProducts…)

# -------------------------------------------------------------------------
# Step 1: Build the "Full" Logistic Model
# -------------------------------------------------------------------------
full_log_model <- glm(Exited ~ .-Complain-AgeGroup, 
                      data = churn_clean, 
                      family = "binomial")

# -------------------------------------------------------------------------
# Step 2: Run the Stepwise Algorithm
# -------------------------------------------------------------------------
stepwise_model <- step(full_log_model, direction = "both", trace = 0)

# Print the final, optimized mathematical summary
summary(stepwise_model)

## 
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age + 
##     Tenure + Balance + NumOfProducts + IsActiveMember + IsZeroBalance, 
##     family = "binomial", data = churn_clean)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.156e+00  2.775e-01 -11.376   <2e-16 ***
## CreditScore      -6.707e-04  3.033e-04  -2.212   0.0270 *  
## GeographyGermany  9.880e-01  7.471e-02  13.225   <2e-16 ***
## GeographySpain    5.863e-02  7.603e-02   0.771   0.4406    
## GenderMale       -5.226e-01  5.898e-02  -8.861   <2e-16 ***
## Age               7.120e-02  2.760e-03  25.798   <2e-16 ***
## Tenure           -1.932e-02  1.010e-02  -1.913   0.0557 .  
## Balance           1.877e-06  1.156e-06   1.624   0.1044    
## NumOfProducts2   -1.571e+00  7.218e-02 -21.769   <2e-16 ***
## NumOfProducts3    2.543e+00  1.796e-01  14.162   <2e-16 ***
## NumOfProducts4    1.633e+01  1.752e+02   0.093   0.9258    
## IsActiveMember1  -1.105e+00  6.235e-02 -17.719   <2e-16 ***
## IsZeroBalance1    4.042e-01  1.601e-01   2.525   0.0116 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10112.5  on 9999  degrees of freedom
## Residual deviance:  7429.4  on 9987  degrees of freedom
## AIC: 7455.4
## 
## Number of Fisher Scoring iterations: 14

# -------------------------------------------------------------------------
# Step 3: Visualizing the Surviving Variables
# -------------------------------------------------------------------------
model_results <- tidy(stepwise_model, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>% # Remove the intercept for a cleaner chart
  arrange(estimate)                 # Sort by impact size

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# Create the visual Forest Plot
ggplot(model_results, aes(x = estimate, y = reorder(term, estimate))) +
  
  # Draw a vertical line at 1.0 (The "No Effect" baseline)
  geom_vline(xintercept = 1, linetype = "dashed", color = "red", size = 1) +
  
  # Plot the Odds Ratios and their Confidence Intervals
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high), 
                  color = "#0072B2", size = 0.8, linewidth = 1.2) +
  
  theme_minimal() +
  labs(title = "Stepwise Logistic Regression: Final Variables",
       subtitle = "Odds Ratios > 1 increase churn risk; < 1 decrease churn risk",
       x = "Odds Ratio (Impact on Exiting)",
       y = "Predictor Variables") +
  
  theme(axis.text.y = element_text(face = "bold", size = 11),
        plot.title = element_text(face = "bold", size = 14))

churn_clean_removeNOP4 <- churn_clean[churn_clean$NumOfProducts != 4, ]

# -------------------------------------------------------------------------
# Step 1: Do it again!
# -------------------------------------------------------------------------
full_log_model <- glm(Exited ~ .-Complain-AgeGroup, 
                      data = churn_clean_removeNOP4, 
                      family = "binomial")

# -------------------------------------------------------------------------
# Step 2: Run the Stepwise Algorithm
# -------------------------------------------------------------------------
stepwise_model <- step(full_log_model, direction = "both", trace = 0)

# Print the final, optimized mathematical summary
summary(stepwise_model)

## 
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age + 
##     Tenure + Balance + NumOfProducts + IsActiveMember + IsZeroBalance, 
##     family = "binomial", data = churn_clean_removeNOP4)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.156e+00  2.775e-01 -11.376   <2e-16 ***
## CreditScore      -6.707e-04  3.033e-04  -2.212   0.0270 *  
## GeographyGermany  9.880e-01  7.471e-02  13.225   <2e-16 ***
## GeographySpain    5.863e-02  7.602e-02   0.771   0.4406    
## GenderMale       -5.226e-01  5.898e-02  -8.861   <2e-16 ***
## Age               7.120e-02  2.760e-03  25.799   <2e-16 ***
## Tenure           -1.932e-02  1.010e-02  -1.913   0.0557 .  
## Balance           1.877e-06  1.156e-06   1.624   0.1044    
## NumOfProducts2   -1.571e+00  7.218e-02 -21.770   <2e-16 ***
## NumOfProducts3    2.543e+00  1.796e-01  14.162   <2e-16 ***
## IsActiveMember1  -1.105e+00  6.235e-02 -17.719   <2e-16 ***
## IsZeroBalance1    4.042e-01  1.601e-01   2.525   0.0116 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9920.2  on 9939  degrees of freedom
## Residual deviance: 7429.4  on 9928  degrees of freedom
## AIC: 7453.4
## 
## Number of Fisher Scoring iterations: 5

# -------------------------------------------------------------------------
# Step 3: Visualizing the Surviving Variables (Forest Plot)
# -------------------------------------------------------------------------
model_results <- tidy(stepwise_model, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>% # Remove the intercept for a cleaner chart
  arrange(estimate)                 # Sort by impact size

# Create the visual Forest Plot
ggplot(model_results, aes(x = estimate, y = reorder(term, estimate))) +
  
  # Draw a vertical line at 1.0 (The "No Effect" baseline)
  geom_vline(xintercept = 1, linetype = "dashed", color = "red", size = 1) +
  
  # Plot the Odds Ratios and their Confidence Intervals
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high), 
                  color = "#0072B2", size = 0.8, linewidth = 1.2) +
  
  theme_minimal() +
  labs(title = "Stepwise Logistic Regression w/o NOP=4: Final Variables",
       subtitle = "Odds Ratios > 1 increase churn risk; < 1 decrease churn risk",
       x = "Odds Ratio (Impact on Exiting)",
       y = "Predictor Variables") +
  
  theme(axis.text.y = element_text(face = "bold", size = 11),
        plot.title = element_text(face = "bold", size = 14))

churn_clean_removeNOP3n4 <- churn_clean_removeNOP4[churn_clean_removeNOP4$NumOfProducts != 3, ]

# -------------------------------------------------------------------------
# Step 1: Last stepwise regression hopefully...
# -------------------------------------------------------------------------
full_log_model <- glm(Exited ~ .-Complain-AgeGroup, 
                      data = churn_clean_removeNOP3n4, 
                      family = "binomial")

# -------------------------------------------------------------------------
# Step 2: Run the Stepwise Algorithm
# -------------------------------------------------------------------------
stepwise_model <- step(full_log_model, direction = "both", trace = 0)

# Print the final, optimized mathematical summary
summary(stepwise_model)

## 
## Call:
## glm(formula = Exited ~ CreditScore + Geography + Gender + Age + 
##     Tenure + Balance + NumOfProducts + IsActiveMember + IsZeroBalance, 
##     family = "binomial", data = churn_clean_removeNOP3n4)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.206e+00  2.814e-01 -11.390  < 2e-16 ***
## CreditScore      -6.004e-04  3.073e-04  -1.954  0.05075 .  
## GeographyGermany  1.021e+00  7.588e-02  13.454  < 2e-16 ***
## GeographySpain    5.368e-02  7.740e-02   0.694  0.48796    
## GenderMale       -5.203e-01  5.986e-02  -8.692  < 2e-16 ***
## Age               7.101e-02  2.792e-03  25.433  < 2e-16 ***
## Tenure           -2.036e-02  1.025e-02  -1.987  0.04695 *  
## Balance           1.801e-06  1.179e-06   1.527  0.12668    
## NumOfProducts2   -1.598e+00  7.245e-02 -22.062  < 2e-16 ***
## IsActiveMember1  -1.109e+00  6.342e-02 -17.480  < 2e-16 ***
## IsZeroBalance1    4.829e-01  1.632e-01   2.959  0.00309 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9170.9  on 9673  degrees of freedom
## Residual deviance: 7213.7  on 9663  degrees of freedom
## AIC: 7235.7
## 
## Number of Fisher Scoring iterations: 5

# -------------------------------------------------------------------------
# Step 3: Visualizing the Surviving Variables (Forest Plot)
# -------------------------------------------------------------------------
model_results <- tidy(stepwise_model, exponentiate = TRUE, conf.int = TRUE) %>%
  filter(term != "(Intercept)") %>% # Remove the intercept for a cleaner chart
  arrange(estimate)                 # Sort by impact size

# Create the visual Forest Plot
ggplot(model_results, aes(x = estimate, y = reorder(term, estimate))) +
  
  # Draw a vertical line at 1.0 (The "No Effect" baseline)
  geom_vline(xintercept = 1, linetype = "dashed", color = "red", size = 1) +
  
  # Plot the Odds Ratios and their Confidence Intervals
  geom_pointrange(aes(xmin = conf.low, xmax = conf.high), 
                  color = "#0072B2", size = 0.8, linewidth = 1.2) +
  
  theme_minimal() +
  labs(title = "Stepwise Logistic Regression w/o NOP=3,4: Final Variables",
       subtitle = "Odds Ratios > 1 increase churn risk; < 1 decrease churn risk",
       x = "Odds Ratio (Impact on Exiting)",
       y = "Predictor Variables") +
  
  theme(axis.text.y = element_text(face = "bold", size = 11),
        plot.title = element_text(face = "bold", size = 14))

Stepwise logistic regression w/o NOP 3,4 decided to keep same 9 variables: NumOfProducts, Geography, IsZeroBalance, Age, Balance, CreditScore, Tenure, Gender, IsActiveMember

set.seed(2026)

# Train a single Random Forest directly to measure importance
rf_importance2 <- randomForest(Exited ~ .-Complain-AgeGroup-IsZeroBalance, 
                              data = churn_clean_removeNOP3n4, 
                              importance = TRUE,
                              ntree = 500)

# Extract the importance metrics
importance_metrics2 <- varImp(rf_importance2)

importance_metrics2

##                      Retained     Exited
## CreditScore         5.0929070  5.0929070
## Geography          28.6621145 28.6621145
## Gender              6.6405346  6.6405346
## Age                96.0334438 96.0334438
## Tenure              0.2451055  0.2451055
## Balance            33.7971162 33.7971162
## NumOfProducts      74.1121878 74.1121878
## HasCrCard           0.6255965  0.6255965
## IsActiveMember     43.5562211 43.5562211
## EstimatedSalary     0.9106253  0.9106253
## Satisfaction.Score  0.8823618  0.8823618
## Card.Type           0.3869062  0.3869062
## Point.Earned        2.7213442  2.7213442

# 1. Extract the scores into a dataframe
imp_df <- as.data.frame(importance_metrics2)

# Check if 'Overall' exists
if (!"Overall" %in% colnames(imp_df)) {
  imp_df$Overall <- apply(imp_df, 1, max)
}

# 2. Add feature names and sort
imp_df$Feature <- rownames(imp_df)
imp_df <- imp_df[order(-imp_df$Overall), ]
imp_df$Rank <- 1:nrow(imp_df)

# 3. Create the elbow plot
ggplot(imp_df, aes(x = Rank, y = Overall)) +
  geom_line(color = "#0072B2", size = 1.2) +
  geom_point(color = "#D55E00", size = 3) +
  scale_x_continuous(breaks = imp_df$Rank, labels = imp_df$Feature) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold", size = 10),
        plot.title = element_text(face = "bold", size = 14)) +
  labs(title = "Variable Importance Elbow Plot w/o NOP=3,4",
       subtitle = "Look for the 'elbow' where the slope dramatically flattens out",
       x = "Variables",
       y = "Importance Score (from RF)")

Same 5 variables are the most important in the RF.

Data Splitting and Preparation

# Select the variables we determined in the above tests
churn_clean_initial <- churn_clean %>%
  select(Exited, NumOfProducts, Geography, IsZeroBalance, Age, Balance, CreditScore, Tenure, Gender, IsActiveMember)

set.seed(2026)

# Create an 70/30 train/test split
trainIndex <- createDataPartition(churn_clean_initial$Exited, p = .7, list = FALSE, times = 1)
data_train <- churn_clean_initial[ trainIndex,]
data_test  <- churn_clean_initial[-trainIndex,]

# Define cross-validation control for Caret
# We know we didn't go over cross validation in depth, but we included it as a best practice for training our models. We refrained from down sampling since we didn't go over that
fitControl <- trainControl(
  method = "cv",
  number = 10,
  classProbs = TRUE, 
  summaryFunction = twoClassSummary, 
  savePredictions = "final"
)

Model Training

# 1. Train the model ONCE with tuneLength set to your max number of variables (5)
set.seed(2026)
rf_model_comparison <- train(Exited ~ Age + NumOfProducts + IsActiveMember + Geography + Balance, 
                             data = data_train, 
                             method = "rf", 
                             trControl = fitControl,
                             tuneLength = 5)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

# 2. Extract the data table comparing the AUC (labeled as 'ROC' in caret) for each step
print(rf_model_comparison$results[, c("mtry", "ROC")])

##   mtry       ROC
## 1    2 0.8371362
## 2    3 0.8350677
## 3    5 0.8209321
## 4    6 0.8161117
## 5    8 0.8115212

# 3. Plot it visually to see the curve!
plot(rf_model_comparison, metric = "ROC", 
     main = "Random Forest: AUC by Number of Variables (mtry)")

# 1. Define the exact 'k' values you want to test (numbers from 1 to 50)
k_grid <- expand.grid(k = seq(1,50))

# 2. Train the model using a custom grid instead of tuneLength
set.seed(2026)
knn_model_expanded <- train(Exited ~ Age + NumOfProducts + IsActiveMember + Geography + Balance, 
                            data = data_train, 
                            method = "knn", 
                            trControl = fitControl, 
                            preProcess = c("center", "scale"), 
                            tuneGrid = k_grid)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

# 3. Plot it to find the peak
plot(knn_model_expanded, metric = "ROC", main = "KNN: Finding the Peak k")

knn_model_expanded$bestTune

##     k
## 49 49

# 1. Train the SVM model 
set.seed(2026)
svm_model_comparison <- train(Exited ~ Age + NumOfProducts + IsActiveMember + Geography + Balance, 
                              data = data_train, 
                              method = "svmRadial", 
                              trControl = fitControl,
                              preProcess = c("center", "scale"), 
                              tuneLength = 10)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

## line search fails -2.218004 -0.3310784 1.690448e-05 1.362631e-05 -1.603092e-07 -1.481361e-07 -4.728493e-12

## Warning in method$predict(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class prediction calculations failed; returning NAs

## Warning in method$prob(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class probability calculations failed; returning NAs

## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded

## line search fails -2.067391 -0.1680523 1.294252e-05 9.420677e-06 -9.785983e-08 -8.944148e-08 -2.109152e-12

## Warning in method$predict(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class prediction calculations failed; returning NAs

## Warning in method$prob(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class probability calculations failed; returning NAs

## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded

## line search fails -2.230202 -0.3354493 1.186907e-05 8.821782e-06 -1.076494e-07 -9.86187e-08 -2.147691e-12

## Warning in method$predict(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class prediction calculations failed; returning NAs

## Warning in method$prob(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class probability calculations failed; returning NAs

## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded

## line search fails -2.173099 -0.2848492 1.201433e-05 9.112076e-06 -1.02294e-07 -9.415291e-08 -2.086922e-12

## Warning in method$predict(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class prediction calculations failed; returning NAs

## Warning in method$prob(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class probability calculations failed; returning NAs

## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded

## line search fails -1.622565 0.27362 1.170546e-05 -1.08564e-06 -3.591657e-08 -2.495117e-08 -3.933319e-13

## Warning in method$predict(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class prediction calculations failed; returning NAs

## Warning in method$prob(modelFit = modelFit, newdata = newdata, submodels =
## param): kernlab class probability calculations failed; returning NAs

## Warning in data.frame(..., check.names = FALSE): row names were found from a
## short variable and have been discarded

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

# 2. Extract the data table comparing the AUC (ROC) for each Cost value
print(svm_model_comparison$results[, c("C", "ROC")])

##         C       ROC
## 1    0.25 0.7878251
## 2    0.50 0.7919771
## 3    1.00 0.7924277
## 4    2.00 0.7930324
## 5    4.00 0.7888797
## 6    8.00 0.7868201
## 7   16.00 0.7821982
## 8   32.00 0.7763509
## 9   64.00 0.7709974
## 10 128.00 0.7663646

# 3. Let's see how Cost impacts performance
plot(svm_model_comparison, metric = "ROC", 
     main = "SVM: AUC by Cost Parameter (C)")

set.seed(2026)
# Logistic Regression
model_log <- train(Exited ~ ., 
                   data = data_train, 
                   method = "glm", 
                   family = "binomial",
                   trControl = fitControl)

# Random Forest
model_rf <- train(Exited ~ Age+NumOfProducts+IsActiveMember+Geography+Balance, 
                  data = data_train, 
                  method = "rf", 
                  trControl = fitControl,
                  tuneLength = 2) 

# K-Nearest Neighbors (KNN)
model_knn <- train(Exited ~ Age+NumOfProducts+IsActiveMember+Geography+Balance, 
                   data = data_train, 
                   method = "knn", 
                   preProcess = c("center", "scale"),
                   trControl = fitControl,
                   tuneLength = 48)

# Support Vector Machine (SVM - Radial Basis Function)
model_svm <- train(Exited ~ Age+NumOfProducts+IsActiveMember+Geography+Balance, 
                   data = data_train, 
                   method = "svmRadial", 
                   preProcess = c("center", "scale"),
                   trControl = fitControl,
                   tuneLength = 10)

## line search fails -2.247287 -0.3442948 1.507793e-05 1.101928e-05 -1.356353e-07 -1.236615e-07 -3.40776e-12

## line search fails -2.0948 -0.1919599 1.91742e-05 1.357116e-05 -1.451383e-07 -1.323768e-07 -4.579418e-12

## line search fails -1.646316 0.2632708 1.865162e-05 3.14052e-06 -7.134653e-08 -5.717287e-08 -1.510281e-12

## line search fails -2.261451 -0.3820507 1.453754e-05 1.204477e-05 -1.501107e-07 -1.385735e-07 -3.851325e-12

## line search fails -2.111311 -0.2296402 1.53794e-05 1.214218e-05 -1.348815e-07 -1.241844e-07 -3.582266e-12

## line search fails -2.241063 -0.332637 1.193089e-05 7.913591e-06 -1.023489e-07 -9.277717e-08 -1.955314e-12

## line search fails -2.252176 -0.3699626 1.143544e-05 8.503992e-06 -1.088124e-07 -9.988772e-08 -2.093762e-12

summary(model_log)

## 
## Call:
## NULL
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -3.054e+00  3.248e-01  -9.401  < 2e-16 ***
## NumOfProducts2   -1.470e+00  8.492e-02 -17.315  < 2e-16 ***
## NumOfProducts3    2.464e+00  2.067e-01  11.917  < 2e-16 ***
## NumOfProducts4    1.617e+01  2.458e+02   0.066    0.948    
## GeographyGermany  9.249e-01  8.783e-02  10.531  < 2e-16 ***
## GeographySpain    9.744e-02  8.899e-02   1.095    0.274    
## IsZeroBalance1    3.990e-01  1.903e-01   2.097    0.036 *  
## Age               6.799e-02  3.234e-03  21.023  < 2e-16 ***
## Balance           2.194e-06  1.383e-06   1.586    0.113    
## CreditScore      -7.037e-04  3.560e-04  -1.977    0.048 *  
## Tenure           -1.685e-02  1.198e-02  -1.407    0.160    
## GenderMale       -5.526e-01  6.951e-02  -7.950 1.87e-15 ***
## IsActiveMember1  -1.065e+00  7.345e-02 -14.505  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7080.3  on 7000  degrees of freedom
## Residual deviance: 5334.4  on 6988  degrees of freedom
## AIC: 5360.4
## 
## Number of Fisher Scoring iterations: 14

summary(model_rf)

##                 Length Class      Mode     
## call                4  -none-     call     
## type                1  -none-     character
## predicted        7001  factor     numeric  
## err.rate         1500  -none-     numeric  
## confusion           6  -none-     numeric  
## votes           14002  matrix     numeric  
## oob.times        7001  -none-     numeric  
## classes             2  -none-     character
## importance          8  -none-     numeric  
## importanceSD        0  -none-     NULL     
## localImportance     0  -none-     NULL     
## proximity           0  -none-     NULL     
## ntree               1  -none-     numeric  
## mtry                1  -none-     numeric  
## forest             14  -none-     list     
## y                7001  factor     numeric  
## test                0  -none-     NULL     
## inbag               0  -none-     NULL     
## xNames              8  -none-     character
## problemType         1  -none-     character
## tuneValue           1  data.frame list     
## obsLevels           2  -none-     character
## param               0  -none-     list

summary(model_knn)

##             Length Class      Mode     
## learn       2      -none-     list     
## k           1      -none-     numeric  
## theDots     0      -none-     list     
## xNames      8      -none-     character
## problemType 1      -none-     character
## tuneValue   1      data.frame list     
## obsLevels   2      -none-     character
## param       0      -none-     list

summary(model_svm)

## Length  Class   Mode 
##      1   ksvm     S4

Model Evaluation

# Compare Model Results visually
results <- resamples(list(Logistic = model_log, 
                          RandomForest = model_rf, 
                          KNN = model_knn, 
                          SVM = model_svm))

# Visual Plot of Model Comparisons (AUC/ROC) - removed for poor interpretability
# bwplot(results, main = "Model Comparison: ROC (AUC), Sensitivity, and Specificity")

# Generating Predictions on the Test Set
pred_log_prob <- predict(model_log, data_test, type = "prob")
pred_rf_prob <- predict(model_rf, data_test, type = "prob")
pred_knn_prob <- predict(model_knn, data_test, type = "prob")
pred_svm_prob <- predict(model_svm, data_test, type = "prob")

# Calculate ROC curves for visual comparison
roc_log <- roc(data_test$Exited, pred_log_prob$Exited)
roc_rf  <- roc(data_test$Exited, pred_rf_prob$Exited)
roc_knn <- roc(data_test$Exited, pred_knn_prob$Exited)
roc_svm <- roc(data_test$Exited, pred_svm_prob$Exited)

# Plotting the ROC Curves together
ggroc(list("Logistic Reg" = roc_log, 
           "Random Forest" = roc_rf, 
           "KNN" = roc_knn, 
           "SVM" = roc_svm), legacy.axes = TRUE) +
  theme_minimal() +
  labs(title = "ROC Curve Comparison across Models",
       x = "False Positive Rate (1 - Specificity)",
       y = "True Positive Rate (Sensitivity)") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "grey")

# Print AUC values to the console
cat("AUC - Logistic Regression:", auc(roc_log), "\n")

## AUC - Logistic Regression: 0.8572047

cat("AUC - Random Forest:", auc(roc_rf), "\n")

## AUC - Random Forest: 0.8703696

cat("AUC - KNN:", auc(roc_knn), "\n")

## AUC - KNN: 0.8801115

cat("AUC - SVM:", auc(roc_svm), "\n")

## AUC - SVM: 0.8393985

# STEP 1: Find the probability of Exited on test data
prob_log <- predict(model_log, newdata = data_test, type = "prob")[,"Exited"]
prob_rf  <- predict(model_rf, newdata = data_test, type = "prob")[,"Exited"]
prob_knn <- predict(model_knn, newdata = data_test, type = "prob")[,"Exited"]
prob_svm <- predict(model_svm, newdata = data_test, type = "prob")[,"Exited"]

# STEP 2: Combine actuals and predictions into a single Data Frame
lift_data <- data.frame(
  Actual_Class = data_test$Exited,
  Logistic_Reg = prob_log,
  Random_Forest = prob_rf,
  KNN = prob_knn,
  SVM = prob_svm
)

lift_data$Actual_Num <- ifelse(lift_data$Actual_Class == "Exited", 1, 0)

# 2. Sort the data by each model's probability and calculate the cumulative successes
plot_df <- data.frame(
  Cases = 1:nrow(lift_data),
  Baseline = seq(0, sum(lift_data$Actual_Num), length.out = nrow(lift_data)),
  Logistic_Reg = cumsum(lift_data$Actual_Num[order(-lift_data$Logistic_Reg)]),
  Random_Forest = cumsum(lift_data$Actual_Num[order(-lift_data$Random_Forest)]),
  SVM = cumsum(lift_data$Actual_Num[order(-lift_data$SVM)]),
  KNN = cumsum(lift_data$Actual_Num[order(-lift_data$KNN)])
)

# 3. Reshape the data so ggplot2 can easily read it
plot_df_long <- pivot_longer(plot_df, 
                             cols = c(Baseline, Logistic_Reg, Random_Forest, SVM, KNN),
                             names_to = "Model", 
                             values_to = "Cumulative_Success")

# 4. Plot the lift chart
ggplot(plot_df_long, aes(x = Cases, y = Cumulative_Success, color = Model, linetype = Model)) +
  geom_line(linewidth = 1) +
  scale_linetype_manual(values = c("Baseline" = "dashed", "Logistic_Reg" = "solid", 
                                   "Random_Forest" = "solid", "SVM" = "solid", "KNN" = "solid")) +
  labs(title = "Lift Chart: Cumulative Success by Model",
       x = "Number of Cases",
       y = "Cumulative Success") +
  theme_minimal()

pred_log <- predict(model_log, newdata = data_test)
pred_rf  <- predict(model_rf,  newdata = data_test)
pred_knn <- predict(model_knn, newdata = data_test)
pred_svm <- predict(model_svm, newdata = data_test)
plot_confusion_matrix <- function(prediction, truth, model_name) {
  
  # Calculate the confusion matrix
  cm <- confusionMatrix(prediction, truth)
  
  # Convert the table output into a dataframe
  cm_data <- as.data.frame(cm$table)
  
  # Build the visual
  ggplot(data = cm_data, aes(x = Reference, y = Prediction, fill = Freq)) +
    geom_tile(color = "white", size = 1) +
    
    # Overlay the exact count numbers inside the colored boxes
    geom_text(aes(label = Freq), vjust = 0.5, fontface = "bold", size = 6, color = "white") +
    
    # Use a blue gradient (very fancy, we know)
    scale_fill_gradient(low = "#9ecae1", high = "#08519c", name = "Count") +
    
    theme_minimal() +
    labs(title = paste("Confusion Matrix:", model_name),
         x = "Actual",
         y = "Predicted") +
    
    theme(plot.title = element_text(face = "bold", size = 14),
          axis.title = element_text(face = "bold"),
          axis.text = element_text(size = 12, face = "bold"))
}

# Generate the Individual Graphs
plot_log <- plot_confusion_matrix(pred_log, data_test$Exited, "Logistic Regression")
plot_rf  <- plot_confusion_matrix(pred_rf,  data_test$Exited, "Random Forest")
plot_knn <- plot_confusion_matrix(pred_knn, data_test$Exited, "K-Nearest Neighbors")
plot_svm <- plot_confusion_matrix(pred_svm, data_test$Exited, "Support Vector Machine")

# View the CMs
plot_log

plot_rf

plot_knn

plot_svm

cm_log <- confusionMatrix(pred_log, data_test$Exited, positive = "Exited")
cm_rf  <- confusionMatrix(pred_rf,  data_test$Exited, positive = "Exited")
cm_knn <- confusionMatrix(pred_knn, data_test$Exited, positive = "Exited")
cm_svm <- confusionMatrix(pred_svm, data_test$Exited, positive = "Exited")

# Build the Master Table
metrics_table <- data.frame(
  Model = c("Logistic Regression", "Random Forest", "KNN", "SVM"),
  AUC = c(auc(roc_log), auc(roc_rf), auc(roc_knn), auc(roc_svm)),
  Accuracy = c(cm_log$overall["Accuracy"], cm_rf$overall["Accuracy"], 
               cm_knn$overall["Accuracy"], cm_svm$overall["Accuracy"]),
  Sensitivity = c(cm_log$byClass["Sensitivity"], cm_rf$byClass["Sensitivity"], 
                  cm_knn$byClass["Sensitivity"], cm_svm$byClass["Sensitivity"]),
  Specificity = c(cm_log$byClass["Specificity"], cm_rf$byClass["Specificity"], 
                  cm_knn$byClass["Specificity"], cm_svm$byClass["Specificity"])
)

# Clean up and print
# Remove row names and round all numbers to 3 decimal places
rownames(metrics_table) <- NULL
metrics_table[,-1] <- round(metrics_table[,-1], 3)

# Print the final table
print(metrics_table)

##                 Model   AUC Accuracy Sensitivity Specificity
## 1 Logistic Regression 0.857    0.848       0.390       0.965
## 2       Random Forest 0.870    0.863       0.406       0.979
## 3                 KNN 0.880    0.868       0.470       0.970
## 4                 SVM 0.839    0.865       0.408       0.982

Revising our Best Model

# Testing if we change our predictions from the default 0.5 to a lower threshold like 0.3
pred_knn_03 <- ifelse(prob_knn > 0.3, "Exited", "Retained")
pred_knn_03 <- factor(pred_knn_03, levels = c("Retained", "Exited"))

# Re-run the confusion matrix
confusionMatrix(pred_knn_03, data_test$Exited, positive = "Exited")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Retained Exited
##   Retained     2145    199
##   Exited        243    412
##                                           
##                Accuracy : 0.8526          
##                  95% CI : (0.8394, 0.8651)
##     No Information Rate : 0.7963          
##     P-Value [Acc > NIR] : 1.038e-15       
##                                           
##                   Kappa : 0.5576          
##                                           
##  Mcnemar's Test P-Value : 0.04083         
##                                           
##             Sensitivity : 0.6743          
##             Specificity : 0.8982          
##          Pos Pred Value : 0.6290          
##          Neg Pred Value : 0.9151          
##              Prevalence : 0.2037          
##          Detection Rate : 0.1374          
##    Detection Prevalence : 0.2184          
##       Balanced Accuracy : 0.7863          
##                                           
##        'Positive' Class : Exited          
##

Moving from a 0.5 to a 0.3 threshold is how we actually make the model useful for retention. At the standard 0.5 cutoff, the model is playing it safe, it only flags a customer as “Exited” if it’s absolutely certain, which is why our sensitivity was stuck in the 40% range. By lowering the bar to 0.3, we’re telling the model to be more aggressive; if there’s even a 30% chance someone is leaving, we want them on our radar. While this does lead to a few more false alarms, at face value, it’s a much better business trade-off because catching a majority of the people about to walk out the door is worth the small cost of over-communicating with a few loyal ones.

Scenario	Prediction	Actual	Result	Cost Level
False Alarm	Churn	Stay	You send a discount to a loyal customer.	Low Cost (Small marketing spend)
Missed Churner	Stay	Churn	A customer leaves and takes their lifetime revenue with them.	High Cost (Loss of LTV + Acquisition cost)

# 1. Set our Business Assumptions
save_value <- 500*mean(churn_clean$Tenure)  # Value of saving a customer ~2500
offer_cost <- 100 # Cost of the marketing offer

# 2. Get probabilities from our KNN
probs <- predict(model_knn, newdata = data_test, type = "prob")[, "Exited"]
actuals <- data_test$Exited

# 3. Create a sequence of thresholds to test
thresholds <- seq(0.05, 0.95, by = 0.01)

# 4. Loop through and calculate profit for each
profit_results <- sapply(thresholds, function(t) {
  preds <- ifelse(probs > t, "Exited", "Retained")
  
  # Breakdown the Confusion Matrix results
  true_pos  <- sum(preds == "Exited" & actuals == "Exited") # Caught Churners
  false_pos <- sum(preds == "Exited" & actuals == "Retained")  # False Alarms
  
  # Profit = (Savings from caught churners) - (Cost of all offers sent)
  total_profit <- (true_pos * save_value) - ((true_pos + false_pos) * offer_cost)
  return(total_profit)
})

# 5. Combine into a table
profit_table <- data.frame(Threshold = thresholds, Profit = profit_results)

# 6. Find the winner
best_row <- profit_table[which.max(profit_table$Profit), ]
print(paste("The most profitable threshold is:", best_row$Threshold))

## [1] "The most profitable threshold is: 0.06"

Really low! Running a CM in the console showed us a crazy increase in error rate.

# More business assumptions!
save_value   <- 500*mean(churn_clean$Tenure)   # Revenue from a saved customer per year * average customer tenure
offer_cost   <- 100   # Cost of sending the offer including the risk annoyance causes churn
success_rate <- 0.2   # Let's assume only 20% of people who get the offer stay

profit_results <- sapply(thresholds, function(t) {
  preds <- ifelse(probs > t, "Exited", "Retained")
  
  true_pos  <- sum(preds == "Exited" & actuals == "Exited")
  false_pos <- sum(preds == "Exited" & actuals == "Retained")
  
  # Gain = (True Positives * Success Rate * Save Value)
  # Loss = (Total Offers Sent * Offer Cost)
  total_profit <- (true_pos * success_rate * save_value) - ((true_pos + false_pos) * offer_cost)
  return(total_profit)
})

# Create the table
optimized_profit_table <- data.frame(Threshold = thresholds, Profit = profit_results)

# Find the winner 
best_threshold <- thresholds[which.max(profit_results)]
print(paste("The optimized threshold is:", best_threshold))

## [1] "The optimized threshold is: 0.22"

ggplot(optimized_profit_table, aes(x = Threshold, y = Profit)) +
  geom_line(color = "#08519c", size = 1.2) +
  geom_vline(xintercept = best_threshold, linetype = "dashed", color = "red") +
  annotate("text", x = best_threshold + 0.02, 
           y = max(optimized_profit_table$Profit) * 1.05, 
           label = paste("Peak Profit at", best_threshold), 
           color = "red", fontface = "bold", vjust = 0) +
  theme_minimal() +
  scale_y_continuous(labels = scales::label_comma()) +
  
  labs(title = "Profit Optimization: Finding the Threshold Sweet Spot",
       subtitle = "Optimal threshold identified for maximum ROI",
       x = "Probability Threshold",
       y = "Total Profit ($)")

# Change our  predictions from the default 0.5 to the calculate optimal threshold of 0.22
pred_knn_04<- ifelse(prob_knn > 0.22, "Exited", "Retained")
pred_knn_04 <- factor(pred_knn_04, levels = c("Retained", "Exited"))

# Re-run the confusion matrix with our new prediction
confusionMatrix(pred_knn_04, data_test$Exited, positive = "Exited")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Retained Exited
##   Retained     1971    134
##   Exited        417    477
##                                         
##                Accuracy : 0.8163        
##                  95% CI : (0.8019, 0.83)
##     No Information Rate : 0.7963        
##     P-Value [Acc > NIR] : 0.003194      
##                                         
##                   Kappa : 0.517         
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.7807        
##             Specificity : 0.8254        
##          Pos Pred Value : 0.5336        
##          Neg Pred Value : 0.9363        
##              Prevalence : 0.2037        
##          Detection Rate : 0.1591        
##    Detection Prevalence : 0.2981        
##       Balanced Accuracy : 0.8030        
##                                         
##        'Positive' Class : Exited        
##

Action	Sensitivity	Specificity (Identifying Leavers)	Profit Impact
Lower Cutoff	Goes Up \(\uparrow\)	Goes Down \(\downarrow\)	We catch more “stayers,” but we waste too much money on incentives for people who leave anyway (False Positives).
Higher Cutoff	Goes Down \(\downarrow\)	Goes Up \(\uparrow\)	We save money on incentives, but we “miss” too many customers who would have stayed if we’d reached out.

# 1. Calculate metrics across all thresholds
tradeoff_data <- data.frame(Threshold = seq(0, 1, by = 0.01))

metrics <- lapply(tradeoff_data$Threshold, function(t) {
  preds <- ifelse(probs >= t, "Exited", "Retained")
  
  sensitivity <- sum(preds == "Exited" & actuals == "Exited") / sum(actuals == "Exited")
  error_rate  <- sum(preds != actuals) / length(actuals)
  
  return(c(Sensitivity = sensitivity, ErrorRate = error_rate))
})

# 2. Combine into a  table
tradeoff_data <- cbind(tradeoff_data, do.call(rbind, metrics))

# 3. Create the Dual-Axis Style Plot
ggplot(tradeoff_data, aes(x = Threshold)) +
  geom_line(aes(y = Sensitivity, color = "Sensitivity"), size = 1.5) +
  geom_line(aes(y = ErrorRate, color = "Error Rate"), size = 1.5, linetype = "dashed") +
  scale_color_manual(values = c("Sensitivity" = "#08519c", "Error Rate" = "#cb181d")) +
  geom_vline(xintercept = 0.22, color = "black", linetype = "dotted") +
  annotate("text", x = 0.35, y = 0.9, label = "0.22 Target") +
  labs(title = "Finding the Balance: Sensitivity vs. Error Rate",
       subtitle = "The intersection represents the mathematically 'balanced' threshold",
       y = "Rate (0.0 to 1.0)",
       x = "Probability Threshold",
       color = "Metric") +
  theme_minimal()

# 1. Define the specific thresholds we want to see in the table
target_thresholds <- c(0.1, 0.2, 0.3, 0.4, 0.5)

# 2. Calculate Sensitivity, Specificity, and Error Rate
comparison_table <- lapply(target_thresholds, function(t) {
  preds <- ifelse(probs >= t, "Exited", "Retained")
  
  # Sensitivity (True Positive Rate): Catching those who leave
  sens <- sum(preds == "Exited" & actuals == "Exited") / sum(actuals == "Exited")
  
  # Specificity (True Negative Rate): Correct identification of those who stay
  spec <- sum(preds == "Retained" & actuals == "Retained") / sum(actuals == "Retained")
  
  # Error Rate: Overall misclassification
  err  <- sum(preds != actuals) / length(actuals)
  
  strategy <- case_when(
    t == 0.2  ~ "Optimal Target",
    t < 0.2   ~ "Aggressive",
    TRUE      ~ "Conservative"
  )
  
  data.frame(
    Threshold = t,
    Strategy = strategy,
    Sensitivity = sens,
    Specificity = spec,
    Error_Rate = err
  )
}) %>% bind_rows()

# 3. Create the Styled Table (fancy!)
comparison_table %>%
  mutate(
    Sensitivity = percent(Sensitivity, accuracy = 0.1),
    Specificity = percent(Specificity, accuracy = 0.1),
    Error_Rate = percent(Error_Rate, accuracy = 0.1)
  ) %>%
  kbl(caption = "Threshold Trade-off Analysis: Catching Churn vs. Operational Efficiency", 
      align = "clccc",
      booktabs = TRUE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "bordered"), 
                full_width = F, 
                position = "center",
                html_font = "Arial") %>% 
  # Highlight the 0.2 row in Blue
  row_spec(2, bold = T, color = "white", background = "#08519c") %>%
  # Adding borders for a cleaner look
  column_spec(1:5, border_left = T, border_right = T) %>%
  # Updated header to reflect the new column
  add_header_above(c(" " = 2, "Model Performance Metrics" = 3))

Threshold Trade-off Analysis: Catching Churn vs. Operational Efficiency
		Model Performance Metrics
Threshold	Strategy	Sensitivity	Specificity	Error_Rate
0.1	Aggressive	90.7%	59.5%	34.1%
0.2	Optimal Target	80.4%	80.0%	19.9%
0.3	Conservative	67.4%	89.8%	14.7%
0.4	Conservative	56.8%	95.3%	12.5%
0.5	Conservative	47.3%	97.0%	13.1%

# 1. Create a clean dataframe with the KNN probabilities and actual outcomes
density_data <- data.frame(
  Probability = prob_knn,
  Actual_Outcome = data_test$Exited
)

# 2. Build the Density Plot
ggplot(density_data, aes(x = Probability, fill = Actual_Outcome)) +
  geom_density(alpha = 0.6) +
  
  # Add the bold line for your new optimized threshold
  geom_vline(xintercept = 0.22, linetype = "dashed", color = "black", linewidth = 1.2) +
  
  # Using the color palette already established in your EDA
  scale_fill_manual(values = c("Retained" = "#2ecc71", "Exited" = "#e74c3c")) +
  
  # Annotate the line so the audience knows exactly what it represents
  annotate("text", x = 0.24, y = max(density(density_data$Probability)$y) * 0.8, 
           label = "0.22 Action Threshold", angle = 90, fontface = "bold") +
  
  theme_minimal() +
  labs(title = "KNN Probability Distribution vs. Actual Churn",
       subtitle = "Shifting the threshold left captures the vast majority of 'Exited' customers",
       x = "Predicted Probability of Exiting",
       y = "Density of Customers",
       fill = "Actual Status") +
  theme(plot.title = element_text(face = "bold", size = 14),
        axis.title = element_text(face = "bold"))

library(gridExtra)

# 1. Generate the factor predictions for both the default and optimized thresholds
pred_05 <- factor(ifelse(prob_knn > 0.50, "Exited", "Retained"), levels = c("Retained", "Exited"))
pred_22 <- factor(ifelse(prob_knn > 0.22, "Exited", "Retained"), levels = c("Retained", "Exited"))

# 2. Run your custom function for both scenarios
matrix_05 <- plot_confusion_matrix(pred_05, data_test$Exited, "Safe Cutoff (0.50)")
matrix_22 <- plot_confusion_matrix(pred_22, data_test$Exited, "Proactive Cutoff (0.22)")

# 3. Plot them side-by-side
grid.arrange(matrix_05, matrix_22, ncol = 2)

# 1. Calculate the raw business outcomes based on the 0.22 predictions
tp <- sum(pred_22 == "Exited" & data_test$Exited == "Exited")
fp <- sum(pred_22 == "Exited" & data_test$Exited == "Retained")

# 2. Apply your financial assumptions
gross_saved <- tp * success_rate * save_value
total_cost  <- (tp + fp) * offer_cost
net_profit  <- gross_saved - total_cost

# 3. Structure the data for a clean financial impact chart
fin_data <- data.frame(
  Category = factor(c("1. Gross Value Saved", "2. Total Campaign Cost", "3. Net Profit Generated"), 
                    levels = c("1. Gross Value Saved", "2. Total Campaign Cost", "3. Net Profit Generated")),
  Amount = c(gross_saved, -total_cost, net_profit),
  Status = c("Positive", "Negative", "Net")
)

# 4. Build the Chart
ggplot(fin_data, aes(x = Category, y = Amount, fill = Status)) +
  geom_col(width = 0.6, color = "black", alpha = 0.9) +
  
  # Add the exact dollar amounts above (or below) the bars
  geom_text(aes(label = dollar(Amount)), 
            vjust = ifelse(fin_data$Amount > 0, -0.5, 1.5), 
            fontface = "bold", size = 5) +
  
  scale_fill_manual(values = c("Positive" = "#2ecc71", "Negative" = "#e74c3c", "Net" = "#08519c")) +
  scale_y_continuous(labels = dollar_format()) +
  
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(face = "bold", size = 11),
        plot.title = element_text(face = "bold", size = 14)) +
  
  labs(title = "Financial Impact of the Optimized KNN Model",
       subtitle = paste("Threshold: 0.22 | Assumed Success Rate:", percent(success_rate)),
       x = "",
       y = "Financial Impact ($)")

DM Project

Andrew Campo

2026-03-05

Group X Data Mining Project

Variable Dictionary

Acknowledgements

Pre-processing

Exploratory Data Analysis

Categorical Variables

Numeric Variables

Data Splitting and Preparation

Model Training

Model Evaluation

Revising our Best Model