–Please click above to navigate directly to machine learning prediction if interested.

–Else step-by-step, we firstly focus on general EDA and we will continue with ML.

pacman::p_load(pacman,tidyverse, lubridate,forcats, stringr, ggthemes, tidyr, readxl, skimr, 
               psych, Amelia, dlookr, caret, fastDummies,
               pROC, e1071, PRROC, randomForest, kernlab, Rtsne, vip, fastshap)

–Data Loading

filepath <- "C:/Users/KAINTHA/Documents/Customer_Churn_Data_Large.xlsx"

# We have 5 sheets in the excel, where EDA is to be performed for first 4 sheets.
df_Customer_Demographics <- read_xlsx(filepath, sheet = "Customer_Demographics")

glimpse(df_Customer_Demographics)
## Rows: 1,000
## Columns: 5
## $ CustomerID    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ Age           <dbl> 62, 65, 18, 21, 21, 57, 27, 37, 39, 68, 54, 41, 24, 42, …
## $ Gender        <chr> "M", "M", "M", "M", "M", "F", "F", "M", "M", "M", "M", "…
## $ MaritalStatus <chr> "Single", "Married", "Single", "Widowed", "Divorced", "D…
## $ IncomeLevel   <chr> "Low", "Low", "Low", "Low", "Medium", "Medium", "High", …
describe(df_Customer_Demographics)
## # A tibble: 2 × 26
##   described_variables     n    na  mean    sd se_mean   IQR skewness kurtosis
##   <chr>               <int> <int> <dbl> <dbl>   <dbl> <dbl>    <dbl>    <dbl>
## 1 CustomerID           1000     0 500.  289.    9.13   500.   0         -1.2 
## 2 Age                  1000     0  43.3  15.2   0.482   26    0.0132    -1.21
## # ℹ 17 more variables: p00 <dbl>, p01 <dbl>, p05 <dbl>, p10 <dbl>, p20 <dbl>,
## #   p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
## #   p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>
Amelia::missmap(df_Customer_Demographics)  #for missing values

diagnose_outlier(df_Customer_Demographics) #for outliers
## # A tibble: 2 × 6
##   variables  outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##   <chr>             <int>          <dbl>         <dbl>     <dbl>        <dbl>
## 1 CustomerID            0              0           NaN     500.         500. 
## 2 Age                   0              0           NaN      43.3         43.3

–We have no missing values and outliers here

–Let’s dive deeper into the data using plots

ggplot(df_Customer_Demographics, aes(x = Gender)) +
  geom_bar(fill = "#AA336A") +
  theme_solarized() +
  labs(title = "Gender Distribution", x = "Gender", y = "Count")

ggplot(df_Customer_Demographics, aes(x = MaritalStatus)) +
  geom_bar(fill = "#AA336A") +
  theme_solarized_2() +
  labs(title = "Marital Status Distribution", x = "Marital Status", y = "Count")

ggplot(df_Customer_Demographics, aes(x = IncomeLevel)) +
  geom_bar(fill = "#AA336A") +
  theme_solarized_2() +
  labs(title = "Income Level Distribution", x = "Income Level", y = "Count")

ggplot(df_Customer_Demographics, aes(x = Age)) +
  geom_histogram(fill = "#AA336A", bins = 30) +
  theme_solarized_2() +
  labs(title = "Age Distribution", x = "Age", y = "Frequency")

–“Female” gender show higher count.

–“Widowed” Status display higher count while “Single” lowest.

–“High” income level show higest distribution.

–Now that we have the insights for the Customer Demographics, we will focus on the Transaction History Of the customer

df_Transaction_History <- read_xlsx(filepath, sheet = "Transaction_History")

df_Transaction_History <- df_Transaction_History %>%
  mutate(TransactionDate = as.POSIXct(TransactionDate))

glimpse(df_Transaction_History)
## Rows: 5,054
## Columns: 5
## $ CustomerID      <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, …
## $ TransactionID   <dbl> 7194, 7250, 9660, 2998, 1228, 8903, 3527, 9279, 9839, …
## $ TransactionDate <dttm> 2022-03-27, 2022-08-08, 2022-07-25, 2022-01-25, 2022-…
## $ AmountSpent     <dbl> 416.50, 54.96, 197.50, 101.31, 397.37, 285.21, 311.34,…
## $ ProductCategory <chr> "Electronics", "Clothing", "Electronics", "Furniture",…
summary(df_Transaction_History)
##    CustomerID     TransactionID  TransactionDate                AmountSpent    
##  Min.   :   1.0   Min.   :1000   Min.   :2022-01-01 00:00:00   Min.   :  5.18  
##  1st Qu.: 251.0   1st Qu.:3242   1st Qu.:2022-04-03 00:00:00   1st Qu.:127.11  
##  Median : 506.0   Median :5530   Median :2022-07-01 00:00:00   Median :250.53  
##  Mean   : 501.4   Mean   :5511   Mean   :2022-07-01 19:25:37   Mean   :250.71  
##  3rd Qu.: 749.0   3rd Qu.:7681   3rd Qu.:2022-09-29 00:00:00   3rd Qu.:373.41  
##  Max.   :1000.0   Max.   :9997   Max.   :2022-12-31 00:00:00   Max.   :499.86  
##  ProductCategory   
##  Length:5054       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
colSums(is.na((df_Transaction_History)))
##      CustomerID   TransactionID TransactionDate     AmountSpent ProductCategory 
##               0               0               0               0               0

–We don’t have any missing values

diagnose_outlier(df_Transaction_History)
## # A tibble: 3 × 6
##   variables     outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##   <chr>                <int>          <dbl>         <dbl>     <dbl>        <dbl>
## 1 CustomerID               0              0           NaN      501.         501.
## 2 TransactionID            0              0           NaN     5511.        5511.
## 3 AmountSpent              0              0           NaN      251.         251.

–And neither we have any outliers.

–So we will focus on EDA similarly.

#Distribution of Amount Spent
ggplot(df_Transaction_History, aes(x = AmountSpent)) +
  geom_histogram(bins = 100, fill = "#AA336A") +
  theme_solarized_2() +
  labs(title = "Distribution of Amount Spent", x = "Amount Spent", y = "Count")

#Average Amount Spent Per Month
monthly_avg <- df_Transaction_History %>%
  mutate(Month = as.Date(format(as.Date(TransactionDate), "%Y-%m-01"))) %>%
  group_by(Month) %>%
  summarise(AvgAmount = mean(AmountSpent, na.rm = TRUE), .groups = "drop")


full_months <- data.frame(
  Month = seq(min(monthly_avg$Month), max(monthly_avg$Month), by = "month")
)


monthly_avg_full <- full_months %>%
  left_join(monthly_avg, by = "Month")


ggplot(monthly_avg_full, aes(x = Month, y = AvgAmount)) +
  geom_line(color = "#AA336A") +
  theme_solarized_2() +
  scale_x_date(date_breaks = "1 month", date_labels = "%b\n%Y") +
  labs(title = "Average Amount Spent Per Month",
       x = "Month", y = "Average Amount Spent") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Transaction Count by Product Category
ggplot(df_Transaction_History, aes(x = ProductCategory)) +
  geom_bar(fill = "#AA336A") +
  theme_solarized_2() +
  labs(title = "Transaction Count by Product Category", x = "Product Category", y = "Count")

–February and December show lowest average amount spent, while March and August peak at amount spent.

–Repeat Customers vs. One-Time Buyers

df_Transaction_History %>%
  group_by(CustomerID) %>%
  summarise(TransactionCount = n()) %>%
  mutate(CustomerType = ifelse(TransactionCount > 1, "Repeat", "One-time")) %>%
  count(CustomerType)
## # A tibble: 2 × 2
##   CustomerType     n
##   <chr>        <int>
## 1 One-time       117
## 2 Repeat         883

–Repeat customers are 7.5 fold compared to One-Time customers

–Now we will focus on the EDA for Customer Service sheet

df_Customer_Service <- read_xlsx(filepath, sheet = "Customer_Service")

glimpse(df_Customer_Service)
## Rows: 1,002
## Columns: 5
## $ CustomerID       <dbl> 1, 2, 3, 4, 4, 6, 8, 8, 9, 11, 11, 12, 12, 13, 14, 14…
## $ InteractionID    <dbl> 6363, 3329, 9976, 7354, 5393, 2358, 4191, 8937, 7813,…
## $ InteractionDate  <dttm> 2022-03-31, 2022-03-17, 2022-08-24, 2022-11-18, 2022…
## $ InteractionType  <chr> "Inquiry", "Inquiry", "Inquiry", "Inquiry", "Inquiry"…
## $ ResolutionStatus <chr> "Resolved", "Resolved", "Resolved", "Resolved", "Unre…
describe(df_Customer_Service)
## # A tibble: 2 × 26
##   described_variables     n    na  mean    sd se_mean   IQR skewness kurtosis
##   <chr>               <int> <int> <dbl> <dbl>   <dbl> <dbl>    <dbl>    <dbl>
## 1 CustomerID           1002     0  485.  287.    9.07  498.  0.0526     -1.20
## 2 InteractionID        1002     0 5953. 2306.   72.8  3917.  0.00479    -1.19
## # ℹ 17 more variables: p00 <dbl>, p01 <dbl>, p05 <dbl>, p10 <dbl>, p20 <dbl>,
## #   p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
## #   p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>
diagnose_outlier(df_Customer_Service)
## # A tibble: 2 × 6
##   variables     outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##   <chr>                <int>          <dbl>         <dbl>     <dbl>        <dbl>
## 1 CustomerID               0              0           NaN      485.         485.
## 2 InteractionID            0              0           NaN     5953.        5953.
colSums(is.na((df_Customer_Service)))
##       CustomerID    InteractionID  InteractionDate  InteractionType 
##                0                0                0                0 
## ResolutionStatus 
##                0

–We have no outliers or missing values in the sheet,

–So we will explore the data using the plots,

#Monthly interactions
df_Customer_Service %>%
  mutate(Month = as.Date(format(as.Date(InteractionDate), "%Y-%m-01"))) %>%
  group_by(Month) %>%
  summarise(InteractionCount = n()) %>%
  ggplot(aes(x = Month, y = InteractionCount)) +
  geom_line(color = "#AA336A") +
  geom_point() +
  theme_solarized_2() +
  scale_x_date(date_breaks = "1 month", date_labels = "%b\n%Y") +
  labs(title = "Monthly Interactions", x = "Month", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Distribution of Interaction Types
ggplot(df_Customer_Service, aes(x = InteractionType)) +
  geom_bar(fill = "#AA336A") +
  theme_solarized_2() +
  labs(title = "Distribution of Interaction Types", x = "Type", y = "Count")

#Resolution Status Overview
ggplot(df_Customer_Service, aes(x = ResolutionStatus)) +
  geom_bar(fill = "#AA336A") +
  theme_solarized_2() +
  labs(title = "Resolution Status Overview", x = "Status", y = "Count")

#Interaction Type vs Resolution Status
ggplot(df_Customer_Service, aes(x = InteractionType, fill = ResolutionStatus)) +
  geom_bar(position = "dodge") +
  theme_solarized_2() +
  labs(title = "Interaction Type vs Resolution Status", x = "Interaction Type", y = "Count")

#Resolution Rate Over Time
df_Customer_Service %>%
  mutate(Month = format(as.Date(InteractionDate), "%Y-%m")) %>%
  group_by(Month, ResolutionStatus) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(Month) %>%
  mutate(Rate = Count / sum(Count)) %>%
  filter(ResolutionStatus == "Resolved") %>%
  ggplot(aes(x = as.Date(paste0(Month, "-01")), y = Rate)) +
  geom_line(color = "#AA336A") +
  theme_solarized_2() +
  labs(title = "Resolution Rate Over Time", x = "Month", y = "Resolved %") +
  scale_y_continuous(labels = scales::percent) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b\n%Y") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Interaction Volume by Weekday
df_Customer_Service %>%
  mutate(
    Weekday = weekdays(as.Date(InteractionDate)),
    Weekday = factor(Weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
  ) %>%
  count(Weekday) %>%
  ggplot(aes(x = Weekday, y = n)) +
  geom_col(fill = "#AA336A") +
  theme_solarized_2() +
  labs(title = "Interaction Volume by Weekday", x = "Day", y = "Count")

#Groups customers into behavioral segments
df_Customer_Service %>%
  group_by(CustomerID) %>%
  summarise(
    TotalInteractions = n(),      # how many times this customer contacted
    Unresolved = sum(ResolutionStatus == "Unresolved"), # how many times the issue was unresolved
    ChannelsUsed = n_distinct(InteractionType) # how many different channels (chat, call, etc.)
  ) %>%
  mutate(Segment = case_when(
    TotalInteractions >= 10 & Unresolved > 0 ~ "High Touch - Unresolved",  
    TotalInteractions >= 10 ~ "High Touch",                                 
    ChannelsUsed > 1 ~ "Multi-channel",                                    
    TRUE ~ "Low Touch"                                                      
  )) %>%
  count(Segment)
## # A tibble: 2 × 2
##   Segment           n
##   <chr>         <int>
## 1 Low Touch       453
## 2 Multi-channel   215

–July, followed by March show highest interactions while February, January and Decemeber show lowest. –Feedbacks were predominant in Interaction Types –52.2% of the cases were resolved. –November and December show the highest resolve rate while January show the lowest. –Mid-Week had the highest interaction volume.

df_Online_Activity <- read_xlsx(filepath, sheet = "Online_Activity")

glimpse(df_Online_Activity)
## Rows: 1,000
## Columns: 4
## $ CustomerID     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ LastLoginDate  <dttm> 2023-10-21, 2023-12-05, 2023-11-15, 2023-08-25, 2023-1…
## $ LoginFrequency <dbl> 34, 5, 3, 2, 41, 2, 32, 17, 24, 29, 30, 43, 10, 26, 44,…
## $ ServiceUsage   <chr> "Mobile App", "Website", "Website", "Website", "Website…
describe(df_Online_Activity)
## # A tibble: 2 × 26
##   described_variables     n    na  mean    sd se_mean   IQR skewness kurtosis
##   <chr>               <int> <int> <dbl> <dbl>   <dbl> <dbl>    <dbl>    <dbl>
## 1 CustomerID           1000     0 500.  289.    9.13  500.     0        -1.2 
## 2 LoginFrequency       1000     0  25.9  14.1   0.444  24.2   -0.128    -1.18
## # ℹ 17 more variables: p00 <dbl>, p01 <dbl>, p05 <dbl>, p10 <dbl>, p20 <dbl>,
## #   p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
## #   p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>
diagnose_outlier(df_Online_Activity)
## # A tibble: 2 × 6
##   variables     outliers_cnt outliers_ratio outliers_mean with_mean without_mean
##   <chr>                <int>          <dbl>         <dbl>     <dbl>        <dbl>
## 1 CustomerID               0              0           NaN     500.         500. 
## 2 LoginFrequen…            0              0           NaN      25.9         25.9
colSums(is.na((df_Online_Activity)))
##     CustomerID  LastLoginDate LoginFrequency   ServiceUsage 
##              0              0              0              0

–We have no outliers or missing values in this sheet.

–Now we again explore the data with the plots.

df_online <- df_Online_Activity %>%
  mutate(LastLoginMonth = floor_date(as.Date(LastLoginDate), unit = "month"))

df_online %>%
  group_by(LastLoginMonth) %>%
  summarise(AvgLogin = mean(LoginFrequency, na.rm = TRUE)) %>%
  ggplot(aes(x = LastLoginMonth, y = AvgLogin)) +
  geom_line(color = "#AA336A") +
  geom_point() +
  theme_solarized_2() +
  scale_x_date(date_breaks = "1 month", date_labels = "%b\n%Y") +
  labs(title = "Average Login Frequency Over Time", x = "Month", y = "Average Login Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

df_Online_Activity %>%
  mutate(Weekday = weekdays(as.Date(LastLoginDate)),
         Weekday = factor(Weekday, 
                          levels = c("Monday", "Tuesday", "Wednesday", 
                                     "Thursday", "Friday", "Saturday", "Sunday"))) %>%
  count(Weekday, ServiceUsage) %>%
  ggplot(aes(x = Weekday, y = n, fill = ServiceUsage)) +
  geom_col(position = "dodge") +
  theme_solarized_2() +
  labs(title = "Service Usage by Weekday", x = "Day", y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

df_Online_Activity %>%
  mutate(Weekday = weekdays(as.Date(LastLoginDate)),
         Weekday = factor(Weekday, 
                          levels = c("Monday", "Tuesday", "Wednesday", 
                                     "Thursday", "Friday", "Saturday", "Sunday"))) %>%
  group_by(Weekday) %>%
  summarise(AvgLogin = mean(LoginFrequency, na.rm = TRUE)) %>%
  ggplot(aes(x = Weekday, y = AvgLogin)) +
  geom_col(fill = "#AA336A") +
  theme_solarized_2() +
  labs(title = "Average Login Frequency by Weekday", x = "Day", y = "Avg Login Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

–November had the highest login frequency while May had the lowest

–“Online Banking” marginally shows higher distribution of service usage levels

–“Online Banking” service usage is predominant in January, February, April and November while, “Website” shows higher in December, and “Mobile App” in June and July and September

–“Online Banking” is highly used on Mondays and Sundays for Service Usage

–We also do have one more sheet which includes the ChurnStatus of each Customer ID, which is described below in the file.

df_churn_status <- read_xlsx(filepath, sheet = "Churn_Status")

describe(df_churn_status)
## # A tibble: 2 × 26
##   described_variables     n    na    mean      sd se_mean   IQR skewness
##   <chr>               <int> <int>   <dbl>   <dbl>   <dbl> <dbl>    <dbl>
## 1 CustomerID           1000     0 500.    289.     9.13    500.     0   
## 2 ChurnStatus          1000     0   0.204   0.403  0.0127    0      1.47
## # ℹ 18 more variables: kurtosis <dbl>, p00 <dbl>, p01 <dbl>, p05 <dbl>,
## #   p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>,
## #   p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>,
## #   p99 <dbl>, p100 <dbl>

1 Machine Learning Model

–Now that we have the EDA of all the sheets in the plot, we will normalise numerical features to ensure consistent scales across variables. This step is crucial for preparing the data for machine learning algorithms.

–Encode categorical variables using techniques like one-hot encoding to transform them into a numerical form appropriate for analysis.

–Merge all sheets using CustomerID to create one customer-level modelling table (one row per customer).

–Treat missing values sensibly (e.g., 0 for no activity; large recency for no recent events) so “no activity” becomes a usable signal.

–Split the data into stratified train/test sets to preserve churn proportion and evaluate fairly.

–Handle churn imbalance using SMOTE or upsampling so churn cases are learned properly.

–Train models (Random Forest, SVM) with cross-validation and ROC-AUC optimisation.

–Evaluate using ROC-AUC, PR-AUC, and confusion matrix to assess churn detection quality.

–Explain the best model using feature importance and SHAP to identify key churn drivers for retention action.

set.seed(123)

#A safe date parser (Excel dates are a mess).
# A helper to replace missing numeric values with 0 where appropriate.
to_date <- function(x) {
  if (inherits(x, "Date")) return(x)
  if (inherits(x, c("POSIXct","POSIXt"))) return(as.Date(x))
  if (is.numeric(x)) return(as.Date(x, origin = "1899-12-30"))
  as.Date(x)
}

na0 <- function(x) replace(x, is.na(x), 0)

We read all 5 sheets. The churn sheet contains the target label (what we want to predict).

customer_demographics <- read_xlsx(filepath, sheet = "Customer_Demographics")
transaction_history   <- read_xlsx(filepath, sheet = "Transaction_History")
customer_service      <- read_xlsx(filepath, sheet = "Customer_Service")
online_activity       <- read_xlsx(filepath, sheet = "Online_Activity")
churn_status          <- read_xlsx(filepath, sheet = "Churn_Status")

#Clean column types
#This is where we ensure sure IDs are numeric, categories are factors, and dates are real dates.  
#This avoids silent errors when we compute time-based features like "days since last transaction.

customer_demographics <- customer_demographics %>%
  mutate(
    CustomerID = as.numeric(CustomerID),
    Age = as.numeric(Age),
    Gender = as.factor(Gender),
    MaritalStatus = as.factor(MaritalStatus),
    IncomeLevel = as.factor(IncomeLevel)
  )

transaction_history <- transaction_history %>%
  mutate(
    CustomerID = as.numeric(CustomerID),
    TransactionID = as.numeric(TransactionID),
    TransactionDate = to_date(TransactionDate),
    AmountSpent = as.numeric(AmountSpent),
    ProductCategory = as.factor(ProductCategory),
    txn_month = floor_date(TransactionDate, "month")
  )

customer_service <- customer_service %>%
  mutate(
    CustomerID = as.numeric(CustomerID),
    InteractionID = as.numeric(InteractionID),
    InteractionDate = to_date(InteractionDate),
    InteractionType = as.factor(InteractionType),
    ResolutionStatus = as.factor(ResolutionStatus),
    svc_month = floor_date(InteractionDate, "month")
  )

online_activity <- online_activity %>%
  mutate(
    CustomerID = as.numeric(CustomerID),
    LastLoginDate = to_date(LastLoginDate),
    LoginFrequency = as.numeric(LoginFrequency),
    ServiceUsage = as.factor(ServiceUsage)
  )

churn_labels <- churn_status %>%
  mutate(
    CustomerID = as.numeric(CustomerID),
    ChurnStatus = as.numeric(ChurnStatus),
    Churn = factor(ifelse(ChurnStatus == 1, "Churn", "NoChurn"), levels = c("Churn","NoChurn"))
  ) %>%
  select(CustomerID, Churn)

churn_labels %>% count(Churn) %>% mutate(pct = n/sum(n))
## # A tibble: 2 × 3
##   Churn       n   pct
##   <fct>   <int> <dbl>
## 1 Churn     204 0.204
## 2 NoChurn   796 0.796

Aggregate transactions to customer-level

Transactions have many rows per customer, but machine learning needs 1 row per customer.
So we convert raw transactions into customer-level behaviour features (spend level, frequency, recency, category mix).

max_txn_date <- max(transaction_history$TransactionDate, na.rm = TRUE)

txn_summary <- transaction_history %>%
  group_by(CustomerID) %>%
  summarise(
    txn_count = n(),
    spend_total = sum(AmountSpent, na.rm = TRUE),
    spend_mean = mean(AmountSpent, na.rm = TRUE),
    spend_median = median(AmountSpent, na.rm = TRUE),
    spend_sd = sd(AmountSpent, na.rm = TRUE),
    spend_cv = ifelse(spend_mean > 0, spend_sd / spend_mean, 0),
    categories_distinct = n_distinct(ProductCategory),
    active_months_txn = n_distinct(txn_month),
    txns_per_active_month = txn_count / pmax(active_months_txn, 1),
    days_since_last_txn = as.numeric(max_txn_date - max(TransactionDate, na.rm = TRUE)),
    .groups = "drop"
  )

txn_category_mix <- transaction_history %>%
  group_by(CustomerID, ProductCategory) %>%
  summarise(
    txn_cat_count = n(),
    txn_cat_spend = sum(AmountSpent, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  group_by(CustomerID) %>%
  mutate(
    txn_cat_count_share = txn_cat_count / sum(txn_cat_count),
    txn_cat_spend_share = txn_cat_spend / sum(txn_cat_spend)
  ) %>%
  ungroup() %>%
  pivot_wider(
    names_from = ProductCategory,
    values_from = c(txn_cat_count, txn_cat_count_share, txn_cat_spend_share),
    values_fill = 0,
    names_sep = "__"
  )

txn_summary %>% slice_head(n = 5)
## # A tibble: 5 × 11
##   CustomerID txn_count spend_total spend_mean spend_median spend_sd spend_cv
##        <dbl>     <int>       <dbl>      <dbl>        <dbl>    <dbl>    <dbl>
## 1          1         1        416.       416.         416.      NA    NA    
## 2          2         7       1547.       221.         200.     120.    0.542
## 3          3         6       1703.       284.         294.     133.    0.470
## 4          4         5        917.       183.         126.     132.    0.717
## 5          5         8       2001.       250.         219.     130.    0.519
## # ℹ 4 more variables: categories_distinct <int>, active_months_txn <int>,
## #   txns_per_active_month <dbl>, days_since_last_txn <dbl>

Aggregate customer service to customer-level

Service interactions also have many rows per customer.
We summarise intensity (how often they contact) and quality (resolution rate, unresolved volume), which are common churn drivers.

max_svc_date <- max(customer_service$InteractionDate, na.rm = TRUE)

svc_summary <- customer_service %>%
  mutate(
    is_resolved = as.integer(ResolutionStatus == "Resolved"),
    is_unresolved = as.integer(ResolutionStatus == "Unresolved")
  ) %>%
  group_by(CustomerID) %>%
  summarise(
    svc_count = n(),
    svc_resolved = sum(is_resolved, na.rm = TRUE),
    svc_unresolved = sum(is_unresolved, na.rm = TRUE),
    svc_resolution_rate = ifelse(svc_count > 0, svc_resolved / svc_count, NA_real_),
    svc_types_distinct = n_distinct(InteractionType),
    active_months_svc = n_distinct(svc_month),
    svcs_per_active_month = svc_count / pmax(active_months_svc, 1),
    days_since_last_svc = as.numeric(max_svc_date - max(InteractionDate, na.rm = TRUE)),
    .groups = "drop"
  )

svc_type_counts <- customer_service %>%
  count(CustomerID, InteractionType, name = "svc_type_n") %>%
  pivot_wider(
    names_from = InteractionType,
    values_from = svc_type_n,
    values_fill = 0,
    names_prefix = "svc_type__"
  )

svc_resolution_counts <- customer_service %>%
  count(CustomerID, ResolutionStatus, name = "svc_res_n") %>%
  pivot_wider(
    names_from = ResolutionStatus,
    values_from = svc_res_n,
    values_fill = 0,
    names_prefix = "svc_res__"
  )

svc_summary %>% slice_head(n = 5)
## # A tibble: 5 × 9
##   CustomerID svc_count svc_resolved svc_unresolved svc_resolution_rate
##        <dbl>     <int>        <int>          <int>               <dbl>
## 1          1         1            1              0                 1  
## 2          2         1            1              0                 1  
## 3          3         1            1              0                 1  
## 4          4         2            1              1                 0.5
## 5          6         1            1              0                 1  
## # ℹ 4 more variables: svc_types_distinct <int>, active_months_svc <int>,
## #   svcs_per_active_month <dbl>, days_since_last_svc <dbl>

Engineer online activity features

Online activity is already one row per customer, so we add recency and simple time features.
These help detect customers “fading away” (e.g., not logging in for long).

max_login_date <- max(online_activity$LastLoginDate, na.rm = TRUE)

online_features <- online_activity %>%
  mutate(
    login_month = month(LastLoginDate),
    login_weekday = wday(LastLoginDate, label = TRUE, abbr = FALSE),
    days_since_last_login = as.numeric(max_login_date - LastLoginDate)
  ) %>%
  select(-LastLoginDate)

online_features %>% slice_head(n = 5)
## # A tibble: 5 × 6
##   CustomerID LoginFrequency ServiceUsage login_month login_weekday
##        <dbl>          <dbl> <fct>              <dbl> <ord>        
## 1          1             34 Mobile App            10 Saturday     
## 2          2              5 Website               12 Tuesday      
## 3          3              3 Website               11 Wednesday    
## 4          4              2 Website                8 Friday       
## 5          5             41 Website               10 Friday       
## # ℹ 1 more variable: days_since_last_login <dbl>

Build the final modeling table (one row per customer)

We merge everything by CustomerID.
Then we fill missing behaviour features sensibly: - Missing transactions => counts become 0 and recency becomes “very large” (no recent activity). - Missing service => counts become 0, resolution rate stays NA (no service usage).

customer_data <- churn_labels %>%
  left_join(customer_demographics, by = "CustomerID") %>%
  left_join(txn_summary, by = "CustomerID") %>%
  left_join(txn_category_mix, by = "CustomerID") %>%
  left_join(svc_summary, by = "CustomerID") %>%
  left_join(svc_type_counts, by = "CustomerID") %>%
  left_join(svc_resolution_counts, by = "CustomerID") %>%
  left_join(online_features, by = "CustomerID")

max_possible_txn_gap <- as.numeric(max_txn_date - min(transaction_history$TransactionDate, na.rm = TRUE)) + 1
max_possible_svc_gap <- as.numeric(max_svc_date - min(customer_service$InteractionDate, na.rm = TRUE)) + 1

customer_data <- customer_data %>%
  mutate(
    had_transactions = ifelse(is.na(txn_count), 0, 1),
    had_service = ifelse(is.na(svc_count), 0, 1),

    txn_count = na0(txn_count),
    spend_total = na0(spend_total),
    spend_mean = na0(spend_mean),
    spend_median = na0(spend_median),
    spend_sd = na0(spend_sd),
    spend_cv = na0(spend_cv),
    categories_distinct = na0(categories_distinct),
    active_months_txn = na0(active_months_txn),
    txns_per_active_month = na0(txns_per_active_month),
    days_since_last_txn = ifelse(is.na(days_since_last_txn), max_possible_txn_gap, days_since_last_txn),

    svc_count = na0(svc_count),
    svc_resolved = na0(svc_resolved),
    svc_unresolved = na0(svc_unresolved),
    svc_resolution_rate = ifelse(had_service == 0, NA_real_, svc_resolution_rate),
    svc_types_distinct = na0(svc_types_distinct),
    active_months_svc = na0(active_months_svc),
    svcs_per_active_month = na0(svcs_per_active_month),
    days_since_last_svc = ifelse(is.na(days_since_last_svc), max_possible_svc_gap, days_since_last_svc)
  )

customer_data %>% summarise(rows = n(), churn_missing = sum(is.na(Churn)))
## # A tibble: 1 × 2
##    rows churn_missing
##   <int>         <int>
## 1  3097             0

Split into train and test sets

We keep the churn ratio similar in both sets using stratified splitting.
This lets us test performance on unseen customers.

set.seed(123)
idx <- createDataPartition(customer_data$Churn, p = 0.80, list = FALSE)

train_data_raw <- customer_data[idx, ] %>% select(-CustomerID)
test_data_raw  <- customer_data[-idx, ] %>% select(-CustomerID)

bind_rows(
  train_data_raw %>% count(Churn) %>% mutate(Set = "Train"),
  test_data_raw  %>% count(Churn) %>% mutate(Set = "Test")
)
## # A tibble: 4 × 3
##   Churn       n Set  
##   <fct>   <int> <chr>
## 1 Churn     509 Train
## 2 NoChurn  1969 Train
## 3 Churn     127 Test 
## 4 NoChurn   492 Test

Convert to model-ready features (dummies + impute + remove constants + scale)

Models need numeric inputs: - We one-hot encode categories (dummy variables) - We impute missing values (median) - We remove near-constant columns BEFORE scaling (fixes your warning) - We scale numeric features for fair learning, especially for SVM

dummy_encoder <- dummyVars(Churn ~ ., data = train_data_raw, fullRank = TRUE)

x_train <- predict(dummy_encoder, newdata = train_data_raw) %>% as.data.frame()
x_test  <- predict(dummy_encoder, newdata = test_data_raw) %>% as.data.frame()

y_train <- train_data_raw$Churn
y_test  <- test_data_raw$Churn

# Impute first
imputer <- preProcess(x_train, method = "medianImpute")
x_train_i <- predict(imputer, x_train)
x_test_i  <- predict(imputer, x_test)

# Remove near-zero variance columns (prevents "constant variable" scaling warnings)
nzv_idx <- nearZeroVar(x_train_i)
if (length(nzv_idx) > 0) {
  x_train_i <- x_train_i[, -nzv_idx, drop = FALSE]
  x_test_i  <- x_test_i[, colnames(x_train_i), drop = FALSE]
}

# Scale after removing constants
scaler <- preProcess(x_train_i, method = c("center","scale"))
x_train_p <- predict(scaler, x_train_i)
x_test_p  <- predict(scaler, x_test_i)

tibble(train_features = ncol(x_train_p), test_features = ncol(x_test_p))
## # A tibble: 1 × 2
##   train_features test_features
##            <int>         <int>
## 1             53            53

Handle class imbalance

Churn is minority, so we balance the training set to avoid the model learning “NoChurn” too well.

use_smote <- requireNamespace("smotefamily", quietly = TRUE)

if (use_smote) {
  library(smotefamily)

  y_train_num <- ifelse(y_train == "Churn", 1, 0)
  sm <- SMOTE(X = x_train_p, target = y_train_num, K = 5, dup_size = 0)

  sm_df <- sm$data
  x_train_bal <- sm_df %>% select(-class)
  y_train_bal <- factor(ifelse(sm_df$class == 1, "Churn", "NoChurn"), levels = c("Churn","NoChurn"))
} else {
  up <- upSample(x = x_train_p, y = y_train, yname = "Churn")
  x_train_bal <- up %>% select(-Churn)
  y_train_bal <- up$Churn
}

tibble(Churn = y_train_bal) %>% count(Churn) %>% mutate(pct = n/sum(n))
## # A tibble: 2 × 3
##   Churn       n   pct
##   <fct>   <int> <dbl>
## 1 Churn    1969   0.5
## 2 NoChurn  1969   0.5

t-SNE (visual check)

t-SNE compresses high-dimensional customer features into 2D for a plot.
This is a diagnostic view to see if churners cluster differently.

set.seed(123)
perp <- min(30, floor((nrow(x_train_p) - 1) / 3))

tsne_fit <- Rtsne(
  as.matrix(x_train_p),
  dims = 2,
  perplexity = perp,
  pca = TRUE,
  check_duplicates = FALSE,
  verbose = FALSE
)

tsne_df <- tibble(
  tsne1 = tsne_fit$Y[, 1],
  tsne2 = tsne_fit$Y[, 2],
  Churn = y_train
)

ggplot(tsne_df, aes(tsne1, tsne2, color = Churn)) +
  geom_point(alpha = 0.75, size = 2) +
  labs(
    title = "t-SNE Projection (Train Customers)",
    x = "t-SNE 1", y = "t-SNE 2"
  ) +
  theme_minimal(base_size = 13)

The t-SNE plot shows churn and non-churn customers are largely mixed with no clear separation, suggesting churn is driven by overlapping behavioural patterns rather than distinct customer clusters.

–Train models (Random Forest + SVM)

We train: - Random Forest (best performer and fairly interpretable via importance/SHAP) - SVM Radial (strong benchmark, but less interpretable and can be slower)

We keep cross-validation reasonable for runtime.

ctrl <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  savePredictions = "final"
)

set.seed(123)
rf_grid <- expand.grid(mtry = pmax(2, floor(sqrt(ncol(x_train_bal))) + (-1:1)))

rf_model <- train(
  x = x_train_bal, y = y_train_bal,
  method = "rf",
  metric = "ROC",
  trControl = ctrl,
  tuneGrid = rf_grid,
  ntree = 500,
  importance = TRUE
)

rf_model
## Random Forest 
## 
## 3938 samples
##   53 predictor
##    2 classes: 'Churn', 'NoChurn' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3150, 3151, 3150, 3151, 3150 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##   6     0.9998491  0.9974606  0.9989848
##   7     0.9998504  0.9974606  0.9989848
##   8     0.9998582  0.9974606  0.9994924
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 8.
set.seed(123)
svm_model <- train(
  x = x_train_bal, y = y_train_bal,
  method = "svmRadial",
  metric = "ROC",
  trControl = ctrl,
  tuneLength = 8   # keep it lighter so it doesn't take forever
)

svm_model
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 3938 samples
##   53 predictor
##    2 classes: 'Churn', 'NoChurn' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 3150, 3151, 3150, 3151, 3150 
## Resampling results across tuning parameters:
## 
##   C      ROC        Sens       Spec     
##    0.25  0.7924572  0.7384586  0.7226993
##    0.50  0.8560225  0.7963615  0.7704486
##    1.00  0.9146381  0.8547694  0.8308908
##    2.00  0.9572964  0.9106237  0.8699991
##    4.00  0.9821262  0.9573488  0.9228194
##    8.00  0.9933106  0.9786750  0.9568386
##   16.00  0.9947555  0.9883210  0.9685176
##   32.00  0.9942313  0.9893362  0.9730861
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01099798
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01099798 and C = 16.

Evaluate models (ROC-AUC + PR-AUC)

Accuracy can lie with imbalanced data.
We evaluate using ROC-AUC and PR-AUC and show a simple comparison table.

best_threshold_f1 <- function(probs, truth, positive = "Churn") {
  truth_bin <- ifelse(truth == positive, 1, 0)
  thresholds <- seq(0.05, 0.95, by = 0.01)

  f1s <- sapply(thresholds, function(t) {
    pred_bin <- ifelse(probs >= t, 1, 0)
    tp <- sum(pred_bin == 1 & truth_bin == 1)
    fp <- sum(pred_bin == 1 & truth_bin == 0)
    fn <- sum(pred_bin == 0 & truth_bin == 1)

    prec <- ifelse(tp + fp == 0, 0, tp / (tp + fp))
    rec  <- ifelse(tp + fn == 0, 0, tp / (tp + fn))
    ifelse(prec + rec == 0, 0, 2 * prec * rec / (prec + rec))
  })

  thresholds[which.max(f1s)]
}

evaluate_model <- function(fit, x_test, y_test) {
  probs <- predict(fit, newdata = x_test, type = "prob")[, "Churn"]

  roc_obj <- pROC::roc(response = y_test, predictor = probs, levels = rev(levels(y_test)))
  auc_val <- as.numeric(pROC::auc(roc_obj))

  truth_bin <- ifelse(y_test == "Churn", 1, 0)
  pr <- PRROC::pr.curve(scores.class0 = probs[truth_bin == 1],
                        scores.class1 = probs[truth_bin == 0],
                        curve = FALSE)
  pr_auc <- pr$auc.integral

  thr <- best_threshold_f1(probs, y_test)
  preds <- factor(ifelse(probs >= thr, "Churn", "NoChurn"), levels = levels(y_test))
  cm <- confusionMatrix(preds, y_test, positive = "Churn")

  list(probs = probs, auc = auc_val, pr_auc = pr_auc, threshold = thr, cm = cm)
}

rf_res  <- evaluate_model(rf_model,  x_test_p, y_test)
svm_res <- evaluate_model(svm_model, x_test_p, y_test)

tibble(
  Model = c("RandomForest","SVMRadial"),
  ROC_AUC = c(rf_res$auc, svm_res$auc),
  PR_AUC  = c(rf_res$pr_auc, svm_res$pr_auc)
) %>% arrange(desc(ROC_AUC))
## # A tibble: 2 × 3
##   Model        ROC_AUC PR_AUC
##   <chr>          <dbl>  <dbl>
## 1 RandomForest   0.996  0.988
## 2 SVMRadial      0.931  0.899

Confusion matrix for the best model

This shows what type of errors the model makes (false positives vs false negatives).
That matters for retention strategy decisions.

rf_res$cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Churn NoChurn
##    Churn     118       1
##    NoChurn     9     491
##                                           
##                Accuracy : 0.9838          
##                  95% CI : (0.9705, 0.9922)
##     No Information Rate : 0.7948          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9493          
##                                           
##  Mcnemar's Test P-Value : 0.02686         
##                                           
##             Sensitivity : 0.9291          
##             Specificity : 0.9980          
##          Pos Pred Value : 0.9916          
##          Neg Pred Value : 0.9820          
##              Prevalence : 0.2052          
##          Detection Rate : 0.1906          
##    Detection Prevalence : 0.1922          
##       Balanced Accuracy : 0.9636          
##                                           
##        'Positive' Class : Churn           
## 

Global “why churn” signals (Random Forest importance)

This tells us which features matter most overall.
It does not tell direction, but it’s a strong starting point for stakeholders.

imp_raw <- varImp(rf_model, scale = TRUE)

imp_df <- imp_raw$importance %>%
  as.data.frame() %>%
  tibble::rownames_to_column("feature")

num_cols <- names(imp_df)[sapply(imp_df, is.numeric)]

rf_importance <- imp_df %>%
  mutate(importance = if (length(num_cols) == 1) .data[[num_cols[1]]] else rowMeans(across(all_of(num_cols)))) %>%
  arrange(desc(importance)) %>%
  slice_head(n = 20) %>%
  select(feature, importance)

ggplot(rf_importance, aes(x = reorder(feature, importance), y = importance)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 20 Drivers of Churn (Random Forest Importance)",
    x = "Feature", y = "Importance"
  ) +
  theme_minimal(base_size = 13)

Bank’s churn model is basically telling us something very human: customers leave when they stop showing up and stop behaving like active customers. The strongest signals are how long it has been since a customer last logged in and their login frequency, followed by how recently they last made a transaction. After that, churn is heavily linked to spending behaviour (how much they spend, how consistent their spending is, and whether their spending is becoming volatile), plus how recently they contacted customer service. Demographics like age, income level, and marital status matter, but they’re clearly secondary compared to engagement and activity.

So the practical retention plan is simple: flag customers who haven’t logged in or transacted recently, especially if their spending is dropping or becoming inconsistent, and intervene early with targeted nudges (personalised offers, reminders, friction-free reactivation prompts) and service recovery where needed. In other words, monitor engagement + recency as the early-warning system, then use spending and service signals to decide what kind of retention action to take.

–Now we also focus on SHAP (SHapley Additive exPlanations), explaining, which feature actually pushed predictions up/down by this much.

SHAP explains contributions: - Positive SHAP pushes prediction toward churn. - Negative SHAP pushes away from churn.

We compute SHAP on a sample of test customers to keep runtime sensible.

pred_churn_prob <- function(model, newdata) {
  predict(model, newdata = newdata, type = "prob")[, "Churn"]
}

set.seed(123)
sample_n <- min(300, nrow(x_test_p))
sample_idx <- sample(seq_len(nrow(x_test_p)), sample_n)

X_explain <- x_test_p[sample_idx, , drop = FALSE]

shap_vals <- fastshap::explain(
  object = rf_model,
  X = X_explain,
  pred_wrapper = pred_churn_prob,
  nsim = 80,
  adjust = TRUE
)

shap_long <- shap_vals %>%
  as.data.frame() %>%
  mutate(row_id = row_number()) %>%
  pivot_longer(-row_id, names_to = "feature", values_to = "shap")

shap_global <- shap_long %>%
  group_by(feature) %>%
  summarise(mean_abs_shap = mean(abs(shap)), .groups = "drop") %>%
  arrange(desc(mean_abs_shap)) %>%
  slice_head(n = 20)

ggplot(shap_global, aes(x = reorder(feature, mean_abs_shap), y = mean_abs_shap)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Top 20 Drivers of Churn (SHAP Mean |Contribution|)",
    x = "Feature", y = "Mean |SHAP|"
  ) +
  theme_minimal(base_size = 13)

This SHAP chart is basically the “no excuses” version of your story: the biggest reason customers churn is that they stop engaging. The top driver is LoginFrequency (customers who log in less are far more likely to leave), and right after that is days_since_last_svc, meaning recent or frequent service interactions are strongly tied to churn risk (often a sign of friction, complaints, or unresolved issues). Then the next block is all about spending behaviour: median/mean/total spend and how stable it is (spend_cv and spend_sd). In plain terms, churners tend to either spend less or their spending becomes inconsistent and volatile, which is a classic “customer drifting away” signal. Recency still matters too (days_since_last_txn and days_since_last_login), plus some minor timing/channel patterns (weekday, website usage), which are more “behavioural fingerprints” than core causes. So the solution is: use login drop + recent service friction + changing spend patterns as your early-warning triggers, then act with targeted retention: reactivation nudges for low logins, service recovery for recent service users, and personalised offers for customers whose spending is declining or becoming erratic.

SHAP for one high-risk customer (clear indicators)

We pick the customer in the sample with the highest churn probability and list the top reasons.

probs_sample <- pred_churn_prob(rf_model, X_explain)
top_case <- which.max(probs_sample)

case_df <- tibble(
  feature = colnames(shap_vals),
  shap = as.numeric(shap_vals[top_case, ])
) %>%
  arrange(desc(abs(shap))) %>%
  slice_head(n = 12)

case_df
## # A tibble: 12 × 2
##    feature                 shap
##    <chr>                  <dbl>
##  1 spend_sd              0.0777
##  2 LoginFrequency        0.0650
##  3 spend_mean            0.0575
##  4 spend_total           0.0548
##  5 spend_cv              0.0542
##  6 Age                   0.0431
##  7 days_since_last_svc   0.0411
##  8 spend_median          0.0333
##  9 login_weekday^4       0.0303
## 10 days_since_last_login 0.0268
## 11 days_since_last_txn   0.0241
## 12 login_weekday^6       0.0133

So above is the “why THIS specific customer is predicted to churn” list. For the highest-risk customer, the model indicates churn is mainly driven by unstable spending behaviour (high spending variability and inconsistency), combined with reduced engagement (lower login frequency and greater time since last login). Additional contributors include the customer’s overall spending level (mean/median/total spend patterns), recent customer service behaviour, and age, but the dominant signal is that the customer’s activity has become irregular and is drifting. This suggests a targeted retention action should focus on re-engagement (bringing them back to regular logins) and stabilising usage/spend, alongside proactive service follow-up if friction is present.