–Please click above to navigate directly to machine learning prediction if interested.
–Else step-by-step, we firstly focus on general EDA and we will continue with ML.
pacman::p_load(pacman,tidyverse, lubridate,forcats, stringr, ggthemes, tidyr, readxl, skimr,
psych, Amelia, dlookr, caret, fastDummies,
pROC, e1071, PRROC, randomForest, kernlab, Rtsne, vip, fastshap)–Data Loading
filepath <- "C:/Users/KAINTHA/Documents/Customer_Churn_Data_Large.xlsx"
# We have 5 sheets in the excel, where EDA is to be performed for first 4 sheets.
df_Customer_Demographics <- read_xlsx(filepath, sheet = "Customer_Demographics")
glimpse(df_Customer_Demographics)## Rows: 1,000
## Columns: 5
## $ CustomerID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ Age <dbl> 62, 65, 18, 21, 21, 57, 27, 37, 39, 68, 54, 41, 24, 42, …
## $ Gender <chr> "M", "M", "M", "M", "M", "F", "F", "M", "M", "M", "M", "…
## $ MaritalStatus <chr> "Single", "Married", "Single", "Widowed", "Divorced", "D…
## $ IncomeLevel <chr> "Low", "Low", "Low", "Low", "Medium", "Medium", "High", …
## # A tibble: 2 × 26
## described_variables n na mean sd se_mean IQR skewness kurtosis
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CustomerID 1000 0 500. 289. 9.13 500. 0 -1.2
## 2 Age 1000 0 43.3 15.2 0.482 26 0.0132 -1.21
## # ℹ 17 more variables: p00 <dbl>, p01 <dbl>, p05 <dbl>, p10 <dbl>, p20 <dbl>,
## # p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
## # p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>
## # A tibble: 2 × 6
## variables outliers_cnt outliers_ratio outliers_mean with_mean without_mean
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 CustomerID 0 0 NaN 500. 500.
## 2 Age 0 0 NaN 43.3 43.3
–We have no missing values and outliers here
–Let’s dive deeper into the data using plots
ggplot(df_Customer_Demographics, aes(x = Gender)) +
geom_bar(fill = "#AA336A") +
theme_solarized() +
labs(title = "Gender Distribution", x = "Gender", y = "Count")ggplot(df_Customer_Demographics, aes(x = MaritalStatus)) +
geom_bar(fill = "#AA336A") +
theme_solarized_2() +
labs(title = "Marital Status Distribution", x = "Marital Status", y = "Count")ggplot(df_Customer_Demographics, aes(x = IncomeLevel)) +
geom_bar(fill = "#AA336A") +
theme_solarized_2() +
labs(title = "Income Level Distribution", x = "Income Level", y = "Count")ggplot(df_Customer_Demographics, aes(x = Age)) +
geom_histogram(fill = "#AA336A", bins = 30) +
theme_solarized_2() +
labs(title = "Age Distribution", x = "Age", y = "Frequency")–“Female” gender show higher count.
–“Widowed” Status display higher count while “Single” lowest.
–“High” income level show higest distribution.
–Now that we have the insights for the Customer Demographics, we will focus on the Transaction History Of the customer
df_Transaction_History <- read_xlsx(filepath, sheet = "Transaction_History")
df_Transaction_History <- df_Transaction_History %>%
mutate(TransactionDate = as.POSIXct(TransactionDate))
glimpse(df_Transaction_History)## Rows: 5,054
## Columns: 5
## $ CustomerID <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, …
## $ TransactionID <dbl> 7194, 7250, 9660, 2998, 1228, 8903, 3527, 9279, 9839, …
## $ TransactionDate <dttm> 2022-03-27, 2022-08-08, 2022-07-25, 2022-01-25, 2022-…
## $ AmountSpent <dbl> 416.50, 54.96, 197.50, 101.31, 397.37, 285.21, 311.34,…
## $ ProductCategory <chr> "Electronics", "Clothing", "Electronics", "Furniture",…
## CustomerID TransactionID TransactionDate AmountSpent
## Min. : 1.0 Min. :1000 Min. :2022-01-01 00:00:00 Min. : 5.18
## 1st Qu.: 251.0 1st Qu.:3242 1st Qu.:2022-04-03 00:00:00 1st Qu.:127.11
## Median : 506.0 Median :5530 Median :2022-07-01 00:00:00 Median :250.53
## Mean : 501.4 Mean :5511 Mean :2022-07-01 19:25:37 Mean :250.71
## 3rd Qu.: 749.0 3rd Qu.:7681 3rd Qu.:2022-09-29 00:00:00 3rd Qu.:373.41
## Max. :1000.0 Max. :9997 Max. :2022-12-31 00:00:00 Max. :499.86
## ProductCategory
## Length:5054
## Class :character
## Mode :character
##
##
##
## CustomerID TransactionID TransactionDate AmountSpent ProductCategory
## 0 0 0 0 0
–We don’t have any missing values
## # A tibble: 3 × 6
## variables outliers_cnt outliers_ratio outliers_mean with_mean without_mean
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 CustomerID 0 0 NaN 501. 501.
## 2 TransactionID 0 0 NaN 5511. 5511.
## 3 AmountSpent 0 0 NaN 251. 251.
–And neither we have any outliers.
–So we will focus on EDA similarly.
#Distribution of Amount Spent
ggplot(df_Transaction_History, aes(x = AmountSpent)) +
geom_histogram(bins = 100, fill = "#AA336A") +
theme_solarized_2() +
labs(title = "Distribution of Amount Spent", x = "Amount Spent", y = "Count")#Average Amount Spent Per Month
monthly_avg <- df_Transaction_History %>%
mutate(Month = as.Date(format(as.Date(TransactionDate), "%Y-%m-01"))) %>%
group_by(Month) %>%
summarise(AvgAmount = mean(AmountSpent, na.rm = TRUE), .groups = "drop")
full_months <- data.frame(
Month = seq(min(monthly_avg$Month), max(monthly_avg$Month), by = "month")
)
monthly_avg_full <- full_months %>%
left_join(monthly_avg, by = "Month")
ggplot(monthly_avg_full, aes(x = Month, y = AvgAmount)) +
geom_line(color = "#AA336A") +
theme_solarized_2() +
scale_x_date(date_breaks = "1 month", date_labels = "%b\n%Y") +
labs(title = "Average Amount Spent Per Month",
x = "Month", y = "Average Amount Spent") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))#Transaction Count by Product Category
ggplot(df_Transaction_History, aes(x = ProductCategory)) +
geom_bar(fill = "#AA336A") +
theme_solarized_2() +
labs(title = "Transaction Count by Product Category", x = "Product Category", y = "Count")–February and December show lowest average amount spent, while March and August peak at amount spent.
–Repeat Customers vs. One-Time Buyers
df_Transaction_History %>%
group_by(CustomerID) %>%
summarise(TransactionCount = n()) %>%
mutate(CustomerType = ifelse(TransactionCount > 1, "Repeat", "One-time")) %>%
count(CustomerType)## # A tibble: 2 × 2
## CustomerType n
## <chr> <int>
## 1 One-time 117
## 2 Repeat 883
–Repeat customers are 7.5 fold compared to One-Time customers
–Now we will focus on the EDA for Customer Service sheet
## Rows: 1,002
## Columns: 5
## $ CustomerID <dbl> 1, 2, 3, 4, 4, 6, 8, 8, 9, 11, 11, 12, 12, 13, 14, 14…
## $ InteractionID <dbl> 6363, 3329, 9976, 7354, 5393, 2358, 4191, 8937, 7813,…
## $ InteractionDate <dttm> 2022-03-31, 2022-03-17, 2022-08-24, 2022-11-18, 2022…
## $ InteractionType <chr> "Inquiry", "Inquiry", "Inquiry", "Inquiry", "Inquiry"…
## $ ResolutionStatus <chr> "Resolved", "Resolved", "Resolved", "Resolved", "Unre…
## # A tibble: 2 × 26
## described_variables n na mean sd se_mean IQR skewness kurtosis
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CustomerID 1002 0 485. 287. 9.07 498. 0.0526 -1.20
## 2 InteractionID 1002 0 5953. 2306. 72.8 3917. 0.00479 -1.19
## # ℹ 17 more variables: p00 <dbl>, p01 <dbl>, p05 <dbl>, p10 <dbl>, p20 <dbl>,
## # p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
## # p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>
## # A tibble: 2 × 6
## variables outliers_cnt outliers_ratio outliers_mean with_mean without_mean
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 CustomerID 0 0 NaN 485. 485.
## 2 InteractionID 0 0 NaN 5953. 5953.
## CustomerID InteractionID InteractionDate InteractionType
## 0 0 0 0
## ResolutionStatus
## 0
–We have no outliers or missing values in the sheet,
–So we will explore the data using the plots,
#Monthly interactions
df_Customer_Service %>%
mutate(Month = as.Date(format(as.Date(InteractionDate), "%Y-%m-01"))) %>%
group_by(Month) %>%
summarise(InteractionCount = n()) %>%
ggplot(aes(x = Month, y = InteractionCount)) +
geom_line(color = "#AA336A") +
geom_point() +
theme_solarized_2() +
scale_x_date(date_breaks = "1 month", date_labels = "%b\n%Y") +
labs(title = "Monthly Interactions", x = "Month", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))#Distribution of Interaction Types
ggplot(df_Customer_Service, aes(x = InteractionType)) +
geom_bar(fill = "#AA336A") +
theme_solarized_2() +
labs(title = "Distribution of Interaction Types", x = "Type", y = "Count")#Resolution Status Overview
ggplot(df_Customer_Service, aes(x = ResolutionStatus)) +
geom_bar(fill = "#AA336A") +
theme_solarized_2() +
labs(title = "Resolution Status Overview", x = "Status", y = "Count")#Interaction Type vs Resolution Status
ggplot(df_Customer_Service, aes(x = InteractionType, fill = ResolutionStatus)) +
geom_bar(position = "dodge") +
theme_solarized_2() +
labs(title = "Interaction Type vs Resolution Status", x = "Interaction Type", y = "Count")#Resolution Rate Over Time
df_Customer_Service %>%
mutate(Month = format(as.Date(InteractionDate), "%Y-%m")) %>%
group_by(Month, ResolutionStatus) %>%
summarise(Count = n(), .groups = "drop") %>%
group_by(Month) %>%
mutate(Rate = Count / sum(Count)) %>%
filter(ResolutionStatus == "Resolved") %>%
ggplot(aes(x = as.Date(paste0(Month, "-01")), y = Rate)) +
geom_line(color = "#AA336A") +
theme_solarized_2() +
labs(title = "Resolution Rate Over Time", x = "Month", y = "Resolved %") +
scale_y_continuous(labels = scales::percent) +
scale_x_date(date_breaks = "1 month", date_labels = "%b\n%Y") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))#Interaction Volume by Weekday
df_Customer_Service %>%
mutate(
Weekday = weekdays(as.Date(InteractionDate)),
Weekday = factor(Weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
) %>%
count(Weekday) %>%
ggplot(aes(x = Weekday, y = n)) +
geom_col(fill = "#AA336A") +
theme_solarized_2() +
labs(title = "Interaction Volume by Weekday", x = "Day", y = "Count")#Groups customers into behavioral segments
df_Customer_Service %>%
group_by(CustomerID) %>%
summarise(
TotalInteractions = n(), # how many times this customer contacted
Unresolved = sum(ResolutionStatus == "Unresolved"), # how many times the issue was unresolved
ChannelsUsed = n_distinct(InteractionType) # how many different channels (chat, call, etc.)
) %>%
mutate(Segment = case_when(
TotalInteractions >= 10 & Unresolved > 0 ~ "High Touch - Unresolved",
TotalInteractions >= 10 ~ "High Touch",
ChannelsUsed > 1 ~ "Multi-channel",
TRUE ~ "Low Touch"
)) %>%
count(Segment)## # A tibble: 2 × 2
## Segment n
## <chr> <int>
## 1 Low Touch 453
## 2 Multi-channel 215
–July, followed by March show highest interactions while February, January and Decemeber show lowest. –Feedbacks were predominant in Interaction Types –52.2% of the cases were resolved. –November and December show the highest resolve rate while January show the lowest. –Mid-Week had the highest interaction volume.
## Rows: 1,000
## Columns: 4
## $ CustomerID <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ LastLoginDate <dttm> 2023-10-21, 2023-12-05, 2023-11-15, 2023-08-25, 2023-1…
## $ LoginFrequency <dbl> 34, 5, 3, 2, 41, 2, 32, 17, 24, 29, 30, 43, 10, 26, 44,…
## $ ServiceUsage <chr> "Mobile App", "Website", "Website", "Website", "Website…
## # A tibble: 2 × 26
## described_variables n na mean sd se_mean IQR skewness kurtosis
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CustomerID 1000 0 500. 289. 9.13 500. 0 -1.2
## 2 LoginFrequency 1000 0 25.9 14.1 0.444 24.2 -0.128 -1.18
## # ℹ 17 more variables: p00 <dbl>, p01 <dbl>, p05 <dbl>, p10 <dbl>, p20 <dbl>,
## # p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
## # p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>
## # A tibble: 2 × 6
## variables outliers_cnt outliers_ratio outliers_mean with_mean without_mean
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 CustomerID 0 0 NaN 500. 500.
## 2 LoginFrequen… 0 0 NaN 25.9 25.9
## CustomerID LastLoginDate LoginFrequency ServiceUsage
## 0 0 0 0
–We have no outliers or missing values in this sheet.
–Now we again explore the data with the plots.
df_online <- df_Online_Activity %>%
mutate(LastLoginMonth = floor_date(as.Date(LastLoginDate), unit = "month"))
df_online %>%
group_by(LastLoginMonth) %>%
summarise(AvgLogin = mean(LoginFrequency, na.rm = TRUE)) %>%
ggplot(aes(x = LastLoginMonth, y = AvgLogin)) +
geom_line(color = "#AA336A") +
geom_point() +
theme_solarized_2() +
scale_x_date(date_breaks = "1 month", date_labels = "%b\n%Y") +
labs(title = "Average Login Frequency Over Time", x = "Month", y = "Average Login Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))df_Online_Activity %>%
mutate(Weekday = weekdays(as.Date(LastLoginDate)),
Weekday = factor(Weekday,
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"))) %>%
count(Weekday, ServiceUsage) %>%
ggplot(aes(x = Weekday, y = n, fill = ServiceUsage)) +
geom_col(position = "dodge") +
theme_solarized_2() +
labs(title = "Service Usage by Weekday", x = "Day", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))df_Online_Activity %>%
mutate(Weekday = weekdays(as.Date(LastLoginDate)),
Weekday = factor(Weekday,
levels = c("Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"))) %>%
group_by(Weekday) %>%
summarise(AvgLogin = mean(LoginFrequency, na.rm = TRUE)) %>%
ggplot(aes(x = Weekday, y = AvgLogin)) +
geom_col(fill = "#AA336A") +
theme_solarized_2() +
labs(title = "Average Login Frequency by Weekday", x = "Day", y = "Avg Login Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))–November had the highest login frequency while May had the lowest
–“Online Banking” marginally shows higher distribution of service usage levels
–“Online Banking” service usage is predominant in January, February, April and November while, “Website” shows higher in December, and “Mobile App” in June and July and September
–“Online Banking” is highly used on Mondays and Sundays for Service Usage
–We also do have one more sheet which includes the ChurnStatus of each Customer ID, which is described below in the file.
## # A tibble: 2 × 26
## described_variables n na mean sd se_mean IQR skewness
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CustomerID 1000 0 500. 289. 9.13 500. 0
## 2 ChurnStatus 1000 0 0.204 0.403 0.0127 0 1.47
## # ℹ 18 more variables: kurtosis <dbl>, p00 <dbl>, p01 <dbl>, p05 <dbl>,
## # p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>,
## # p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>,
## # p99 <dbl>, p100 <dbl>
–Now that we have the EDA of all the sheets in the plot, we will normalise numerical features to ensure consistent scales across variables. This step is crucial for preparing the data for machine learning algorithms.
–Encode categorical variables using techniques like one-hot encoding to transform them into a numerical form appropriate for analysis.
–Merge all sheets using CustomerID to create one customer-level modelling table (one row per customer).
–Treat missing values sensibly (e.g., 0 for no activity; large recency for no recent events) so “no activity” becomes a usable signal.
–Split the data into stratified train/test sets to preserve churn proportion and evaluate fairly.
–Handle churn imbalance using SMOTE or upsampling so churn cases are learned properly.
–Train models (Random Forest, SVM) with cross-validation and ROC-AUC optimisation.
–Evaluate using ROC-AUC, PR-AUC, and confusion matrix to assess churn detection quality.
–Explain the best model using feature importance and SHAP to identify key churn drivers for retention action.
set.seed(123)
#A safe date parser (Excel dates are a mess).
# A helper to replace missing numeric values with 0 where appropriate.
to_date <- function(x) {
if (inherits(x, "Date")) return(x)
if (inherits(x, c("POSIXct","POSIXt"))) return(as.Date(x))
if (is.numeric(x)) return(as.Date(x, origin = "1899-12-30"))
as.Date(x)
}
na0 <- function(x) replace(x, is.na(x), 0)We read all 5 sheets. The churn sheet contains the target label (what we want to predict).
customer_demographics <- read_xlsx(filepath, sheet = "Customer_Demographics")
transaction_history <- read_xlsx(filepath, sheet = "Transaction_History")
customer_service <- read_xlsx(filepath, sheet = "Customer_Service")
online_activity <- read_xlsx(filepath, sheet = "Online_Activity")
churn_status <- read_xlsx(filepath, sheet = "Churn_Status")
#Clean column types
#This is where we ensure sure IDs are numeric, categories are factors, and dates are real dates.
#This avoids silent errors when we compute time-based features like "days since last transaction.
customer_demographics <- customer_demographics %>%
mutate(
CustomerID = as.numeric(CustomerID),
Age = as.numeric(Age),
Gender = as.factor(Gender),
MaritalStatus = as.factor(MaritalStatus),
IncomeLevel = as.factor(IncomeLevel)
)
transaction_history <- transaction_history %>%
mutate(
CustomerID = as.numeric(CustomerID),
TransactionID = as.numeric(TransactionID),
TransactionDate = to_date(TransactionDate),
AmountSpent = as.numeric(AmountSpent),
ProductCategory = as.factor(ProductCategory),
txn_month = floor_date(TransactionDate, "month")
)
customer_service <- customer_service %>%
mutate(
CustomerID = as.numeric(CustomerID),
InteractionID = as.numeric(InteractionID),
InteractionDate = to_date(InteractionDate),
InteractionType = as.factor(InteractionType),
ResolutionStatus = as.factor(ResolutionStatus),
svc_month = floor_date(InteractionDate, "month")
)
online_activity <- online_activity %>%
mutate(
CustomerID = as.numeric(CustomerID),
LastLoginDate = to_date(LastLoginDate),
LoginFrequency = as.numeric(LoginFrequency),
ServiceUsage = as.factor(ServiceUsage)
)
churn_labels <- churn_status %>%
mutate(
CustomerID = as.numeric(CustomerID),
ChurnStatus = as.numeric(ChurnStatus),
Churn = factor(ifelse(ChurnStatus == 1, "Churn", "NoChurn"), levels = c("Churn","NoChurn"))
) %>%
select(CustomerID, Churn)
churn_labels %>% count(Churn) %>% mutate(pct = n/sum(n))## # A tibble: 2 × 3
## Churn n pct
## <fct> <int> <dbl>
## 1 Churn 204 0.204
## 2 NoChurn 796 0.796
Aggregate transactions to customer-level
Transactions have many rows per customer, but machine learning needs
1 row per customer.
So we convert raw transactions into customer-level behaviour features
(spend level, frequency, recency, category mix).
max_txn_date <- max(transaction_history$TransactionDate, na.rm = TRUE)
txn_summary <- transaction_history %>%
group_by(CustomerID) %>%
summarise(
txn_count = n(),
spend_total = sum(AmountSpent, na.rm = TRUE),
spend_mean = mean(AmountSpent, na.rm = TRUE),
spend_median = median(AmountSpent, na.rm = TRUE),
spend_sd = sd(AmountSpent, na.rm = TRUE),
spend_cv = ifelse(spend_mean > 0, spend_sd / spend_mean, 0),
categories_distinct = n_distinct(ProductCategory),
active_months_txn = n_distinct(txn_month),
txns_per_active_month = txn_count / pmax(active_months_txn, 1),
days_since_last_txn = as.numeric(max_txn_date - max(TransactionDate, na.rm = TRUE)),
.groups = "drop"
)
txn_category_mix <- transaction_history %>%
group_by(CustomerID, ProductCategory) %>%
summarise(
txn_cat_count = n(),
txn_cat_spend = sum(AmountSpent, na.rm = TRUE),
.groups = "drop"
) %>%
group_by(CustomerID) %>%
mutate(
txn_cat_count_share = txn_cat_count / sum(txn_cat_count),
txn_cat_spend_share = txn_cat_spend / sum(txn_cat_spend)
) %>%
ungroup() %>%
pivot_wider(
names_from = ProductCategory,
values_from = c(txn_cat_count, txn_cat_count_share, txn_cat_spend_share),
values_fill = 0,
names_sep = "__"
)
txn_summary %>% slice_head(n = 5)## # A tibble: 5 × 11
## CustomerID txn_count spend_total spend_mean spend_median spend_sd spend_cv
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 416. 416. 416. NA NA
## 2 2 7 1547. 221. 200. 120. 0.542
## 3 3 6 1703. 284. 294. 133. 0.470
## 4 4 5 917. 183. 126. 132. 0.717
## 5 5 8 2001. 250. 219. 130. 0.519
## # ℹ 4 more variables: categories_distinct <int>, active_months_txn <int>,
## # txns_per_active_month <dbl>, days_since_last_txn <dbl>
Aggregate customer service to customer-level
Service interactions also have many rows per customer.
We summarise intensity (how often they contact) and quality (resolution
rate, unresolved volume), which are common churn drivers.
max_svc_date <- max(customer_service$InteractionDate, na.rm = TRUE)
svc_summary <- customer_service %>%
mutate(
is_resolved = as.integer(ResolutionStatus == "Resolved"),
is_unresolved = as.integer(ResolutionStatus == "Unresolved")
) %>%
group_by(CustomerID) %>%
summarise(
svc_count = n(),
svc_resolved = sum(is_resolved, na.rm = TRUE),
svc_unresolved = sum(is_unresolved, na.rm = TRUE),
svc_resolution_rate = ifelse(svc_count > 0, svc_resolved / svc_count, NA_real_),
svc_types_distinct = n_distinct(InteractionType),
active_months_svc = n_distinct(svc_month),
svcs_per_active_month = svc_count / pmax(active_months_svc, 1),
days_since_last_svc = as.numeric(max_svc_date - max(InteractionDate, na.rm = TRUE)),
.groups = "drop"
)
svc_type_counts <- customer_service %>%
count(CustomerID, InteractionType, name = "svc_type_n") %>%
pivot_wider(
names_from = InteractionType,
values_from = svc_type_n,
values_fill = 0,
names_prefix = "svc_type__"
)
svc_resolution_counts <- customer_service %>%
count(CustomerID, ResolutionStatus, name = "svc_res_n") %>%
pivot_wider(
names_from = ResolutionStatus,
values_from = svc_res_n,
values_fill = 0,
names_prefix = "svc_res__"
)
svc_summary %>% slice_head(n = 5)## # A tibble: 5 × 9
## CustomerID svc_count svc_resolved svc_unresolved svc_resolution_rate
## <dbl> <int> <int> <int> <dbl>
## 1 1 1 1 0 1
## 2 2 1 1 0 1
## 3 3 1 1 0 1
## 4 4 2 1 1 0.5
## 5 6 1 1 0 1
## # ℹ 4 more variables: svc_types_distinct <int>, active_months_svc <int>,
## # svcs_per_active_month <dbl>, days_since_last_svc <dbl>
Engineer online activity features
Online activity is already one row per customer, so we add recency
and simple time features.
These help detect customers “fading away” (e.g., not logging in for
long).
max_login_date <- max(online_activity$LastLoginDate, na.rm = TRUE)
online_features <- online_activity %>%
mutate(
login_month = month(LastLoginDate),
login_weekday = wday(LastLoginDate, label = TRUE, abbr = FALSE),
days_since_last_login = as.numeric(max_login_date - LastLoginDate)
) %>%
select(-LastLoginDate)
online_features %>% slice_head(n = 5)## # A tibble: 5 × 6
## CustomerID LoginFrequency ServiceUsage login_month login_weekday
## <dbl> <dbl> <fct> <dbl> <ord>
## 1 1 34 Mobile App 10 Saturday
## 2 2 5 Website 12 Tuesday
## 3 3 3 Website 11 Wednesday
## 4 4 2 Website 8 Friday
## 5 5 41 Website 10 Friday
## # ℹ 1 more variable: days_since_last_login <dbl>
Build the final modeling table (one row per customer)
We merge everything by CustomerID.
Then we fill missing behaviour features sensibly: - Missing transactions
=> counts become 0 and recency becomes “very large” (no recent
activity). - Missing service => counts become 0, resolution rate
stays NA (no service usage).
customer_data <- churn_labels %>%
left_join(customer_demographics, by = "CustomerID") %>%
left_join(txn_summary, by = "CustomerID") %>%
left_join(txn_category_mix, by = "CustomerID") %>%
left_join(svc_summary, by = "CustomerID") %>%
left_join(svc_type_counts, by = "CustomerID") %>%
left_join(svc_resolution_counts, by = "CustomerID") %>%
left_join(online_features, by = "CustomerID")
max_possible_txn_gap <- as.numeric(max_txn_date - min(transaction_history$TransactionDate, na.rm = TRUE)) + 1
max_possible_svc_gap <- as.numeric(max_svc_date - min(customer_service$InteractionDate, na.rm = TRUE)) + 1
customer_data <- customer_data %>%
mutate(
had_transactions = ifelse(is.na(txn_count), 0, 1),
had_service = ifelse(is.na(svc_count), 0, 1),
txn_count = na0(txn_count),
spend_total = na0(spend_total),
spend_mean = na0(spend_mean),
spend_median = na0(spend_median),
spend_sd = na0(spend_sd),
spend_cv = na0(spend_cv),
categories_distinct = na0(categories_distinct),
active_months_txn = na0(active_months_txn),
txns_per_active_month = na0(txns_per_active_month),
days_since_last_txn = ifelse(is.na(days_since_last_txn), max_possible_txn_gap, days_since_last_txn),
svc_count = na0(svc_count),
svc_resolved = na0(svc_resolved),
svc_unresolved = na0(svc_unresolved),
svc_resolution_rate = ifelse(had_service == 0, NA_real_, svc_resolution_rate),
svc_types_distinct = na0(svc_types_distinct),
active_months_svc = na0(active_months_svc),
svcs_per_active_month = na0(svcs_per_active_month),
days_since_last_svc = ifelse(is.na(days_since_last_svc), max_possible_svc_gap, days_since_last_svc)
)
customer_data %>% summarise(rows = n(), churn_missing = sum(is.na(Churn)))## # A tibble: 1 × 2
## rows churn_missing
## <int> <int>
## 1 3097 0
Split into train and test sets
We keep the churn ratio similar in both sets using stratified
splitting.
This lets us test performance on unseen customers.
set.seed(123)
idx <- createDataPartition(customer_data$Churn, p = 0.80, list = FALSE)
train_data_raw <- customer_data[idx, ] %>% select(-CustomerID)
test_data_raw <- customer_data[-idx, ] %>% select(-CustomerID)
bind_rows(
train_data_raw %>% count(Churn) %>% mutate(Set = "Train"),
test_data_raw %>% count(Churn) %>% mutate(Set = "Test")
)## # A tibble: 4 × 3
## Churn n Set
## <fct> <int> <chr>
## 1 Churn 509 Train
## 2 NoChurn 1969 Train
## 3 Churn 127 Test
## 4 NoChurn 492 Test
Convert to model-ready features (dummies + impute + remove constants + scale)
Models need numeric inputs: - We one-hot encode categories (dummy variables) - We impute missing values (median) - We remove near-constant columns BEFORE scaling (fixes your warning) - We scale numeric features for fair learning, especially for SVM
dummy_encoder <- dummyVars(Churn ~ ., data = train_data_raw, fullRank = TRUE)
x_train <- predict(dummy_encoder, newdata = train_data_raw) %>% as.data.frame()
x_test <- predict(dummy_encoder, newdata = test_data_raw) %>% as.data.frame()
y_train <- train_data_raw$Churn
y_test <- test_data_raw$Churn
# Impute first
imputer <- preProcess(x_train, method = "medianImpute")
x_train_i <- predict(imputer, x_train)
x_test_i <- predict(imputer, x_test)
# Remove near-zero variance columns (prevents "constant variable" scaling warnings)
nzv_idx <- nearZeroVar(x_train_i)
if (length(nzv_idx) > 0) {
x_train_i <- x_train_i[, -nzv_idx, drop = FALSE]
x_test_i <- x_test_i[, colnames(x_train_i), drop = FALSE]
}
# Scale after removing constants
scaler <- preProcess(x_train_i, method = c("center","scale"))
x_train_p <- predict(scaler, x_train_i)
x_test_p <- predict(scaler, x_test_i)
tibble(train_features = ncol(x_train_p), test_features = ncol(x_test_p))## # A tibble: 1 × 2
## train_features test_features
## <int> <int>
## 1 53 53
Handle class imbalance
Churn is minority, so we balance the training set to avoid the model learning “NoChurn” too well.
use_smote <- requireNamespace("smotefamily", quietly = TRUE)
if (use_smote) {
library(smotefamily)
y_train_num <- ifelse(y_train == "Churn", 1, 0)
sm <- SMOTE(X = x_train_p, target = y_train_num, K = 5, dup_size = 0)
sm_df <- sm$data
x_train_bal <- sm_df %>% select(-class)
y_train_bal <- factor(ifelse(sm_df$class == 1, "Churn", "NoChurn"), levels = c("Churn","NoChurn"))
} else {
up <- upSample(x = x_train_p, y = y_train, yname = "Churn")
x_train_bal <- up %>% select(-Churn)
y_train_bal <- up$Churn
}
tibble(Churn = y_train_bal) %>% count(Churn) %>% mutate(pct = n/sum(n))## # A tibble: 2 × 3
## Churn n pct
## <fct> <int> <dbl>
## 1 Churn 1969 0.5
## 2 NoChurn 1969 0.5
t-SNE (visual check)
t-SNE compresses high-dimensional customer features into 2D for a
plot.
This is a diagnostic view to see if churners cluster differently.
set.seed(123)
perp <- min(30, floor((nrow(x_train_p) - 1) / 3))
tsne_fit <- Rtsne(
as.matrix(x_train_p),
dims = 2,
perplexity = perp,
pca = TRUE,
check_duplicates = FALSE,
verbose = FALSE
)
tsne_df <- tibble(
tsne1 = tsne_fit$Y[, 1],
tsne2 = tsne_fit$Y[, 2],
Churn = y_train
)
ggplot(tsne_df, aes(tsne1, tsne2, color = Churn)) +
geom_point(alpha = 0.75, size = 2) +
labs(
title = "t-SNE Projection (Train Customers)",
x = "t-SNE 1", y = "t-SNE 2"
) +
theme_minimal(base_size = 13)The t-SNE plot shows churn and non-churn customers are largely mixed with no clear separation, suggesting churn is driven by overlapping behavioural patterns rather than distinct customer clusters.
–Train models (Random Forest + SVM)
We train: - Random Forest (best performer and fairly interpretable via importance/SHAP) - SVM Radial (strong benchmark, but less interpretable and can be slower)
We keep cross-validation reasonable for runtime.
ctrl <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "final"
)
set.seed(123)
rf_grid <- expand.grid(mtry = pmax(2, floor(sqrt(ncol(x_train_bal))) + (-1:1)))
rf_model <- train(
x = x_train_bal, y = y_train_bal,
method = "rf",
metric = "ROC",
trControl = ctrl,
tuneGrid = rf_grid,
ntree = 500,
importance = TRUE
)
rf_model## Random Forest
##
## 3938 samples
## 53 predictor
## 2 classes: 'Churn', 'NoChurn'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3150, 3151, 3150, 3151, 3150
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 6 0.9998491 0.9974606 0.9989848
## 7 0.9998504 0.9974606 0.9989848
## 8 0.9998582 0.9974606 0.9994924
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 8.
set.seed(123)
svm_model <- train(
x = x_train_bal, y = y_train_bal,
method = "svmRadial",
metric = "ROC",
trControl = ctrl,
tuneLength = 8 # keep it lighter so it doesn't take forever
)
svm_model## Support Vector Machines with Radial Basis Function Kernel
##
## 3938 samples
## 53 predictor
## 2 classes: 'Churn', 'NoChurn'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 3150, 3151, 3150, 3151, 3150
## Resampling results across tuning parameters:
##
## C ROC Sens Spec
## 0.25 0.7924572 0.7384586 0.7226993
## 0.50 0.8560225 0.7963615 0.7704486
## 1.00 0.9146381 0.8547694 0.8308908
## 2.00 0.9572964 0.9106237 0.8699991
## 4.00 0.9821262 0.9573488 0.9228194
## 8.00 0.9933106 0.9786750 0.9568386
## 16.00 0.9947555 0.9883210 0.9685176
## 32.00 0.9942313 0.9893362 0.9730861
##
## Tuning parameter 'sigma' was held constant at a value of 0.01099798
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01099798 and C = 16.
Evaluate models (ROC-AUC + PR-AUC)
Accuracy can lie with imbalanced data.
We evaluate using ROC-AUC and PR-AUC and show a simple comparison
table.
best_threshold_f1 <- function(probs, truth, positive = "Churn") {
truth_bin <- ifelse(truth == positive, 1, 0)
thresholds <- seq(0.05, 0.95, by = 0.01)
f1s <- sapply(thresholds, function(t) {
pred_bin <- ifelse(probs >= t, 1, 0)
tp <- sum(pred_bin == 1 & truth_bin == 1)
fp <- sum(pred_bin == 1 & truth_bin == 0)
fn <- sum(pred_bin == 0 & truth_bin == 1)
prec <- ifelse(tp + fp == 0, 0, tp / (tp + fp))
rec <- ifelse(tp + fn == 0, 0, tp / (tp + fn))
ifelse(prec + rec == 0, 0, 2 * prec * rec / (prec + rec))
})
thresholds[which.max(f1s)]
}
evaluate_model <- function(fit, x_test, y_test) {
probs <- predict(fit, newdata = x_test, type = "prob")[, "Churn"]
roc_obj <- pROC::roc(response = y_test, predictor = probs, levels = rev(levels(y_test)))
auc_val <- as.numeric(pROC::auc(roc_obj))
truth_bin <- ifelse(y_test == "Churn", 1, 0)
pr <- PRROC::pr.curve(scores.class0 = probs[truth_bin == 1],
scores.class1 = probs[truth_bin == 0],
curve = FALSE)
pr_auc <- pr$auc.integral
thr <- best_threshold_f1(probs, y_test)
preds <- factor(ifelse(probs >= thr, "Churn", "NoChurn"), levels = levels(y_test))
cm <- confusionMatrix(preds, y_test, positive = "Churn")
list(probs = probs, auc = auc_val, pr_auc = pr_auc, threshold = thr, cm = cm)
}
rf_res <- evaluate_model(rf_model, x_test_p, y_test)
svm_res <- evaluate_model(svm_model, x_test_p, y_test)
tibble(
Model = c("RandomForest","SVMRadial"),
ROC_AUC = c(rf_res$auc, svm_res$auc),
PR_AUC = c(rf_res$pr_auc, svm_res$pr_auc)
) %>% arrange(desc(ROC_AUC))## # A tibble: 2 × 3
## Model ROC_AUC PR_AUC
## <chr> <dbl> <dbl>
## 1 RandomForest 0.996 0.988
## 2 SVMRadial 0.931 0.899
Confusion matrix for the best model
This shows what type of errors the model makes (false positives vs
false negatives).
That matters for retention strategy decisions.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Churn NoChurn
## Churn 118 1
## NoChurn 9 491
##
## Accuracy : 0.9838
## 95% CI : (0.9705, 0.9922)
## No Information Rate : 0.7948
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9493
##
## Mcnemar's Test P-Value : 0.02686
##
## Sensitivity : 0.9291
## Specificity : 0.9980
## Pos Pred Value : 0.9916
## Neg Pred Value : 0.9820
## Prevalence : 0.2052
## Detection Rate : 0.1906
## Detection Prevalence : 0.1922
## Balanced Accuracy : 0.9636
##
## 'Positive' Class : Churn
##
Global “why churn” signals (Random Forest importance)
This tells us which features matter most overall.
It does not tell direction, but it’s a strong starting point for
stakeholders.
imp_raw <- varImp(rf_model, scale = TRUE)
imp_df <- imp_raw$importance %>%
as.data.frame() %>%
tibble::rownames_to_column("feature")
num_cols <- names(imp_df)[sapply(imp_df, is.numeric)]
rf_importance <- imp_df %>%
mutate(importance = if (length(num_cols) == 1) .data[[num_cols[1]]] else rowMeans(across(all_of(num_cols)))) %>%
arrange(desc(importance)) %>%
slice_head(n = 20) %>%
select(feature, importance)
ggplot(rf_importance, aes(x = reorder(feature, importance), y = importance)) +
geom_col() +
coord_flip() +
labs(
title = "Top 20 Drivers of Churn (Random Forest Importance)",
x = "Feature", y = "Importance"
) +
theme_minimal(base_size = 13)
Bank’s churn model is basically telling us something very human:
customers leave when they stop showing up and stop behaving like active
customers. The strongest signals are how long it has been since a
customer last logged in and their login frequency, followed by how
recently they last made a transaction. After that, churn is heavily
linked to spending behaviour (how much they spend, how consistent their
spending is, and whether their spending is becoming volatile), plus how
recently they contacted customer service. Demographics like age, income
level, and marital status matter, but they’re clearly secondary compared
to engagement and activity.
So the practical retention plan is simple: flag customers who haven’t logged in or transacted recently, especially if their spending is dropping or becoming inconsistent, and intervene early with targeted nudges (personalised offers, reminders, friction-free reactivation prompts) and service recovery where needed. In other words, monitor engagement + recency as the early-warning system, then use spending and service signals to decide what kind of retention action to take.
–Now we also focus on SHAP (SHapley Additive exPlanations), explaining, which feature actually pushed predictions up/down by this much.
SHAP explains contributions: - Positive SHAP pushes prediction toward churn. - Negative SHAP pushes away from churn.
We compute SHAP on a sample of test customers to keep runtime sensible.
pred_churn_prob <- function(model, newdata) {
predict(model, newdata = newdata, type = "prob")[, "Churn"]
}
set.seed(123)
sample_n <- min(300, nrow(x_test_p))
sample_idx <- sample(seq_len(nrow(x_test_p)), sample_n)
X_explain <- x_test_p[sample_idx, , drop = FALSE]
shap_vals <- fastshap::explain(
object = rf_model,
X = X_explain,
pred_wrapper = pred_churn_prob,
nsim = 80,
adjust = TRUE
)
shap_long <- shap_vals %>%
as.data.frame() %>%
mutate(row_id = row_number()) %>%
pivot_longer(-row_id, names_to = "feature", values_to = "shap")
shap_global <- shap_long %>%
group_by(feature) %>%
summarise(mean_abs_shap = mean(abs(shap)), .groups = "drop") %>%
arrange(desc(mean_abs_shap)) %>%
slice_head(n = 20)
ggplot(shap_global, aes(x = reorder(feature, mean_abs_shap), y = mean_abs_shap)) +
geom_col() +
coord_flip() +
labs(
title = "Top 20 Drivers of Churn (SHAP Mean |Contribution|)",
x = "Feature", y = "Mean |SHAP|"
) +
theme_minimal(base_size = 13)This SHAP chart is basically the “no excuses” version of your story: the biggest reason customers churn is that they stop engaging. The top driver is LoginFrequency (customers who log in less are far more likely to leave), and right after that is days_since_last_svc, meaning recent or frequent service interactions are strongly tied to churn risk (often a sign of friction, complaints, or unresolved issues). Then the next block is all about spending behaviour: median/mean/total spend and how stable it is (spend_cv and spend_sd). In plain terms, churners tend to either spend less or their spending becomes inconsistent and volatile, which is a classic “customer drifting away” signal. Recency still matters too (days_since_last_txn and days_since_last_login), plus some minor timing/channel patterns (weekday, website usage), which are more “behavioural fingerprints” than core causes. So the solution is: use login drop + recent service friction + changing spend patterns as your early-warning triggers, then act with targeted retention: reactivation nudges for low logins, service recovery for recent service users, and personalised offers for customers whose spending is declining or becoming erratic.
SHAP for one high-risk customer (clear indicators)
We pick the customer in the sample with the highest churn probability and list the top reasons.
probs_sample <- pred_churn_prob(rf_model, X_explain)
top_case <- which.max(probs_sample)
case_df <- tibble(
feature = colnames(shap_vals),
shap = as.numeric(shap_vals[top_case, ])
) %>%
arrange(desc(abs(shap))) %>%
slice_head(n = 12)
case_df## # A tibble: 12 × 2
## feature shap
## <chr> <dbl>
## 1 spend_sd 0.0777
## 2 LoginFrequency 0.0650
## 3 spend_mean 0.0575
## 4 spend_total 0.0548
## 5 spend_cv 0.0542
## 6 Age 0.0431
## 7 days_since_last_svc 0.0411
## 8 spend_median 0.0333
## 9 login_weekday^4 0.0303
## 10 days_since_last_login 0.0268
## 11 days_since_last_txn 0.0241
## 12 login_weekday^6 0.0133
So above is the “why THIS specific customer is predicted to churn” list. For the highest-risk customer, the model indicates churn is mainly driven by unstable spending behaviour (high spending variability and inconsistency), combined with reduced engagement (lower login frequency and greater time since last login). Additional contributors include the customer’s overall spending level (mean/median/total spend patterns), recent customer service behaviour, and age, but the dominant signal is that the customer’s activity has become irregular and is drifting. This suggests a targeted retention action should focus on re-engagement (bringing them back to regular logins) and stabilising usage/spend, alongside proactive service follow-up if friction is present.