In this customer churn assessment, a dataset with 19 features, an attrition flag and over 10,000 records was used to create a model to predict if a particular client would cancel their credit card membership. R and R Studio were used to perform both exploratory data analysis (EDA) and machine learning classification modeling.
The EDA process revealed some interesting insights to assist our modeling process. I looked at various metrics by certain dimensions to assess what features had the biggest impact on attrition rate.
The metrics included:
Average Attrition Rate (AvgAR)
Average Income (AvgInc)
Average Number of Dependents (AvgDep)
Average Credit Limit (AvgCL)
Average Total Revolving Balance (AvgTRB)
Average Total Transaction Amount (AvgTTA)
Average Total Transaction Count (AvgTTC)
Average Utility Ratio (AvgUR)
Average (Months Inactive (AvgMI)
Summaries were displayed by the following dimensions:
Gender
Age
Education Level
Income
Card Category
Marital Status
Months Inactive
Although not directly germane to the analysis, it was interesting that despite having almost a double utilization ratio of 34% versus 20% for males, females have less than half the credit limit of males 5,023 versus 12,685. This may be due to the average income level of females is nearly a third of males. Yet, both male and female customers have roughly 2.3 dependents on the card.
Additionally, low income members rely heavily on credit use while higher income members were less reliant on credit use.
Initial indications identified Gender, Months Inactive and Education level as having significant influence on attrition rate. Female attrition rate was 17.3% versus male at 14.6%. As members become less active (greater number of months inactive), attrition rate increased peaking at 29.9% at 4 months. Finally, members with a doctorate education level had an attrition rate of 21.1%.
For feature selection, I used the roc curve area method to rank the
significance of the features. The results coincided with the initial
EDA. Here are the top 5 features by ROC curve area:
Feature Score
Months_Inactive_12_mon 7.02
Gender 3.75
Total_Ct_Chng_Q4_Q1 2.88
Education_Level 2.82
Contacts_Count_12_mon 2.71
Using the above features, seven classification models were
deployed:
Logistic Regression
K-Nearest Neighbor
Support Vector Machine
Kernel SVM
Naive Bayes
Decision Tree
Random Forest
Using a combination of accuracy and recall (sensitivity) performance metrics, the Random Forest model with an accuracy of 86.9% and a recall 35.6%, both top among all models, performed best among all. Given that the class of “Attrited Customer” was very low, meaning a model could simply predict all customers to be “Existing” and still produce an accuracy of 83.9% (as did the Support Vector Machine model), the recall measure, which measures how well “Attrited Customer” was predicted, was used as well.
Once a model is created, there are two ways to integrate into the
business.
1. Run the model against all existing customers on a periodic basis to
identify which customers were at highest risk of attrition. Then target
these customers with special rates or offers or simply contact them and
provide better servicing.
2. Use the most important features to set triggers to reach out to and
provide offers to customers that were approaching a high risk level. An
example would be to set triggers to alert when customers became inactive
for a certain period as “Months Inactive” has a high influence on
attrition rate.
#rm(list=ls())
#gc()
## Using data table library for quicker and easier analysis
library(data.table)
library(ggplot2)
library(scales)
library(knitr)
directory <- 'C:/Users/ThomasJardine/OneDrive - 0to1.one/NY Life Assessment/'
file_name <- 'BankChurners.csv'
file <- paste0(directory, file_name)
## Import dataset
BankChurners <- fread(file)
BankChurners <- BankChurners[,1:(length(BankChurners)-2)]
Despite having almost a double utilization ratio, females have less than half the credit limit of males. This may be due to the average income level of females is nearly a third of males. Yet, both male and female customers have roughly 2.3 dependents on the card.
Females have a higher attrition rate to males, 17.4% vs 14.6%.
The age group (45,55] has the highest average salary ($60,662) and a commensurate highest credit limit but also has the highest attrition rate of 16.6% (excluding the 10 members in the (65-75] age range).
Interesting that while there seems to be no variation in the measures (income, credit limit, utilization, etc.) by education level, the doctorate attrition rate is very high at 21,1%.
No surprise that average credit limit rises as income rises but utilization rate is highest among the low income and decreases as income increases with the highest income cohort having the lowest utilization rate of 12.5%.
There seems to be no differentiation of metrics by marital status.
Following a similar, and not surprising, pattern as income, Blue card members, with the lowest average income and credit limit, have the highest utility ratio. Platinum card holders have the highest attrition rate (albeit only 20 members in the sample).
There is a clear and somewhat intuitive relationship between number of months inactive and attrition rate notwithstanding the 0, 5 and 6 months which given the low count are not significant.
Low income card members are heavily reliant on credit use.
## Summarize BankChurners
summary(BankChurners)
## CLIENTNUM Attrition_Flag Customer_Age Gender
## Min. :708082083 Length:10127 Min. :26.00 Length:10127
## 1st Qu.:713036770 Class :character 1st Qu.:41.00 Class :character
## Median :717926358 Mode :character Median :46.00 Mode :character
## Mean :739177606 Mean :46.33
## 3rd Qu.:773143533 3rd Qu.:52.00
## Max. :828343083 Max. :73.00
## Dependent_count Education_Level Marital_Status Income_Category
## Min. :0.000 Length:10127 Length:10127 Length:10127
## 1st Qu.:1.000 Class :character Class :character Class :character
## Median :2.000 Mode :character Mode :character Mode :character
## Mean :2.346
## 3rd Qu.:3.000
## Max. :5.000
## Card_Category Months_on_book Total_Relationship_Count
## Length:10127 Min. :13.00 Min. :1.000
## Class :character 1st Qu.:31.00 1st Qu.:3.000
## Mode :character Median :36.00 Median :4.000
## Mean :35.93 Mean :3.813
## 3rd Qu.:40.00 3rd Qu.:5.000
## Max. :56.00 Max. :6.000
## Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit
## Min. :0.000 Min. :0.000 Min. : 1438
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 2555
## Median :2.000 Median :2.000 Median : 4549
## Mean :2.341 Mean :2.455 Mean : 8632
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:11068
## Max. :6.000 Max. :6.000 Max. :34516
## Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt
## Min. : 0 Min. : 3 Min. :0.0000 Min. : 510
## 1st Qu.: 359 1st Qu.: 1324 1st Qu.:0.6310 1st Qu.: 2156
## Median :1276 Median : 3474 Median :0.7360 Median : 3899
## Mean :1163 Mean : 7469 Mean :0.7599 Mean : 4404
## 3rd Qu.:1784 3rd Qu.: 9859 3rd Qu.:0.8590 3rd Qu.: 4741
## Max. :2517 Max. :34516 Max. :3.3970 Max. :18484
## Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
## Min. : 10.00 Min. :0.0000 Min. :0.0000
## 1st Qu.: 45.00 1st Qu.:0.5820 1st Qu.:0.0230
## Median : 67.00 Median :0.7020 Median :0.1760
## Mean : 64.86 Mean :0.7122 Mean :0.2749
## 3rd Qu.: 81.00 3rd Qu.:0.8180 3rd Qu.:0.5030
## Max. :139.00 Max. :3.7140 Max. :0.9990
## Column Description of BankChurners
str(BankChurners)
## Classes 'data.table' and 'data.frame': 10127 obs. of 21 variables:
## $ CLIENTNUM : int 768805383 818770008 713982108 769911858 709106358 713061558 810347208 818906208 710930508 719661558 ...
## $ Attrition_Flag : chr "Existing Customer" "Existing Customer" "Existing Customer" "Existing Customer" ...
## $ Customer_Age : int 45 49 51 40 40 44 51 32 37 48 ...
## $ Gender : chr "M" "F" "M" "F" ...
## $ Dependent_count : int 3 5 3 4 3 2 4 0 3 2 ...
## $ Education_Level : chr "High School" "Graduate" "Graduate" "High School" ...
## $ Marital_Status : chr "Married" "Single" "Married" "Unknown" ...
## $ Income_Category : chr "$60K - $80K" "Less than $40K" "$80K - $120K" "Less than $40K" ...
## $ Card_Category : chr "Blue" "Blue" "Blue" "Blue" ...
## $ Months_on_book : int 39 44 36 34 21 36 46 27 36 36 ...
## $ Total_Relationship_Count: int 5 6 4 3 5 3 6 2 5 6 ...
## $ Months_Inactive_12_mon : int 1 1 1 4 1 1 1 2 2 3 ...
## $ Contacts_Count_12_mon : int 3 2 0 1 0 2 3 2 0 3 ...
## $ Credit_Limit : num 12691 8256 3418 3313 4716 ...
## $ Total_Revolving_Bal : int 777 864 0 2517 0 1247 2264 1396 2517 1677 ...
## $ Avg_Open_To_Buy : num 11914 7392 3418 796 4716 ...
## $ Total_Amt_Chng_Q4_Q1 : num 1.33 1.54 2.59 1.4 2.17 ...
## $ Total_Trans_Amt : int 1144 1291 1887 1171 816 1088 1330 1538 1350 1441 ...
## $ Total_Trans_Ct : int 42 33 20 20 28 24 31 36 24 32 ...
## $ Total_Ct_Chng_Q4_Q1 : num 1.62 3.71 2.33 2.33 2.5 ...
## $ Avg_Utilization_Ratio : num 0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...
## - attr(*, ".internal.selfref")=<externalptr>
## Additional columns
## Attrition flag as num
BankChurners$Attrition_Flag_Num <- as.numeric(factor(BankChurners$Attrition_Flag))-1
## Age Group Bucket
BankChurners$Customer_Age_Bucket <- cut(x = BankChurners$Customer_Age, breaks = c(25, 35, 45, 55, 65, 75))
## Gender as num
BankChurners$Gender_Num <- as.numeric(factor(BankChurners$Gender))-1
## Education level as num
levs <- c("Uneducated", "Unknown", "High School", "College", "Graduate", "Post-Graduate", "Doctorate")
BankChurners$Education_Level_Num <- as.numeric(factor(BankChurners$Education_Level, levels = levs ))
## Marital status as num
BankChurners$Marital_Status_Num <- as.numeric(factor(BankChurners$Marital_Status))
## Income_Category as num
BankChurners$Income <- ifelse(BankChurners$Income_Category == '$60K - $80K', 70000,
ifelse(BankChurners$Income_Category == '$40K - $60K', 50000,
ifelse(BankChurners$Income_Category == '$120K +', 150000,
ifelse(BankChurners$Income_Category == 'Less than $40K', 20000,
ifelse(BankChurners$Income_Category == '$80K - $120K', 100000, 50000)))))
## Card_Category as num
levs <- c("Blue", "Silver", "Gold", "Platinum")
BankChurners$Card_Category_Num <- as.numeric(factor(BankChurners$Card_Category, levels = levs ))
Reviewing the relationships among the numerical features including the transformed numeric features.
The “Average Open to Buy” variable is highly correlated to Credit Limit, Card Category, Income and Gender. Gender is correlated to Income. Average Utilization Ratio, while highly correlated to Total Balance, is negatively correlated to Average Open to Buy and Credit Limit. Attrition Flag Num is correlated to Total Trans Ct and Total Ct Chg Q4 and negatively correlated to Contacts Count and Months Inactive (recall “Existing Customer” = 1 and “Attrited Customer” = 0.
## Correlation Matrix Plot
res <- cor(BankChurners[,c(3,5,10:22,24:28)])
#install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust", tl.cex = 0.6, diag = FALSE,
tl.col = "black", tl.srt = 45)
## Summary by Gender
kable(
(BankChurners)[, .(.N
, AvgAR=sprintf("%0.1f%%", (1-sum(Attrition_Flag_Num)/.N)*100)
, AvgInc=dollar(mean(Income))
, AvgDep=format(mean(Dependent_count), digits=2)
, AvgCL=dollar(mean(Credit_Limit))
, AvgTRB=dollar(mean(Total_Revolving_Bal))
, AvgTTA=dollar(mean(Total_Trans_Amt))
, AvgTTC=format(mean(Total_Trans_Ct), digits=3)
, AvgUR=sprintf("%0.1f%%", mean(Avg_Utilization_Ratio)*100)
, AvgMI=format(mean(Months_Inactive_12_mon), digits=2))
, by = c('Gender')]
)
| Gender | N | AvgAR | AvgInc | AvgDep | AvgCL | AvgTRB | AvgTTA | AvgTTC | AvgUR | AvgMI |
|---|---|---|---|---|---|---|---|---|---|---|
| M | 4769 | 14.6% | $85,474.94 | 2.4 | $12,685.67 | $1,188.43 | $4,493.71 | 63.2 | 20.0% | 2.3 |
| F | 5358 | 17.4% | $31,612.54 | 2.3 | $5,023.85 | $1,140.01 | $4,324.32 | 66.4 | 34.2% | 2.4 |
Despite having almost a double utilization ratio, females have less than half the credit limit of males. This may be due to the average income level of females is nearly a third of males. Yet, both male and female customers have roughly 2.3 dependents on the card.
#BankChurners[,.N,by=c('Gender', 'Attrition_Flag')][,perc:=100*N/sum(N),by='Gender']
#[Attrition_Flag=='Attrited #Customer',.SD]
ggplot(data=
BankChurners[,.(perc=100*(1-sum(Attrition_Flag_Num)/.N),tot=.N, AvgUR=mean(Avg_Utilization_Ratio)),
by='Gender'][order(Gender)],
aes(x=Gender, y=perc, width=tot/10000, fill=AvgUR)) +
geom_bar(stat='identity') +
ggtitle('Attrition Rate by Gender') +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
# geom_text(aes(label = paste('(',tot,')')), vjust = -1.5, colour = "black", size = 3) +
ylab('Attrition Rate/(Count)') +
labs(caption = "width of bars is count size")
Females have a higher attrition rate to males, 17.4% vs 14.6%.
## Summary by Age Bucket
kable(
(BankChurners)[, .(.N
, AvgAR=sprintf("%0.1f%%", (1-sum(Attrition_Flag_Num)/.N)*100)
, AvgInc=dollar(mean(Income))
, AvgDep=format(mean(Dependent_count), digits=2)
, AvgCL=dollar(mean(Credit_Limit))
, AvgTRB=dollar(mean(Total_Revolving_Bal))
, AvgTTA=dollar(mean(Total_Trans_Amt))
, AvgTTC=format(mean(Total_Trans_Ct), digits=3)
, AvgUR=sprintf("%0.1f%%", mean(Avg_Utilization_Ratio)*100)
, AvgMI=format(mean(Months_Inactive_12_mon), digits=2))
, by = c('Customer_Age_Bucket')][order(Customer_Age_Bucket)]
)
| Customer_Age_Bucket | N | AvgAR | AvgInc | AvgDep | AvgCL | AvgTRB | AvgTTA | AvgTTC | AvgUR | AvgMI |
|---|---|---|---|---|---|---|---|---|---|---|
| (25,35] | 919 | 13.3% | $49,118.61 | 1.4 | $7,220.70 | $1,159.10 | $4,505.69 | 64.9 | 30.1% | 2.2 |
| (35,45] | 3742 | 16.2% | $56,378.94 | 2.8 | $8,859.21 | $1,161.93 | $4,483.30 | 65.9 | 26.8% | 2.3 |
| (45,55] | 4135 | 16.6% | $60,662.64 | 2.5 | $9,166.34 | $1,152.80 | $4,471.54 | 65.7 | 26.7% | 2.4 |
| (55,65] | 1321 | 15.8% | $52,747.92 | 1.3 | $7,319.78 | $1,196.07 | $3,919.87 | 59.4 | 29.9% | 2.4 |
| (65,75] | 10 | 20.0% | $38,000 | 0.3 | $5,656.93 | $1,583.30 | $1,498.30 | 31.5 | 40.5% | 2.9 |
ggplot(data=
BankChurners[,.(perc=100*(1-sum(Attrition_Flag_Num)/.N),tot=.N, AvgUR=mean(Avg_Utilization_Ratio)),
by='Customer_Age_Bucket'][order(Customer_Age_Bucket)],
aes(x=Customer_Age_Bucket, y=perc, width=tot/5000, fill=AvgUR)) +
geom_bar(stat='identity') +
ggtitle('Attrition Rate by Age') +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
# geom_text(aes(label = paste('(',tot,')')), vjust = -1.5, colour = "black", size = 3) +
ylab('Attrition Rate/(Count)') +
labs(caption = "width of bars is count size")
The age group (45,55] has the highest average salary ($60,662) and a commensurate highest credit limit but also has the highest attrition rate of 16.6% (excluding the 2 members in the (65-75] age range).
## Summary by Education Level
kable(
(BankChurners)[, .(.N
, AvgAR=sprintf("%0.1f%%", (1-sum(Attrition_Flag_Num)/.N)*100)
, AvgInc=dollar(mean(Income))
, AvgDep=format(mean(Dependent_count), digits=2)
, AvgCL=dollar(mean(Credit_Limit))
, AvgTRB=dollar(mean(Total_Revolving_Bal))
, AvgTTA=dollar(mean(Total_Trans_Amt))
, AvgTTC=format(mean(Total_Trans_Ct), digits=3)
, AvgUR=sprintf("%0.1f%%", mean(Avg_Utilization_Ratio)*100)
, AvgMI=format(mean(Months_Inactive_12_mon), digits=2))
, by = c('Education_Level')]
)
| Education_Level | N | AvgAR | AvgInc | AvgDep | AvgCL | AvgTRB | AvgTTA | AvgTTC | AvgUR | AvgMI |
|---|---|---|---|---|---|---|---|---|---|---|
| High School | 2013 | 15.2% | $58,002.98 | 2.3 | $8,605.82 | $1,194.35 | $4,403.74 | 64.7 | 27.8% | 2.3 |
| Graduate | 3128 | 15.6% | $55,936.70 | 2.3 | $8,566.10 | $1,162.38 | $4,418.49 | 64.9 | 27.8% | 2.4 |
| Uneducated | 1487 | 15.9% | $57,390.72 | 2.4 | $8,899.51 | $1,154.08 | $4,484.34 | 65.3 | 27.2% | 2.4 |
| Unknown | 1519 | 16.9% | $56,892.69 | 2.4 | $8,491.80 | $1,160.45 | $4,420.05 | 64.7 | 27.6% | 2.3 |
| College | 1013 | 15.2% | $57,936.82 | 2.4 | $8,684.54 | $1,135.78 | $4,255.81 | 65 | 26.5% | 2.4 |
| Post-Graduate | 516 | 17.8% | $56,763.57 | 2.4 | $8,862.56 | $1,187.67 | $4,515.41 | 65 | 28.1% | 2.3 |
| Doctorate | 451 | 21.1% | $56,629.71 | 2.3 | $8,413.26 | $1,094.08 | $4,193.02 | 63.8 | 26.1% | 2.4 |
ggplot(data=
BankChurners[,.(perc=100*(1-sum(Attrition_Flag_Num)/.N),tot=.N, AvgUR=mean(Avg_Utilization_Ratio)),
by='Education_Level'][order(Education_Level)],
aes(x=Education_Level, y=perc, width=tot/3000, fill=AvgUR)) +
geom_bar(stat='identity') +
ggtitle('Attrition Rate by Education') +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
# geom_text(aes(label = paste('(',tot,')')), vjust = -1.5, colour = "black", size = 3) +
ylab('Attrition Rate/(Count)') +
labs(caption = "width of bars is count size")
Interesting that while there seems to be no variation in the measures (income, credit limit, utilization, etc.) by education level, the doctorate attrition rate is very high at 21%.
## Summary by Income
kable(
(BankChurners)[, .(.N
, AvgAR=sprintf("%0.1f%%", (1-sum(Attrition_Flag_Num)/.N)*100)
, AvgInc=dollar(mean(Income))
, AvgDep=format(mean(Dependent_count), digits=2)
, AvgCL=dollar(mean(Credit_Limit))
, AvgTRB=dollar(mean(Total_Revolving_Bal))
, AvgTTA=dollar(mean(Total_Trans_Amt))
, AvgTTC=format(mean(Total_Trans_Ct), digits=3)
, AvgUR=sprintf("%0.1f%%", mean(Avg_Utilization_Ratio)*100)
, AvgMI=format(mean(Months_Inactive_12_mon), digits=2))
, by = c('Income_Category')]
)
| Income_Category | N | AvgAR | AvgInc | AvgDep | AvgCL | AvgTRB | AvgTTA | AvgTTC | AvgUR | AvgMI |
|---|---|---|---|---|---|---|---|---|---|---|
| $60K - $80K | 1402 | 13.5% | $70,000 | 2.4 | $10,758.77 | $1,154.99 | $4,450.71 | 63.2 | 20.6% | 2.3 |
| Less than $40K | 3561 | 17.2% | $20,000 | 2.3 | $3,754.40 | $1,145.14 | $4,363.42 | 66.2 | 37.7% | 2.4 |
| $80K - $120K | 1535 | 15.8% | $100,000 | 2.5 | $15,809.89 | $1,206.73 | $4,482.96 | 62.7 | 16.2% | 2.3 |
| $40K - $60K | 1790 | 15.1% | $50,000 | 2.3 | $5,462.26 | $1,171.99 | $4,405.62 | 65 | 31.7% | 2.3 |
| $120K + | 727 | 17.3% | $150,000 | 2.5 | $19,717.32 | $1,222.20 | $4,529.47 | 63.7 | 12.5% | 2.3 |
| Unknown | 1112 | 16.8% | $50,000 | 2.3 | $9,516.58 | $1,115.06 | $4,282.21 | 66.3 | 22.1% | 2.4 |
ggplot(data=
BankChurners[,.(perc=100*(1-sum(Attrition_Flag_Num)/.N),tot=.N, AvgUR=mean(Avg_Utilization_Ratio)),
by='Income_Category'][order(Income_Category)],
aes(x=Income_Category, y=perc, width=tot/3000, fill=AvgUR)) +
geom_bar(stat='identity') +
ggtitle('Attrition Rate by Income') +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
# geom_text(aes(label = paste('(',tot,')')), vjust = -1.5, colour = "black", size = 3) +
ylab('Attrition Rate/(Count)') +
labs(caption = "width of bars is count size")
No surprise that average credit limit rises as income rises but utilization rate is highest among the low income and decreases as income increases with the highest income cohort having the lowest utilization rate of 12.5%.
## Summary by Marital Status
kable(
(BankChurners)[, .(.N
, AvgAR=sprintf("%0.1f%%", (1-sum(Attrition_Flag_Num)/.N)*100)
, AvgInc=dollar(mean(Income))
, AvgDep=format(mean(Dependent_count), digits=2)
, AvgCL=dollar(mean(Credit_Limit))
, AvgTRB=dollar(mean(Total_Revolving_Bal))
, AvgTTA=dollar(mean(Total_Trans_Amt))
, AvgTTC=format(mean(Total_Trans_Ct), digits=3)
, AvgUR=sprintf("%0.1f%%", mean(Avg_Utilization_Ratio)*100)
, AvgMI=format(mean(Months_Inactive_12_mon), digits=2))
, by = c('Marital_Status')]
)
| Marital_Status | N | AvgAR | AvgInc | AvgDep | AvgCL | AvgTRB | AvgTTA | AvgTTC | AvgUR | AvgMI |
|---|---|---|---|---|---|---|---|---|---|---|
| Married | 4687 | 15.1% | $57,793.90 | 2.4 | $8,076.66 | $1,197.16 | $4,173.42 | 61.7 | 29.0% | 2.3 |
| Single | 3943 | 16.9% | $55,883.84 | 2.3 | $8,999.68 | $1,124.24 | $4,594.55 | 67.8 | 26.4% | 2.4 |
| Unknown | 749 | 17.2% | $58,064.09 | 2.5 | $9,445.28 | $1,157.75 | $4,720.01 | 67.1 | 25.6% | 2.3 |
| Divorced | 748 | 16.2% | $56,537.43 | 2.4 | $9,358.57 | $1,155.98 | $4,529.09 | 66.9 | 25.7% | 2.3 |
ggplot(data=
BankChurners[,.(perc=100*(1-sum(Attrition_Flag_Num)/.N),tot=.N, AvgUR=mean(Avg_Utilization_Ratio)),
by='Marital_Status'][order(Marital_Status)],
aes(x=Marital_Status, y=perc, width=tot/5000, fill=AvgUR)) +
geom_bar(stat='identity') +
ggtitle('Attrition Rate by Marital Status') +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
# geom_text(aes(label = paste('(',tot,')')), vjust = -1.5, colour = "black", size = 3) +
ylab('Attrition Rate/(Count)') +
labs(caption = "width of bars is count size")
There seems to be no differentiation of metrics by marital status.
## Summary by Card Category
kable(
(BankChurners)[, .(.N
, AvgAR=sprintf("%0.1f%%", (1-sum(Attrition_Flag_Num)/.N)*100)
, AvgInc=dollar(mean(Income))
, AvgDep=format(mean(Dependent_count), digits=2)
, AvgCL=dollar(mean(Credit_Limit))
, AvgTRB=dollar(mean(Total_Revolving_Bal))
, AvgTTA=dollar(mean(Total_Trans_Amt))
, AvgTTC=format(mean(Total_Trans_Ct), digits=3)
, AvgUR=sprintf("%0.1f%%", mean(Avg_Utilization_Ratio)*100)
, AvgMI=format(mean(Months_Inactive_12_mon), digits=2))
, by = c('Card_Category')]
)
| Card_Category | N | AvgAR | AvgInc | AvgDep | AvgCL | AvgTRB | AvgTTA | AvgTTC | AvgUR | AvgMI |
|---|---|---|---|---|---|---|---|---|---|---|
| Blue | 9436 | 16.1% | $56,106.40 | 2.3 | $7,363.78 | $1,157.81 | $4,225.41 | 64 | 29.1% | 2.3 |
| Gold | 116 | 18.1% | $73,362.07 | 2.7 | $28,416.37 | $1,344.32 | $7,685.61 | 81.5 | 5.7% | 2.3 |
| Silver | 555 | 14.8% | $67,783.78 | 2.4 | $25,277.84 | $1,206.13 | $6,590.48 | 74.7 | 5.7% | 2.3 |
| Platinum | 20 | 25.0% | $73,000 | 2.4 | $30,283.45 | $1,267.95 | $8,999.75 | 87 | 4.4% | 2.2 |
ggplot(data=
BankChurners[,.(perc=100*(1-sum(Attrition_Flag_Num)/.N),tot=.N, AvgUR=mean(Avg_Utilization_Ratio)),
by='Card_Category'][order(Card_Category)],
aes(x=Card_Category, y=perc, width=tot/10000, fill=AvgUR)) +
geom_bar(stat='identity') +
ggtitle('Attrition Rate by Card Category') +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
ylab('Attrition Rate/(Count)') +
labs(caption = "width of bars is count size")
Following a similar, and not surprising, pattern as income, Blue card members, with the lowest average income nd credit limit, have the highest utility ratio. Platinum card holders have the highest attrition rate (albeit only 20 members in the sample).
## Summary by Months Inactive
kable(
(BankChurners)[, .(.N
, AvgAR=sprintf("%0.1f%%", (1-sum(Attrition_Flag_Num)/.N)*100)
, AvgInc=dollar(mean(Income))
, AvgDep=format(mean(Dependent_count), digits=2)
, AvgCL=dollar(mean(Credit_Limit))
, AvgTRB=dollar(mean(Total_Revolving_Bal))
, AvgTTA=dollar(mean(Total_Trans_Amt))
, AvgTTC=format(mean(Total_Trans_Ct), digits=3)
, AvgUR=sprintf("%0.1f%%", mean(Avg_Utilization_Ratio)*100)
, AvgMI=format(mean(Months_Inactive_12_mon), digits=2))
, by = c('Months_Inactive_12_mon')][order(Months_Inactive_12_mon)]
)
| Months_Inactive_12_mon | N | AvgAR | AvgInc | AvgDep | AvgCL | AvgTRB | AvgTTA | AvgTTC | AvgUR | AvgMI |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 29 | 51.7% | $44,137.93 | 2.1 | $7,014.27 | $714.90 | $3,156.34 | 54.4 | 15.4% | 0 |
| 1 | 2233 | 4.5% | $56,892.07 | 2.4 | $8,628.31 | $1,230.62 | $4,672.82 | 67.3 | 28.8% | 1 |
| 2 | 3282 | 15.4% | $57,922.00 | 2.4 | $8,925.32 | $1,167.45 | $4,401.07 | 65 | 27.1% | 2 |
| 3 | 3846 | 21.5% | $56,627.67 | 2.3 | $8,561.65 | $1,138.31 | $4,321.20 | 63.7 | 27.1% | 3 |
| 4 | 435 | 29.9% | $57,724.14 | 2.3 | $7,916.64 | $1,048.37 | $3,769.31 | 60.7 | 26.8% | 4 |
| 5 | 178 | 18.0% | $51,629.21 | 2.4 | $7,515.52 | $1,131.32 | $4,688.28 | 67.8 | 31.1% | 5 |
| 6 | 124 | 15.3% | $52,419.35 | 2.2 | $7,603.88 | $1,130.56 | $4,326.02 | 66.9 | 28.1% | 6 |
ggplot(data=
BankChurners[,.(perc=100*(1-sum(Attrition_Flag_Num)/.N),tot=.N, AvgUR=mean(Avg_Utilization_Ratio)),
by='Months_Inactive_12_mon'][order(Months_Inactive_12_mon)],
aes(x=Months_Inactive_12_mon, y=perc, width=tot/4000, fill=AvgUR)) +
geom_bar(stat='identity') +
ggtitle('Attrition Rate by Months Inactive') +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
ylab('Attrition Rate/(Count)') +
labs(caption = "width of bars is count size")
There is a clear and somewhat intuitive relationship between number of months inactive and attrition rate notwithstanding the 0, 5 and 6 months which given the low count are not significant.
ggplot(data=
BankChurners[,.(perc=mean(Avg_Utilization_Ratio)*100,tot=.N),
by='Income'],
aes(x=Income, y=perc, width=tot*4)) +
geom_bar(stat='identity') +
ggtitle('Utility Ratio By Income') +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
ylab('Utility Ratio/(Count)') +
labs(caption = "width of bars is count size")
ggplot(data=
BankChurners[,.(perc=mean(Credit_Limit),tot=.N),
by='Income'],
aes(x=Income, y=perc, width=tot*6)) +
geom_bar(stat='identity') +
ggtitle('Credit Limit By Income') +
theme(plot.title = element_text(hjust = 0.5)) +
geom_text(aes(label = paste(dollar(perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
ylab('Utility Ratio/(Count)') +
labs(caption = "width of bars is count size")
The feature selection results found here are inline with our
conclusions from the EDA exercise. For feature selection, I used the ROC
curve area method to rank the significance of the features.
Feature Score
Months_Inactive_12_mon 7.02
Gender 3.75
Total_Ct_Chng_Q4_Q1 2.88
Education_Level 2.82
Contacts_Count_12_mon 2.71
The following transformations were performed in the above “Additional Columns” code chunk. The categorical features, Gender and Education Level, were transformed into numerical representations. Because Education Level has a hierarchy associated to it, thee numerical association was customized to rank as follows: “Uneducated”, “Unknown”, “High School”, “College”, “Graduate”, “Post-Graduate”, “Doctorate”.
Additionally, each feature, except Gender because it is a binary feature, was scaled to standardize the column with a mean of 0 and standard deviation of 1. Not all models require scaling but using scaling does not impact the results.
Finally, the dependent variable, Attrition_Flag_Num, is transformed to a factor in R. Again not required by all models but is okay for all models.
## I use the ROC Curve area as a score to rank feature relevance in the dataset
library(caret)
#use roc_curve area as score
## takes 20+ mins
roc_imp <- filterVarImp(x = BankChurners[,3:21], y = BankChurners$Attrition_Flag_Num)
#sort the score in decreasing order
roc_imp <- data.frame(cbind(variable = rownames(roc_imp), score = roc_imp[,1]))
roc_imp$score <- as.double(roc_imp$score)
roc_imp[order(roc_imp$score,decreasing = TRUE),]
## variable score
## 10 Months_Inactive_12_mon 7.020371e+00
## 2 Gender 3.753000e+00
## 18 Total_Ct_Chng_Q4_Q1 2.886986e+00
## 4 Education_Level 2.820714e+00
## 11 Contacts_Count_12_mon 2.713653e+00
## 13 Total_Revolving_Bal 1.846161e+00
## 19 Avg_Utilization_Ratio 1.824920e+00
## 17 Total_Trans_Ct 1.792174e+00
## 9 Total_Relationship_Count 1.412392e+00
## 6 Income_Category 1.357638e+00
## 12 Credit_Limit 9.761108e-01
## 5 Marital_Status 7.259596e-01
## 7 Card_Category 5.845796e-01
## 8 Months_on_book 3.687921e-01
## 1 Customer_Age 2.184954e-01
## 3 Dependent_count 1.999022e-01
## 16 Total_Trans_Amt 2.975181e-11
## 14 Avg_Open_To_Buy 8.230264e-12
## 15 Total_Amt_Chng_Q4_Q1 2.815500e-13
## model dataset construction
dataset <- BankChurners[,c(24,25,12,13,20,22)]
dataset$Attrition_Flag_Num <- factor(dataset$Attrition_Flag_Num, levels = c(0, 1))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split <- sample.split(dataset$Attrition_Flag_Num, SplitRatio = 0.75)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)
# Feature Scaling
training_set[,2] <- scale(training_set[,2])
training_set[,3] <- scale(training_set[,3])
training_set[,4] <- scale(training_set[,4])
training_set[,5] <- scale(training_set[,5])
test_set[,2] <- scale(test_set[,2])
test_set[,3] <- scale(test_set[,3])
test_set[,4] <- scale(test_set[,4])
test_set[,5] <- scale(test_set[,5])
I will employ a series of well known classification models
including:
Logistic Regression
K-Nearest Neighbor
Support Vector Machine
Kernel SVM
Naive Bayes
Decision Tree
Random Forest
See below for the training of each model.
# Fitting Logistic Regression to the Training set
classifier = glm(formula = Attrition_Flag_Num ~ .,
family = binomial,
data = training_set)
# Predicting the Test set results
prob_pred = predict(classifier, type = 'response', newdata = test_set[,-6])
y_pred = ifelse(prob_pred > 0.5, 1, 0)
# Making the Confusion Matrix
cm = as.matrix(table(Actual = test_set[[6]], Predicted = y_pred)) # create the confusion matrix
#paste("Accuracy = ",sprintf("%0.1f%%", (1-(cm[1,2]+cm[2,1])/dim(test_set)[1])*100))
lr_acc <- sum(diag(cm)) / sum(cm)
lr_precision <- cm[1,1]/(cm[1,1]+cm[2,1])
lr_recall <- cm[1,1]/(cm[1,1]+cm[1,2]) ## AKA sensitivity
lr_F1 <- (2*lr_precision*lr_recall)/(lr_precision+lr_recall)
# Fitting K Nearest Neighbor to the Training set
library(class)
y_pred = knn(train = training_set[, -6],
test = test_set[, -6],
cl = training_set[[6]],
k = 5,
prob = TRUE)
# Making the Confusion Matrix
cm = as.matrix(table(Actual = test_set[[6]], Predicted = y_pred)) # create the confusion matrix
knn_acc <- sum(diag(cm)) / sum(cm)
knn_precision <- cm[1,1]/(cm[1,1]+cm[2,1])
knn_recall <- cm[1,1]/(cm[1,1]+cm[1,2]) ## AKA sensitivity
knn_F1 <- (2*knn_precision*knn_recall)/(knn_precision+knn_recall)
# Fitting Support Vector Machine to the Training set
# install.packages('e1071')
library(e1071)
classifier = svm(formula = Attrition_Flag_Num ~ .,
data = training_set,
type = 'C-classification',
kernel = 'linear')
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[,1:5])
# Making the Confusion Matrix
cm = as.matrix(table(Actual = test_set[[6]], Predicted = y_pred)) # create the confusion matrix
svm_acc <- sum(diag(cm)) / sum(cm)
svm_precision <- cm[1,1]/(cm[1,1]+cm[2,1])
svm_recall <- cm[1,1]/(cm[1,1]+cm[1,2]) ## AKA sensitivity
svm_F1 <- (2*svm_precision*svm_recall)/(svm_precision+svm_recall)
# Fitting Kernel SVM to the Training set
library(e1071)
classifier = svm(formula = Attrition_Flag_Num ~ .,
data = training_set,
type = 'C-classification',
kernel = 'radial')
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[,1:5])
# Making the Confusion Matrix
cm = as.matrix(table(Actual = test_set[[6]], Predicted = y_pred)) # create the confusion matrix
ksvm_acc <- sum(diag(cm)) / sum(cm)
ksvm_precision <- cm[1,1]/(cm[1,1]+cm[2,1])
ksvm_recall <- cm[1,1]/(cm[1,1]+cm[1,2]) ## AKA sensitivity
ksvm_F1 <- (2*ksvm_precision*ksvm_recall)/(ksvm_precision+ksvm_recall)
# Fitting Naive Bayes to the Training set
library(e1071)
classifier = naiveBayes(x = training_set[,1:5],
y = training_set$Attrition_Flag_Num)
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[,1:5])
# Making the Confusion Matrix
cm = as.matrix(table(Actual = test_set[[6]], Predicted = y_pred)) # create the confusion matrix
nb_acc <- sum(diag(cm)) / sum(cm)
nb_precision <- cm[1,1]/(cm[1,1]+cm[2,1])
nb_recall <- cm[1,1]/(cm[1,1]+cm[1,2]) ## AKA sensitivity
nb_F1 <- (2*nb_precision*nb_recall)/(nb_precision+nb_recall)
# Fitting Decision Tree Classification to the Training set
# install.packages('rpart')
library(rpart)
training_set$Attrition_Flag_Num = factor(training_set$Attrition_Flag_Num, levels = c(0, 1))
test_set$Attrition_Flag_Num = factor(test_set$Attrition_Flag_Num, levels = c(0, 1))
classifier = rpart(formula = Attrition_Flag_Num ~ .,
data = training_set)
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[,-6], type = 'class')
# Making the Confusion Matrix
cm = as.matrix(table(Actual = test_set[[6]], Predicted = y_pred)) # create the confusion matrix
dt_acc <- sum(diag(cm)) / sum(cm)
dt_precision <- cm[1,1]/(cm[1,1]+cm[2,1])
dt_recall <- cm[1,1]/(cm[1,1]+cm[1,2]) ## AKA sensitivity
dt_F1 <- (2*dt_precision*dt_recall)/(dt_precision+dt_recall)
# Plotting the tree
## recreate training and test dataset
set.seed(123)
split = sample.split(dataset$Attrition_Flag_Num, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
training_set$Attrition_Flag_Num = factor(training_set$Attrition_Flag_Num, levels = c(0, 1))
test_set$Attrition_Flag_Num = factor(test_set$Attrition_Flag_Num, levels = c(0, 1))
classifier = rpart(formula = Attrition_Flag_Num ~ .,
data = training_set)
library(rpart.plot)
prp(classifier)
# Fitting Random Forest Classification to the Training set
# install.packages('randomForest')
library(randomForest)
set.seed(123)
classifier = randomForest(x = training_set[,1:5],
y = training_set$Attrition_Flag_Num,
ntree = 500)
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[,-6])
# Making the Confusion Matrix
cm = as.matrix(table(Actual = test_set[[6]], Predicted = y_pred)) # create the confusion matrix
rf_acc <- sum(diag(cm)) / sum(cm)
rf_precision <- cm[1,1]/(cm[1,1]+cm[2,1])
rf_recall <- cm[1,1]/(cm[1,1]+cm[1,2]) ## AKA sensitivity
rf_F1 <- (2*rf_precision*rf_recall)/(rf_precision+rf_recall)
In machine learning, models have two types of parameters. Those “learned” by training the model (ex. coefficients in a regressions model) and those parameters the user can adjust or hyperparameters. Adjusting these hyperparameters is known as tuning a model or optimizing the training.
In R the library “caret” has an automatic hyperparameter selection function which I use below. In the interest of simplicity, I tune only the Random Forest model as this is the model with the highest accuracy and recall.
# Grid Search to find the best parameters
# install.packages('caret')
# this code chunk takes several minutes to run
library(caret)
classifier <- train(form = Attrition_Flag_Num ~ ., data = training_set, method = 'rf')
classifier
## Random Forest
##
## 7595 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 7595, 7595, 7595, 7595, 7595, 7595, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8615113 0.3796700
## 3 0.8471441 0.3546715
## 5 0.8336159 0.3275144
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
The optimal Random Forest model chosen uses mtry = 2 where:
mtry - Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)
Notice that the accuracy using the optimal model (mtry = 2) is 86.1% or less than the accuracy I obtained above when training the Random Forest model, 86.9%. This is because I only calculated the accuracy using one training set and one test set while the function above takes the average of accuracies from 25 repetitions of sets.
Accuracy, the percentage of correct predictions a model makes, is an excellent metric to evaluate a model’s performance. By creating a confusion matrix from the vector of predictions and vector of actual results, the accuracy is calculated by totaling the confusion matrix diagnol and dividing by the number of records. The accuracies of each model are below.
Additionally, given that the class of “Attrited Customer” was very low, meaning a model could simply predict all customers to be “Existing” and still produce an accuracy of 83.9% (as did the Support Vector Machine model), the recall measure which measures how well “Attrited Customer” was used. Recall is calculated using the confusion matrix: TruePositive / (True Positive + FalseNegative).
NB. This recall calculation rewards predicting an attrited customer because simply predicting existing customers (and missing potential attrited customers) doesn’t help. Plus, the cost of identifying a falsely identified potential attrited customer (with special offers etc.)is minimal while missing a potential attrited customer and not providing them special offers is high.
Finally, these accuracy measures are the result of just one set of training and test selections. A more robust method of calculating accuracies (and other performance metrics) is to employ a k-fold cross validation. This involves training a model on k different selections of a training and test set from the dataset and averaging the resulting metrics. Below I apply a 10-fold cross validation on the Random Forest model.
Model Accuracy and Recall metrics:
Logistic Regression Accuracy: 86.4% Recall: 23.8%
K-Nearest Neighbor Accuracy: 85.4% Recall: 30.7%
Support Vector Machine Accuracy: 83.9% Recall: 0.0%
Kernel SVM Accuracy: 86.4% Recall: 22.9%
Naive Bayes Accuracy: 86.1% Recall: 18.2%
Decision Tree Accuracy: 86.8% Recall: 33.4%
Random Forest Accuracy: 86.9% Recall: 35.6%
Using a combination of accuracy and recall (sensitivity) performance metrics, the Random Forest model with an accuracy of 86.9% and a recall 35.6%, both top among all models, performed best among all.
# K-Fold Cross Validation
library(caret)
folds <- createFolds(training_set$Attrition_Flag_Num, k = 10)
cv <- lapply(folds, function(x) {
training_fold = training_set[-x, ]
test_fold = training_set[x, ]
classifier = randomForest(x = training_fold[,1:5],
y = training_fold$Attrition_Flag_Num,
ntree = 500)
# Predicting the Test set results
y_pred <- predict(classifier, newdata = test_fold[,-6])
cm <- as.matrix(table(Actual = test_fold[[6]], Predicted = y_pred)) # create the confusion matrix
accuracy <- (cm[1,1] + cm[2,2]) / sum(cm)
return(accuracy)
})
rf_kfold_accuracy <- mean(as.numeric(cv))
Random Forest 10-Fold Average Accuracy: 86.7%
Once a model is created, there are two ways to integrate into the business. 1. Run the model against all existing customers on a periodic basis to identify which customers were at highest risk of attrition. Then target these customers with special rates or offers or simply contact them and provide better servicing. 2. Use the most important features to set triggers to reach out to and provide offers to customers that were approaching a high risk level. An example would be to set triggers to alert when customers became inactive for a certain period as “Months Inactive” has a high influence on attrition rate.
Using this model to target customers at high risk of attrition can improve the client servicing model and reduce the number of exiting clients.