Document

Executive Summary

In this customer churn assessment, a dataset with 19 features, an attrition flag and over 10,000 records was used to create a model to predict if a particular client would cancel their credit card membership. R and R Studio were used to perform both exploratory data analysis (EDA) and machine learning classification modeling.

The EDA process revealed some interesting insights to assist our modeling process. I looked at various metrics by certain dimensions to assess what features had the biggest impact on attrition rate.

The metrics included:
Average Attrition Rate (AvgAR)
Average Income (AvgInc)
Average Number of Dependents (AvgDep)
Average Credit Limit (AvgCL)
Average Total Revolving Balance (AvgTRB)
Average Total Transaction Amount (AvgTTA)
Average Total Transaction Count (AvgTTC)
Average Utility Ratio (AvgUR)
Average (Months Inactive (AvgMI)

Summaries were displayed by the following dimensions:
Gender
Age
Education Level
Income
Card Category
Marital Status
Months Inactive

Although not directly germane to the analysis, it was interesting that despite having almost a double utilization ratio of 34% versus 20% for males, females have less than half the credit limit of males 5,023 versus 12,685. This may be due to the average income level of females is nearly a third of males. Yet, both male and female customers have roughly 2.3 dependents on the card.

Additionally, low income members rely heavily on credit use while higher income members were less reliant on credit use.

Initial indications identified Gender, Months Inactive and Education level as having significant influence on attrition rate. Female attrition rate was 17.3% versus male at 14.6%. As members become less active (greater number of months inactive), attrition rate increased peaking at 29.9% at 4 months. Finally, members with a doctorate education level had an attrition rate of 21.1%.

For feature selection, I used the roc curve area method to rank the significance of the features. The results coincided with the initial EDA. Here are the top 5 features by ROC curve area:
Feature Score
Months_Inactive_12_mon 7.02
Gender 3.75
Total_Ct_Chng_Q4_Q1 2.88
Education_Level 2.82
Contacts_Count_12_mon 2.71

Using the above features, seven classification models were deployed:
Logistic Regression
K-Nearest Neighbor
Support Vector Machine
Kernel SVM
Naive Bayes
Decision Tree
Random Forest

Using a combination of accuracy and recall (sensitivity) performance metrics, the Random Forest model with an accuracy of 86.9% and a recall 35.6%, both top among all models, performed best among all. Given that the class of “Attrited Customer” was very low, meaning a model could simply predict all customers to be “Existing” and still produce an accuracy of 83.9% (as did the Support Vector Machine model), the recall measure, which measures how well “Attrited Customer” was predicted, was used as well.

Once a model is created, there are two ways to integrate into the business.
1. Run the model against all existing customers on a periodic basis to identify which customers were at highest risk of attrition. Then target these customers with special rates or offers or simply contact them and provide better servicing.
2. Use the most important features to set triggers to reach out to and provide offers to customers that were approaching a high risk level. An example would be to set triggers to alert when customers became inactive for a certain period as “Months Inactive” has a high influence on attrition rate.

Data load

#rm(list=ls())
#gc()

## Using data table library for quicker and easier analysis
library(data.table)
library(ggplot2)
library(scales)
library(knitr)
directory <- 'C:/Users/ThomasJardine/OneDrive - 0to1.one/NY Life Assessment/'
file_name <- 'BankChurners.csv'
file <- paste0(directory, file_name)
## Import dataset
BankChurners <- fread(file)
BankChurners <- BankChurners[,1:(length(BankChurners)-2)]

Q1

Question 1: Please perform EDA on the dataset and provide 3-4 key insights.

Summary

Basic

  • Despite having almost a double utilization ratio, females have less than half the credit limit of males. This may be due to the average income level of females is nearly a third of males. Yet, both male and female customers have roughly 2.3 dependents on the card.

  • Females have a higher attrition rate to males, 17.4% vs 14.6%.

  • The age group (45,55] has the highest average salary ($60,662) and a commensurate highest credit limit but also has the highest attrition rate of 16.6% (excluding the 10 members in the (65-75] age range).

  • Interesting that while there seems to be no variation in the measures (income, credit limit, utilization, etc.) by education level, the doctorate attrition rate is very high at 21,1%.

  • No surprise that average credit limit rises as income rises but utilization rate is highest among the low income and decreases as income increases with the highest income cohort having the lowest utilization rate of 12.5%.

  • There seems to be no differentiation of metrics by marital status.

  • Following a similar, and not surprising, pattern as income, Blue card members, with the lowest average income and credit limit, have the highest utility ratio. Platinum card holders have the highest attrition rate (albeit only 20 members in the sample).

  • There is a clear and somewhat intuitive relationship between number of months inactive and attrition rate notwithstanding the 0, 5 and 6 months which given the low count are not significant.

  • Low income card members are heavily reliant on credit use.

## Summarize BankChurners
summary(BankChurners)
##    CLIENTNUM         Attrition_Flag      Customer_Age      Gender         
##  Min.   :708082083   Length:10127       Min.   :26.00   Length:10127      
##  1st Qu.:713036770   Class :character   1st Qu.:41.00   Class :character  
##  Median :717926358   Mode  :character   Median :46.00   Mode  :character  
##  Mean   :739177606                      Mean   :46.33                     
##  3rd Qu.:773143533                      3rd Qu.:52.00                     
##  Max.   :828343083                      Max.   :73.00                     
##  Dependent_count Education_Level    Marital_Status     Income_Category   
##  Min.   :0.000   Length:10127       Length:10127       Length:10127      
##  1st Qu.:1.000   Class :character   Class :character   Class :character  
##  Median :2.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2.346                                                           
##  3rd Qu.:3.000                                                           
##  Max.   :5.000                                                           
##  Card_Category      Months_on_book  Total_Relationship_Count
##  Length:10127       Min.   :13.00   Min.   :1.000           
##  Class :character   1st Qu.:31.00   1st Qu.:3.000           
##  Mode  :character   Median :36.00   Median :4.000           
##                     Mean   :35.93   Mean   :3.813           
##                     3rd Qu.:40.00   3rd Qu.:5.000           
##                     Max.   :56.00   Max.   :6.000           
##  Months_Inactive_12_mon Contacts_Count_12_mon  Credit_Limit  
##  Min.   :0.000          Min.   :0.000         Min.   : 1438  
##  1st Qu.:2.000          1st Qu.:2.000         1st Qu.: 2555  
##  Median :2.000          Median :2.000         Median : 4549  
##  Mean   :2.341          Mean   :2.455         Mean   : 8632  
##  3rd Qu.:3.000          3rd Qu.:3.000         3rd Qu.:11068  
##  Max.   :6.000          Max.   :6.000         Max.   :34516  
##  Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt
##  Min.   :   0        Min.   :    3   Min.   :0.0000       Min.   :  510  
##  1st Qu.: 359        1st Qu.: 1324   1st Qu.:0.6310       1st Qu.: 2156  
##  Median :1276        Median : 3474   Median :0.7360       Median : 3899  
##  Mean   :1163        Mean   : 7469   Mean   :0.7599       Mean   : 4404  
##  3rd Qu.:1784        3rd Qu.: 9859   3rd Qu.:0.8590       3rd Qu.: 4741  
##  Max.   :2517        Max.   :34516   Max.   :3.3970       Max.   :18484  
##  Total_Trans_Ct   Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
##  Min.   : 10.00   Min.   :0.0000      Min.   :0.0000       
##  1st Qu.: 45.00   1st Qu.:0.5820      1st Qu.:0.0230       
##  Median : 67.00   Median :0.7020      Median :0.1760       
##  Mean   : 64.86   Mean   :0.7122      Mean   :0.2749       
##  3rd Qu.: 81.00   3rd Qu.:0.8180      3rd Qu.:0.5030       
##  Max.   :139.00   Max.   :3.7140      Max.   :0.9990
## Column Description of BankChurners
str(BankChurners)
## Classes 'data.table' and 'data.frame':   10127 obs. of  21 variables:
##  $ CLIENTNUM               : int  768805383 818770008 713982108 769911858 709106358 713061558 810347208 818906208 710930508 719661558 ...
##  $ Attrition_Flag          : chr  "Existing Customer" "Existing Customer" "Existing Customer" "Existing Customer" ...
##  $ Customer_Age            : int  45 49 51 40 40 44 51 32 37 48 ...
##  $ Gender                  : chr  "M" "F" "M" "F" ...
##  $ Dependent_count         : int  3 5 3 4 3 2 4 0 3 2 ...
##  $ Education_Level         : chr  "High School" "Graduate" "Graduate" "High School" ...
##  $ Marital_Status          : chr  "Married" "Single" "Married" "Unknown" ...
##  $ Income_Category         : chr  "$60K - $80K" "Less than $40K" "$80K - $120K" "Less than $40K" ...
##  $ Card_Category           : chr  "Blue" "Blue" "Blue" "Blue" ...
##  $ Months_on_book          : int  39 44 36 34 21 36 46 27 36 36 ...
##  $ Total_Relationship_Count: int  5 6 4 3 5 3 6 2 5 6 ...
##  $ Months_Inactive_12_mon  : int  1 1 1 4 1 1 1 2 2 3 ...
##  $ Contacts_Count_12_mon   : int  3 2 0 1 0 2 3 2 0 3 ...
##  $ Credit_Limit            : num  12691 8256 3418 3313 4716 ...
##  $ Total_Revolving_Bal     : int  777 864 0 2517 0 1247 2264 1396 2517 1677 ...
##  $ Avg_Open_To_Buy         : num  11914 7392 3418 796 4716 ...
##  $ Total_Amt_Chng_Q4_Q1    : num  1.33 1.54 2.59 1.4 2.17 ...
##  $ Total_Trans_Amt         : int  1144 1291 1887 1171 816 1088 1330 1538 1350 1441 ...
##  $ Total_Trans_Ct          : int  42 33 20 20 28 24 31 36 24 32 ...
##  $ Total_Ct_Chng_Q4_Q1     : num  1.62 3.71 2.33 2.33 2.5 ...
##  $ Avg_Utilization_Ratio   : num  0.061 0.105 0 0.76 0 0.311 0.066 0.048 0.113 0.144 ...
##  - attr(*, ".internal.selfref")=<externalptr>
## Additional columns

## Attrition flag as num
BankChurners$Attrition_Flag_Num <- as.numeric(factor(BankChurners$Attrition_Flag))-1

## Age Group Bucket
BankChurners$Customer_Age_Bucket <- cut(x = BankChurners$Customer_Age, breaks = c(25, 35, 45, 55, 65, 75))

## Gender as num
BankChurners$Gender_Num <- as.numeric(factor(BankChurners$Gender))-1

## Education level as num
levs <- c("Uneducated", "Unknown", "High School", "College",  "Graduate", "Post-Graduate", "Doctorate")
BankChurners$Education_Level_Num <- as.numeric(factor(BankChurners$Education_Level, levels = levs ))

## Marital status as num
BankChurners$Marital_Status_Num <- as.numeric(factor(BankChurners$Marital_Status))

## Income_Category as num
BankChurners$Income <- ifelse(BankChurners$Income_Category == '$60K - $80K', 70000,
               ifelse(BankChurners$Income_Category == '$40K - $60K', 50000,
                ifelse(BankChurners$Income_Category == '$120K +', 150000,
                 ifelse(BankChurners$Income_Category == 'Less than $40K', 20000,
                  ifelse(BankChurners$Income_Category == '$80K - $120K', 100000, 50000)))))

## Card_Category as num
levs <- c("Blue", "Silver", "Gold", "Platinum")
BankChurners$Card_Category_Num <- as.numeric(factor(BankChurners$Card_Category, levels = levs ))

Reviewing the relationships among the numerical features including the transformed numeric features.

The “Average Open to Buy” variable is highly correlated to Credit Limit, Card Category, Income and Gender. Gender is correlated to Income. Average Utilization Ratio, while highly correlated to Total Balance, is negatively correlated to Average Open to Buy and Credit Limit. Attrition Flag Num is correlated to Total Trans Ct and Total Ct Chg Q4 and negatively correlated to Contacts Count and Months Inactive (recall “Existing Customer” = 1 and “Attrited Customer” = 0.

## Correlation Matrix Plot
res <- cor(BankChurners[,c(3,5,10:22,24:28)])
#install.packages("corrplot")
library(corrplot)
corrplot(res, type = "upper", order = "hclust", tl.cex = 0.6, diag = FALSE,
         tl.col = "black", tl.srt = 45)

Summaries

Gender
## Summary by Gender
kable(
(BankChurners)[, .(.N
                 , AvgAR=sprintf("%0.1f%%", (1-sum(Attrition_Flag_Num)/.N)*100)
                 , AvgInc=dollar(mean(Income))
                 , AvgDep=format(mean(Dependent_count), digits=2)
                 , AvgCL=dollar(mean(Credit_Limit))
                 , AvgTRB=dollar(mean(Total_Revolving_Bal))
                 , AvgTTA=dollar(mean(Total_Trans_Amt))
                 , AvgTTC=format(mean(Total_Trans_Ct), digits=3)
                 , AvgUR=sprintf("%0.1f%%", mean(Avg_Utilization_Ratio)*100)
                 , AvgMI=format(mean(Months_Inactive_12_mon), digits=2))
                 , by = c('Gender')]
)
Gender N AvgAR AvgInc AvgDep AvgCL AvgTRB AvgTTA AvgTTC AvgUR AvgMI
M 4769 14.6% $85,474.94 2.4 $12,685.67 $1,188.43 $4,493.71 63.2 20.0% 2.3
F 5358 17.4% $31,612.54 2.3 $5,023.85 $1,140.01 $4,324.32 66.4 34.2% 2.4

Despite having almost a double utilization ratio, females have less than half the credit limit of males. This may be due to the average income level of females is nearly a third of males. Yet, both male and female customers have roughly 2.3 dependents on the card.

#BankChurners[,.N,by=c('Gender', 'Attrition_Flag')][,perc:=100*N/sum(N),by='Gender']
#[Attrition_Flag=='Attrited #Customer',.SD]
ggplot(data=
BankChurners[,.(perc=100*(1-sum(Attrition_Flag_Num)/.N),tot=.N, AvgUR=mean(Avg_Utilization_Ratio)),
             by='Gender'][order(Gender)],
aes(x=Gender, y=perc, width=tot/10000, fill=AvgUR)) +
  geom_bar(stat='identity') +
  ggtitle('Attrition Rate by Gender') +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
#  geom_text(aes(label = paste('(',tot,')')), vjust = -1.5, colour = "black", size = 3) +
  ylab('Attrition Rate/(Count)') + 
  labs(caption = "width of bars is count size")

Females have a higher attrition rate to males, 17.4% vs 14.6%.

Age
## Summary by Age Bucket
kable(
(BankChurners)[, .(.N
                 , AvgAR=sprintf("%0.1f%%", (1-sum(Attrition_Flag_Num)/.N)*100)
                 , AvgInc=dollar(mean(Income))
                 , AvgDep=format(mean(Dependent_count), digits=2)
                 , AvgCL=dollar(mean(Credit_Limit))
                 , AvgTRB=dollar(mean(Total_Revolving_Bal))
                 , AvgTTA=dollar(mean(Total_Trans_Amt))
                 , AvgTTC=format(mean(Total_Trans_Ct), digits=3)
                 , AvgUR=sprintf("%0.1f%%", mean(Avg_Utilization_Ratio)*100)
                 , AvgMI=format(mean(Months_Inactive_12_mon), digits=2))
                 , by = c('Customer_Age_Bucket')][order(Customer_Age_Bucket)]
)
Customer_Age_Bucket N AvgAR AvgInc AvgDep AvgCL AvgTRB AvgTTA AvgTTC AvgUR AvgMI
(25,35] 919 13.3% $49,118.61 1.4 $7,220.70 $1,159.10 $4,505.69 64.9 30.1% 2.2
(35,45] 3742 16.2% $56,378.94 2.8 $8,859.21 $1,161.93 $4,483.30 65.9 26.8% 2.3
(45,55] 4135 16.6% $60,662.64 2.5 $9,166.34 $1,152.80 $4,471.54 65.7 26.7% 2.4
(55,65] 1321 15.8% $52,747.92 1.3 $7,319.78 $1,196.07 $3,919.87 59.4 29.9% 2.4
(65,75] 10 20.0% $38,000 0.3 $5,656.93 $1,583.30 $1,498.30 31.5 40.5% 2.9
ggplot(data=
BankChurners[,.(perc=100*(1-sum(Attrition_Flag_Num)/.N),tot=.N, AvgUR=mean(Avg_Utilization_Ratio)),
             by='Customer_Age_Bucket'][order(Customer_Age_Bucket)],
aes(x=Customer_Age_Bucket, y=perc, width=tot/5000, fill=AvgUR)) +
  geom_bar(stat='identity') +
  ggtitle('Attrition Rate by Age') +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
#  geom_text(aes(label = paste('(',tot,')')), vjust = -1.5, colour = "black", size = 3) +
  ylab('Attrition Rate/(Count)') + 
  labs(caption = "width of bars is count size")

The age group (45,55] has the highest average salary ($60,662) and a commensurate highest credit limit but also has the highest attrition rate of 16.6% (excluding the 2 members in the (65-75] age range).

Education
## Summary by Education Level
kable(
  (BankChurners)[, .(.N
                 , AvgAR=sprintf("%0.1f%%", (1-sum(Attrition_Flag_Num)/.N)*100)
                 , AvgInc=dollar(mean(Income))
                 , AvgDep=format(mean(Dependent_count), digits=2)
                 , AvgCL=dollar(mean(Credit_Limit))
                 , AvgTRB=dollar(mean(Total_Revolving_Bal))
                 , AvgTTA=dollar(mean(Total_Trans_Amt))
                 , AvgTTC=format(mean(Total_Trans_Ct), digits=3)
                 , AvgUR=sprintf("%0.1f%%", mean(Avg_Utilization_Ratio)*100)
                 , AvgMI=format(mean(Months_Inactive_12_mon), digits=2))
                 , by = c('Education_Level')]
)
Education_Level N AvgAR AvgInc AvgDep AvgCL AvgTRB AvgTTA AvgTTC AvgUR AvgMI
High School 2013 15.2% $58,002.98 2.3 $8,605.82 $1,194.35 $4,403.74 64.7 27.8% 2.3
Graduate 3128 15.6% $55,936.70 2.3 $8,566.10 $1,162.38 $4,418.49 64.9 27.8% 2.4
Uneducated 1487 15.9% $57,390.72 2.4 $8,899.51 $1,154.08 $4,484.34 65.3 27.2% 2.4
Unknown 1519 16.9% $56,892.69 2.4 $8,491.80 $1,160.45 $4,420.05 64.7 27.6% 2.3
College 1013 15.2% $57,936.82 2.4 $8,684.54 $1,135.78 $4,255.81 65 26.5% 2.4
Post-Graduate 516 17.8% $56,763.57 2.4 $8,862.56 $1,187.67 $4,515.41 65 28.1% 2.3
Doctorate 451 21.1% $56,629.71 2.3 $8,413.26 $1,094.08 $4,193.02 63.8 26.1% 2.4
ggplot(data=
BankChurners[,.(perc=100*(1-sum(Attrition_Flag_Num)/.N),tot=.N, AvgUR=mean(Avg_Utilization_Ratio)),
             by='Education_Level'][order(Education_Level)],
aes(x=Education_Level, y=perc, width=tot/3000, fill=AvgUR)) +
  geom_bar(stat='identity') +
  ggtitle('Attrition Rate by Education') +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
#  geom_text(aes(label = paste('(',tot,')')), vjust = -1.5, colour = "black", size = 3) +
  ylab('Attrition Rate/(Count)') + 
  labs(caption = "width of bars is count size")

Interesting that while there seems to be no variation in the measures (income, credit limit, utilization, etc.) by education level, the doctorate attrition rate is very high at 21%.

Income
## Summary by Income
kable(
  (BankChurners)[, .(.N
                 , AvgAR=sprintf("%0.1f%%", (1-sum(Attrition_Flag_Num)/.N)*100)
                 , AvgInc=dollar(mean(Income))
                 , AvgDep=format(mean(Dependent_count), digits=2)
                 , AvgCL=dollar(mean(Credit_Limit))
                 , AvgTRB=dollar(mean(Total_Revolving_Bal))
                 , AvgTTA=dollar(mean(Total_Trans_Amt))
                 , AvgTTC=format(mean(Total_Trans_Ct), digits=3)
                 , AvgUR=sprintf("%0.1f%%", mean(Avg_Utilization_Ratio)*100)
                 , AvgMI=format(mean(Months_Inactive_12_mon), digits=2))
                 , by = c('Income_Category')]
)
Income_Category N AvgAR AvgInc AvgDep AvgCL AvgTRB AvgTTA AvgTTC AvgUR AvgMI
$60K - $80K 1402 13.5% $70,000 2.4 $10,758.77 $1,154.99 $4,450.71 63.2 20.6% 2.3
Less than $40K 3561 17.2% $20,000 2.3 $3,754.40 $1,145.14 $4,363.42 66.2 37.7% 2.4
$80K - $120K 1535 15.8% $100,000 2.5 $15,809.89 $1,206.73 $4,482.96 62.7 16.2% 2.3
$40K - $60K 1790 15.1% $50,000 2.3 $5,462.26 $1,171.99 $4,405.62 65 31.7% 2.3
$120K + 727 17.3% $150,000 2.5 $19,717.32 $1,222.20 $4,529.47 63.7 12.5% 2.3
Unknown 1112 16.8% $50,000 2.3 $9,516.58 $1,115.06 $4,282.21 66.3 22.1% 2.4
ggplot(data=
BankChurners[,.(perc=100*(1-sum(Attrition_Flag_Num)/.N),tot=.N, AvgUR=mean(Avg_Utilization_Ratio)),
             by='Income_Category'][order(Income_Category)],
aes(x=Income_Category, y=perc, width=tot/3000, fill=AvgUR)) +
  geom_bar(stat='identity') +
  ggtitle('Attrition Rate by Income') +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
#  geom_text(aes(label = paste('(',tot,')')), vjust = -1.5, colour = "black", size = 3) +
  ylab('Attrition Rate/(Count)') + 
  labs(caption = "width of bars is count size")

No surprise that average credit limit rises as income rises but utilization rate is highest among the low income and decreases as income increases with the highest income cohort having the lowest utilization rate of 12.5%.

Marital
## Summary by Marital Status
kable(
  (BankChurners)[, .(.N
                 , AvgAR=sprintf("%0.1f%%", (1-sum(Attrition_Flag_Num)/.N)*100)
                 , AvgInc=dollar(mean(Income))
                 , AvgDep=format(mean(Dependent_count), digits=2)
                 , AvgCL=dollar(mean(Credit_Limit))
                 , AvgTRB=dollar(mean(Total_Revolving_Bal))
                 , AvgTTA=dollar(mean(Total_Trans_Amt))
                 , AvgTTC=format(mean(Total_Trans_Ct), digits=3)
                 , AvgUR=sprintf("%0.1f%%", mean(Avg_Utilization_Ratio)*100)
                 , AvgMI=format(mean(Months_Inactive_12_mon), digits=2))
                 , by = c('Marital_Status')]
)
Marital_Status N AvgAR AvgInc AvgDep AvgCL AvgTRB AvgTTA AvgTTC AvgUR AvgMI
Married 4687 15.1% $57,793.90 2.4 $8,076.66 $1,197.16 $4,173.42 61.7 29.0% 2.3
Single 3943 16.9% $55,883.84 2.3 $8,999.68 $1,124.24 $4,594.55 67.8 26.4% 2.4
Unknown 749 17.2% $58,064.09 2.5 $9,445.28 $1,157.75 $4,720.01 67.1 25.6% 2.3
Divorced 748 16.2% $56,537.43 2.4 $9,358.57 $1,155.98 $4,529.09 66.9 25.7% 2.3
ggplot(data=
BankChurners[,.(perc=100*(1-sum(Attrition_Flag_Num)/.N),tot=.N, AvgUR=mean(Avg_Utilization_Ratio)),
             by='Marital_Status'][order(Marital_Status)],
aes(x=Marital_Status, y=perc, width=tot/5000, fill=AvgUR)) +
  geom_bar(stat='identity') +
  ggtitle('Attrition Rate by Marital Status') +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
#  geom_text(aes(label = paste('(',tot,')')), vjust = -1.5, colour = "black", size = 3) +
  ylab('Attrition Rate/(Count)') + 
  labs(caption = "width of bars is count size")

There seems to be no differentiation of metrics by marital status.

Card
## Summary by Card Category
kable(
  (BankChurners)[, .(.N
                 , AvgAR=sprintf("%0.1f%%", (1-sum(Attrition_Flag_Num)/.N)*100)
                 , AvgInc=dollar(mean(Income))
                 , AvgDep=format(mean(Dependent_count), digits=2)
                 , AvgCL=dollar(mean(Credit_Limit))
                 , AvgTRB=dollar(mean(Total_Revolving_Bal))
                 , AvgTTA=dollar(mean(Total_Trans_Amt))
                 , AvgTTC=format(mean(Total_Trans_Ct), digits=3)
                 , AvgUR=sprintf("%0.1f%%", mean(Avg_Utilization_Ratio)*100)
                 , AvgMI=format(mean(Months_Inactive_12_mon), digits=2))
                 , by = c('Card_Category')]
)
Card_Category N AvgAR AvgInc AvgDep AvgCL AvgTRB AvgTTA AvgTTC AvgUR AvgMI
Blue 9436 16.1% $56,106.40 2.3 $7,363.78 $1,157.81 $4,225.41 64 29.1% 2.3
Gold 116 18.1% $73,362.07 2.7 $28,416.37 $1,344.32 $7,685.61 81.5 5.7% 2.3
Silver 555 14.8% $67,783.78 2.4 $25,277.84 $1,206.13 $6,590.48 74.7 5.7% 2.3
Platinum 20 25.0% $73,000 2.4 $30,283.45 $1,267.95 $8,999.75 87 4.4% 2.2
ggplot(data=
BankChurners[,.(perc=100*(1-sum(Attrition_Flag_Num)/.N),tot=.N, AvgUR=mean(Avg_Utilization_Ratio)),
             by='Card_Category'][order(Card_Category)],
aes(x=Card_Category, y=perc, width=tot/10000, fill=AvgUR)) +
  geom_bar(stat='identity') +
  ggtitle('Attrition Rate by Card Category') +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
  ylab('Attrition Rate/(Count)') + 
  labs(caption = "width of bars is count size")

Following a similar, and not surprising, pattern as income, Blue card members, with the lowest average income nd credit limit, have the highest utility ratio. Platinum card holders have the highest attrition rate (albeit only 20 members in the sample).

Mths/In
## Summary by Months Inactive
kable(
  (BankChurners)[, .(.N
                 , AvgAR=sprintf("%0.1f%%", (1-sum(Attrition_Flag_Num)/.N)*100)
                 , AvgInc=dollar(mean(Income))
                 , AvgDep=format(mean(Dependent_count), digits=2)
                 , AvgCL=dollar(mean(Credit_Limit))
                 , AvgTRB=dollar(mean(Total_Revolving_Bal))
                 , AvgTTA=dollar(mean(Total_Trans_Amt))
                 , AvgTTC=format(mean(Total_Trans_Ct), digits=3)
                 , AvgUR=sprintf("%0.1f%%", mean(Avg_Utilization_Ratio)*100)
                 , AvgMI=format(mean(Months_Inactive_12_mon), digits=2))
                 , by = c('Months_Inactive_12_mon')][order(Months_Inactive_12_mon)]
)
Months_Inactive_12_mon N AvgAR AvgInc AvgDep AvgCL AvgTRB AvgTTA AvgTTC AvgUR AvgMI
0 29 51.7% $44,137.93 2.1 $7,014.27 $714.90 $3,156.34 54.4 15.4% 0
1 2233 4.5% $56,892.07 2.4 $8,628.31 $1,230.62 $4,672.82 67.3 28.8% 1
2 3282 15.4% $57,922.00 2.4 $8,925.32 $1,167.45 $4,401.07 65 27.1% 2
3 3846 21.5% $56,627.67 2.3 $8,561.65 $1,138.31 $4,321.20 63.7 27.1% 3
4 435 29.9% $57,724.14 2.3 $7,916.64 $1,048.37 $3,769.31 60.7 26.8% 4
5 178 18.0% $51,629.21 2.4 $7,515.52 $1,131.32 $4,688.28 67.8 31.1% 5
6 124 15.3% $52,419.35 2.2 $7,603.88 $1,130.56 $4,326.02 66.9 28.1% 6
ggplot(data=
BankChurners[,.(perc=100*(1-sum(Attrition_Flag_Num)/.N),tot=.N, AvgUR=mean(Avg_Utilization_Ratio)),
             by='Months_Inactive_12_mon'][order(Months_Inactive_12_mon)],
aes(x=Months_Inactive_12_mon, y=perc, width=tot/4000, fill=AvgUR)) +
  geom_bar(stat='identity') +
  ggtitle('Attrition Rate by Months Inactive') +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
  ylab('Attrition Rate/(Count)') + 
  labs(caption = "width of bars is count size")

There is a clear and somewhat intuitive relationship between number of months inactive and attrition rate notwithstanding the 0, 5 and 6 months which given the low count are not significant.

Utility
ggplot(data=
BankChurners[,.(perc=mean(Avg_Utilization_Ratio)*100,tot=.N),
             by='Income'],
aes(x=Income, y=perc, width=tot*4)) +
  geom_bar(stat='identity') +
  ggtitle('Utility Ratio By Income') +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = paste(sprintf("%0.1f%%", perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
  ylab('Utility Ratio/(Count)') + 
  labs(caption = "width of bars is count size")

Credit
ggplot(data=
BankChurners[,.(perc=mean(Credit_Limit),tot=.N),
             by='Income'],
aes(x=Income, y=perc, width=tot*6)) +
  geom_bar(stat='identity') +
  ggtitle('Credit Limit By Income') +
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = paste(dollar(perc),'(',tot,')')), vjust = -0.2, colour = "black", size = 3) +
  ylab('Utility Ratio/(Count)') + 
  labs(caption = "width of bars is count size")

Q2

Question 2: Please engineer 5 features for the model and explain how you would transform the feature for the model.

Summary

The feature selection results found here are inline with our conclusions from the EDA exercise. For feature selection, I used the ROC curve area method to rank the significance of the features.
Feature Score
Months_Inactive_12_mon 7.02
Gender 3.75
Total_Ct_Chng_Q4_Q1 2.88
Education_Level 2.82
Contacts_Count_12_mon 2.71

The following transformations were performed in the above “Additional Columns” code chunk. The categorical features, Gender and Education Level, were transformed into numerical representations. Because Education Level has a hierarchy associated to it, thee numerical association was customized to rank as follows: “Uneducated”, “Unknown”, “High School”, “College”, “Graduate”, “Post-Graduate”, “Doctorate”.

Additionally, each feature, except Gender because it is a binary feature, was scaled to standardize the column with a mean of 0 and standard deviation of 1. Not all models require scaling but using scaling does not impact the results.

Finally, the dependent variable, Attrition_Flag_Num, is transformed to a factor in R. Again not required by all models but is okay for all models.

## I use the ROC Curve area as a score to rank feature relevance in the dataset
library(caret)
#use roc_curve area as score
## takes 20+ mins
roc_imp <- filterVarImp(x = BankChurners[,3:21], y = BankChurners$Attrition_Flag_Num)

#sort the score in decreasing order
roc_imp <- data.frame(cbind(variable = rownames(roc_imp), score = roc_imp[,1]))
roc_imp$score <- as.double(roc_imp$score)
roc_imp[order(roc_imp$score,decreasing = TRUE),]
##                    variable        score
## 10   Months_Inactive_12_mon 7.020371e+00
## 2                    Gender 3.753000e+00
## 18      Total_Ct_Chng_Q4_Q1 2.886986e+00
## 4           Education_Level 2.820714e+00
## 11    Contacts_Count_12_mon 2.713653e+00
## 13      Total_Revolving_Bal 1.846161e+00
## 19    Avg_Utilization_Ratio 1.824920e+00
## 17           Total_Trans_Ct 1.792174e+00
## 9  Total_Relationship_Count 1.412392e+00
## 6           Income_Category 1.357638e+00
## 12             Credit_Limit 9.761108e-01
## 5            Marital_Status 7.259596e-01
## 7             Card_Category 5.845796e-01
## 8            Months_on_book 3.687921e-01
## 1              Customer_Age 2.184954e-01
## 3           Dependent_count 1.999022e-01
## 16          Total_Trans_Amt 2.975181e-11
## 14          Avg_Open_To_Buy 8.230264e-12
## 15     Total_Amt_Chng_Q4_Q1 2.815500e-13
## model dataset construction
dataset <- BankChurners[,c(24,25,12,13,20,22)]

dataset$Attrition_Flag_Num <- factor(dataset$Attrition_Flag_Num, levels = c(0, 1))

# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split <- sample.split(dataset$Attrition_Flag_Num, SplitRatio = 0.75)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

# Feature Scaling
training_set[,2] <- scale(training_set[,2])
training_set[,3] <- scale(training_set[,3])
training_set[,4] <- scale(training_set[,4])
training_set[,5] <- scale(training_set[,5])
test_set[,2] <- scale(test_set[,2])
test_set[,3] <- scale(test_set[,3])
test_set[,4] <- scale(test_set[,4])
test_set[,5] <- scale(test_set[,5])

Q3

Question 3: What kind of model would you use to predict customer churn?

Summary

I will employ a series of well known classification models including:
Logistic Regression
K-Nearest Neighbor
Support Vector Machine
Kernel SVM
Naive Bayes
Decision Tree
Random Forest

See below for the training of each model.

Models

Log Reg

# Fitting Logistic Regression to the Training set
classifier = glm(formula = Attrition_Flag_Num ~ .,
                 family = binomial,
                 data = training_set)

# Predicting the Test set results
prob_pred = predict(classifier, type = 'response', newdata = test_set[,-6])
y_pred = ifelse(prob_pred > 0.5, 1, 0)

# Making the Confusion Matrix
cm = as.matrix(table(Actual = test_set[[6]], Predicted = y_pred)) # create the confusion matrix

#paste("Accuracy = ",sprintf("%0.1f%%", (1-(cm[1,2]+cm[2,1])/dim(test_set)[1])*100))
lr_acc <- sum(diag(cm)) / sum(cm) 
lr_precision <- cm[1,1]/(cm[1,1]+cm[2,1])
lr_recall <- cm[1,1]/(cm[1,1]+cm[1,2]) ## AKA sensitivity
lr_F1 <- (2*lr_precision*lr_recall)/(lr_precision+lr_recall)

KNN

# Fitting K Nearest Neighbor to the Training set
library(class)
y_pred = knn(train = training_set[, -6],
             test = test_set[, -6],
             cl = training_set[[6]],
             k = 5,
             prob = TRUE)

# Making the Confusion Matrix
cm = as.matrix(table(Actual = test_set[[6]], Predicted = y_pred)) # create the confusion matrix

knn_acc <- sum(diag(cm)) / sum(cm) 
knn_precision <- cm[1,1]/(cm[1,1]+cm[2,1])
knn_recall <- cm[1,1]/(cm[1,1]+cm[1,2]) ## AKA sensitivity
knn_F1 <- (2*knn_precision*knn_recall)/(knn_precision+knn_recall)

SVM

# Fitting Support Vector Machine to the Training set
# install.packages('e1071')
library(e1071)
classifier = svm(formula = Attrition_Flag_Num ~ .,
                 data = training_set,
                 type = 'C-classification',
                 kernel = 'linear')

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[,1:5])

# Making the Confusion Matrix
cm = as.matrix(table(Actual = test_set[[6]], Predicted = y_pred)) # create the confusion matrix

svm_acc <- sum(diag(cm)) / sum(cm) 
svm_precision <- cm[1,1]/(cm[1,1]+cm[2,1])
svm_recall <- cm[1,1]/(cm[1,1]+cm[1,2]) ## AKA sensitivity
svm_F1 <- (2*svm_precision*svm_recall)/(svm_precision+svm_recall)

KSVM

# Fitting Kernel SVM to the Training set
library(e1071)
classifier = svm(formula = Attrition_Flag_Num ~ .,
                 data = training_set,
                 type = 'C-classification',
                 kernel = 'radial')

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[,1:5])

# Making the Confusion Matrix
cm = as.matrix(table(Actual = test_set[[6]], Predicted = y_pred)) # create the confusion matrix

ksvm_acc <- sum(diag(cm)) / sum(cm) 
ksvm_precision <- cm[1,1]/(cm[1,1]+cm[2,1])
ksvm_recall <- cm[1,1]/(cm[1,1]+cm[1,2]) ## AKA sensitivity
ksvm_F1 <- (2*ksvm_precision*ksvm_recall)/(ksvm_precision+ksvm_recall)

NB

# Fitting Naive Bayes to the Training set
library(e1071)
classifier = naiveBayes(x = training_set[,1:5],
                        y = training_set$Attrition_Flag_Num)

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[,1:5])

# Making the Confusion Matrix
cm = as.matrix(table(Actual = test_set[[6]], Predicted = y_pred)) # create the confusion matrix

nb_acc <- sum(diag(cm)) / sum(cm) 
nb_precision <- cm[1,1]/(cm[1,1]+cm[2,1])
nb_recall <- cm[1,1]/(cm[1,1]+cm[1,2]) ## AKA sensitivity
nb_F1 <- (2*nb_precision*nb_recall)/(nb_precision+nb_recall)

DecTree

# Fitting Decision Tree Classification to the Training set
# install.packages('rpart')
library(rpart)
training_set$Attrition_Flag_Num = factor(training_set$Attrition_Flag_Num, levels = c(0, 1))
test_set$Attrition_Flag_Num = factor(test_set$Attrition_Flag_Num, levels = c(0, 1))
classifier = rpart(formula = Attrition_Flag_Num ~ .,
                   data = training_set)

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[,-6], type = 'class')

# Making the Confusion Matrix
cm = as.matrix(table(Actual = test_set[[6]], Predicted = y_pred)) # create the confusion matrix

dt_acc <- sum(diag(cm)) / sum(cm) 
dt_precision <- cm[1,1]/(cm[1,1]+cm[2,1])
dt_recall <- cm[1,1]/(cm[1,1]+cm[1,2]) ## AKA sensitivity
dt_F1 <- (2*dt_precision*dt_recall)/(dt_precision+dt_recall)
# Plotting the tree

## recreate training and test dataset
set.seed(123)
split = sample.split(dataset$Attrition_Flag_Num, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

training_set$Attrition_Flag_Num = factor(training_set$Attrition_Flag_Num, levels = c(0, 1))
test_set$Attrition_Flag_Num = factor(test_set$Attrition_Flag_Num, levels = c(0, 1))
classifier = rpart(formula = Attrition_Flag_Num ~ .,
                   data = training_set)
library(rpart.plot)
prp(classifier)

RF

# Fitting Random Forest Classification to the Training set
# install.packages('randomForest')
library(randomForest)
set.seed(123)
classifier = randomForest(x = training_set[,1:5],
                          y = training_set$Attrition_Flag_Num,
                          ntree = 500)
# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[,-6])

# Making the Confusion Matrix
cm = as.matrix(table(Actual = test_set[[6]], Predicted = y_pred)) # create the confusion matrix

rf_acc <- sum(diag(cm)) / sum(cm) 
rf_precision <- cm[1,1]/(cm[1,1]+cm[2,1])
rf_recall <- cm[1,1]/(cm[1,1]+cm[1,2]) ## AKA sensitivity
rf_F1 <- (2*rf_precision*rf_recall)/(rf_precision+rf_recall)

Q4

Question 4: Please explain any tuning you would do to the model and why.

Summary

In machine learning, models have two types of parameters. Those “learned” by training the model (ex. coefficients in a regressions model) and those parameters the user can adjust or hyperparameters. Adjusting these hyperparameters is known as tuning a model or optimizing the training.

In R the library “caret” has an automatic hyperparameter selection function which I use below. In the interest of simplicity, I tune only the Random Forest model as this is the model with the highest accuracy and recall.

# Grid Search to find the best parameters
# install.packages('caret')
# this code chunk takes several minutes to run
library(caret)
classifier <- train(form = Attrition_Flag_Num ~ ., data = training_set, method = 'rf')
classifier
## Random Forest 
## 
## 7595 samples
##    5 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 7595, 7595, 7595, 7595, 7595, 7595, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8615113  0.3796700
##   3     0.8471441  0.3546715
##   5     0.8336159  0.3275144
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

The optimal Random Forest model chosen uses mtry = 2 where:

mtry - Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)

Notice that the accuracy using the optimal model (mtry = 2) is 86.1% or less than the accuracy I obtained above when training the Random Forest model, 86.9%. This is because I only calculated the accuracy using one training set and one test set while the function above takes the average of accuracies from 25 repetitions of sets.

Q5

Question 5: How would you evaluate the model’s performance?

Summary

Accuracy, the percentage of correct predictions a model makes, is an excellent metric to evaluate a model’s performance. By creating a confusion matrix from the vector of predictions and vector of actual results, the accuracy is calculated by totaling the confusion matrix diagnol and dividing by the number of records. The accuracies of each model are below.

Additionally, given that the class of “Attrited Customer” was very low, meaning a model could simply predict all customers to be “Existing” and still produce an accuracy of 83.9% (as did the Support Vector Machine model), the recall measure which measures how well “Attrited Customer” was used. Recall is calculated using the confusion matrix: TruePositive / (True Positive + FalseNegative).

NB. This recall calculation rewards predicting an attrited customer because simply predicting existing customers (and missing potential attrited customers) doesn’t help. Plus, the cost of identifying a falsely identified potential attrited customer (with special offers etc.)is minimal while missing a potential attrited customer and not providing them special offers is high.

Finally, these accuracy measures are the result of just one set of training and test selections. A more robust method of calculating accuracies (and other performance metrics) is to employ a k-fold cross validation. This involves training a model on k different selections of a training and test set from the dataset and averaging the resulting metrics. Below I apply a 10-fold cross validation on the Random Forest model.

Model Accuracy and Recall metrics:
Logistic Regression Accuracy: 86.4% Recall: 23.8%
K-Nearest Neighbor Accuracy: 85.4% Recall: 30.7%
Support Vector Machine Accuracy: 83.9% Recall: 0.0%
Kernel SVM Accuracy: 86.4% Recall: 22.9%
Naive Bayes Accuracy: 86.1% Recall: 18.2%
Decision Tree Accuracy: 86.8% Recall: 33.4%
Random Forest Accuracy: 86.9% Recall: 35.6%

Using a combination of accuracy and recall (sensitivity) performance metrics, the Random Forest model with an accuracy of 86.9% and a recall 35.6%, both top among all models, performed best among all.

# K-Fold Cross Validation
library(caret)
folds <- createFolds(training_set$Attrition_Flag_Num, k = 10)
cv <- lapply(folds, function(x) {
  training_fold = training_set[-x, ]
  test_fold = training_set[x, ]
  classifier = randomForest(x = training_fold[,1:5],
                          y = training_fold$Attrition_Flag_Num,
                          ntree = 500)
# Predicting the Test set results
  y_pred <- predict(classifier, newdata = test_fold[,-6])
  cm <- as.matrix(table(Actual = test_fold[[6]], Predicted = y_pred)) # create the confusion matrix
  accuracy <- (cm[1,1] + cm[2,2]) / sum(cm)
  return(accuracy)
})
rf_kfold_accuracy <- mean(as.numeric(cv))

Random Forest 10-Fold Average Accuracy: 86.7%

Q6

Question 6: How would you integrate the model into the business?

Once a model is created, there are two ways to integrate into the business. 1. Run the model against all existing customers on a periodic basis to identify which customers were at highest risk of attrition. Then target these customers with special rates or offers or simply contact them and provide better servicing. 2. Use the most important features to set triggers to reach out to and provide offers to customers that were approaching a high risk level. An example would be to set triggers to alert when customers became inactive for a certain period as “Months Inactive” has a high influence on attrition rate.

Using this model to target customers at high risk of attrition can improve the client servicing model and reduce the number of exiting clients.

Dashboard