This data set show that 10,000 customer mentioning their age, salary, marital_status, credit card limit, credit card category, etc.
We will done exploratory analysis on this data to understand customer demography, customer credit card usage and loyal customer from this bank dataset. This dataset is download from Kaggle
data <- read.csv("C:/Users/ANDREW/OneDrive/Desktop/UM/Course/WQD 7004 - Programming for Data Science/2. Individual Assignment 04.22/archive/BankChurners.csv")
# To show the first 10 row of the dataset
head(data, n=10)
## CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count
## 1 768805383 Existing Customer 45 M 3
## 2 818770008 Existing Customer 49 F 5
## 3 713982108 Existing Customer 51 M 3
## 4 769911858 Existing Customer 40 F 4
## 5 709106358 Existing Customer 40 M 3
## 6 713061558 Existing Customer 44 M 2
## 7 810347208 Existing Customer 51 M 4
## 8 818906208 Existing Customer 32 M 0
## 9 710930508 Existing Customer 37 M 3
## 10 719661558 Existing Customer 48 M 2
## Education_Level Marital_Status Income_Category Card_Category Months_on_book
## 1 High School Married $60K - $80K Blue 39
## 2 Graduate Single Less than $40K Blue 44
## 3 Graduate Married $80K - $120K Blue 36
## 4 High School Unknown Less than $40K Blue 34
## 5 Uneducated Married $60K - $80K Blue 21
## 6 Graduate Married $40K - $60K Blue 36
## 7 Unknown Married $120K + Gold 46
## 8 High School Unknown $60K - $80K Silver 27
## 9 Uneducated Single $60K - $80K Blue 36
## 10 Graduate Single $80K - $120K Blue 36
## Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon
## 1 5 1 3
## 2 6 1 2
## 3 4 1 0
## 4 3 4 1
## 5 5 1 0
## 6 3 1 2
## 7 6 1 3
## 8 2 2 2
## 9 5 2 0
## 10 6 3 3
## Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1
## 1 12691 777 11914 1.335
## 2 8256 864 7392 1.541
## 3 3418 0 3418 2.594
## 4 3313 2517 796 1.405
## 5 4716 0 4716 2.175
## 6 4010 1247 2763 1.376
## 7 34516 2264 32252 1.975
## 8 29081 1396 27685 2.204
## 9 22352 2517 19835 3.355
## 10 11656 1677 9979 1.524
## Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
## 1 1144 42 1.625 0.061
## 2 1291 33 3.714 0.105
## 3 1887 20 2.333 0.000
## 4 1171 20 2.333 0.760
## 5 816 28 2.500 0.000
## 6 1088 24 0.846 0.311
## 7 1330 31 0.722 0.066
## 8 1538 36 0.714 0.048
## 9 1350 24 1.182 0.113
## 10 1441 32 0.882 0.144
library(knitr)
library(dplyr)
library(ggplot2) - used to plot data for data visual
library(data.table) - used to categorize age into groups defined by interval
library(viridis) - used for color scale to make plot
library(janitor) - used of tabyl( function to count combination of one, two or three variables and show the percentage.
CLIENTNUM - Client number. Unique identifier for the customer holding the account
Attrition_Flag - Internal event (customer activity) variable - if the account is closed then 1 else 0
Customer_Age - Demographic variable - Customer’s Age in Years
Gender - Demographic variable - M=Male, F=Female
Dependent_count - Demographic variable - Number of dependents
Education_Level - Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.)
Marital_Status - Demographic variable - Married, Single, Divorced, Unknown
Income_Category - Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, >$120K)
Card_Category - Product Variable - Type of Card (Blue, Silver, Gold, Platinum)
Months_on_book - Period of relationship with bank
Total_Relationship_Count - Total no. of products held by the customer
Months_Inactive_12_mon - No. of months inactive in the last 12 months
Contacts_Count_12_mon - No. of Contacts in the last 12 months
Credit_Limit - Credit Limit on the Credit Card
Total_Revolving_Bal - Total Revolving Balance on the Credit Card
Avg_Open_To_Buy - Open to Buy Credit Line (Average of last 12 months)
Total_Amt_Chng_Q4_Q1 - Change in Transaction Amount (Q4 over Q1)
Total_Trans_Amt - Total Transaction Amount (Last 12 months)
Total_Trans_Ct - Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1 - Change in Transaction Count (Q4 over Q1)
Avg_Utilization_Ratio - Average Card Utilization Ratio
# Set "df" as data frame from original data
df <- data
# Change all variable name to lowercase
names(df) <- tolower(names(df))
# Rename 'customer_age' column to 'age'
names(df)[3] <- "age"
# Add new column agegroup by categorize age
setDT(df)[age >25 & age <= 35, agegroup := "25-35 Years Old"]
df[age >35 & age <= 45, agegroup := "35-45 Years Old"]
df[age >45 & age <= 55, agegroup := "45-55 Years Old"]
df[age >55 & age <= 65, agegroup := "55-65 Years Old"]
df[age >65 & age <= 75, agegroup := "65-75 Years Old"]
# Add new column total_trans_cat (total transaction category) by categorize total transaction amount
setDT(df)[total_trans_amt <= 2000, total_trans_cat := "$0 - $2000"]
df[total_trans_amt >= 2001 & total_trans_amt <= 4000, total_trans_cat := "$2001 - $4000"]
df[total_trans_amt >= 4001 & total_trans_amt <= 6000, total_trans_cat := "$4001 - $6000"]
df[total_trans_amt >= 6001 & total_trans_amt <= 8000, total_trans_cat := "$6001 - $8000"]
df[total_trans_amt >= 8001 & total_trans_amt <= 10000, total_trans_cat := "$8001 - $10000"]
df[total_trans_amt >= 10001 & total_trans_amt <= 12000, total_trans_cat := "$10001 - $12000"]
df[total_trans_amt >= 12001 & total_trans_amt <= 14000, total_trans_cat := "$12001 - $14000"]
df[total_trans_amt >= 14001 & total_trans_amt <= 16000, total_trans_cat := "$14001 - $16000"]
# Add new column years_on_book from months_on_book
df$years_on_book <- as.numeric(as.character(df$months_on_book)) / 12
# Remove unknown in income_category
df1 <- df[!(df$income_category=="Unknown")]
ggplot(data = df1,
mapping = aes(x = age, fill = gender)) +
geom_histogram(alpha = 5, bins = 50) +
labs(fill = "Gender", x = "Customer Age", y = "Count",
title = "Customer Demography") +
theme_bw() +
scale_x_continuous(breaks = round(seq(min(df1$age), max(df1$age),
by = 2),1))
# Show the Male and Female in total and percentage
tabyl(df1, gender)
## gender n percent
## F 4298 0.476761
## M 4717 0.523239
# Calculate the Majority percentage of credit card customer from age 36 - 58
major <- sum(df1$age[between(df1$age,36,58)]) / sum(df1$age) *100
print(major)
## [1] 85.28081
# Calculate the least percentage of credit card customer from age 26 - 35 and age 59 - 73
minor <- 100 - major
print(minor)
## [1] 14.71919
Summary from customer demography table
Customer demography table show the distribution of customer group from dataset. In total there are 4,298 Female customers (47.7%) and 4,717 Male customers (52.3%).
Majority of customer is between 36 years old and 58 years old which is 85.3%.
Customer above 65 years old and below 28 years old is least group size which is 14.7%.
ggplot(data = df1,
mapping = aes(x = age, fill = card_category)) +
geom_histogram(alpha = 5, bins = 50) +
labs(fill = "Card Category" ,x = "Credit Card Customer By Category",
y = "Count", title = "Customer Demography by Card Category") +
facet_grid(. ~ gender) +
theme_bw() +
scale_x_continuous(breaks = round(seq(min(df1$age), max(df1$age), by = 5),1))
# Show the customer demography by card category
tabyl(df1, card_category)
## card_category n percent
## Blue 8391 0.930782030
## Gold 107 0.011869107
## Platinum 15 0.001663894
## Silver 502 0.055684969
# Show the customer demography by card category by gender
tabyl(df1, card_category, gender)
## card_category F M
## Blue 4106 4285
## Gold 29 78
## Platinum 4 11
## Silver 159 343
Summary from customer demography by card category
Table above show that there was 8,391(93.1%) customers use Blue card in which 4,106 Female customers and 4,285 Male customers.
502 (5.6%) customers use Silver card in which 159 Female customers and 343 Male customers.
107 (1.2%) customers use Gold card in which 29 Female customers and 78 Male customes.
15 (0.1%) customers use Platinum card in which 4 Female customers and 11 Male customers.
ggplot(data = df1, mapping = aes(x = total_trans_ct, y = total_trans_amt)) +
geom_point(aes(color = gender)) +
facet_wrap(~ agegroup, nrow=2) +
theme_bw() +
ggtitle("Customer Credit Card Usage") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("Total Transaction Count") +
ylab("Total Transaction Amount") +
labs(color = "Gender") +
scale_color_manual(labels = c("Female", "Male"), values = c("pink", "blue")) +
scale_x_continuous(breaks = round(seq(min(df1$total_trans_ct), max(df1$total_trans_ct), by = 20),1)) +
scale_y_continuous(breaks = round(seq(min(df1$total_trans_amt), max(df1$total_trans_amt), by = 1500),1))
# Show the customer spending behavior by total transaction amount category
tabyl(df1, total_trans_cat)
## total_trans_cat n percent valid_percent
## $0 - $2000 2034 0.225623960 0.227669577
## $10001 - $12000 10 0.001109262 0.001119319
## $12001 - $14000 173 0.019190238 0.019364227
## $14001 - $16000 433 0.048031059 0.048466532
## $2001 - $4000 2735 0.303383250 0.306133871
## $4001 - $6000 2908 0.322573489 0.325498097
## $6001 - $8000 334 0.037049362 0.037385270
## $8001 - $10000 307 0.034054354 0.034363107
## <NA> 81 0.008985025 NA
# Show the customer spending behavior by total transaction amount category by age group
tabyl(df1, total_trans_cat, agegroup) %>%
adorn_totals("row") %>%
adorn_percentages("all") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns %>%
adorn_title
## agegroup
## total_trans_cat 25-35 Years Old 35-45 Years Old 45-55 Years Old
## $0 - $2000 1.5% (138) 7.1% (644) 9.1% (818)
## $10001 - $12000 0.0% (2) 0.0% (4) 0.0% (3)
## $12001 - $14000 0.4% (34) 0.7% (64) 0.6% (56)
## $14001 - $16000 0.6% (52) 1.8% (161) 1.9% (168)
## $2001 - $4000 4.2% (380) 11.9% (1069) 11.3% (1023)
## $4001 - $6000 1.5% (133) 12.6% (1140) 14.4% (1297)
## $6001 - $8000 0.1% (13) 1.5% (138) 1.7% (154)
## $8001 - $10000 0.3% (24) 1.4% (128) 1.4% (129)
## <NA> 0.1% (12) 0.3% (24) 0.4% (36)
## Total 8.7% (788) 37.4% (3372) 40.9% (3684)
##
## 55-65 Years Old 65-75 Years Old
## 4.7% (428) 0.1% (6)
## 0.0% (1) 0.0% (0)
## 0.2% (19) 0.0% (0)
## 0.6% (52) 0.0% (0)
## 2.9% (262) 0.0% (1)
## 3.7% (338) 0.0% (0)
## 0.3% (29) 0.0% (0)
## 0.3% (26) 0.0% (0)
## 0.1% (9) 0.0% (0)
## 12.9% (1164) 0.1% (7)
Summary from Customer spending behavior table
According to the table above, the highest category of customer spending on credit card is $4001 - $6000 (32%), follow by $2001 - $4000 (30%) and $0 - $2000 (22%).
The least category of customer spending on credit card is $10001 - $12000 (0.1%), $12001 - $14000 (2%) and $8001 - $10000 (3%)
The highest category of customer spending on credit card based on age group is customer age group 45-55 years old who spend $4001 - $6000 (14.4%), customer age group 35-45 years old who spend $4001 - $6000 (12.6%) and customer age group 35-45 years old who spend $2001 - $4000 (11.9%)
ggplot(data = df1,
mapping = aes(x = months_on_book, fill = attrition_flag)) +
geom_histogram(alpha = 0.5 , bins = 20) +
labs(fill = "Attrition Customer", x = "Customer Months On Book", y = "Count",
title = "Credit Card Loyal Customer") +
theme_bw() +
scale_x_continuous(breaks = round(seq(min(df1$months_on_book), max(df1$months_on_book), by = 2),1))
# To show loyal customer based on months_on_book in bank
tabyl(df1, months_on_book, attrition_flag) %>%
adorn_totals("row") %>%
adorn_percentages("all") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns %>%
adorn_title
## attrition_flag
## months_on_book Attrited Customer Existing Customer
## 13 0.1% (6) 0.5% (44)
## 14 0.0% (1) 0.2% (14)
## 15 0.1% (7) 0.2% (18)
## 16 0.0% (3) 0.2% (22)
## 17 0.0% (4) 0.4% (33)
## 18 0.1% (11) 0.4% (39)
## 19 0.1% (6) 0.6% (54)
## 20 0.1% (12) 0.6% (58)
## 21 0.1% (9) 0.7% (61)
## 22 0.2% (16) 0.9% (77)
## 23 0.1% (12) 1.0% (91)
## 24 0.2% (21) 1.3% (118)
## 25 0.3% (26) 1.3% (120)
## 26 0.2% (20) 1.7% (149)
## 27 0.2% (21) 1.7% (156)
## 28 0.5% (41) 2.3% (208)
## 29 0.4% (32) 2.1% (186)
## 30 0.6% (53) 2.4% (220)
## 31 0.3% (30) 2.8% (252)
## 32 0.4% (40) 2.5% (222)
## 33 0.5% (44) 2.6% (236)
## 34 0.6% (52) 2.9% (262)
## 35 0.4% (40) 2.7% (247)
## 36 4.2% (383) 20.0% (1804)
## 37 0.6% (55) 2.9% (265)
## 38 0.6% (52) 2.9% (258)
## 39 0.6% (55) 2.8% (248)
## 40 0.4% (39) 2.7% (247)
## 41 0.5% (44) 2.4% (217)
## 42 0.3% (28) 2.3% (208)
## 43 0.5% (41) 2.3% (211)
## 44 0.4% (36) 1.9% (171)
## 45 0.3% (30) 1.9% (174)
## 46 0.4% (34) 1.6% (146)
## 47 0.2% (22) 1.5% (133)
## 48 0.3% (25) 1.4% (123)
## 49 0.2% (16) 1.2% (109)
## 50 0.2% (21) 0.7% (63)
## 51 0.1% (13) 0.6% (54)
## 52 0.1% (11) 0.5% (43)
## 53 0.1% (7) 0.7% (66)
## 54 0.1% (5) 0.5% (43)
## 55 0.0% (2) 0.3% (30)
## 56 0.2% (14) 0.8% (75)
## Total 16.0% (1440) 84.0% (7575)
# To show loyal customer group by agegroup and gender
df1 %>%
tabyl(attrition_flag, agegroup, gender, show_missing_levels = FALSE) %>%
adorn_totals("row") %>%
adorn_percentages("all") %>%
adorn_pct_formatting(digits = 1) %>%
adorn_ns %>%
adorn_title
## $F
## agegroup
## attrition_flag 25-35 Years Old 35-45 Years Old 45-55 Years Old
## Attrited Customer 1.2% (52) 6.8% (294) 7.4% (316)
## Existing Customer 7.1% (306) 30.7% (1320) 33.4% (1435)
## Total 8.3% (358) 37.6% (1614) 40.7% (1751)
##
## 55-65 Years Old 65-75 Years Old
## 2.0% (86) 0.0% (0)
## 11.3% (487) 0.0% (2)
## 13.3% (573) 0.0% (2)
##
## $M
## agegroup
## attrition_flag 25-35 Years Old 35-45 Years Old 45-55 Years Old
## Attrited Customer 1.2% (56) 5.4% (255) 6.0% (282)
## Existing Customer 7.9% (374) 31.9% (1503) 35.0% (1651)
## Total 9.1% (430) 37.3% (1758) 41.0% (1933)
##
## 55-65 Years Old 65-75 Years Old
## 2.1% (98) 0.0% (1)
## 10.5% (493) 0.1% (4)
## 12.5% (591) 0.1% (5)
Summary from Credit Card Loyal Customer table
Data above show that there was 7,575 existing customer (84%) and 1,440 attrited_customer (16%) and majority of customer already stay at this bank for 36 months/ 3 years which is 1,804 customer (20%).
Majority of Female existing customer was age group 45-55 years old 1,435 customers (33.4%) and attrited customer - 316 customers (7.4%).
Majority of Male existing customer was age group 45-55 years old - 1,651 customers (35%) and attrited customer - 282 customers (6%).