Data analysis is an art. This sample aims to illustrate how to use R to perform exploratory analysis the given problems. Only brief discussion will be included as an illustration.
First, we have to load the libraries.
library(dplyr)
library(ggplot2)
library(psych)
library(corrplot)
library(VIM)
library(gridExtra)
library(car)
library(knitr)
library(gmodels)
dat <- read.csv("C:\\Users\\user\\downloads\\credit_card_usage.csv")
head(dat,10)
Cust_ID Balance Balance_freq Purchase Max_oneoff Installment
1 C0001 6493.033342 1.000000 447.50 447.50 0.00
2 C0002 1722.475015 1.000000 115.91 45.65 70.26
3 C0003 78.818407 0.500000 0.00 0.00 0.00
4 C0004 5.644904 0.454545 47.50 0.00 47.50
5 C0005 1466.919250 1.000000 0.00 0.00 0.00
6 C0006 8.844399 0.181818 116.68 0.00 116.68
7 C0007 1205.024103 1.000000 1247.43 0.00 1247.43
8 C0008 2468.506177 1.000000 0.00 0.00 0.00
9 C0009 1160.206649 0.900000 1600.00 1400.00 200.00
10 C0010 6002.618198 1.000000 0.00 0.00 0.00
Cash_advance Purchase_freq Oneoff_freq Installment_freq Cash_advance_freq
1 1664.47374 0.416667 0.416667 0.000000 0.583333
2 146.60786 0.500000 0.083333 0.500000 0.250000
3 1113.18608 0.000000 0.000000 0.000000 0.166667
4 0.00000 0.250000 0.000000 0.166667 0.000000
5 37.69126 0.000000 0.000000 0.000000 0.166667
6 0.00000 0.166667 0.000000 0.166667 0.000000
7 0.00000 1.000000 0.000000 1.000000 0.000000
8 648.80863 0.000000 0.000000 0.000000 0.250000
9 0.00000 0.300000 0.100000 0.100000 0.000000
10 1163.46222 0.000000 0.000000 0.000000 0.500000
Cash_advance_tran Purchase_tran Credit_limit Payment Min_payment
1 8 6 8000 1907.59974 1874.45719
2 4 7 1900 442.94462 813.97454
3 7 0 1200 1397.77013 21.82119
4 0 3 3000 70.55406 42.43471
5 2 0 1500 323.64218 429.53837
6 0 2 3000 677.98290 123.10715
7 0 41 1000 1314.35815 2812.82049
8 4 0 2500 601.74019 1085.79506
9 0 5 2000 876.75728 474.63298
10 10 0 6500 1498.94815 1837.99886
Full_payment_perc Tenure
1 0.000000 12
2 0.000000 12
3 0.333333 6
4 0.100000 12
5 0.000000 12
6 0.666667 12
7 0.000000 12
8 0.000000 12
9 0.000000 10
10 0.000000 12
The following program produces basic descriptive statistics and boxplots of the raw data.
describe(dat[,2:18])
vars n mean sd median trimmed mad min
Balance 1 3000 1520.10 2032.05 845.61 1089.89 1155.17 0.00
Balance_freq 2 3000 0.87 0.25 1.00 0.93 0.00 0.00
Purchase 3 3000 988.76 2152.37 360.05 579.32 533.81 0.00
Max_oneoff 4 3000 589.94 1684.35 29.30 264.21 43.44 0.00
Installment 5 3000 399.32 832.63 86.80 222.13 128.69 0.00
Cash_advance 6 3000 921.62 1939.51 0.00 463.63 0.00 0.00
Purchase_freq 7 3000 0.49 0.40 0.43 0.48 0.64 0.00
Oneoff_freq 8 3000 0.20 0.29 0.08 0.13 0.12 0.00
Installment_freq 9 3000 0.36 0.40 0.17 0.33 0.25 0.00
Cash_advance_freq 10 3000 0.13 0.20 0.00 0.09 0.00 0.00
Cash_advance_tran 11 3000 3.08 6.75 0.00 1.60 0.00 0.00
Purchase_tran 12 3000 14.55 24.03 7.00 9.49 10.38 0.00
Credit_limit 13 2999 4444.50 3672.98 3000.00 3843.97 2668.68 150.00
Payment 14 3000 1721.28 2914.19 863.78 1144.99 870.55 0.00
Min_payment 15 2886 810.09 1776.23 308.93 477.43 279.46 0.12
Full_payment_perc 16 3000 0.15 0.29 0.00 0.08 0.00 0.00
Tenure 17 3000 11.49 1.38 12.00 11.90 0.00 6.00
max range skew kurtosis se
Balance 15532.34 15532.34 2.32 6.76 37.10
Balance_freq 1.00 1.00 -1.93 2.69 0.00
Purchase 49039.57 49039.57 9.29 146.84 39.30
Max_oneoff 40761.25 40761.25 10.70 187.28 30.75
Installment 13184.43 13184.43 5.74 52.80 15.20
Cash_advance 21943.85 21943.85 4.06 23.88 35.41
Purchase_freq 1.00 1.00 0.08 -1.64 0.01
Oneoff_freq 1.00 1.00 1.57 1.28 0.01
Installment_freq 1.00 1.00 0.52 -1.40 0.01
Cash_advance_freq 1.17 1.17 1.86 3.45 0.00
Cash_advance_tran 123.00 123.00 6.65 80.97 0.12
Purchase_tran 248.00 248.00 4.14 24.79 0.44
Credit_limit 28000.00 27850.00 1.57 2.78 67.07
Payment 46930.60 46930.60 6.40 64.51 53.21
Min_payment 30528.43 30528.32 8.30 99.23 33.06
Full_payment_perc 1.00 1.00 1.96 2.54 0.01
Tenure 12.00 6.00 -2.85 7.07 0.03
# re-order the columns to facilitate the correlation analysis
dat <- dat %>% select(Cust_ID, Tenure, Credit_limit, Balance, Balance_freq,
Purchase,Purchase_freq, Purchase_tran,
Max_oneoff, Oneoff_freq,
Payment, Min_payment, Full_payment_perc,
Installment, Installment_freq,
Cash_advance, Cash_advance_freq, Cash_advance_tran)
## boxplot by group
dat %>% select(Cust_ID, Credit_limit, Balance, Purchase, Max_oneoff,
Payment, Min_payment) %>%
reshape2::melt(id="Cust_ID") %>%
ggplot(aes(x=variable, y=value, fill=variable)) +
geom_boxplot() +
labs(x="Characterisitcs", y="Value",
title="Balance/Purchase/Max_Oneoff/Payment") +
theme(plot.title=element_text(hjust=0.5))
dat %>% select(Cust_ID, Balance_freq, Purchase_freq, Oneoff_freq, Cash_advance_freq,
Full_payment_perc) %>%
reshape2::melt(id="Cust_ID") %>%
ggplot(aes(x=variable, y=value, fill=variable)) +
geom_boxplot() +
labs(x="Characterisitcs", y="Value",
title="Balance/Purchase/Oneoff/Cash advance/Full payment - Frequency") +
theme(plot.title=element_text(hjust=0.5))
dat %>% select(Cust_ID, Installment, Cash_advance) %>%
reshape2::melt(id="Cust_ID") %>%
ggplot(aes(x=variable, y=value, fill=variable)) +
geom_boxplot() +
labs(x="Characterisitcs", y="Value",
title="Installment/Cash advance") +
theme(plot.title=element_text(hjust=0.5))
dat %>% select(Cust_ID, Cash_advance_tran, Tenure) %>%
reshape2::melt(id="Cust_ID") %>%
ggplot(aes(x=variable, y=value, fill=variable)) +
geom_boxplot() +
labs(x="Characterisitcs", y="Value",
title="Cash advance transaction/Tenure") +
theme(plot.title=element_text(hjust=0.5))
From the descriptive statistics and boxplots, the distributions of Purchase, Purchase_tran, Max_oneoff, Installment, Cash_advance, Cash_advance_tran, Credit_limit, Payment, and Min_payment are highly positive skew. From the boxplot, all variables except purchase_freq and tenure have substantial numbers of outliers.
The following program analyzes the missing data in the dataset.
cat("Missing by count\n")
Missing by count
sapply(dat[,2:18], function(x) round(sum(is.na(x)),2))
Tenure Credit_limit Balance Balance_freq
0 1 0 0
Purchase Purchase_freq Purchase_tran Max_oneoff
0 0 0 0
Oneoff_freq Payment Min_payment Full_payment_perc
0 0 114 0
Installment Installment_freq Cash_advance Cash_advance_freq
0 0 0 0
Cash_advance_tran
0
cat("Missing by percentage\n")
Missing by percentage
sapply(dat[,2:18], function(x) round(sum(is.na(x))/nrow(dat),2))
Tenure Credit_limit Balance Balance_freq
0.00 0.00 0.00 0.00
Purchase Purchase_freq Purchase_tran Max_oneoff
0.00 0.00 0.00 0.00
Oneoff_freq Payment Min_payment Full_payment_perc
0.00 0.00 0.04 0.00
Installment Installment_freq Cash_advance Cash_advance_freq
0.00 0.00 0.00 0.00
Cash_advance_tran
0.00
aggr(dat, prop=FALSE, numbers=TRUE)
aggr(dat, prop=TRUE, numbers=FALSE)
matrixplot(dat, interactive=FALSE) # without sorting
The data has a total of 115 missing value with 114 found in Min_payment, which is about 4%. The missing value seems to be randomly distributed in the dataset. Since Min_payment is one of the key variables related to customer’s credit usage, the variable cannot be excluded in the analysis. So, the missing data may better be handled by listwise deletion.
dat_comp <- dat[complete.cases(dat),]
Correlation analysis can be performed to identify highly related variables that can be used to segment the customers for promoting the installment and cash advance services.
dat_comp %>%
select(Credit_limit:Full_payment_perc) %>%
cor() %>%
round(3) %>%
corrplot(method = "color", addCoef.col="white", type = "upper",
title="Correlation - Balance/Purchase/Oneoff/Payment",
mar=c(0,0,2,0),
tl.cex=0.5, number.cex = 0.4)
dat_comp %>%
select(Credit_limit:Balance_freq, Installment:Cash_advance_tran) %>%
cor() %>%
round(3) %>%
corrplot(method = "color", addCoef.col="white", type = "upper",
title="Correlation - Credit_limit/Balance/Installment/Cash Advance",
mar=c(0,0,2,0),
tl.cex=0.5, number.cex = 0.5)
dat_comp %>%
select(Purchase:Purchase_tran, Installment:Cash_advance_tran) %>%
cor() %>%
round(3) %>%
corrplot(method = "color", addCoef.col="white", type = "upper",
title="Correlation - Purchase/Installment/Cash Advance",
mar=c(0,0,2,0),
tl.cex=0.5, number.cex = 0.5)
dat_comp %>%
select(Max_oneoff:Oneoff_freq, Installment:Cash_advance_tran) %>%
cor() %>%
round(3) %>%
corrplot(method = "color", addCoef.col="white", type = "upper",
title="Correlation - Oneoff/Installment/Cash Advance",
mar=c(0,0,2,0),
tl.cex=0.5, number.cex = 0.5)
dat_comp %>%
select(Payment:Full_payment_perc, Installment:Cash_advance_tran) %>%
cor() %>%
round(3) %>%
corrplot(method = "color", addCoef.col="white", type = "upper",
title="Correlation - Payment/Installment/Cash Advance",
mar=c(0,0,2,0),
tl.cex=0.5, number.cex = 0.5)
The following rule of thumb is applied to evaluate the correlations between variables: +==================================================+ |Correlation coefficient | Interpretation | +———————– | ————————+ |0.90 - 1.00 | Very strong correlation | |0.70 - 0.89 | Strong correation | |0.40 - 0.69 | Moderate correation | |0.10 - 0.39 | Weak correation | |0.00 - 0.10 | Negligible correation | +==================================================+
From the correlation plots, Purchase and Max_oneoff have very strong correlation. For installment/cash advance attributes, moderate to strong correlations include:
| Relationship | Variables | |
|---|---|
| Strong correlation | Purchase/Installment (0.7); Purchase_freq/Installment_freq (0.86); |
| Moderate correlation | Balance/Cash_advance (0.49); Balance/Cash_advance_freq (0.45); Purchase_freq/Installment (0.47); Purchase_tran/Installment (0.63); Purchase_tran/Installment_freq (0.54); Max_oneoff/Installment (0.4); Payment/Installment (0.43); |
To visualize the associations between variables, correlation coefficients and scatter plots of variables in the above table are derived below.
x<-dat_comp %>% select(Purchase, Purchase_freq, Purchase_tran,
Max_oneoff, Payment,
Installment, Installment_freq) %>%
cor()
round(x,3)
Purchase Purchase_freq Purchase_tran Max_oneoff Payment
Purchase 1.000 0.386 0.654 0.935 0.616
Purchase_freq 0.386 1.000 0.580 0.260 0.107
Purchase_tran 0.654 0.580 1.000 0.523 0.345
Max_oneoff 0.935 0.260 0.523 1.000 0.573
Payment 0.616 0.107 0.345 0.573 1.000
Installment 0.695 0.472 0.634 0.395 0.433
Installment_freq 0.310 0.863 0.542 0.126 0.102
Installment Installment_freq
Purchase 0.695 0.310
Purchase_freq 0.472 0.863
Purchase_tran 0.634 0.542
Max_oneoff 0.395 0.126
Payment 0.433 0.102
Installment 1.000 0.546
Installment_freq 0.546 1.000
dat_comp %>% select(Purchase, Purchase_freq, Purchase_tran,
Installment, Installment_freq) %>%
scatterplotMatrix()
dat_comp %>% select(Max_oneoff, Payment,
Installment, Installment_freq) %>%
scatterplotMatrix()
From the above correlation coefficients and scatter plots, Purchase is a reasonable predictor for Installment, and Purchase_freq can be a predictor for Installment_freq.
x <-dat_comp %>% select(Balance, Cash_advance, Cash_advance_freq) %>%
cor()
round(x,3)
Balance Cash_advance Cash_advance_freq
Balance 1.000 0.486 0.448
Cash_advance 0.486 1.000 0.662
Cash_advance_freq 0.448 0.662 1.000
dat_comp %>% select(Balance, Cash_advance, Cash_advance_freq) %>%
scatterplotMatrix()
From the above correlation coefficient and scatter plots, Balance can be a weak predictor of Cash_advance and Cash_advance_freq.
From the correlation analysis, we can formulate the following hypotheses:
To verify the hypothesis, we classify customers by Purchase, Purchase_freq, and Balance into three groups: High (>75 percentile), Medium (25 - 75 percentiles), and Low (<25 percentile).
The quantiles of variables are:
dat_comp %>% select(Purchase, Purchase_freq, Balance, Installment, Cash_advance) %>% sapply(quantile)
Purchase Purchase_freq Balance Installment Cash_advance
0% 0.00 0.000000 0.0000 0.00 0.000
25% 43.08 0.083333 134.5811 0.00 0.000
50% 368.47 0.500000 887.0908 90.00 0.000
75% 1122.30 0.916667 2069.4513 479.52 1105.396
100% 49039.57 1.000000 15532.3397 13184.43 21943.849
The figures show that less than 50% of customers making use of installment and cash advance services.
dat_new <- dat_comp %>% mutate(p_group=ifelse(Purchase>1122, "High",
ifelse(Purchase<=1122 & Purchase>43, "Medium", "Low")),
pf_group=ifelse(Purchase_freq>0.917, "High",
ifelse(Purchase_freq<=0.917 & Purchase_freq>0.083,
"Medium", "Low")),
b_group=ifelse(Balance>2069, "High",
ifelse(Balance<=2069 & Balance>135, "Medium", "Low")))
# reorder the levels
dat_new$p_group <- factor(dat_new$p_group, levels=c("High", "Medium", "Low"))
dat_new$pf_group <- factor(dat_new$pf_group, levels=c("High", "Medium", "Low"))
dat_new$b_group <- factor(dat_new$b_group, levels=c("High", "Medium", "Low"))
# count the # for each group
cat("Purchase:")
Purchase:
table(dat_new$p_group)
High Medium Low
723 1441 721
cat("Purchase freq:")
Purchase freq:
table(dat_new$pf_group)
High Medium Low
695 1544 646
cat("Balance:")
Balance:
table(dat_new$b_group)
High Medium Low
722 1441 722
The following program evaluates the effects of using purchase/purchase_tran to classify observations on installment and installment_freq.
x<-dat_new %>% select(Installment, Installment_freq, p_group) %>%
group_by(p_group) %>%
summarise(Av_install=round(mean(Installment),3),
SD_install=round(sd(Installment),3),
Median_install=round(median(Installment),3),
Av_install_freq=round(mean(Installment_freq),3),
SD_install_freq=round(sd(Installment_freq),3),
Median_install_freq=round(median(Installment_freq),3))
# use kable because x is a tibble
kable(x)
| p_group | Av_install | SD_install | Median_install | Av_install_freq | SD_install_freq | Median_install_freq |
|---|---|---|---|---|---|---|
| High | 1155.117 | 1392.330 | 873.49 | 0.630 | 0.375 | 0.750 |
| Medium | 237.403 | 262.050 | 155.32 | 0.413 | 0.380 | 0.333 |
| Low | 1.158 | 6.193 | 0.00 | 0.008 | 0.050 | 0.000 |
x<-dat_new %>% select(Installment, Installment_freq, pf_group) %>%
group_by(pf_group) %>%
summarise(Av_install=round(mean(Installment),3),
SD_install=round(sd(Installment),3),
Median_install=round(median(Installment),3),
Av_install_freq=round(mean(Installment_freq),3),
SD_install_freq=round(sd(Installment_freq),3),
Median_install_freq=round(median(Installment_freq),3))
kable(x)
| pf_group | Av_install | SD_install | Median_install | Av_install_freq | SD_install_freq | Median_install_freq |
|---|---|---|---|---|---|---|
| High | 1009.496 | 1204.707 | 643.960 | 0.853 | 0.286 | 1.00 |
| Medium | 308.559 | 664.826 | 103.295 | 0.300 | 0.304 | 0.25 |
| Low | 0.104 | 2.634 | 0.000 | 0.000 | 0.003 | 0.00 |
g1 <- dat_new %>% select(Installment, p_group) %>%
reshape2::melt(id="p_group") %>%
ggplot(aes(x=p_group, y=value, fill=variable)) +
geom_boxplot() +
labs(x="Purchase Group", y="Installment",
title="Installment vs Purchase Group") +
theme(plot.title=element_text(hjust=0.5), legend.position="None")
g2 <- dat_new %>% select(Installment, pf_group) %>%
reshape2::melt(id="pf_group") %>%
ggplot(aes(x=pf_group, y=value, fill=variable)) +
geom_boxplot() +
labs(x="Purchase Frequency Group", y="Installment",
title="Installment vs Purchase Frequency Group") +
theme(plot.title=element_text(hjust=0.5), legend.position="None")
g3 <- dat_new %>% select(Installment_freq, p_group) %>%
reshape2::melt(id="p_group") %>%
ggplot(aes(x=p_group, y=value, fill=variable)) +
geom_boxplot() +
labs(x="Purchase Group", y="Installment Frequency",
title="Installment Frequency vs Purchase Group") +
theme(plot.title=element_text(hjust=0.5), legend.position="None")
g4 <- dat_new %>% select(Installment_freq, pf_group) %>%
reshape2::melt(id="pf_group") %>%
ggplot(aes(x=pf_group, y=value, fill=variable)) +
geom_boxplot() +
labs(x="Purchase Frequency Group", y="Installment Frequency",
title="Installment Frequency vs Purchase Frequency Group") +
theme(plot.title=element_text(hjust=0.5), legend.position="None")
grid.arrange(arrangeGrob(g1, g2, nrow=2))
grid.arrange(arrangeGrob(g3, g4, nrow=2))
The descriptive statistics and boxplots suggest that high purchase group has significantly higher installment (Av=1155.12; Median=873.49) than medium purchase group (Av=237.40; Median=155.32) and high purchase frequency group has significantly higher installment frequency (Av=0.85; Median=1) than medium purchase frequency group (Av=0.30; Median=0.25). Low purchase and low purchase frequency group seldom use the installment service.
The following program further explores the effect of grouping the observations by purchase and purchase_freq.
x<- dat_new %>% select(Installment, Installment_freq, p_group, pf_group) %>%
group_by(p_group, pf_group) %>%
summarise(Av_install=round(mean(Installment),3),
SD_install=round(sd(Installment),3),
Median_install=round(median(Installment),3),
Av_install_freq=round(mean(Installment_freq),3),
SD_install_freq=round(sd(Installment_freq),3),
Median_install_freq=round(median(Installment_freq),3))
kable(x[,1:5])
| p_group | pf_group | Av_install | SD_install | Median_install |
|---|---|---|---|---|
| High | High | 1547.622 | 1466.499 | 1205.00 |
| High | Medium | 770.140 | 1198.253 | 449.24 |
| Medium | High | 441.734 | 274.408 | 419.88 |
| Medium | Medium | 175.753 | 224.329 | 94.12 |
| Low | High | 4.000 | 6.928 | 0.00 |
| Low | Medium | 10.495 | 15.023 | 0.00 |
| Low | Low | 0.104 | 2.634 | 0.00 |
kable(x[,c(1:2,6:8)])
| p_group | pf_group | Av_install_freq | SD_install_freq | Median_install_freq |
|---|---|---|---|---|
| High | High | 0.857 | 0.268 | 1.000 |
| High | Medium | 0.408 | 0.330 | 0.417 |
| Medium | High | 0.853 | 0.298 | 1.000 |
| Medium | Medium | 0.280 | 0.291 | 0.167 |
| Low | High | 0.306 | 0.529 | 0.000 |
| Low | Medium | 0.064 | 0.098 | 0.000 |
| Low | Low | 0.000 | 0.003 | 0.000 |
CrossTable(dat_new$p_group, dat_new$pf_group,
prop.chisq=FALSE, dnn=c("Purchase Group", "Purchase Frequency Group"))
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 2885
| Purchase Frequency Group
Purchase Group | High | Medium | Low | Row Total |
---------------|-----------|-----------|-----------|-----------|
High | 358 | 365 | 0 | 723 |
| 0.495 | 0.505 | 0.000 | 0.251 |
| 0.515 | 0.236 | 0.000 | |
| 0.124 | 0.127 | 0.000 | |
---------------|-----------|-----------|-----------|-----------|
Medium | 334 | 1107 | 0 | 1441 |
| 0.232 | 0.768 | 0.000 | 0.499 |
| 0.481 | 0.717 | 0.000 | |
| 0.116 | 0.384 | 0.000 | |
---------------|-----------|-----------|-----------|-----------|
Low | 3 | 72 | 646 | 721 |
| 0.004 | 0.100 | 0.896 | 0.250 |
| 0.004 | 0.047 | 1.000 | |
| 0.001 | 0.025 | 0.224 | |
---------------|-----------|-----------|-----------|-----------|
Column Total | 695 | 1544 | 646 | 2885 |
| 0.241 | 0.535 | 0.224 | |
---------------|-----------|-----------|-----------|-----------|
# create combine group for boxplot
x<- dat_new %>% select(Installment, Installment_freq, p_group, pf_group) %>%
reshape2::melt(id=c("p_group", "pf_group")) %>%
mutate(c_group=ifelse(p_group=="High",
ifelse(pf_group=="High", "High-High",
ifelse(pf_group=="Medium", "High-Medium", "High-Low")),
ifelse(p_group=="Medium",
ifelse(pf_group=="High", "Medium-High",
ifelse(pf_group=="Medium", "Medium-Medium", "Medium-Low")),
ifelse(pf_group=="High", "Low-High",
ifelse(pf_group=="Medium", "Low-Medium", "Low-Low")))))
x$c_group <- factor(x$c_group, levels=c("High-High", "High-Medium", "High-Low",
"Medium-High", "Medium-Medium","Medium-Low",
"Low-High", "Low-Medium", "Low-Low"))
x %>% filter(variable=="Installment") %>%
ggplot(aes(x=c_group, y=value, fill=variable)) +
geom_boxplot() +
labs(x="Purchase + Purchase Frequency", y="Installment",
title="Installment vs Purchase + Purchase Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title=element_text(hjust=0.5), legend.position="None")
x %>% filter(variable=="Installment_freq") %>%
ggplot(aes(x=c_group, y=value, fill=variable)) +
geom_boxplot() +
labs(x="Purchase + Purchase Frequency", y="Installment Frequency",
title="Installment vs Purchase + Purchase Frequency") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title=element_text(hjust=0.5), legend.position="None")
y <- x %>% mutate(i_group=ifelse(value>0, "Yes", "No"))
y$i_group <- factor(y$i_group, levels=c("Yes", "No"))
CrossTable(y$c_group, y$i_group,
prop.chisq=FALSE, dnn=c("Purchase-Purchase Freq", "Installment"))
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 5770
| Installment
Purchase-Purchase Freq | Yes | No | Row Total |
-----------------------|-----------|-----------|-----------|
High-High | 686 | 30 | 716 |
| 0.958 | 0.042 | 0.124 |
| 0.210 | 0.012 | |
| 0.119 | 0.005 | |
-----------------------|-----------|-----------|-----------|
High-Medium | 532 | 198 | 730 |
| 0.729 | 0.271 | 0.127 |
| 0.163 | 0.079 | |
| 0.092 | 0.034 | |
-----------------------|-----------|-----------|-----------|
Medium-High | 614 | 54 | 668 |
| 0.919 | 0.081 | 0.116 |
| 0.188 | 0.022 | |
| 0.106 | 0.009 | |
-----------------------|-----------|-----------|-----------|
Medium-Medium | 1380 | 834 | 2214 |
| 0.623 | 0.377 | 0.384 |
| 0.422 | 0.334 | |
| 0.239 | 0.145 | |
-----------------------|-----------|-----------|-----------|
Low-High | 2 | 4 | 6 |
| 0.333 | 0.667 | 0.001 |
| 0.001 | 0.002 | |
| 0.000 | 0.001 | |
-----------------------|-----------|-----------|-----------|
Low-Medium | 55 | 89 | 144 |
| 0.382 | 0.618 | 0.025 |
| 0.017 | 0.036 | |
| 0.010 | 0.015 | |
-----------------------|-----------|-----------|-----------|
Low-Low | 2 | 1290 | 1292 |
| 0.002 | 0.998 | 0.224 |
| 0.001 | 0.516 | |
| 0.000 | 0.224 | |
-----------------------|-----------|-----------|-----------|
Column Total | 3271 | 2499 | 5770 |
| 0.567 | 0.433 | |
-----------------------|-----------|-----------|-----------|
From the descriptive statistics and boxplots, it clearly indicates that applying both the purchase and purchase_freq to grouping the observations can refine the granularity of customer groups that have demands of installment services. In general, high purchase and purchase frequency group is the most likely to use the installment services and has the highest installment spending. In the High-Medium and Medium-Medium groups, the percentage of non-installment customers are 27.1% and 37.7%, respectively. So, target promotion can be devised to stimulate the use of installment services for customers who haven’t made use of the installment services within these two groups of customers.
The following program evaluates the effects of using balance to classify observations on cash_advance and cash_advance_freq.
x<- dat_new %>% select(Cash_advance, Cash_advance_freq, b_group) %>%
group_by(b_group) %>%
summarise(Av_cash=round(mean(Cash_advance),3),
SD_cash=round(sd(Cash_advance),3),
Median_cash=round(median(Cash_advance),3),
Av_cash_freq=round(mean(Cash_advance_freq),3),
SD_cash_freq=round(sd(Cash_advance_freq),3),
Median_cash_freq=round(median(Cash_advance_freq),3))
kable(x)
| b_group | Av_cash | SD_cash | Median_cash | Av_cash_freq | SD_cash_freq | Median_cash_freq |
|---|---|---|---|---|---|---|
| High | 2378.959 | 3058.038 | 1441.826 | 0.271 | 0.246 | 0.250 |
| Medium | 661.091 | 1231.056 | 36.943 | 0.123 | 0.173 | 0.083 |
| Low | 66.712 | 281.853 | 0.000 | 0.017 | 0.055 | 0.000 |
g1 <-dat_new %>% select(Cash_advance, b_group) %>%
reshape2::melt(id="b_group") %>%
ggplot(aes(x=b_group, y=value, fill=variable)) +
geom_boxplot() +
labs(x="Balance Group", y="Cash advance",
title="Cash Advance vs Balance Group") +
theme(plot.title=element_text(hjust=0.5), legend.position="None")
g2 <- dat_new %>% select(Cash_advance_freq, b_group) %>%
reshape2::melt(id="b_group") %>%
ggplot(aes(x=b_group, y=value, fill=variable)) +
geom_boxplot() +
labs(x="Balance Group", y="Cash advance frequency",
title="Cash Advance Frequency vs Balance Group") +
theme(plot.title=element_text(hjust=0.5), legend.position="None")
grid.arrange(arrangeGrob(g1, g2, nrow=2))
y <- dat_new %>% select(Cash_advance, b_group) %>%
reshape2::melt(id="b_group") %>%
mutate(i_group=ifelse(value>0, "Yes", "No"))
y$i_group <- factor(y$i_group, levels=c("Yes", "No"))
CrossTable(y$b_group, y$i_group,
prop.chisq=FALSE, dnn=c("Balance", "Cash advance"))
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 2885
| Cash advance
Balance | Yes | No | Row Total |
-------------|-----------|-----------|-----------|
High | 567 | 155 | 722 |
| 0.785 | 0.215 | 0.250 |
| 0.407 | 0.104 | |
| 0.197 | 0.054 | |
-------------|-----------|-----------|-----------|
Medium | 733 | 708 | 1441 |
| 0.509 | 0.491 | 0.499 |
| 0.527 | 0.474 | |
| 0.254 | 0.245 | |
-------------|-----------|-----------|-----------|
Low | 92 | 630 | 722 |
| 0.127 | 0.873 | 0.250 |
| 0.066 | 0.422 | |
| 0.032 | 0.218 | |
-------------|-----------|-----------|-----------|
Column Total | 1392 | 1493 | 2885 |
| 0.482 | 0.518 | |
-------------|-----------|-----------|-----------|
The results confirm that high balance group tends to make more cash advance than medium or low balance group. Although this phenomenon seems a bit unnatural, 21.5% of high balance and 49.1% of medium balance haven’t made cash advance in the past six months. So, target marketing plan may design for these groups of customers. Overall, the analysis results support the two hypothese.