Introduction

Data analysis is an art. This sample aims to illustrate how to use R to perform exploratory analysis the given problems. Only brief discussion will be included as an illustration.

Descriptive statistics

First, we have to load the libraries.

library(dplyr)
library(ggplot2)
library(psych)
library(corrplot)
library(VIM)
library(gridExtra)
library(car)
library(knitr)
library(gmodels)
dat <- read.csv("C:\\Users\\user\\downloads\\credit_card_usage.csv")
head(dat,10)
   Cust_ID     Balance Balance_freq Purchase Max_oneoff Installment
1    C0001 6493.033342     1.000000   447.50     447.50        0.00
2    C0002 1722.475015     1.000000   115.91      45.65       70.26
3    C0003   78.818407     0.500000     0.00       0.00        0.00
4    C0004    5.644904     0.454545    47.50       0.00       47.50
5    C0005 1466.919250     1.000000     0.00       0.00        0.00
6    C0006    8.844399     0.181818   116.68       0.00      116.68
7    C0007 1205.024103     1.000000  1247.43       0.00     1247.43
8    C0008 2468.506177     1.000000     0.00       0.00        0.00
9    C0009 1160.206649     0.900000  1600.00    1400.00      200.00
10   C0010 6002.618198     1.000000     0.00       0.00        0.00
   Cash_advance Purchase_freq Oneoff_freq Installment_freq Cash_advance_freq
1    1664.47374      0.416667    0.416667         0.000000          0.583333
2     146.60786      0.500000    0.083333         0.500000          0.250000
3    1113.18608      0.000000    0.000000         0.000000          0.166667
4       0.00000      0.250000    0.000000         0.166667          0.000000
5      37.69126      0.000000    0.000000         0.000000          0.166667
6       0.00000      0.166667    0.000000         0.166667          0.000000
7       0.00000      1.000000    0.000000         1.000000          0.000000
8     648.80863      0.000000    0.000000         0.000000          0.250000
9       0.00000      0.300000    0.100000         0.100000          0.000000
10   1163.46222      0.000000    0.000000         0.000000          0.500000
   Cash_advance_tran Purchase_tran Credit_limit    Payment Min_payment
1                  8             6         8000 1907.59974  1874.45719
2                  4             7         1900  442.94462   813.97454
3                  7             0         1200 1397.77013    21.82119
4                  0             3         3000   70.55406    42.43471
5                  2             0         1500  323.64218   429.53837
6                  0             2         3000  677.98290   123.10715
7                  0            41         1000 1314.35815  2812.82049
8                  4             0         2500  601.74019  1085.79506
9                  0             5         2000  876.75728   474.63298
10                10             0         6500 1498.94815  1837.99886
   Full_payment_perc Tenure
1           0.000000     12
2           0.000000     12
3           0.333333      6
4           0.100000     12
5           0.000000     12
6           0.666667     12
7           0.000000     12
8           0.000000     12
9           0.000000     10
10          0.000000     12

The following program produces basic descriptive statistics and boxplots of the raw data.

describe(dat[,2:18])
                  vars    n    mean      sd  median trimmed     mad    min
Balance              1 3000 1520.10 2032.05  845.61 1089.89 1155.17   0.00
Balance_freq         2 3000    0.87    0.25    1.00    0.93    0.00   0.00
Purchase             3 3000  988.76 2152.37  360.05  579.32  533.81   0.00
Max_oneoff           4 3000  589.94 1684.35   29.30  264.21   43.44   0.00
Installment          5 3000  399.32  832.63   86.80  222.13  128.69   0.00
Cash_advance         6 3000  921.62 1939.51    0.00  463.63    0.00   0.00
Purchase_freq        7 3000    0.49    0.40    0.43    0.48    0.64   0.00
Oneoff_freq          8 3000    0.20    0.29    0.08    0.13    0.12   0.00
Installment_freq     9 3000    0.36    0.40    0.17    0.33    0.25   0.00
Cash_advance_freq   10 3000    0.13    0.20    0.00    0.09    0.00   0.00
Cash_advance_tran   11 3000    3.08    6.75    0.00    1.60    0.00   0.00
Purchase_tran       12 3000   14.55   24.03    7.00    9.49   10.38   0.00
Credit_limit        13 2999 4444.50 3672.98 3000.00 3843.97 2668.68 150.00
Payment             14 3000 1721.28 2914.19  863.78 1144.99  870.55   0.00
Min_payment         15 2886  810.09 1776.23  308.93  477.43  279.46   0.12
Full_payment_perc   16 3000    0.15    0.29    0.00    0.08    0.00   0.00
Tenure              17 3000   11.49    1.38   12.00   11.90    0.00   6.00
                       max    range  skew kurtosis    se
Balance           15532.34 15532.34  2.32     6.76 37.10
Balance_freq          1.00     1.00 -1.93     2.69  0.00
Purchase          49039.57 49039.57  9.29   146.84 39.30
Max_oneoff        40761.25 40761.25 10.70   187.28 30.75
Installment       13184.43 13184.43  5.74    52.80 15.20
Cash_advance      21943.85 21943.85  4.06    23.88 35.41
Purchase_freq         1.00     1.00  0.08    -1.64  0.01
Oneoff_freq           1.00     1.00  1.57     1.28  0.01
Installment_freq      1.00     1.00  0.52    -1.40  0.01
Cash_advance_freq     1.17     1.17  1.86     3.45  0.00
Cash_advance_tran   123.00   123.00  6.65    80.97  0.12
Purchase_tran       248.00   248.00  4.14    24.79  0.44
Credit_limit      28000.00 27850.00  1.57     2.78 67.07
Payment           46930.60 46930.60  6.40    64.51 53.21
Min_payment       30528.43 30528.32  8.30    99.23 33.06
Full_payment_perc     1.00     1.00  1.96     2.54  0.01
Tenure               12.00     6.00 -2.85     7.07  0.03
# re-order the columns to facilitate the correlation analysis
dat <- dat %>% select(Cust_ID, Tenure, Credit_limit, Balance, Balance_freq, 
                                Purchase,Purchase_freq, Purchase_tran,
                                Max_oneoff, Oneoff_freq,
                                Payment, Min_payment, Full_payment_perc,
                                Installment, Installment_freq,
                                Cash_advance, Cash_advance_freq, Cash_advance_tran)

## boxplot by group
dat %>% select(Cust_ID, Credit_limit, Balance, Purchase, Max_oneoff, 
               Payment, Min_payment) %>%
  reshape2::melt(id="Cust_ID") %>%
  ggplot(aes(x=variable, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Characterisitcs", y="Value", 
       title="Balance/Purchase/Max_Oneoff/Payment") +
  theme(plot.title=element_text(hjust=0.5))

dat %>% select(Cust_ID, Balance_freq, Purchase_freq, Oneoff_freq, Cash_advance_freq,
                    Full_payment_perc) %>%
  reshape2::melt(id="Cust_ID") %>%
  ggplot(aes(x=variable, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Characterisitcs", y="Value", 
       title="Balance/Purchase/Oneoff/Cash advance/Full payment - Frequency") +
  theme(plot.title=element_text(hjust=0.5))

dat %>% select(Cust_ID, Installment, Cash_advance) %>%
  reshape2::melt(id="Cust_ID") %>%
  ggplot(aes(x=variable, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Characterisitcs", y="Value", 
       title="Installment/Cash advance") +
  theme(plot.title=element_text(hjust=0.5))

dat %>% select(Cust_ID, Cash_advance_tran, Tenure) %>%
  reshape2::melt(id="Cust_ID") %>%
  ggplot(aes(x=variable, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Characterisitcs", y="Value", 
       title="Cash advance transaction/Tenure") +
  theme(plot.title=element_text(hjust=0.5))

From the descriptive statistics and boxplots, the distributions of Purchase, Purchase_tran, Max_oneoff, Installment, Cash_advance, Cash_advance_tran, Credit_limit, Payment, and Min_payment are highly positive skew. From the boxplot, all variables except purchase_freq and tenure have substantial numbers of outliers.

Missing data analysis

The following program analyzes the missing data in the dataset.

cat("Missing by count\n")
Missing by count
sapply(dat[,2:18], function(x) round(sum(is.na(x)),2))
           Tenure      Credit_limit           Balance      Balance_freq 
                0                 1                 0                 0 
         Purchase     Purchase_freq     Purchase_tran        Max_oneoff 
                0                 0                 0                 0 
      Oneoff_freq           Payment       Min_payment Full_payment_perc 
                0                 0               114                 0 
      Installment  Installment_freq      Cash_advance Cash_advance_freq 
                0                 0                 0                 0 
Cash_advance_tran 
                0 
cat("Missing by percentage\n")
Missing by percentage
sapply(dat[,2:18], function(x) round(sum(is.na(x))/nrow(dat),2))
           Tenure      Credit_limit           Balance      Balance_freq 
             0.00              0.00              0.00              0.00 
         Purchase     Purchase_freq     Purchase_tran        Max_oneoff 
             0.00              0.00              0.00              0.00 
      Oneoff_freq           Payment       Min_payment Full_payment_perc 
             0.00              0.00              0.04              0.00 
      Installment  Installment_freq      Cash_advance Cash_advance_freq 
             0.00              0.00              0.00              0.00 
Cash_advance_tran 
             0.00 
aggr(dat, prop=FALSE, numbers=TRUE)

aggr(dat, prop=TRUE, numbers=FALSE)

matrixplot(dat, interactive=FALSE) # without sorting

The data has a total of 115 missing value with 114 found in Min_payment, which is about 4%. The missing value seems to be randomly distributed in the dataset. Since Min_payment is one of the key variables related to customer’s credit usage, the variable cannot be excluded in the analysis. So, the missing data may better be handled by listwise deletion.

dat_comp <- dat[complete.cases(dat),]

Correlation analysis

Correlation analysis can be performed to identify highly related variables that can be used to segment the customers for promoting the installment and cash advance services.

dat_comp %>% 
  select(Credit_limit:Full_payment_perc) %>%
  cor() %>%
  round(3) %>%
  corrplot(method = "color", addCoef.col="white", type = "upper", 
           title="Correlation - Balance/Purchase/Oneoff/Payment",
           mar=c(0,0,2,0),
           tl.cex=0.5, number.cex = 0.4)

dat_comp %>% 
  select(Credit_limit:Balance_freq, Installment:Cash_advance_tran) %>%
  cor() %>%
  round(3) %>%
  corrplot(method = "color", addCoef.col="white", type = "upper", 
           title="Correlation - Credit_limit/Balance/Installment/Cash Advance",
           mar=c(0,0,2,0),
           tl.cex=0.5, number.cex = 0.5)

dat_comp %>% 
  select(Purchase:Purchase_tran, Installment:Cash_advance_tran) %>%
  cor() %>%
  round(3) %>%
  corrplot(method = "color", addCoef.col="white", type = "upper", 
           title="Correlation - Purchase/Installment/Cash Advance",
           mar=c(0,0,2,0),
           tl.cex=0.5, number.cex = 0.5)

dat_comp %>% 
  select(Max_oneoff:Oneoff_freq, Installment:Cash_advance_tran) %>%
  cor() %>%
  round(3) %>%
  corrplot(method = "color", addCoef.col="white", type = "upper", 
           title="Correlation - Oneoff/Installment/Cash Advance",
           mar=c(0,0,2,0),
           tl.cex=0.5, number.cex = 0.5)

dat_comp %>% 
  select(Payment:Full_payment_perc, Installment:Cash_advance_tran) %>%
  cor() %>%
  round(3) %>%
  corrplot(method = "color", addCoef.col="white", type = "upper", 
           title="Correlation - Payment/Installment/Cash Advance",
           mar=c(0,0,2,0),
           tl.cex=0.5, number.cex = 0.5)

The following rule of thumb is applied to evaluate the correlations between variables: +==================================================+ |Correlation coefficient | Interpretation | +———————– | ————————+ |0.90 - 1.00 | Very strong correlation | |0.70 - 0.89 | Strong correation | |0.40 - 0.69 | Moderate correation | |0.10 - 0.39 | Weak correation | |0.00 - 0.10 | Negligible correation | +==================================================+

From the correlation plots, Purchase and Max_oneoff have very strong correlation. For installment/cash advance attributes, moderate to strong correlations include:

Relationship | Variables
Strong correlation Purchase/Installment (0.7); Purchase_freq/Installment_freq (0.86);
Moderate correlation Balance/Cash_advance (0.49); Balance/Cash_advance_freq (0.45); Purchase_freq/Installment (0.47); Purchase_tran/Installment (0.63); Purchase_tran/Installment_freq (0.54); Max_oneoff/Installment (0.4); Payment/Installment (0.43);

To visualize the associations between variables, correlation coefficients and scatter plots of variables in the above table are derived below.

  1. Scatter plot between Purchase related variables, Balance, Max_oneoff and Installment/Installment_freq
x<-dat_comp %>% select(Purchase, Purchase_freq, Purchase_tran, 
                  Max_oneoff, Payment,
                  Installment, Installment_freq) %>%
  cor()
round(x,3)
                 Purchase Purchase_freq Purchase_tran Max_oneoff Payment
Purchase            1.000         0.386         0.654      0.935   0.616
Purchase_freq       0.386         1.000         0.580      0.260   0.107
Purchase_tran       0.654         0.580         1.000      0.523   0.345
Max_oneoff          0.935         0.260         0.523      1.000   0.573
Payment             0.616         0.107         0.345      0.573   1.000
Installment         0.695         0.472         0.634      0.395   0.433
Installment_freq    0.310         0.863         0.542      0.126   0.102
                 Installment Installment_freq
Purchase               0.695            0.310
Purchase_freq          0.472            0.863
Purchase_tran          0.634            0.542
Max_oneoff             0.395            0.126
Payment                0.433            0.102
Installment            1.000            0.546
Installment_freq       0.546            1.000
dat_comp %>% select(Purchase, Purchase_freq, Purchase_tran, 
               Installment, Installment_freq) %>%
  scatterplotMatrix()

dat_comp %>% select(Max_oneoff, Payment,               
               Installment, Installment_freq) %>%
  scatterplotMatrix()

From the above correlation coefficients and scatter plots, Purchase is a reasonable predictor for Installment, and Purchase_freq can be a predictor for Installment_freq.

  1. Scatter plots between Balance/Cash_advance and Cash_advance_freq
x <-dat_comp %>% select(Balance, Cash_advance, Cash_advance_freq) %>%
  cor()
round(x,3)
                  Balance Cash_advance Cash_advance_freq
Balance             1.000        0.486             0.448
Cash_advance        0.486        1.000             0.662
Cash_advance_freq   0.448        0.662             1.000
dat_comp %>% select(Balance, Cash_advance, Cash_advance_freq) %>%
  scatterplotMatrix()

From the above correlation coefficient and scatter plots, Balance can be a weak predictor of Cash_advance and Cash_advance_freq.

Hypothesis and evaluation

From the correlation analysis, we can formulate the following hypotheses:

  1. Hypothesis 1: Installment increases with the purchase amount and purchase frequency.
  2. Hypothesis 2: Cash advance increases with balance of account.

To verify the hypothesis, we classify customers by Purchase, Purchase_freq, and Balance into three groups: High (>75 percentile), Medium (25 - 75 percentiles), and Low (<25 percentile).

The quantiles of variables are:

dat_comp %>% select(Purchase, Purchase_freq, Balance, Installment, Cash_advance) %>% sapply(quantile)
     Purchase Purchase_freq    Balance Installment Cash_advance
0%       0.00      0.000000     0.0000        0.00        0.000
25%     43.08      0.083333   134.5811        0.00        0.000
50%    368.47      0.500000   887.0908       90.00        0.000
75%   1122.30      0.916667  2069.4513      479.52     1105.396
100% 49039.57      1.000000 15532.3397    13184.43    21943.849

The figures show that less than 50% of customers making use of installment and cash advance services.

dat_new <- dat_comp %>% mutate(p_group=ifelse(Purchase>1122, "High", 
                                  ifelse(Purchase<=1122 & Purchase>43, "Medium", "Low")),
                    pf_group=ifelse(Purchase_freq>0.917, "High", 
                                  ifelse(Purchase_freq<=0.917 & Purchase_freq>0.083, 
                                         "Medium", "Low")),
                    b_group=ifelse(Balance>2069, "High", 
                                  ifelse(Balance<=2069 & Balance>135, "Medium", "Low")))
# reorder the levels
dat_new$p_group <- factor(dat_new$p_group, levels=c("High", "Medium", "Low"))
dat_new$pf_group <- factor(dat_new$pf_group, levels=c("High", "Medium", "Low"))
dat_new$b_group <- factor(dat_new$b_group, levels=c("High", "Medium", "Low"))
# count the # for each group
cat("Purchase:")
Purchase:
table(dat_new$p_group)

  High Medium    Low 
   723   1441    721 
cat("Purchase freq:")
Purchase freq:
table(dat_new$pf_group)

  High Medium    Low 
   695   1544    646 
cat("Balance:")
Balance:
table(dat_new$b_group)

  High Medium    Low 
   722   1441    722 

H1 - installment and installment frequency

The following program evaluates the effects of using purchase/purchase_tran to classify observations on installment and installment_freq.

x<-dat_new %>% select(Installment, Installment_freq, p_group) %>%
  group_by(p_group) %>%
  summarise(Av_install=round(mean(Installment),3), 
            SD_install=round(sd(Installment),3),
            Median_install=round(median(Installment),3),
            Av_install_freq=round(mean(Installment_freq),3),
            SD_install_freq=round(sd(Installment_freq),3),
            Median_install_freq=round(median(Installment_freq),3))
# use kable because x is a tibble 
kable(x)
p_group Av_install SD_install Median_install Av_install_freq SD_install_freq Median_install_freq
High 1155.117 1392.330 873.49 0.630 0.375 0.750
Medium 237.403 262.050 155.32 0.413 0.380 0.333
Low 1.158 6.193 0.00 0.008 0.050 0.000
x<-dat_new %>% select(Installment, Installment_freq, pf_group) %>%
  group_by(pf_group) %>%
  summarise(Av_install=round(mean(Installment),3), 
            SD_install=round(sd(Installment),3),
            Median_install=round(median(Installment),3),
            Av_install_freq=round(mean(Installment_freq),3),
            SD_install_freq=round(sd(Installment_freq),3),
            Median_install_freq=round(median(Installment_freq),3))
kable(x)
pf_group Av_install SD_install Median_install Av_install_freq SD_install_freq Median_install_freq
High 1009.496 1204.707 643.960 0.853 0.286 1.00
Medium 308.559 664.826 103.295 0.300 0.304 0.25
Low 0.104 2.634 0.000 0.000 0.003 0.00
g1 <- dat_new %>% select(Installment, p_group) %>% 
  reshape2::melt(id="p_group") %>%
  ggplot(aes(x=p_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Purchase Group", y="Installment", 
       title="Installment vs Purchase Group") +
  theme(plot.title=element_text(hjust=0.5), legend.position="None") 

g2 <- dat_new %>% select(Installment, pf_group) %>% 
  reshape2::melt(id="pf_group") %>%
  ggplot(aes(x=pf_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Purchase Frequency Group", y="Installment", 
       title="Installment vs Purchase Frequency Group") +
  theme(plot.title=element_text(hjust=0.5), legend.position="None") 

g3 <- dat_new %>% select(Installment_freq, p_group) %>% 
  reshape2::melt(id="p_group") %>%
  ggplot(aes(x=p_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Purchase Group", y="Installment Frequency", 
       title="Installment Frequency vs Purchase Group") +
  theme(plot.title=element_text(hjust=0.5), legend.position="None") 

g4 <- dat_new %>% select(Installment_freq, pf_group) %>% 
  reshape2::melt(id="pf_group") %>%
  ggplot(aes(x=pf_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Purchase Frequency Group", y="Installment Frequency", 
       title="Installment Frequency vs Purchase Frequency Group") +
  theme(plot.title=element_text(hjust=0.5), legend.position="None") 

grid.arrange(arrangeGrob(g1, g2, nrow=2))

grid.arrange(arrangeGrob(g3, g4, nrow=2))

The descriptive statistics and boxplots suggest that high purchase group has significantly higher installment (Av=1155.12; Median=873.49) than medium purchase group (Av=237.40; Median=155.32) and high purchase frequency group has significantly higher installment frequency (Av=0.85; Median=1) than medium purchase frequency group (Av=0.30; Median=0.25). Low purchase and low purchase frequency group seldom use the installment service.

The following program further explores the effect of grouping the observations by purchase and purchase_freq.

x<- dat_new %>% select(Installment, Installment_freq, p_group, pf_group) %>%
  group_by(p_group, pf_group) %>%
  summarise(Av_install=round(mean(Installment),3), 
            SD_install=round(sd(Installment),3), 
            Median_install=round(median(Installment),3), 
            Av_install_freq=round(mean(Installment_freq),3),
            SD_install_freq=round(sd(Installment_freq),3),
            Median_install_freq=round(median(Installment_freq),3))

kable(x[,1:5])
p_group pf_group Av_install SD_install Median_install
High High 1547.622 1466.499 1205.00
High Medium 770.140 1198.253 449.24
Medium High 441.734 274.408 419.88
Medium Medium 175.753 224.329 94.12
Low High 4.000 6.928 0.00
Low Medium 10.495 15.023 0.00
Low Low 0.104 2.634 0.00
kable(x[,c(1:2,6:8)])
p_group pf_group Av_install_freq SD_install_freq Median_install_freq
High High 0.857 0.268 1.000
High Medium 0.408 0.330 0.417
Medium High 0.853 0.298 1.000
Medium Medium 0.280 0.291 0.167
Low High 0.306 0.529 0.000
Low Medium 0.064 0.098 0.000
Low Low 0.000 0.003 0.000
CrossTable(dat_new$p_group, dat_new$pf_group,
           prop.chisq=FALSE, dnn=c("Purchase Group", "Purchase Frequency Group"))

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  2885 

 
               | Purchase Frequency Group 
Purchase Group |      High |    Medium |       Low | Row Total | 
---------------|-----------|-----------|-----------|-----------|
          High |       358 |       365 |         0 |       723 | 
               |     0.495 |     0.505 |     0.000 |     0.251 | 
               |     0.515 |     0.236 |     0.000 |           | 
               |     0.124 |     0.127 |     0.000 |           | 
---------------|-----------|-----------|-----------|-----------|
        Medium |       334 |      1107 |         0 |      1441 | 
               |     0.232 |     0.768 |     0.000 |     0.499 | 
               |     0.481 |     0.717 |     0.000 |           | 
               |     0.116 |     0.384 |     0.000 |           | 
---------------|-----------|-----------|-----------|-----------|
           Low |         3 |        72 |       646 |       721 | 
               |     0.004 |     0.100 |     0.896 |     0.250 | 
               |     0.004 |     0.047 |     1.000 |           | 
               |     0.001 |     0.025 |     0.224 |           | 
---------------|-----------|-----------|-----------|-----------|
  Column Total |       695 |      1544 |       646 |      2885 | 
               |     0.241 |     0.535 |     0.224 |           | 
---------------|-----------|-----------|-----------|-----------|

 
# create combine group for boxplot
x<- dat_new %>% select(Installment, Installment_freq, p_group, pf_group) %>% 
  reshape2::melt(id=c("p_group", "pf_group")) %>%
  mutate(c_group=ifelse(p_group=="High", 
                        ifelse(pf_group=="High", "High-High", 
                               ifelse(pf_group=="Medium", "High-Medium", "High-Low")),
                  ifelse(p_group=="Medium", 
                         ifelse(pf_group=="High", "Medium-High", 
                               ifelse(pf_group=="Medium", "Medium-Medium", "Medium-Low")),
                         ifelse(pf_group=="High", "Low-High", 
                               ifelse(pf_group=="Medium", "Low-Medium", "Low-Low"))))) 
x$c_group <- factor(x$c_group, levels=c("High-High", "High-Medium", "High-Low",
                                        "Medium-High", "Medium-Medium","Medium-Low",
                                        "Low-High", "Low-Medium", "Low-Low"))

x %>% filter(variable=="Installment") %>%
  ggplot(aes(x=c_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Purchase + Purchase Frequency", y="Installment", 
       title="Installment vs Purchase + Purchase Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title=element_text(hjust=0.5), legend.position="None") 

x %>% filter(variable=="Installment_freq") %>% 
  ggplot(aes(x=c_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Purchase + Purchase Frequency", y="Installment Frequency", 
       title="Installment vs Purchase + Purchase Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title=element_text(hjust=0.5), legend.position="None") 

y <- x %>% mutate(i_group=ifelse(value>0, "Yes", "No"))
y$i_group <- factor(y$i_group, levels=c("Yes", "No"))
CrossTable(y$c_group, y$i_group, 
           prop.chisq=FALSE, dnn=c("Purchase-Purchase Freq", "Installment"))

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  5770 

 
                       | Installment 
Purchase-Purchase Freq |       Yes |        No | Row Total | 
-----------------------|-----------|-----------|-----------|
             High-High |       686 |        30 |       716 | 
                       |     0.958 |     0.042 |     0.124 | 
                       |     0.210 |     0.012 |           | 
                       |     0.119 |     0.005 |           | 
-----------------------|-----------|-----------|-----------|
           High-Medium |       532 |       198 |       730 | 
                       |     0.729 |     0.271 |     0.127 | 
                       |     0.163 |     0.079 |           | 
                       |     0.092 |     0.034 |           | 
-----------------------|-----------|-----------|-----------|
           Medium-High |       614 |        54 |       668 | 
                       |     0.919 |     0.081 |     0.116 | 
                       |     0.188 |     0.022 |           | 
                       |     0.106 |     0.009 |           | 
-----------------------|-----------|-----------|-----------|
         Medium-Medium |      1380 |       834 |      2214 | 
                       |     0.623 |     0.377 |     0.384 | 
                       |     0.422 |     0.334 |           | 
                       |     0.239 |     0.145 |           | 
-----------------------|-----------|-----------|-----------|
              Low-High |         2 |         4 |         6 | 
                       |     0.333 |     0.667 |     0.001 | 
                       |     0.001 |     0.002 |           | 
                       |     0.000 |     0.001 |           | 
-----------------------|-----------|-----------|-----------|
            Low-Medium |        55 |        89 |       144 | 
                       |     0.382 |     0.618 |     0.025 | 
                       |     0.017 |     0.036 |           | 
                       |     0.010 |     0.015 |           | 
-----------------------|-----------|-----------|-----------|
               Low-Low |         2 |      1290 |      1292 | 
                       |     0.002 |     0.998 |     0.224 | 
                       |     0.001 |     0.516 |           | 
                       |     0.000 |     0.224 |           | 
-----------------------|-----------|-----------|-----------|
          Column Total |      3271 |      2499 |      5770 | 
                       |     0.567 |     0.433 |           | 
-----------------------|-----------|-----------|-----------|

 

From the descriptive statistics and boxplots, it clearly indicates that applying both the purchase and purchase_freq to grouping the observations can refine the granularity of customer groups that have demands of installment services. In general, high purchase and purchase frequency group is the most likely to use the installment services and has the highest installment spending. In the High-Medium and Medium-Medium groups, the percentage of non-installment customers are 27.1% and 37.7%, respectively. So, target promotion can be devised to stimulate the use of installment services for customers who haven’t made use of the installment services within these two groups of customers.

H2 - cash advance and cash advance frequency

The following program evaluates the effects of using balance to classify observations on cash_advance and cash_advance_freq.

x<- dat_new %>% select(Cash_advance, Cash_advance_freq, b_group) %>%
  group_by(b_group) %>%
  summarise(Av_cash=round(mean(Cash_advance),3), 
            SD_cash=round(sd(Cash_advance),3), 
            Median_cash=round(median(Cash_advance),3), 
            Av_cash_freq=round(mean(Cash_advance_freq),3),
            SD_cash_freq=round(sd(Cash_advance_freq),3),
            Median_cash_freq=round(median(Cash_advance_freq),3))
kable(x)
b_group Av_cash SD_cash Median_cash Av_cash_freq SD_cash_freq Median_cash_freq
High 2378.959 3058.038 1441.826 0.271 0.246 0.250
Medium 661.091 1231.056 36.943 0.123 0.173 0.083
Low 66.712 281.853 0.000 0.017 0.055 0.000
g1 <-dat_new %>% select(Cash_advance, b_group) %>% 
  reshape2::melt(id="b_group") %>%
  ggplot(aes(x=b_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Balance Group", y="Cash advance", 
       title="Cash Advance vs Balance Group") +
  theme(plot.title=element_text(hjust=0.5), legend.position="None") 

g2 <- dat_new %>% select(Cash_advance_freq, b_group) %>% 
  reshape2::melt(id="b_group") %>%
  ggplot(aes(x=b_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Balance Group", y="Cash advance frequency", 
       title="Cash Advance Frequency vs Balance Group") +
  theme(plot.title=element_text(hjust=0.5), legend.position="None") 

grid.arrange(arrangeGrob(g1, g2, nrow=2))

y <- dat_new %>% select(Cash_advance, b_group) %>%
  reshape2::melt(id="b_group") %>%
  mutate(i_group=ifelse(value>0, "Yes", "No"))
y$i_group <- factor(y$i_group, levels=c("Yes", "No"))
CrossTable(y$b_group, y$i_group, 
           prop.chisq=FALSE, dnn=c("Balance", "Cash advance"))

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  2885 

 
             | Cash advance 
     Balance |       Yes |        No | Row Total | 
-------------|-----------|-----------|-----------|
        High |       567 |       155 |       722 | 
             |     0.785 |     0.215 |     0.250 | 
             |     0.407 |     0.104 |           | 
             |     0.197 |     0.054 |           | 
-------------|-----------|-----------|-----------|
      Medium |       733 |       708 |      1441 | 
             |     0.509 |     0.491 |     0.499 | 
             |     0.527 |     0.474 |           | 
             |     0.254 |     0.245 |           | 
-------------|-----------|-----------|-----------|
         Low |        92 |       630 |       722 | 
             |     0.127 |     0.873 |     0.250 | 
             |     0.066 |     0.422 |           | 
             |     0.032 |     0.218 |           | 
-------------|-----------|-----------|-----------|
Column Total |      1392 |      1493 |      2885 | 
             |     0.482 |     0.518 |           | 
-------------|-----------|-----------|-----------|

 

The results confirm that high balance group tends to make more cash advance than medium or low balance group. Although this phenomenon seems a bit unnatural, 21.5% of high balance and 49.1% of medium balance haven’t made cash advance in the past six months. So, target marketing plan may design for these groups of customers. Overall, the analysis results support the two hypothese.