Introduction

Data analysis is an art. This sample aims to illustrate how to use R to perform exploratory analysis the given problems. Only brief discussion will be included as an illustration.

Descriptive statistics

First, we have to load the libraries.

library(dplyr)
library(ggplot2)
library(psych)
library(corrplot)
library(VIM)
library(gridExtra)
library(car)
library(knitr)
library(gmodels)

dat <- read.csv("C:\\Users\\user\\downloads\\credit_card_usage.csv")

head(dat,10)

   Cust_ID     Balance Balance_freq Purchase Max_oneoff Installment
1    C0001 6493.033342     1.000000   447.50     447.50        0.00
2    C0002 1722.475015     1.000000   115.91      45.65       70.26
3    C0003   78.818407     0.500000     0.00       0.00        0.00
4    C0004    5.644904     0.454545    47.50       0.00       47.50
5    C0005 1466.919250     1.000000     0.00       0.00        0.00
6    C0006    8.844399     0.181818   116.68       0.00      116.68
7    C0007 1205.024103     1.000000  1247.43       0.00     1247.43
8    C0008 2468.506177     1.000000     0.00       0.00        0.00
9    C0009 1160.206649     0.900000  1600.00    1400.00      200.00
10   C0010 6002.618198     1.000000     0.00       0.00        0.00
   Cash_advance Purchase_freq Oneoff_freq Installment_freq Cash_advance_freq
1    1664.47374      0.416667    0.416667         0.000000          0.583333
2     146.60786      0.500000    0.083333         0.500000          0.250000
3    1113.18608      0.000000    0.000000         0.000000          0.166667
4       0.00000      0.250000    0.000000         0.166667          0.000000
5      37.69126      0.000000    0.000000         0.000000          0.166667
6       0.00000      0.166667    0.000000         0.166667          0.000000
7       0.00000      1.000000    0.000000         1.000000          0.000000
8     648.80863      0.000000    0.000000         0.000000          0.250000
9       0.00000      0.300000    0.100000         0.100000          0.000000
10   1163.46222      0.000000    0.000000         0.000000          0.500000
   Cash_advance_tran Purchase_tran Credit_limit    Payment Min_payment
1                  8             6         8000 1907.59974  1874.45719
2                  4             7         1900  442.94462   813.97454
3                  7             0         1200 1397.77013    21.82119
4                  0             3         3000   70.55406    42.43471
5                  2             0         1500  323.64218   429.53837
6                  0             2         3000  677.98290   123.10715
7                  0            41         1000 1314.35815  2812.82049
8                  4             0         2500  601.74019  1085.79506
9                  0             5         2000  876.75728   474.63298
10                10             0         6500 1498.94815  1837.99886
   Full_payment_perc Tenure
1           0.000000     12
2           0.000000     12
3           0.333333      6
4           0.100000     12
5           0.000000     12
6           0.666667     12
7           0.000000     12
8           0.000000     12
9           0.000000     10
10          0.000000     12

The following program produces basic descriptive statistics and boxplots of the raw data.

describe(dat[,2:18])

                  vars    n    mean      sd  median trimmed     mad    min
Balance              1 3000 1520.10 2032.05  845.61 1089.89 1155.17   0.00
Balance_freq         2 3000    0.87    0.25    1.00    0.93    0.00   0.00
Purchase             3 3000  988.76 2152.37  360.05  579.32  533.81   0.00
Max_oneoff           4 3000  589.94 1684.35   29.30  264.21   43.44   0.00
Installment          5 3000  399.32  832.63   86.80  222.13  128.69   0.00
Cash_advance         6 3000  921.62 1939.51    0.00  463.63    0.00   0.00
Purchase_freq        7 3000    0.49    0.40    0.43    0.48    0.64   0.00
Oneoff_freq          8 3000    0.20    0.29    0.08    0.13    0.12   0.00
Installment_freq     9 3000    0.36    0.40    0.17    0.33    0.25   0.00
Cash_advance_freq   10 3000    0.13    0.20    0.00    0.09    0.00   0.00
Cash_advance_tran   11 3000    3.08    6.75    0.00    1.60    0.00   0.00
Purchase_tran       12 3000   14.55   24.03    7.00    9.49   10.38   0.00
Credit_limit        13 2999 4444.50 3672.98 3000.00 3843.97 2668.68 150.00
Payment             14 3000 1721.28 2914.19  863.78 1144.99  870.55   0.00
Min_payment         15 2886  810.09 1776.23  308.93  477.43  279.46   0.12
Full_payment_perc   16 3000    0.15    0.29    0.00    0.08    0.00   0.00
Tenure              17 3000   11.49    1.38   12.00   11.90    0.00   6.00
                       max    range  skew kurtosis    se
Balance           15532.34 15532.34  2.32     6.76 37.10
Balance_freq          1.00     1.00 -1.93     2.69  0.00
Purchase          49039.57 49039.57  9.29   146.84 39.30
Max_oneoff        40761.25 40761.25 10.70   187.28 30.75
Installment       13184.43 13184.43  5.74    52.80 15.20
Cash_advance      21943.85 21943.85  4.06    23.88 35.41
Purchase_freq         1.00     1.00  0.08    -1.64  0.01
Oneoff_freq           1.00     1.00  1.57     1.28  0.01
Installment_freq      1.00     1.00  0.52    -1.40  0.01
Cash_advance_freq     1.17     1.17  1.86     3.45  0.00
Cash_advance_tran   123.00   123.00  6.65    80.97  0.12
Purchase_tran       248.00   248.00  4.14    24.79  0.44
Credit_limit      28000.00 27850.00  1.57     2.78 67.07
Payment           46930.60 46930.60  6.40    64.51 53.21
Min_payment       30528.43 30528.32  8.30    99.23 33.06
Full_payment_perc     1.00     1.00  1.96     2.54  0.01
Tenure               12.00     6.00 -2.85     7.07  0.03

# re-order the columns to facilitate the correlation analysis
dat <- dat %>% select(Cust_ID, Tenure, Credit_limit, Balance, Balance_freq, 
                                Purchase,Purchase_freq, Purchase_tran,
                                Max_oneoff, Oneoff_freq,
                                Payment, Min_payment, Full_payment_perc,
                                Installment, Installment_freq,
                                Cash_advance, Cash_advance_freq, Cash_advance_tran)

## boxplot by group
dat %>% select(Cust_ID, Credit_limit, Balance, Purchase, Max_oneoff, 
               Payment, Min_payment) %>%
  reshape2::melt(id="Cust_ID") %>%
  ggplot(aes(x=variable, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Characterisitcs", y="Value", 
       title="Balance/Purchase/Max_Oneoff/Payment") +
  theme(plot.title=element_text(hjust=0.5))

dat %>% select(Cust_ID, Balance_freq, Purchase_freq, Oneoff_freq, Cash_advance_freq,
                    Full_payment_perc) %>%
  reshape2::melt(id="Cust_ID") %>%
  ggplot(aes(x=variable, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Characterisitcs", y="Value", 
       title="Balance/Purchase/Oneoff/Cash advance/Full payment - Frequency") +
  theme(plot.title=element_text(hjust=0.5))

dat %>% select(Cust_ID, Installment, Cash_advance) %>%
  reshape2::melt(id="Cust_ID") %>%
  ggplot(aes(x=variable, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Characterisitcs", y="Value", 
       title="Installment/Cash advance") +
  theme(plot.title=element_text(hjust=0.5))

dat %>% select(Cust_ID, Cash_advance_tran, Tenure) %>%
  reshape2::melt(id="Cust_ID") %>%
  ggplot(aes(x=variable, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Characterisitcs", y="Value", 
       title="Cash advance transaction/Tenure") +
  theme(plot.title=element_text(hjust=0.5))

From the descriptive statistics and boxplots, the distributions of Purchase, Purchase_tran, Max_oneoff, Installment, Cash_advance, Cash_advance_tran, Credit_limit, Payment, and Min_payment are highly positive skew. From the boxplot, all variables except purchase_freq and tenure have substantial numbers of outliers.

Missing data analysis

The following program analyzes the missing data in the dataset.

cat("Missing by count\n")

Missing by count

sapply(dat[,2:18], function(x) round(sum(is.na(x)),2))

           Tenure      Credit_limit           Balance      Balance_freq 
                0                 1                 0                 0 
         Purchase     Purchase_freq     Purchase_tran        Max_oneoff 
                0                 0                 0                 0 
      Oneoff_freq           Payment       Min_payment Full_payment_perc 
                0                 0               114                 0 
      Installment  Installment_freq      Cash_advance Cash_advance_freq 
                0                 0                 0                 0 
Cash_advance_tran 
                0

cat("Missing by percentage\n")

Missing by percentage

sapply(dat[,2:18], function(x) round(sum(is.na(x))/nrow(dat),2))

           Tenure      Credit_limit           Balance      Balance_freq 
             0.00              0.00              0.00              0.00 
         Purchase     Purchase_freq     Purchase_tran        Max_oneoff 
             0.00              0.00              0.00              0.00 
      Oneoff_freq           Payment       Min_payment Full_payment_perc 
             0.00              0.00              0.04              0.00 
      Installment  Installment_freq      Cash_advance Cash_advance_freq 
             0.00              0.00              0.00              0.00 
Cash_advance_tran 
             0.00

aggr(dat, prop=FALSE, numbers=TRUE)

aggr(dat, prop=TRUE, numbers=FALSE)

matrixplot(dat, interactive=FALSE) # without sorting

The data has a total of 115 missing value with 114 found in Min_payment, which is about 4%. The missing value seems to be randomly distributed in the dataset. Since Min_payment is one of the key variables related to customer’s credit usage, the variable cannot be excluded in the analysis. So, the missing data may better be handled by listwise deletion.

dat_comp <- dat[complete.cases(dat),]

Correlation analysis

Correlation analysis can be performed to identify highly related variables that can be used to segment the customers for promoting the installment and cash advance services.

dat_comp %>% 
  select(Credit_limit:Full_payment_perc) %>%
  cor() %>%
  round(3) %>%
  corrplot(method = "color", addCoef.col="white", type = "upper", 
           title="Correlation - Balance/Purchase/Oneoff/Payment",
           mar=c(0,0,2,0),
           tl.cex=0.5, number.cex = 0.4)

dat_comp %>% 
  select(Credit_limit:Balance_freq, Installment:Cash_advance_tran) %>%
  cor() %>%
  round(3) %>%
  corrplot(method = "color", addCoef.col="white", type = "upper", 
           title="Correlation - Credit_limit/Balance/Installment/Cash Advance",
           mar=c(0,0,2,0),
           tl.cex=0.5, number.cex = 0.5)

dat_comp %>% 
  select(Purchase:Purchase_tran, Installment:Cash_advance_tran) %>%
  cor() %>%
  round(3) %>%
  corrplot(method = "color", addCoef.col="white", type = "upper", 
           title="Correlation - Purchase/Installment/Cash Advance",
           mar=c(0,0,2,0),
           tl.cex=0.5, number.cex = 0.5)

dat_comp %>% 
  select(Max_oneoff:Oneoff_freq, Installment:Cash_advance_tran) %>%
  cor() %>%
  round(3) %>%
  corrplot(method = "color", addCoef.col="white", type = "upper", 
           title="Correlation - Oneoff/Installment/Cash Advance",
           mar=c(0,0,2,0),
           tl.cex=0.5, number.cex = 0.5)

dat_comp %>% 
  select(Payment:Full_payment_perc, Installment:Cash_advance_tran) %>%
  cor() %>%
  round(3) %>%
  corrplot(method = "color", addCoef.col="white", type = "upper", 
           title="Correlation - Payment/Installment/Cash Advance",
           mar=c(0,0,2,0),
           tl.cex=0.5, number.cex = 0.5)

The following rule of thumb is applied to evaluate the correlations between variables: +==================================================+ |Correlation coefficient | Interpretation | +———————– | ————————+ |0.90 - 1.00 | Very strong correlation | |0.70 - 0.89 | Strong correation | |0.40 - 0.69 | Moderate correation | |0.10 - 0.39 | Weak correation | |0.00 - 0.10 | Negligible correation | +==================================================+

From the correlation plots, Purchase and Max_oneoff have very strong correlation. For installment/cash advance attributes, moderate to strong correlations include:

Relationship \| Variables
Strong correlation	Purchase/Installment (0.7); Purchase_freq/Installment_freq (0.86);
Moderate correlation	Balance/Cash_advance (0.49); Balance/Cash_advance_freq (0.45); Purchase_freq/Installment (0.47); Purchase_tran/Installment (0.63); Purchase_tran/Installment_freq (0.54); Max_oneoff/Installment (0.4); Payment/Installment (0.43);

To visualize the associations between variables, correlation coefficients and scatter plots of variables in the above table are derived below.

Scatter plot between Purchase related variables, Balance, Max_oneoff and Installment/Installment_freq

x<-dat_comp %>% select(Purchase, Purchase_freq, Purchase_tran, 
                  Max_oneoff, Payment,
                  Installment, Installment_freq) %>%
  cor()
round(x,3)

                 Purchase Purchase_freq Purchase_tran Max_oneoff Payment
Purchase            1.000         0.386         0.654      0.935   0.616
Purchase_freq       0.386         1.000         0.580      0.260   0.107
Purchase_tran       0.654         0.580         1.000      0.523   0.345
Max_oneoff          0.935         0.260         0.523      1.000   0.573
Payment             0.616         0.107         0.345      0.573   1.000
Installment         0.695         0.472         0.634      0.395   0.433
Installment_freq    0.310         0.863         0.542      0.126   0.102
                 Installment Installment_freq
Purchase               0.695            0.310
Purchase_freq          0.472            0.863
Purchase_tran          0.634            0.542
Max_oneoff             0.395            0.126
Payment                0.433            0.102
Installment            1.000            0.546
Installment_freq       0.546            1.000

dat_comp %>% select(Purchase, Purchase_freq, Purchase_tran, 
               Installment, Installment_freq) %>%
  scatterplotMatrix()

dat_comp %>% select(Max_oneoff, Payment,               
               Installment, Installment_freq) %>%
  scatterplotMatrix()

From the above correlation coefficients and scatter plots, Purchase is a reasonable predictor for Installment, and Purchase_freq can be a predictor for Installment_freq.

Scatter plots between Balance/Cash_advance and Cash_advance_freq

x <-dat_comp %>% select(Balance, Cash_advance, Cash_advance_freq) %>%
  cor()
round(x,3)

                  Balance Cash_advance Cash_advance_freq
Balance             1.000        0.486             0.448
Cash_advance        0.486        1.000             0.662
Cash_advance_freq   0.448        0.662             1.000

dat_comp %>% select(Balance, Cash_advance, Cash_advance_freq) %>%
  scatterplotMatrix()

From the above correlation coefficient and scatter plots, Balance can be a weak predictor of Cash_advance and Cash_advance_freq.

Hypothesis and evaluation

From the correlation analysis, we can formulate the following hypotheses:

Hypothesis 1: Installment increases with the purchase amount and purchase frequency.
Hypothesis 2: Cash advance increases with balance of account.

To verify the hypothesis, we classify customers by Purchase, Purchase_freq, and Balance into three groups: High (>75 percentile), Medium (25 - 75 percentiles), and Low (<25 percentile).

The quantiles of variables are:

dat_comp %>% select(Purchase, Purchase_freq, Balance, Installment, Cash_advance) %>% sapply(quantile)

     Purchase Purchase_freq    Balance Installment Cash_advance
0%       0.00      0.000000     0.0000        0.00        0.000
25%     43.08      0.083333   134.5811        0.00        0.000
50%    368.47      0.500000   887.0908       90.00        0.000
75%   1122.30      0.916667  2069.4513      479.52     1105.396
100% 49039.57      1.000000 15532.3397    13184.43    21943.849

The figures show that less than 50% of customers making use of installment and cash advance services.

dat_new <- dat_comp %>% mutate(p_group=ifelse(Purchase>1122, "High", 
                                  ifelse(Purchase<=1122 & Purchase>43, "Medium", "Low")),
                    pf_group=ifelse(Purchase_freq>0.917, "High", 
                                  ifelse(Purchase_freq<=0.917 & Purchase_freq>0.083, 
                                         "Medium", "Low")),
                    b_group=ifelse(Balance>2069, "High", 
                                  ifelse(Balance<=2069 & Balance>135, "Medium", "Low")))
# reorder the levels
dat_new$p_group <- factor(dat_new$p_group, levels=c("High", "Medium", "Low"))
dat_new$pf_group <- factor(dat_new$pf_group, levels=c("High", "Medium", "Low"))
dat_new$b_group <- factor(dat_new$b_group, levels=c("High", "Medium", "Low"))
# count the # for each group
cat("Purchase:")

Purchase:

table(dat_new$p_group)


  High Medium    Low 
   723   1441    721

cat("Purchase freq:")

Purchase freq:

table(dat_new$pf_group)


  High Medium    Low 
   695   1544    646

cat("Balance:")

Balance:

table(dat_new$b_group)


  High Medium    Low 
   722   1441    722

H1 - installment and installment frequency

The following program evaluates the effects of using purchase/purchase_tran to classify observations on installment and installment_freq.

x<-dat_new %>% select(Installment, Installment_freq, p_group) %>%
  group_by(p_group) %>%
  summarise(Av_install=round(mean(Installment),3), 
            SD_install=round(sd(Installment),3),
            Median_install=round(median(Installment),3),
            Av_install_freq=round(mean(Installment_freq),3),
            SD_install_freq=round(sd(Installment_freq),3),
            Median_install_freq=round(median(Installment_freq),3))
# use kable because x is a tibble 
kable(x)

p_group	Av_install	SD_install	Median_install	Av_install_freq	SD_install_freq	Median_install_freq
High	1155.117	1392.330	873.49	0.630	0.375	0.750
Medium	237.403	262.050	155.32	0.413	0.380	0.333
Low	1.158	6.193	0.00	0.008	0.050	0.000

x<-dat_new %>% select(Installment, Installment_freq, pf_group) %>%
  group_by(pf_group) %>%
  summarise(Av_install=round(mean(Installment),3), 
            SD_install=round(sd(Installment),3),
            Median_install=round(median(Installment),3),
            Av_install_freq=round(mean(Installment_freq),3),
            SD_install_freq=round(sd(Installment_freq),3),
            Median_install_freq=round(median(Installment_freq),3))
kable(x)

pf_group	Av_install	SD_install	Median_install	Av_install_freq	SD_install_freq	Median_install_freq
High	1009.496	1204.707	643.960	0.853	0.286	1.00
Medium	308.559	664.826	103.295	0.300	0.304	0.25
Low	0.104	2.634	0.000	0.000	0.003	0.00

g1 <- dat_new %>% select(Installment, p_group) %>% 
  reshape2::melt(id="p_group") %>%
  ggplot(aes(x=p_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Purchase Group", y="Installment", 
       title="Installment vs Purchase Group") +
  theme(plot.title=element_text(hjust=0.5), legend.position="None") 

g2 <- dat_new %>% select(Installment, pf_group) %>% 
  reshape2::melt(id="pf_group") %>%
  ggplot(aes(x=pf_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Purchase Frequency Group", y="Installment", 
       title="Installment vs Purchase Frequency Group") +
  theme(plot.title=element_text(hjust=0.5), legend.position="None") 

g3 <- dat_new %>% select(Installment_freq, p_group) %>% 
  reshape2::melt(id="p_group") %>%
  ggplot(aes(x=p_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Purchase Group", y="Installment Frequency", 
       title="Installment Frequency vs Purchase Group") +
  theme(plot.title=element_text(hjust=0.5), legend.position="None") 

g4 <- dat_new %>% select(Installment_freq, pf_group) %>% 
  reshape2::melt(id="pf_group") %>%
  ggplot(aes(x=pf_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Purchase Frequency Group", y="Installment Frequency", 
       title="Installment Frequency vs Purchase Frequency Group") +
  theme(plot.title=element_text(hjust=0.5), legend.position="None") 

grid.arrange(arrangeGrob(g1, g2, nrow=2))

grid.arrange(arrangeGrob(g3, g4, nrow=2))

The descriptive statistics and boxplots suggest that high purchase group has significantly higher installment (Av=1155.12; Median=873.49) than medium purchase group (Av=237.40; Median=155.32) and high purchase frequency group has significantly higher installment frequency (Av=0.85; Median=1) than medium purchase frequency group (Av=0.30; Median=0.25). Low purchase and low purchase frequency group seldom use the installment service.

The following program further explores the effect of grouping the observations by purchase and purchase_freq.

x<- dat_new %>% select(Installment, Installment_freq, p_group, pf_group) %>%
  group_by(p_group, pf_group) %>%
  summarise(Av_install=round(mean(Installment),3), 
            SD_install=round(sd(Installment),3), 
            Median_install=round(median(Installment),3), 
            Av_install_freq=round(mean(Installment_freq),3),
            SD_install_freq=round(sd(Installment_freq),3),
            Median_install_freq=round(median(Installment_freq),3))

kable(x[,1:5])

p_group	pf_group	Av_install	SD_install	Median_install
High	High	1547.622	1466.499	1205.00
High	Medium	770.140	1198.253	449.24
Medium	High	441.734	274.408	419.88
Medium	Medium	175.753	224.329	94.12
Low	High	4.000	6.928	0.00
Low	Medium	10.495	15.023	0.00
Low	Low	0.104	2.634	0.00

kable(x[,c(1:2,6:8)])

p_group	pf_group	Av_install_freq	SD_install_freq	Median_install_freq
High	High	0.857	0.268	1.000
High	Medium	0.408	0.330	0.417
Medium	High	0.853	0.298	1.000
Medium	Medium	0.280	0.291	0.167
Low	High	0.306	0.529	0.000
Low	Medium	0.064	0.098	0.000
Low	Low	0.000	0.003	0.000

CrossTable(dat_new$p_group, dat_new$pf_group,
           prop.chisq=FALSE, dnn=c("Purchase Group", "Purchase Frequency Group"))


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  2885 

 
               | Purchase Frequency Group 
Purchase Group |      High |    Medium |       Low | Row Total | 
---------------|-----------|-----------|-----------|-----------|
          High |       358 |       365 |         0 |       723 | 
               |     0.495 |     0.505 |     0.000 |     0.251 | 
               |     0.515 |     0.236 |     0.000 |           | 
               |     0.124 |     0.127 |     0.000 |           | 
---------------|-----------|-----------|-----------|-----------|
        Medium |       334 |      1107 |         0 |      1441 | 
               |     0.232 |     0.768 |     0.000 |     0.499 | 
               |     0.481 |     0.717 |     0.000 |           | 
               |     0.116 |     0.384 |     0.000 |           | 
---------------|-----------|-----------|-----------|-----------|
           Low |         3 |        72 |       646 |       721 | 
               |     0.004 |     0.100 |     0.896 |     0.250 | 
               |     0.004 |     0.047 |     1.000 |           | 
               |     0.001 |     0.025 |     0.224 |           | 
---------------|-----------|-----------|-----------|-----------|
  Column Total |       695 |      1544 |       646 |      2885 | 
               |     0.241 |     0.535 |     0.224 |           | 
---------------|-----------|-----------|-----------|-----------|

# create combine group for boxplot
x<- dat_new %>% select(Installment, Installment_freq, p_group, pf_group) %>% 
  reshape2::melt(id=c("p_group", "pf_group")) %>%
  mutate(c_group=ifelse(p_group=="High", 
                        ifelse(pf_group=="High", "High-High", 
                               ifelse(pf_group=="Medium", "High-Medium", "High-Low")),
                  ifelse(p_group=="Medium", 
                         ifelse(pf_group=="High", "Medium-High", 
                               ifelse(pf_group=="Medium", "Medium-Medium", "Medium-Low")),
                         ifelse(pf_group=="High", "Low-High", 
                               ifelse(pf_group=="Medium", "Low-Medium", "Low-Low"))))) 
x$c_group <- factor(x$c_group, levels=c("High-High", "High-Medium", "High-Low",
                                        "Medium-High", "Medium-Medium","Medium-Low",
                                        "Low-High", "Low-Medium", "Low-Low"))

x %>% filter(variable=="Installment") %>%
  ggplot(aes(x=c_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Purchase + Purchase Frequency", y="Installment", 
       title="Installment vs Purchase + Purchase Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title=element_text(hjust=0.5), legend.position="None")

x %>% filter(variable=="Installment_freq") %>% 
  ggplot(aes(x=c_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Purchase + Purchase Frequency", y="Installment Frequency", 
       title="Installment vs Purchase + Purchase Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title=element_text(hjust=0.5), legend.position="None")

y <- x %>% mutate(i_group=ifelse(value>0, "Yes", "No"))
y$i_group <- factor(y$i_group, levels=c("Yes", "No"))
CrossTable(y$c_group, y$i_group, 
           prop.chisq=FALSE, dnn=c("Purchase-Purchase Freq", "Installment"))


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  5770 

 
                       | Installment 
Purchase-Purchase Freq |       Yes |        No | Row Total | 
-----------------------|-----------|-----------|-----------|
             High-High |       686 |        30 |       716 | 
                       |     0.958 |     0.042 |     0.124 | 
                       |     0.210 |     0.012 |           | 
                       |     0.119 |     0.005 |           | 
-----------------------|-----------|-----------|-----------|
           High-Medium |       532 |       198 |       730 | 
                       |     0.729 |     0.271 |     0.127 | 
                       |     0.163 |     0.079 |           | 
                       |     0.092 |     0.034 |           | 
-----------------------|-----------|-----------|-----------|
           Medium-High |       614 |        54 |       668 | 
                       |     0.919 |     0.081 |     0.116 | 
                       |     0.188 |     0.022 |           | 
                       |     0.106 |     0.009 |           | 
-----------------------|-----------|-----------|-----------|
         Medium-Medium |      1380 |       834 |      2214 | 
                       |     0.623 |     0.377 |     0.384 | 
                       |     0.422 |     0.334 |           | 
                       |     0.239 |     0.145 |           | 
-----------------------|-----------|-----------|-----------|
              Low-High |         2 |         4 |         6 | 
                       |     0.333 |     0.667 |     0.001 | 
                       |     0.001 |     0.002 |           | 
                       |     0.000 |     0.001 |           | 
-----------------------|-----------|-----------|-----------|
            Low-Medium |        55 |        89 |       144 | 
                       |     0.382 |     0.618 |     0.025 | 
                       |     0.017 |     0.036 |           | 
                       |     0.010 |     0.015 |           | 
-----------------------|-----------|-----------|-----------|
               Low-Low |         2 |      1290 |      1292 | 
                       |     0.002 |     0.998 |     0.224 | 
                       |     0.001 |     0.516 |           | 
                       |     0.000 |     0.224 |           | 
-----------------------|-----------|-----------|-----------|
          Column Total |      3271 |      2499 |      5770 | 
                       |     0.567 |     0.433 |           | 
-----------------------|-----------|-----------|-----------|

From the descriptive statistics and boxplots, it clearly indicates that applying both the purchase and purchase_freq to grouping the observations can refine the granularity of customer groups that have demands of installment services. In general, high purchase and purchase frequency group is the most likely to use the installment services and has the highest installment spending. In the High-Medium and Medium-Medium groups, the percentage of non-installment customers are 27.1% and 37.7%, respectively. So, target promotion can be devised to stimulate the use of installment services for customers who haven’t made use of the installment services within these two groups of customers.

H2 - cash advance and cash advance frequency

The following program evaluates the effects of using balance to classify observations on cash_advance and cash_advance_freq.

x<- dat_new %>% select(Cash_advance, Cash_advance_freq, b_group) %>%
  group_by(b_group) %>%
  summarise(Av_cash=round(mean(Cash_advance),3), 
            SD_cash=round(sd(Cash_advance),3), 
            Median_cash=round(median(Cash_advance),3), 
            Av_cash_freq=round(mean(Cash_advance_freq),3),
            SD_cash_freq=round(sd(Cash_advance_freq),3),
            Median_cash_freq=round(median(Cash_advance_freq),3))
kable(x)

b_group	Av_cash	SD_cash	Median_cash	Av_cash_freq	SD_cash_freq	Median_cash_freq
High	2378.959	3058.038	1441.826	0.271	0.246	0.250
Medium	661.091	1231.056	36.943	0.123	0.173	0.083
Low	66.712	281.853	0.000	0.017	0.055	0.000

g1 <-dat_new %>% select(Cash_advance, b_group) %>% 
  reshape2::melt(id="b_group") %>%
  ggplot(aes(x=b_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Balance Group", y="Cash advance", 
       title="Cash Advance vs Balance Group") +
  theme(plot.title=element_text(hjust=0.5), legend.position="None") 

g2 <- dat_new %>% select(Cash_advance_freq, b_group) %>% 
  reshape2::melt(id="b_group") %>%
  ggplot(aes(x=b_group, y=value, fill=variable))  + 
  geom_boxplot() +
  labs(x="Balance Group", y="Cash advance frequency", 
       title="Cash Advance Frequency vs Balance Group") +
  theme(plot.title=element_text(hjust=0.5), legend.position="None") 

grid.arrange(arrangeGrob(g1, g2, nrow=2))

y <- dat_new %>% select(Cash_advance, b_group) %>%
  reshape2::melt(id="b_group") %>%
  mutate(i_group=ifelse(value>0, "Yes", "No"))
y$i_group <- factor(y$i_group, levels=c("Yes", "No"))
CrossTable(y$b_group, y$i_group, 
           prop.chisq=FALSE, dnn=c("Balance", "Cash advance"))


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  2885 

 
             | Cash advance 
     Balance |       Yes |        No | Row Total | 
-------------|-----------|-----------|-----------|
        High |       567 |       155 |       722 | 
             |     0.785 |     0.215 |     0.250 | 
             |     0.407 |     0.104 |           | 
             |     0.197 |     0.054 |           | 
-------------|-----------|-----------|-----------|
      Medium |       733 |       708 |      1441 | 
             |     0.509 |     0.491 |     0.499 | 
             |     0.527 |     0.474 |           | 
             |     0.254 |     0.245 |           | 
-------------|-----------|-----------|-----------|
         Low |        92 |       630 |       722 | 
             |     0.127 |     0.873 |     0.250 | 
             |     0.066 |     0.422 |           | 
             |     0.032 |     0.218 |           | 
-------------|-----------|-----------|-----------|
Column Total |      1392 |      1493 |      2885 | 
             |     0.482 |     0.518 |           | 
-------------|-----------|-----------|-----------|

The results confirm that high balance group tends to make more cash advance than medium or low balance group. Although this phenomenon seems a bit unnatural, 21.5% of high balance and 49.1% of medium balance haven’t made cash advance in the past six months. So, target marketing plan may design for these groups of customers. Overall, the analysis results support the two hypothese.

Exploratory Data Analysis

Lumumba V. Wandera