A copy of this document can also be found at: https://rpubs.com/khalilc002/PilgrimBank

This document will discuss Case I: Pilgrim Data, specifically the relationship, if any, between a customer using the banks online services and the profit made from the customer.

The data was loaded and the required libraries attached. Libraries used in this analysis include stats, plyr , dplyr, base, stats, and ggplot2.

The columns and rows can be observed using the head and summary functions. Looking at this we can see many values are missing, specifically from the 2000 data and 2009’s age and income.

  ID X9Profit X9Online X9Age X9Inc X9Tenure X9District X0Profit X0Online
1  1       21        0    NA    NA     6.33       1200       NA       NA
2  2       -6        0     6     3    29.50       1200      -32        0
3  3      -49        1     5     5    26.41       1100      -22        1
4  4       -4        0    NA    NA     2.25       1200       NA       NA
5  5      -61        0     2     9     9.91       1200       -4        0
6  6      -38        0    NA     3     2.33       1300       14        0
  X9Billpay X0Billpay
1         0        NA
2         0         0
3         0         0
4         0        NA
5         0         0
6         0         0
       ID           X9Profit         X9Online          X9Age      
 Min.   :    1   Min.   :-221.0   Min.   :0.0000   Min.   :1.000  
 1st Qu.: 7909   1st Qu.: -34.0   1st Qu.:0.0000   1st Qu.:3.000  
 Median :15818   Median :   9.0   Median :0.0000   Median :4.000  
 Mean   :15818   Mean   : 111.5   Mean   :0.1218   Mean   :4.046  
 3rd Qu.:23726   3rd Qu.: 164.0   3rd Qu.:0.0000   3rd Qu.:5.000  
 Max.   :31634   Max.   :2071.0   Max.   :1.0000   Max.   :7.000  
                                                   NA's   :8289   
     X9Inc          X9Tenure       X9District      X0Profit      
 Min.   :1.000   Min.   : 0.16   Min.   :1100   Min.   :-5643.0  
 1st Qu.:4.000   1st Qu.: 3.75   1st Qu.:1200   1st Qu.:  -30.0  
 Median :6.000   Median : 7.41   Median :1200   Median :   23.0  
 Mean   :5.459   Mean   :10.16   Mean   :1203   Mean   :  144.8  
 3rd Qu.:7.000   3rd Qu.:14.75   3rd Qu.:1200   3rd Qu.:  206.0  
 Max.   :9.000   Max.   :41.16   Max.   :1300   Max.   :27086.0  
 NA's   :8261                                   NA's   :5238     
    X0Online       X9Billpay         X0Billpay   
 Min.   :0.000   Min.   :0.00000   Min.   :0.00  
 1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.00  
 Median :0.000   Median :0.00000   Median :0.00  
 Mean   :0.199   Mean   :0.01669   Mean   :0.03  
 3rd Qu.:0.000   3rd Qu.:0.00000   3rd Qu.:0.00  
 Max.   :1.000   Max.   :1.00000   Max.   :1.00  
 NA's   :5219                      NA's   :5219  

Before completing an analysis it is imparative to have our columns assigned the right variable type. This avoids categorical data being assessed as numerical and vice versa.

        ID   X9Profit   X9Online      X9Age      X9Inc   X9Tenure X9District 
 "integer"  "integer"  "integer"  "integer"  "integer"  "numeric"  "integer" 
  X0Profit   X0Online  X9Billpay  X0Billpay 
 "integer"  "integer"  "integer"  "integer" 

The raw data has both numerical and character data. Columns starting with X9 represent 2019 data while columns starting with X0 represent 2000 data.

Column Original Data Type Required Data Type Information
1. ID Numbers Integer Character Categorical information represented with numbers.
2. Profit Integer Integer Left as an integer as no decimals are observed in profit amounts.
3. Online Character Character Categorical and binary data representing if the customer uses online services.
4. Age Integer Charactor Categorical, representing numerical intervals of the age of the user.
5. Income Integer Character Categorical, representing numerical intervals of the income of the user.
6. Tenure Numeric Numeric Continuous variables respresenting time with bank.
7. District Integer Character Categorical, representing district customer is in.
8. Billpay Integer Character Categorical and binary data

Age and income are currently represented as an integer, this is incorrect as each number represents a category in a scale. The age and income levels have been represented with intervals. The character types are corrected.

In order to avoid skewed data, the rows with missing information are deleted. There are originally 10551 rows with missing data, primarily from the Age, Income columns from 2019 and many rows in 2020 are missing data. Since we will not be looking at data from 2020, these columns are removed before the rows with missing cases. This results in a removal of 8822 rows. Once this is completed, we can see there is no missing data.

sum(!complete.cases(PG))
[1] 10551
PDC2 <-select(PG, -c(8, 9, 11))
sum(!complete.cases(PDC2))
[1] 8822
PDC2 <- na.omit(PDC2)
sum(!complete.cases(PDC2))
[1] 0

The final mannipulation to the data before completing a t-test will be to change online usage variables 1 to Y and 0 to N.

PDC2[PDC2$X9Online == 1, "X9Online"] <- "Y"
PDC2[PDC2$X9Online == 0, "X9Online"] <- "N"

Customer profit is plotted on a boxplot, based on their online platform status:

BP <- boxplot((PDC2$X9Profit ~ PDC2$X9Online), main="Customer Profitability and Online Platform Usage", xlab = "Online Platform Usege", ylab="Profitability ($)")

A t-test can be completed to determine the significance of the relationship (if any) between users of the online platform and those who who do not use the banks online platform.

t-test set up:
Group 1: Users of online platform
Group 2: Nonusers of online platform
Null hypothesis : mean of group 1= mean of group 2
Alternatvie hypothesis : the means of both groups are not equal
Confidence interval: 95%
Alpha : 0.05

var(PDC2$X9Profit[PDC2$X9Online=="Y"])
[1] 84312.07
var(PDC2$X9Profit[PDC2$X9Online=="N"])
[1] 79368.14
t.test(PDC2$X9Profit ~ PDC2$X9Online, mu=0, alt="two.sided", conf=0.95, var.eq=F, paired=F)

    Welch Two Sample t-test

data:  PDC2$X9Profit by PDC2$X9Online
t = -0.87703, df = 3826.8, p-value = 0.3805
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -16.186479   6.180837
sample estimates:
mean in group N mean in group Y 
       126.5216        131.5244 

The p-value returned is 0.3805. p-value > a. Therefore we fail to reject the null hypothesis. We cannot conclude there is a significant difference between the profits of online and non-online users based on the profit and online status of a customer.

As the dataset shows extreme cases, the above t-test was completed without outliers from both sets of users. Outliers were identified based on being 1.5 IQR above or below the mean.

quantile(PDC2$X9Profit[PDC2$X9Online=="Y"],0.25 )
25% 
-40 
quantile(PDC2$X9Profit[PDC2$X9Online=="Y"],0.75 )
75% 
214 
quantile(PDC2$X9Profit[PDC2$X9Online=="Y"],0.50 )
50% 
 27 
IQR(PDC2$X9Profit[PDC2$X9Online=="Y"])
[1] 254
fivenum(PDC2$X9Profit[PDC2$X9Online=="Y"])
[1] -220  -40   27  214 1979
OutlierRange <-IQR(PDC2$X9Profit[PDC2$X9Online=="Y"]) * 1.5
print(OutlierRange)
[1] 381
OutlierLower <- quantile(PDC2$X9Profit[PDC2$X9Online=="Y"],0.25 ) - OutlierRange
print(OutlierLower)
 25% 
-421 
OutlierUpper <- quantile(PDC2$X9Profit[PDC2$X9Online=="Y"],0.75 ) + OutlierRange
print(OutlierUpper)
75% 
595 
#remove lines with Y and profit less than -433
OutlierLower<-as.numeric(OutlierLower)
d<-PDC2[!(PDC2$X9Online=="Y" & PDC2$X9Profit < OutlierLower),]
d<-d[!(d$X9Online=="Y" & d$X9Profit > OutlierUpper),]
#do the same for X9Online = N
quantile(PDC2$X9Profit[PDC2$X9Online=="N"],0.25 )
## 25% 
## -33
quantile(PDC2$X9Profit[PDC2$X9Online=="N"],0.75 )
## 75% 
## 194
quantile(PDC2$X9Profit[PDC2$X9Online=="N"],0.50 )
##  50% 
## 20.5
IQR(PDC2$X9Profit[PDC2$X9Online=="N"])
## [1] 227
fivenum(PDC2$X9Profit[PDC2$X9Online=="N"])
## [1] -221.0  -33.0   20.5  194.0 2071.0
OutlierRange2 <-IQR(PDC2$X9Profit[PDC2$X9Online=="N"]) * 1.5
print(OutlierRange2)
## [1] 340.5
OutlierLower2 <- quantile(PDC2$X9Profit[PDC2$X9Online=="N"],0.25 ) - OutlierRange2
print(OutlierLower2)
##    25% 
## -373.5
OutlierUpper2 <- quantile(PDC2$X9Profit[PDC2$X9Online=="N"],0.75 ) + OutlierRange2
print(OutlierUpper2)
##   75% 
## 534.5
OutlierLower2<-as.numeric(OutlierLower2)
d<-d[!(d$X9Online=="N" & d$X9Profit < OutlierLower2),]
d<-d[!(d$X9Online=="N" & d$X9Profit > OutlierUpper2),]

A new boxplot shows the data without the outliers.

BP <- boxplot((d$X9Profit ~ d$X9Online), main="Customer Profitability and Online Platform Usage", xlab = "Online Platform Usege", ylab="Profitability($)", sub="Data Shown Excluding Outliers")

t-test set up:
Data without outliers
Group 1: Users of online platform
Group 2: Nonusers of online platform
Null hypothesis : mean of group 1= mean of group 2
Alternatvie hypothesis : the means of both groups are not equal
Confidence interval: 95%
Alpha : 0.05

var(d$X9Profit[d$X9Online=="Y"])
[1] 29693.79
var(d$X9Profit[d$X9Online=="N"])
[1] 23181.19
t.test(d$X9Profit ~ d$X9Online, mu=0, alt="two.sided", conf=0.95, var.eq=F, paired=F)

    Welch Two Sample t-test

data:  d$X9Profit by d$X9Online
t = -1.9313, df = 3405.4, p-value = 0.05353
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -13.5470313   0.1023054
sample estimates:
mean in group N mean in group Y 
       61.69634        68.41871 

The p-value 0.05353 > a. without the outliers therefore we again , fail to reject the null hypothesis. However, as the as the outliers compose 1768/22812 customers (7.75%) of the customers, they should be included in the results to avoid a sample bias. The outliers have a significant effect on the data, and to understand this effect, other factors should be observed in the next iterations.

Linear regression model for the original data set PDC2 was completed.The regression equation returned is y = 5.003(x) + 126.522 where y represents the profit and x represents online status. This regression model cannot be accepted with a p-value of 0.37 for the “OnlineY” parameter and very small variance. Ultimately, additional parameters will need to be looked at to better estimate a customer’s profitability.

linearPDC2<- lm(PDC2$X9Profit ~ PDC2$X9Online, PDC2)
summary(linearPDC2)

Call:
lm(formula = PDC2$X9Profit ~ PDC2$X9Online, data = PDC2)

Residuals:
    Min      1Q  Median      3Q     Max 
-351.52 -160.52 -105.52   70.48 1944.48 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     126.522      2.007  63.033   <2e-16 ***
PDC2$X9OnlineY    5.003      5.578   0.897     0.37    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 282.9 on 22810 degrees of freedom
Multiple R-squared:  3.526e-05, Adjusted R-squared:  -8.574e-06 
F-statistic: 0.8044 on 1 and 22810 DF,  p-value: 0.3698

To further test the effect online useage has on profit, a train and test model was made. The data was divided into two groups: a training group used to complete a linear regression between online users and profit and a testing group used to test the relationship determined from the regression. The original data from the test group and the predicted data were then plotted and the correlation observed.

set.seed(46)
ind <-sample(2,nrow(PDC2), replace=TRUE,prob=c(0.67,0.33))
PDC2online.training <-PDC2[ind==1,1:8 ]
PDC2online.test <- PDC2[ind==2,1:8] 

dim(PDC2online.test)
[1] 7457    8
dim(PDC2online.training)
[1] 15355     8
linearonlinetrain<-lm(X9Profit ~X9Online, PDC2online.training)
summary(linearonlinetrain)

Call:
lm(formula = X9Profit ~ X9Online, data = PDC2online.training)

Residuals:
    Min      1Q  Median      3Q     Max 
-351.13 -163.05 -107.13   70.87 1941.95 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  129.047      2.481  52.012   <2e-16 ***
X9OnlineY      3.081      6.871   0.448    0.654    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 286.7 on 15353 degrees of freedom
Multiple R-squared:  1.309e-05, Adjusted R-squared:  -5.204e-05 
F-statistic: 0.201 on 1 and 15353 DF,  p-value: 0.6539
predonline <-predict(linearonlinetrain, newdata=PDC2online.test)
rmseonline <-sqrt(sum(((predonline)-PDC2online.test$X9Profit)^2)/length(PDC2online.test$X9Profit))
c(RMSE = rmseonline, R2 = summary(linearonlinetrain)$r.squared)                
        RMSE           R2 
2.747972e+02 1.309479e-05 
par(mfrow=c(1,1))
plot(predonline, PDC2online.test$X9Profit, main="Predicted Data Vs Test Data of Online Platform Usage and  Profit", xlab="Predicted Profit", ylab= "Test Data Profit")

cor(PDC2online.test$X9Profit,predonline) 
[1] 0.01083267

Looking at the above model, a correlation of approximately 1% is observed. This model is not mathematically significant and cannot be used to predict profitabilty based on online user status.

Checking category rankings with all categories:

linear3 <-lm(PDC2$X9Profit ~PDC2$X9Online +PDC2$X9Age +PDC2$X9Inc +PDC2$X9Tenure +PDC2$X9District +PDC2$X9Billpay, PDC2)
summary(linear3)

Call:
lm(formula = PDC2$X9Profit ~ PDC2$X9Online + PDC2$X9Age + PDC2$X9Inc + 
    PDC2$X9Tenure + PDC2$X9District + PDC2$X9Billpay, data = PDC2)

Residuals:
    Min      1Q  Median      3Q     Max 
-541.12 -155.28  -70.49   67.72 1961.78 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -61.78538   47.66714  -1.296   0.1949    
PDC2$X9OnlineY    6.27296    5.86594   1.069   0.2849    
PDC2$X9Age2      29.53241   11.92736   2.476   0.0133 *  
PDC2$X9Age3      68.38359   11.70682   5.841 5.25e-09 ***
PDC2$X9Age4      73.12425   11.78949   6.202 5.65e-10 ***
PDC2$X9Age5      78.17996   12.24602   6.384 1.76e-10 ***
PDC2$X9Age6      98.38648   12.70063   7.747 9.83e-15 ***
PDC2$X9Age7     134.04574   12.52390  10.703  < 2e-16 ***
PDC2$X9Inc2       3.08323   11.63825   0.265   0.7911    
PDC2$X9Inc3      13.97772    8.33949   1.676   0.0937 .  
PDC2$X9Inc4      11.95508    8.54656   1.399   0.1619    
PDC2$X9Inc5      17.52958    8.52069   2.057   0.0397 *  
PDC2$X9Inc6      42.57730    7.40187   5.752 8.92e-09 ***
PDC2$X9Inc7      64.57707    8.07577   7.996 1.34e-15 ***
PDC2$X9Inc8      82.78589    9.22157   8.977  < 2e-16 ***
PDC2$X9Inc9     151.46959    8.20840  18.453  < 2e-16 ***
PDC2$X9Tenure     4.08032    0.23532  17.339  < 2e-16 ***
PDC2$X9District   0.01600    0.03837   0.417   0.6767    
PDC2$X9Billpay1  79.47084   14.38130   5.526 3.31e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 273.5 on 22793 degrees of freedom
Multiple R-squared:  0.06567,   Adjusted R-squared:  0.06494 
F-statistic: 89.01 on 18 and 22793 DF,  p-value: < 2.2e-16

Looking at this data, we can see that age, income, and tenure, and billpay have the largest affect on the profit. Complete another itteration with these factors.

linear4 <-lm(PDC2$X9Profit ~ +PDC2$X9Age +PDC2$X9Inc +PDC2$X9Tenure +PDC2$X9Billpay, PDC2)
summary(linear4)

Call:
lm(formula = PDC2$X9Profit ~ +PDC2$X9Age + PDC2$X9Inc + PDC2$X9Tenure + 
    PDC2$X9Billpay, data = PDC2)

Residuals:
    Min      1Q  Median      3Q     Max 
-541.25 -155.28  -70.43   68.00 1960.88 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -41.5745    12.1498  -3.422 0.000623 ***
PDC2$X9Age2      29.5039    11.9268   2.474 0.013378 *  
PDC2$X9Age3      67.9248    11.6998   5.806 6.50e-09 ***
PDC2$X9Age4      72.5468    11.7786   6.159 7.43e-10 ***
PDC2$X9Age5      77.3931    12.2266   6.330 2.50e-10 ***
PDC2$X9Age6      97.3714    12.6694   7.686 1.58e-14 ***
PDC2$X9Age7     133.0048    12.4904  10.649  < 2e-16 ***
PDC2$X9Inc2       3.4158    11.6277   0.294 0.768938    
PDC2$X9Inc3      14.2588     8.3335   1.711 0.087090 .  
PDC2$X9Inc4      12.2420     8.5422   1.433 0.151838    
PDC2$X9Inc5      17.8914     8.5145   2.101 0.035627 *  
PDC2$X9Inc6      42.9429     7.3921   5.809 6.36e-09 ***
PDC2$X9Inc7      65.0555     8.0628   8.069 7.46e-16 ***
PDC2$X9Inc8      83.3212     9.2080   9.049  < 2e-16 ***
PDC2$X9Inc9     152.0595     8.1919  18.562  < 2e-16 ***
PDC2$X9Tenure     4.0765     0.2353  17.326  < 2e-16 ***
PDC2$X9Billpay1  84.8963    13.4641   6.305 2.93e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 273.5 on 22795 degrees of freedom
Multiple R-squared:  0.06562,   Adjusted R-squared:  0.06496 
F-statistic: 100.1 on 16 and 22795 DF,  p-value: < 2.2e-16

The above information will be tested, by dividing the data into two sections (1 & 2) with 67% in on and 33% in the other. The first data set will be called PDC2.training and the second PDC2.test.

set.seed(46)
ind <-sample(2,nrow(PDC2), replace=TRUE,prob=c(0.67,0.33))
PDC2.training <-PDC2[ind==1,1:8 ]
head(PDC2.training)
  ID X9Profit X9Online X9Age X9Inc X9Tenure X9District X9Billpay
2  2       -6        N     6     3    29.50       1200         0
3  3      -49        Y     5     5    26.41       1100         0
5  5      -61        N     2     9     9.91       1200         0
7  7      -19        N     3     1     8.41       1300         0
8  8       59        N     5     8     7.33       1200         0
9  9      493        N     4     9    15.33       1200         0
PDC2.test <- PDC2[ind==2,1:8] 

dim(PDC2.test)
[1] 7457    8
dim(PDC2.training)
[1] 15355     8
lineartrain<-lm(X9Profit ~ +X9Age +X9Inc +X9Tenure +X9Billpay, PDC2.training)
summary(lineartrain)

Call:
lm(formula = X9Profit ~ +X9Age + X9Inc + X9Tenure + X9Billpay, 
    data = PDC2.training)

Residuals:
    Min      1Q  Median      3Q     Max 
-538.56 -157.00  -70.90   67.68 1955.99 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -48.4299    14.7767  -3.277  0.00105 ** 
X9Age2       31.9350    14.5442   2.196  0.02813 *  
X9Age3       79.4090    14.2602   5.569 2.61e-08 ***
X9Age4       81.3920    14.3603   5.668 1.47e-08 ***
X9Age5       83.1874    14.9120   5.579 2.47e-08 ***
X9Age6      101.3086    15.4614   6.552 5.84e-11 ***
X9Age7      150.0779    15.2622   9.833  < 2e-16 ***
X9Inc2       11.3073    14.3949   0.786  0.43217    
X9Inc3       11.7621    10.2777   1.144  0.25246    
X9Inc4       11.4860    10.5518   1.089  0.27637    
X9Inc5       21.0718    10.5356   2.000  0.04551 *  
X9Inc6       48.7885     9.1193   5.350 8.92e-08 ***
X9Inc7       68.3633     9.9310   6.884 6.05e-12 ***
X9Inc8       85.2350    11.4181   7.465 8.78e-14 ***
X9Inc9      165.7543    10.0890  16.429  < 2e-16 ***
X9Tenure      3.7514     0.2909  12.897  < 2e-16 ***
X9Billpay1  108.4914    16.1156   6.732 1.73e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 276.7 on 15338 degrees of freedom
Multiple R-squared:  0.06949,   Adjusted R-squared:  0.06852 
F-statistic: 71.59 on 16 and 15338 DF,  p-value: < 2.2e-16
pred <-predict(lineartrain, newdata=PDC2.test)
rmse <-sqrt(sum(((pred)-PDC2.test$X9Profit)^2)/length(PDC2.test$X9Profit))
c(RMSE = rmse, R2 = summary(lineartrain)$r.squared)                
        RMSE           R2 
267.20075232   0.06949432 
par(mfrow=c(1,1))
plot(PDC2.test$X9Profit,pred)                

cor(PDC2.test$X9Profit,pred) 
[1] 0.2370427

By analyzing other variables, the correlation can be increased. This is still a weak model for prediction and further analysis will need to be completed to strength the predicability of this model.

Further analysis to explore:
1. grouping the tenure data into intervals.
2. separating the outliers and seeing if there are any dominate demographics within that group.
3. comparing the profit of 1999 and 2000 from users who were not online users in 1999 but became online users in 2000.