A copy of this document can also be found at: https://rpubs.com/khalilc002/PilgrimBank
This document will discuss Case I: Pilgrim Data, specifically the relationship, if any, between a customer using the banks online services and the profit made from the customer.
The data was loaded and the required libraries attached. Libraries used in this analysis include stats, plyr , dplyr, base, stats, and ggplot2.
The columns and rows can be observed using the head and summary functions. Looking at this we can see many values are missing, specifically from the 2000 data and 2009’s age and income.
ID X9Profit X9Online X9Age X9Inc X9Tenure X9District X0Profit X0Online
1 1 21 0 NA NA 6.33 1200 NA NA
2 2 -6 0 6 3 29.50 1200 -32 0
3 3 -49 1 5 5 26.41 1100 -22 1
4 4 -4 0 NA NA 2.25 1200 NA NA
5 5 -61 0 2 9 9.91 1200 -4 0
6 6 -38 0 NA 3 2.33 1300 14 0
X9Billpay X0Billpay
1 0 NA
2 0 0
3 0 0
4 0 NA
5 0 0
6 0 0
ID X9Profit X9Online X9Age
Min. : 1 Min. :-221.0 Min. :0.0000 Min. :1.000
1st Qu.: 7909 1st Qu.: -34.0 1st Qu.:0.0000 1st Qu.:3.000
Median :15818 Median : 9.0 Median :0.0000 Median :4.000
Mean :15818 Mean : 111.5 Mean :0.1218 Mean :4.046
3rd Qu.:23726 3rd Qu.: 164.0 3rd Qu.:0.0000 3rd Qu.:5.000
Max. :31634 Max. :2071.0 Max. :1.0000 Max. :7.000
NA's :8289
X9Inc X9Tenure X9District X0Profit
Min. :1.000 Min. : 0.16 Min. :1100 Min. :-5643.0
1st Qu.:4.000 1st Qu.: 3.75 1st Qu.:1200 1st Qu.: -30.0
Median :6.000 Median : 7.41 Median :1200 Median : 23.0
Mean :5.459 Mean :10.16 Mean :1203 Mean : 144.8
3rd Qu.:7.000 3rd Qu.:14.75 3rd Qu.:1200 3rd Qu.: 206.0
Max. :9.000 Max. :41.16 Max. :1300 Max. :27086.0
NA's :8261 NA's :5238
X0Online X9Billpay X0Billpay
Min. :0.000 Min. :0.00000 Min. :0.00
1st Qu.:0.000 1st Qu.:0.00000 1st Qu.:0.00
Median :0.000 Median :0.00000 Median :0.00
Mean :0.199 Mean :0.01669 Mean :0.03
3rd Qu.:0.000 3rd Qu.:0.00000 3rd Qu.:0.00
Max. :1.000 Max. :1.00000 Max. :1.00
NA's :5219 NA's :5219
Before completing an analysis it is imparative to have our columns assigned the right variable type. This avoids categorical data being assessed as numerical and vice versa.
ID X9Profit X9Online X9Age X9Inc X9Tenure X9District
"integer" "integer" "integer" "integer" "integer" "numeric" "integer"
X0Profit X0Online X9Billpay X0Billpay
"integer" "integer" "integer" "integer"
The raw data has both numerical and character data. Columns starting with X9 represent 2019 data while columns starting with X0 represent 2000 data.
| Column | Original Data Type | Required Data Type | Information |
|---|---|---|---|
| 1. ID Numbers | Integer | Character | Categorical information represented with numbers. |
| 2. Profit | Integer | Integer | Left as an integer as no decimals are observed in profit amounts. |
| 3. Online | Character | Character | Categorical and binary data representing if the customer uses online services. |
| 4. Age | Integer | Charactor | Categorical, representing numerical intervals of the age of the user. |
| 5. Income | Integer | Character | Categorical, representing numerical intervals of the income of the user. |
| 6. Tenure | Numeric | Numeric | Continuous variables respresenting time with bank. |
| 7. District | Integer | Character | Categorical, representing district customer is in. |
| 8. Billpay | Integer | Character | Categorical and binary data |
Age and income are currently represented as an integer, this is incorrect as each number represents a category in a scale. The age and income levels have been represented with intervals. The character types are corrected.
In order to avoid skewed data, the rows with missing information are deleted. There are originally 10551 rows with missing data, primarily from the Age, Income columns from 2019 and many rows in 2020 are missing data. Since we will not be looking at data from 2020, these columns are removed before the rows with missing cases. This results in a removal of 8822 rows. Once this is completed, we can see there is no missing data.
sum(!complete.cases(PG))
[1] 10551
PDC2 <-select(PG, -c(8, 9, 11))
sum(!complete.cases(PDC2))
[1] 8822
PDC2 <- na.omit(PDC2)
sum(!complete.cases(PDC2))
[1] 0
The final mannipulation to the data before completing a t-test will be to change online usage variables 1 to Y and 0 to N.
PDC2[PDC2$X9Online == 1, "X9Online"] <- "Y"
PDC2[PDC2$X9Online == 0, "X9Online"] <- "N"
Customer profit is plotted on a boxplot, based on their online platform status:
BP <- boxplot((PDC2$X9Profit ~ PDC2$X9Online), main="Customer Profitability and Online Platform Usage", xlab = "Online Platform Usege", ylab="Profitability ($)")
A t-test can be completed to determine the significance of the relationship (if any) between users of the online platform and those who who do not use the banks online platform.
t-test set up:
Group 1: Users of online platform
Group 2: Nonusers of online platform
Null hypothesis : mean of group 1= mean of group 2
Alternatvie hypothesis : the means of both groups are not equal
Confidence interval: 95%
Alpha : 0.05
var(PDC2$X9Profit[PDC2$X9Online=="Y"])
[1] 84312.07
var(PDC2$X9Profit[PDC2$X9Online=="N"])
[1] 79368.14
t.test(PDC2$X9Profit ~ PDC2$X9Online, mu=0, alt="two.sided", conf=0.95, var.eq=F, paired=F)
Welch Two Sample t-test
data: PDC2$X9Profit by PDC2$X9Online
t = -0.87703, df = 3826.8, p-value = 0.3805
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-16.186479 6.180837
sample estimates:
mean in group N mean in group Y
126.5216 131.5244
The p-value returned is 0.3805. p-value > a. Therefore we fail to reject the null hypothesis. We cannot conclude there is a significant difference between the profits of online and non-online users based on the profit and online status of a customer.
As the dataset shows extreme cases, the above t-test was completed without outliers from both sets of users. Outliers were identified based on being 1.5 IQR above or below the mean.
quantile(PDC2$X9Profit[PDC2$X9Online=="Y"],0.25 )
25%
-40
quantile(PDC2$X9Profit[PDC2$X9Online=="Y"],0.75 )
75%
214
quantile(PDC2$X9Profit[PDC2$X9Online=="Y"],0.50 )
50%
27
IQR(PDC2$X9Profit[PDC2$X9Online=="Y"])
[1] 254
fivenum(PDC2$X9Profit[PDC2$X9Online=="Y"])
[1] -220 -40 27 214 1979
OutlierRange <-IQR(PDC2$X9Profit[PDC2$X9Online=="Y"]) * 1.5
print(OutlierRange)
[1] 381
OutlierLower <- quantile(PDC2$X9Profit[PDC2$X9Online=="Y"],0.25 ) - OutlierRange
print(OutlierLower)
25%
-421
OutlierUpper <- quantile(PDC2$X9Profit[PDC2$X9Online=="Y"],0.75 ) + OutlierRange
print(OutlierUpper)
75%
595
#remove lines with Y and profit less than -433
OutlierLower<-as.numeric(OutlierLower)
d<-PDC2[!(PDC2$X9Online=="Y" & PDC2$X9Profit < OutlierLower),]
d<-d[!(d$X9Online=="Y" & d$X9Profit > OutlierUpper),]
#do the same for X9Online = N
quantile(PDC2$X9Profit[PDC2$X9Online=="N"],0.25 )
## 25%
## -33
quantile(PDC2$X9Profit[PDC2$X9Online=="N"],0.75 )
## 75%
## 194
quantile(PDC2$X9Profit[PDC2$X9Online=="N"],0.50 )
## 50%
## 20.5
IQR(PDC2$X9Profit[PDC2$X9Online=="N"])
## [1] 227
fivenum(PDC2$X9Profit[PDC2$X9Online=="N"])
## [1] -221.0 -33.0 20.5 194.0 2071.0
OutlierRange2 <-IQR(PDC2$X9Profit[PDC2$X9Online=="N"]) * 1.5
print(OutlierRange2)
## [1] 340.5
OutlierLower2 <- quantile(PDC2$X9Profit[PDC2$X9Online=="N"],0.25 ) - OutlierRange2
print(OutlierLower2)
## 25%
## -373.5
OutlierUpper2 <- quantile(PDC2$X9Profit[PDC2$X9Online=="N"],0.75 ) + OutlierRange2
print(OutlierUpper2)
## 75%
## 534.5
OutlierLower2<-as.numeric(OutlierLower2)
d<-d[!(d$X9Online=="N" & d$X9Profit < OutlierLower2),]
d<-d[!(d$X9Online=="N" & d$X9Profit > OutlierUpper2),]
A new boxplot shows the data without the outliers.
BP <- boxplot((d$X9Profit ~ d$X9Online), main="Customer Profitability and Online Platform Usage", xlab = "Online Platform Usege", ylab="Profitability($)", sub="Data Shown Excluding Outliers")
t-test set up:
Data without outliers
Group 1: Users of online platform
Group 2: Nonusers of online platform
Null hypothesis : mean of group 1= mean of group 2
Alternatvie hypothesis : the means of both groups are not equal
Confidence interval: 95%
Alpha : 0.05
var(d$X9Profit[d$X9Online=="Y"])
[1] 29693.79
var(d$X9Profit[d$X9Online=="N"])
[1] 23181.19
t.test(d$X9Profit ~ d$X9Online, mu=0, alt="two.sided", conf=0.95, var.eq=F, paired=F)
Welch Two Sample t-test
data: d$X9Profit by d$X9Online
t = -1.9313, df = 3405.4, p-value = 0.05353
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-13.5470313 0.1023054
sample estimates:
mean in group N mean in group Y
61.69634 68.41871
The p-value 0.05353 > a. without the outliers therefore we again , fail to reject the null hypothesis. However, as the as the outliers compose 1768/22812 customers (7.75%) of the customers, they should be included in the results to avoid a sample bias. The outliers have a significant effect on the data, and to understand this effect, other factors should be observed in the next iterations.
Linear regression model for the original data set PDC2 was completed.The regression equation returned is y = 5.003(x) + 126.522 where y represents the profit and x represents online status. This regression model cannot be accepted with a p-value of 0.37 for the “OnlineY” parameter and very small variance. Ultimately, additional parameters will need to be looked at to better estimate a customer’s profitability.
linearPDC2<- lm(PDC2$X9Profit ~ PDC2$X9Online, PDC2)
summary(linearPDC2)
Call:
lm(formula = PDC2$X9Profit ~ PDC2$X9Online, data = PDC2)
Residuals:
Min 1Q Median 3Q Max
-351.52 -160.52 -105.52 70.48 1944.48
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 126.522 2.007 63.033 <2e-16 ***
PDC2$X9OnlineY 5.003 5.578 0.897 0.37
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 282.9 on 22810 degrees of freedom
Multiple R-squared: 3.526e-05, Adjusted R-squared: -8.574e-06
F-statistic: 0.8044 on 1 and 22810 DF, p-value: 0.3698
To further test the effect online useage has on profit, a train and test model was made. The data was divided into two groups: a training group used to complete a linear regression between online users and profit and a testing group used to test the relationship determined from the regression. The original data from the test group and the predicted data were then plotted and the correlation observed.
set.seed(46)
ind <-sample(2,nrow(PDC2), replace=TRUE,prob=c(0.67,0.33))
PDC2online.training <-PDC2[ind==1,1:8 ]
PDC2online.test <- PDC2[ind==2,1:8]
dim(PDC2online.test)
[1] 7457 8
dim(PDC2online.training)
[1] 15355 8
linearonlinetrain<-lm(X9Profit ~X9Online, PDC2online.training)
summary(linearonlinetrain)
Call:
lm(formula = X9Profit ~ X9Online, data = PDC2online.training)
Residuals:
Min 1Q Median 3Q Max
-351.13 -163.05 -107.13 70.87 1941.95
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 129.047 2.481 52.012 <2e-16 ***
X9OnlineY 3.081 6.871 0.448 0.654
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 286.7 on 15353 degrees of freedom
Multiple R-squared: 1.309e-05, Adjusted R-squared: -5.204e-05
F-statistic: 0.201 on 1 and 15353 DF, p-value: 0.6539
predonline <-predict(linearonlinetrain, newdata=PDC2online.test)
rmseonline <-sqrt(sum(((predonline)-PDC2online.test$X9Profit)^2)/length(PDC2online.test$X9Profit))
c(RMSE = rmseonline, R2 = summary(linearonlinetrain)$r.squared)
RMSE R2
2.747972e+02 1.309479e-05
par(mfrow=c(1,1))
plot(predonline, PDC2online.test$X9Profit, main="Predicted Data Vs Test Data of Online Platform Usage and Profit", xlab="Predicted Profit", ylab= "Test Data Profit")
cor(PDC2online.test$X9Profit,predonline)
[1] 0.01083267
Looking at the above model, a correlation of approximately 1% is observed. This model is not mathematically significant and cannot be used to predict profitabilty based on online user status.
Checking category rankings with all categories:
linear3 <-lm(PDC2$X9Profit ~PDC2$X9Online +PDC2$X9Age +PDC2$X9Inc +PDC2$X9Tenure +PDC2$X9District +PDC2$X9Billpay, PDC2)
summary(linear3)
Call:
lm(formula = PDC2$X9Profit ~ PDC2$X9Online + PDC2$X9Age + PDC2$X9Inc +
PDC2$X9Tenure + PDC2$X9District + PDC2$X9Billpay, data = PDC2)
Residuals:
Min 1Q Median 3Q Max
-541.12 -155.28 -70.49 67.72 1961.78
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -61.78538 47.66714 -1.296 0.1949
PDC2$X9OnlineY 6.27296 5.86594 1.069 0.2849
PDC2$X9Age2 29.53241 11.92736 2.476 0.0133 *
PDC2$X9Age3 68.38359 11.70682 5.841 5.25e-09 ***
PDC2$X9Age4 73.12425 11.78949 6.202 5.65e-10 ***
PDC2$X9Age5 78.17996 12.24602 6.384 1.76e-10 ***
PDC2$X9Age6 98.38648 12.70063 7.747 9.83e-15 ***
PDC2$X9Age7 134.04574 12.52390 10.703 < 2e-16 ***
PDC2$X9Inc2 3.08323 11.63825 0.265 0.7911
PDC2$X9Inc3 13.97772 8.33949 1.676 0.0937 .
PDC2$X9Inc4 11.95508 8.54656 1.399 0.1619
PDC2$X9Inc5 17.52958 8.52069 2.057 0.0397 *
PDC2$X9Inc6 42.57730 7.40187 5.752 8.92e-09 ***
PDC2$X9Inc7 64.57707 8.07577 7.996 1.34e-15 ***
PDC2$X9Inc8 82.78589 9.22157 8.977 < 2e-16 ***
PDC2$X9Inc9 151.46959 8.20840 18.453 < 2e-16 ***
PDC2$X9Tenure 4.08032 0.23532 17.339 < 2e-16 ***
PDC2$X9District 0.01600 0.03837 0.417 0.6767
PDC2$X9Billpay1 79.47084 14.38130 5.526 3.31e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 273.5 on 22793 degrees of freedom
Multiple R-squared: 0.06567, Adjusted R-squared: 0.06494
F-statistic: 89.01 on 18 and 22793 DF, p-value: < 2.2e-16
Looking at this data, we can see that age, income, and tenure, and billpay have the largest affect on the profit. Complete another itteration with these factors.
linear4 <-lm(PDC2$X9Profit ~ +PDC2$X9Age +PDC2$X9Inc +PDC2$X9Tenure +PDC2$X9Billpay, PDC2)
summary(linear4)
Call:
lm(formula = PDC2$X9Profit ~ +PDC2$X9Age + PDC2$X9Inc + PDC2$X9Tenure +
PDC2$X9Billpay, data = PDC2)
Residuals:
Min 1Q Median 3Q Max
-541.25 -155.28 -70.43 68.00 1960.88
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.5745 12.1498 -3.422 0.000623 ***
PDC2$X9Age2 29.5039 11.9268 2.474 0.013378 *
PDC2$X9Age3 67.9248 11.6998 5.806 6.50e-09 ***
PDC2$X9Age4 72.5468 11.7786 6.159 7.43e-10 ***
PDC2$X9Age5 77.3931 12.2266 6.330 2.50e-10 ***
PDC2$X9Age6 97.3714 12.6694 7.686 1.58e-14 ***
PDC2$X9Age7 133.0048 12.4904 10.649 < 2e-16 ***
PDC2$X9Inc2 3.4158 11.6277 0.294 0.768938
PDC2$X9Inc3 14.2588 8.3335 1.711 0.087090 .
PDC2$X9Inc4 12.2420 8.5422 1.433 0.151838
PDC2$X9Inc5 17.8914 8.5145 2.101 0.035627 *
PDC2$X9Inc6 42.9429 7.3921 5.809 6.36e-09 ***
PDC2$X9Inc7 65.0555 8.0628 8.069 7.46e-16 ***
PDC2$X9Inc8 83.3212 9.2080 9.049 < 2e-16 ***
PDC2$X9Inc9 152.0595 8.1919 18.562 < 2e-16 ***
PDC2$X9Tenure 4.0765 0.2353 17.326 < 2e-16 ***
PDC2$X9Billpay1 84.8963 13.4641 6.305 2.93e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 273.5 on 22795 degrees of freedom
Multiple R-squared: 0.06562, Adjusted R-squared: 0.06496
F-statistic: 100.1 on 16 and 22795 DF, p-value: < 2.2e-16
The above information will be tested, by dividing the data into two sections (1 & 2) with 67% in on and 33% in the other. The first data set will be called PDC2.training and the second PDC2.test.
set.seed(46)
ind <-sample(2,nrow(PDC2), replace=TRUE,prob=c(0.67,0.33))
PDC2.training <-PDC2[ind==1,1:8 ]
head(PDC2.training)
ID X9Profit X9Online X9Age X9Inc X9Tenure X9District X9Billpay
2 2 -6 N 6 3 29.50 1200 0
3 3 -49 Y 5 5 26.41 1100 0
5 5 -61 N 2 9 9.91 1200 0
7 7 -19 N 3 1 8.41 1300 0
8 8 59 N 5 8 7.33 1200 0
9 9 493 N 4 9 15.33 1200 0
PDC2.test <- PDC2[ind==2,1:8]
dim(PDC2.test)
[1] 7457 8
dim(PDC2.training)
[1] 15355 8
lineartrain<-lm(X9Profit ~ +X9Age +X9Inc +X9Tenure +X9Billpay, PDC2.training)
summary(lineartrain)
Call:
lm(formula = X9Profit ~ +X9Age + X9Inc + X9Tenure + X9Billpay,
data = PDC2.training)
Residuals:
Min 1Q Median 3Q Max
-538.56 -157.00 -70.90 67.68 1955.99
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -48.4299 14.7767 -3.277 0.00105 **
X9Age2 31.9350 14.5442 2.196 0.02813 *
X9Age3 79.4090 14.2602 5.569 2.61e-08 ***
X9Age4 81.3920 14.3603 5.668 1.47e-08 ***
X9Age5 83.1874 14.9120 5.579 2.47e-08 ***
X9Age6 101.3086 15.4614 6.552 5.84e-11 ***
X9Age7 150.0779 15.2622 9.833 < 2e-16 ***
X9Inc2 11.3073 14.3949 0.786 0.43217
X9Inc3 11.7621 10.2777 1.144 0.25246
X9Inc4 11.4860 10.5518 1.089 0.27637
X9Inc5 21.0718 10.5356 2.000 0.04551 *
X9Inc6 48.7885 9.1193 5.350 8.92e-08 ***
X9Inc7 68.3633 9.9310 6.884 6.05e-12 ***
X9Inc8 85.2350 11.4181 7.465 8.78e-14 ***
X9Inc9 165.7543 10.0890 16.429 < 2e-16 ***
X9Tenure 3.7514 0.2909 12.897 < 2e-16 ***
X9Billpay1 108.4914 16.1156 6.732 1.73e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 276.7 on 15338 degrees of freedom
Multiple R-squared: 0.06949, Adjusted R-squared: 0.06852
F-statistic: 71.59 on 16 and 15338 DF, p-value: < 2.2e-16
pred <-predict(lineartrain, newdata=PDC2.test)
rmse <-sqrt(sum(((pred)-PDC2.test$X9Profit)^2)/length(PDC2.test$X9Profit))
c(RMSE = rmse, R2 = summary(lineartrain)$r.squared)
RMSE R2
267.20075232 0.06949432
par(mfrow=c(1,1))
plot(PDC2.test$X9Profit,pred)
cor(PDC2.test$X9Profit,pred)
[1] 0.2370427
By analyzing other variables, the correlation can be increased. This is still a weak model for prediction and further analysis will need to be completed to strength the predicability of this model.
Further analysis to explore:
1. grouping the tenure data into intervals.
2. separating the outliers and seeing if there are any dominate demographics within that group.
3. comparing the profit of 1999 and 2000 from users who were not online users in 1999 but became online users in 2000.