Purpose Universal Bank seeks to find out which customers are likely to accept an offer to purchase a financial product (a personal loan). The assignment focuses on gaining insight into the data providing prior to starting the modeling process.
Data Set The data file included with this assignment is a list of current customers of Universal Bank. In the file is a binary (0/1) variable, “Acpt_Offer”, which indicates those customers that accepted the an offer for a financial product (a personal loan) in the last solicitation (a “1” indicates that customer accepted loan offer and “0” if not). A set of demographic and customer-product variables are included in the file. The description of these variables in the second tab of the Excel data file. There is also a .csv file (comma separated value) with the same data which should be used for the actual analysis in R.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
# a. Read in the data and assign it to a data-frame called “x.dat”. How many observations in the data set? What are the assigned data types of the variables?
setwd("/Users/Fisher/Desktop/BANA 288 Predictive Analytics/HW1 Descriptive Analytics in R")
x.dat <- read.csv("hw1_universal_bank.csv")
dim(x.dat)
## [1] 2500 15
str(x.dat)
## 'data.frame': 2500 obs. of 15 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 25 45 39 35 35 37 53 50 35 34 ...
## $ Work_Exp : int 1 19 15 9 8 13 27 24 10 9 ...
## $ Income : int 49 34 11 100 45 29 72 22 81 180 ...
## $ Fam_Size : int 4 3 1 1 4 4 2 1 3 1 ...
## $ Crdt_Crd_Avg: num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
## $ Education : int 1 1 1 2 2 2 2 3 2 3 ...
## $ EdLevel.1 : int 1 1 1 0 0 0 0 0 0 0 ...
## $ EdLevel.2 : int 0 0 0 1 1 1 1 0 1 0 ...
## $ Mortgage : int 0 0 0 0 0 155 0 0 104 0 ...
## $ Sec_Account : int 1 1 0 0 0 0 0 0 0 0 ...
## $ CD_Account : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Online_Bank : int 0 0 0 0 0 1 1 0 1 0 ...
## $ Credit_Card : int 0 0 0 0 1 0 0 1 0 0 ...
## $ Acpt_Offer : int 0 0 0 0 0 0 0 0 0 1 ...
There are 2500 observations in the data set. Except the fact that the variable “Crdt_Crd_Avg” is numeric data type, all the others are integer data types.
# b. Create a pie chart with a different color for each slice of the pie for the Family Size variable. Interpret the chart.
# fam <- factor(x.dat$Fam_Size)
length(unique(x.dat$Fam_Size)) # there are 4 unique values in family size, we will need 4 colors
## [1] 4
fam <- table(x.dat$Fam_Size)
fam
##
## 1 2 3 4
## 743 630 491 636
str(fam)
## 'table' int [1:4(1d)] 743 630 491 636
## - attr(*, "dimnames")=List of 1
## ..$ : chr [1:4] "1" "2" "3" "4"
colors = c("red", "yellow", "green", "blue")
pie(fam,
labels = as.character(fam),
main = "Family Size Distribution",
col = colors, # col=c("red","yellow","blue","green"), col = rainbow(length(x))
clockwise = TRUE,
border = c('pink'))
legend("topleft", c("1","2","3","4"), cex = 0.8, fill = colors)
The family sizes of the list of current customers of Universal Bank is almost evenly distributed. In most cases, those customers are single or married with two children or without child.
# c. Create a horizontal bar chart with the Education variable that has color and labels. What does this say about the Education levels of the customers?
edu <- table(x.dat$Education)
edu
##
## 1 2 3
## 1065 706 729
bp <- barplot(edu, main="Education levels of the customers", xlab="Level", ylab="Frequency", ylim = c(0,1200), col = rainbow(length(edu)))
text(bp, edu, labels = edu, pos=3)
There are 3 Education levels of the customers. Most of the customers don’t have college degrees, however, there is just a slight difference between the amount of customers having college degrees or graduate/advanced degrees, and the latter is a little higher.
##
## 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 3 15 32 29 48 63 65 72 62 63 53 65 72 58 42 58 67 57 74 62 69 63 61 63 59 55
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
## 53 73 67 79 51 72 69 67 70 63 60 64 58 69 56 42 43 11 3
The graph is a left-skewed histogram. The customers are the elderly crowds from 40 to 80.
# e. Create a horizontal boxplot of the Income variable with labels. What does this boxplot show? Is the data skewed? Are there any outliers for this variable? If so, how many?
boxplot(x.dat$Income, main="Estimated annual income of the customer ($1,000)")
outliers = boxplot(x.dat$Income, plot=FALSE)$out
str(outliers)
## num [1:41] 193 194 195 191 200 205 204 195 192 194 ...
summary(x.dat$Income)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 39.00 64.00 74.45 99.25 205.00
The median annual income of the customers will be \(64000\), and most of the income lie between \(39000\) and \(99250\). The distribution on income data is right skewed. However, there are 41 outliers.
# f. Create a boxplot of the Average Credit Card Debt conditional on the customer’s acceptance of the bank’s offer (Accept Offer). What information does this chart convey? Hint: Use the “~” command after the variable for which the boxplot is made in order to link the variable upon which you are conditioning.
boxplot(x.dat$Crdt_Crd_Avg ~ x.dat$Acpt_Offer, main="Average Credit Card Debt of Customers(in $1,000s)", xlab = "Offer Accepted", ylab = "Debt")
The customers who have accepted the personal loan offered in the last solicitation tend to have higher average spending on bank issued credit cards per month.
# g. Create a scatter plot with Income on the x-axis and Average Credit Card Debt on the y-axis. Add the least squares regression line to the plot. What information does this chart convey? What is the “abline”?
plot(x.dat$Crdt_Crd_Avg ~ x.dat$Income, xlab = "Income", ylab = "Debt")
abline(lm(x.dat$Crdt_Crd_Avg ~ x.dat$Income))
The scatter plot is more dense with lower income and lower debt. There tends to be a linear relationship between income and average spending on bank issued credit cards per month. The “abline” is the predicted linear model on Average Credit Card Debt by independent variable “income”.
# h. Create a second scatter plot with Average Credit Card Debt on the x-axis and Accept Offer on the y-axis. Add the least squares regression line to the plot. What is the take-away from this chart?
plot(x.dat$Acpt_Offer ~ x.dat$Crdt_Crd_Avg, xlab = "Debt", ylab = "Offer Accepted")
abline(lm(x.dat$Acpt_Offer ~ x.dat$Crdt_Crd_Avg))
There is no linear relationship between the average spending on bank issued credit cards per month and whether the customers are going to accept the offer or not. We might consider using logistic regression in stead of linear regression to predict yes/no dependent variable.
# i. Download and install the package “lattice”. What are two interesting commands in this package? What do they do?
#install.packages("lattice")
#library(lattice)
histogram(~Income | Acpt_Offer, data = x.dat)
densityplot(~Income | Acpt_Offer, data = x.dat)
xyplot(Acpt_Offer ~ Income | Education, data = x.dat)
bwplot(~Income | Fam_Size, data = x.dat)
splom(x.dat)
The command densityplot in lattice can create density plot for us to gain interesting insight by looking at the plot. And the xyplot command which I created is to show the offer accpeted regressed against income conditioned on the education level.
# j. Using lattice, create a density plot of the Average Credit Card Debt. What does this chart show?
densityplot(~ Crdt_Crd_Avg, data = x.dat)
This chart shows the distribution of the average spending on bank issued credit cards per month. Most of the customers have $0-4000 monthly debt.
# k. Using lattice, create a histogram of customer Income conditional on if the customer accepted the offer. What is the story in this diagram? Hint: Use the “pipe” to condition on the customer response variable as follows: histogram(~Income | Acpt_Offer, data= x2.dat)
histogram(~Income | Acpt_Offer, data = x.dat)
Customers accpeted the loan offers are those one who have higher income. The two histogram graphs are likely mirroring each other, which means the company should focusing on the high income customers which are more likely to accept an offer to purchase a financial product (a personal loan)
# l. Using lattice, create one additional useful chart. Explain why the chart is useful toward the goal in this assignment.
bwplot(~Income | Fam_Size, data = x.dat)
The boxplots above is the income distribution conditioned on family size. We can tell that larger family size with one or two kids are having less scattered income. And single customers and dink families have wider range income. The independent variable “income” will be a good indicater for the bank to find their target customers, but we still need to figure out the relationship between family size and income. Will teh combination of those to be a better indicater?
# m. In consideration of the goal of understanding which customer characteristics are related to the decision on the bank’s offer, what has been learned so far based on the graphics created?
Income will be a huge indicater on accepting the offer or not, however, there will be more factors we will need to do further research on, working experience, family size, average monthly dept, for instance.
# n. Create a new data frame, x2.dat, from x.dat removing the Education and the ID variables from the data set. Why would we do this?
x2.dat <- subset(x.dat, select = -c(Education, ID))
For the education variable, we have already had the dummy variables for education levels and the ID variable is also redundant for modeling.
# o. Compute descriptive statistics, e.g., means, standard deviations, etc., for all variables in x2.dat.
#options(digits=2) # reserve two decimal fractions
options(scipen = 100)
stat.desc(x2.dat)
## Age Work_Exp Income Fam_Size
## nbr.val 2500.0000000 2500.0000000 2500.0000000 2500.00000000
## nbr.null 0.0000000 49.0000000 0.0000000 0.00000000
## nbr.na 0.0000000 0.0000000 0.0000000 0.00000000
## min 23.0000000 0.0000000 8.0000000 1.00000000
## max 67.0000000 42.0000000 205.0000000 4.00000000
## range 44.0000000 42.0000000 197.0000000 3.00000000
## sum 113365.0000000 50331.0000000 186118.0000000 6020.00000000
## median 45.0000000 20.0000000 64.0000000 2.00000000
## mean 45.3460000 20.1324000 74.4472000 2.40800000
## SE.mean 0.2303904 0.2300565 0.9334480 0.02319719
## CI.mean.0.95 0.4517757 0.4511210 1.8304110 0.04548769
## var 132.6993637 132.3149962 2178.3129373 1.34527411
## std.dev 11.5195210 11.5028256 46.6724002 1.15985952
## coef.var 0.2540361 0.5713589 0.6269195 0.48166924
## Crdt_Crd_Avg EdLevel.1 EdLevel.2 Mortgage
## nbr.val 2500.00000000 2500.000000000 2500.000000000 2500.000000
## nbr.null 54.00000000 1435.000000000 1794.000000000 1708.000000
## nbr.na 0.00000000 0.000000000 0.000000000 0.000000
## min 0.00000000 0.000000000 0.000000000 0.000000
## max 10.00000000 1.000000000 1.000000000 617.000000
## range 10.00000000 1.000000000 1.000000000 617.000000
## sum 4878.21000000 1065.000000000 706.000000000 143471.000000
## median 1.50000000 0.000000000 0.000000000 0.000000
## mean 1.95128400 0.426000000 0.282400000 57.388400
## SE.mean 0.03590898 0.009891852 0.009005139 2.016328
## CI.mean.0.95 0.07041442 0.019397069 0.017658301 3.953845
## var 3.22363785 0.244621849 0.202731333 10163.947124
## std.dev 1.79544920 0.494592609 0.450256963 100.816403
## coef.var 0.92013731 1.161015514 1.594394344 1.756738
## Sec_Account CD_Account Online_Bank Credit_Card
## nbr.val 2500.00000000 2500.000000000 2500.000000000 2500.000000000
## nbr.null 2226.00000000 2344.000000000 1006.000000000 1774.000000000
## nbr.na 0.00000000 0.000000000 0.000000000 0.000000000
## min 0.00000000 0.000000000 0.000000000 0.000000000
## max 1.00000000 1.000000000 1.000000000 1.000000000
## range 1.00000000 1.000000000 1.000000000 1.000000000
## sum 274.00000000 156.000000000 1494.000000000 726.000000000
## median 0.00000000 0.000000000 1.000000000 0.000000000
## mean 0.10960000 0.062400000 0.597600000 0.290400000
## SE.mean 0.00624906 0.004838580 0.009809597 0.009080755
## CI.mean.0.95 0.01225387 0.009488039 0.019235773 0.017806578
## var 0.09762689 0.058529652 0.240570468 0.206150300
## std.dev 0.31245302 0.241929022 0.490479835 0.454037774
## coef.var 2.85084874 3.877067666 0.820749389 1.563490957
## Acpt_Offer
## nbr.val 2500.000000000
## nbr.null 2244.000000000
## nbr.na 0.000000000
## min 0.000000000
## max 1.000000000
## range 1.000000000
## sum 256.000000000
## median 0.000000000
## mean 0.102400000
## SE.mean 0.006064685
## CI.mean.0.95 0.011892325
## var 0.091951020
## std.dev 0.303234267
## coef.var 2.961272134
# p. Compute correlations statistics for all variables in x2.dat. Given the goal of finding out which customers will accept the bank’s offer, which variables seem like the best predictors?
cor(x2.dat)[13, ]
## Age Work_Exp Income Fam_Size Crdt_Crd_Avg
## -0.0029299059 -0.0030854094 0.5172405150 0.0575159659 0.3769046179
## EdLevel.1 EdLevel.2 Mortgage Sec_Account CD_Account
## -0.1522331265 0.0841320863 0.1245543957 0.0124271760 0.3383284125
## Online_Bank Credit_Card Acpt_Offer
## 0.0134912840 -0.0009951699 1.0000000000
Income, Crdt_Crd_Avg, EdLevel.1, MortgageCD_Account seem like the best predictors.
# q. Assuming that Universal Bank seeks to build a model to predict which customers will accept the bank’s offer of the personal loan product, what types of methods or techniques make the most sense in this case?
In my opinion, logstic regression will be the best mothods to predict accepting offer or not, since the dependent variable is bianry, yes or no.
# r. (Addendum 1) Run a simple linear regression that predicts average credit card debt as a function of income. Hint: Recall, to run a regression we use the “lm” (linear model) command. How well does this model fit? Interpret in a sentence the value of the slope coefficient in the regression.
reg1 <- lm(Crdt_Crd_Avg ~ Income, data = x2.dat)
summary(reg1)
##
## Call:
## lm(formula = Crdt_Crd_Avg ~ Income, data = x2.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8581 -0.6885 -0.0500 0.6644 4.9536
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1103670 0.0518049 2.13 0.0332 *
## Income 0.0247278 0.0005896 41.94 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.376 on 2498 degrees of freedom
## Multiple R-squared: 0.4132, Adjusted R-squared: 0.413
## F-statistic: 1759 on 1 and 2498 DF, p-value: < 0.00000000000000022
41.3% of the variation in average credit card debt is explained by the income variable in the model above. Variable interpretations: Each \(1000\) income increases average credit card debt by \(24.73\) (no factor need to be held constant)
# s. (Addendum 2) Is the model in part r above significant? Explain.
The model above is significant. The p-value is less than 0.05, so we are able to reject NULL hypothesis.
# t. (Addendum 3) Using the model from part r, provide a 99% prediction interval estimate for month credit card debt for a person making $75,000 in annual income. Interpret the result.
newdata = data.frame(Income = 75)
predict(reg1, newdata, interval="predict", level = 0.99)
## fit lwr upr
## 1 1.964954 -1.581923 5.51183
The equation will be y = 0.11037 + 0.02473 * Income(in $1000s) So the 99% prediction interval estimate for month credit card debt for a person making \(75,000\) in annual income is between 0 and \(5500\).