Homework 1: Descriptive Analytics in R

Purpose Universal Bank seeks to find out which customers are likely to accept an offer to purchase a financial product (a personal loan). The assignment focuses on gaining insight into the data providing prior to starting the modeling process.

Data Set The data file included with this assignment is a list of current customers of Universal Bank. In the file is a binary (0/1) variable, “Acpt_Offer”, which indicates those customers that accepted the an offer for a financial product (a personal loan) in the last solicitation (a “1” indicates that customer accepted loan offer and “0” if not). A set of demographic and customer-product variables are included in the file. The description of these variables in the second tab of the Excel data file. There is also a .csv file (comma separated value) with the same data which should be used for the actual analysis in R.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# a.    Read in the data and assign it to a data-frame called “x.dat”.  How many observations in the data set?  What are the assigned data types of the variables?
setwd("/Users/Fisher/Desktop/BANA 288 Predictive Analytics/HW1 Descriptive Analytics in R")
x.dat <- read.csv("hw1_universal_bank.csv")
dim(x.dat)

## [1] 2500   15

str(x.dat)

## 'data.frame':    2500 obs. of  15 variables:
##  $ ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age         : int  25 45 39 35 35 37 53 50 35 34 ...
##  $ Work_Exp    : int  1 19 15 9 8 13 27 24 10 9 ...
##  $ Income      : int  49 34 11 100 45 29 72 22 81 180 ...
##  $ Fam_Size    : int  4 3 1 1 4 4 2 1 3 1 ...
##  $ Crdt_Crd_Avg: num  1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
##  $ Education   : int  1 1 1 2 2 2 2 3 2 3 ...
##  $ EdLevel.1   : int  1 1 1 0 0 0 0 0 0 0 ...
##  $ EdLevel.2   : int  0 0 0 1 1 1 1 0 1 0 ...
##  $ Mortgage    : int  0 0 0 0 0 155 0 0 104 0 ...
##  $ Sec_Account : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ CD_Account  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Online_Bank : int  0 0 0 0 0 1 1 0 1 0 ...
##  $ Credit_Card : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Acpt_Offer  : int  0 0 0 0 0 0 0 0 0 1 ...

There are 2500 observations in the data set. Except the fact that the variable “Crdt_Crd_Avg” is numeric data type, all the others are integer data types.

# b.    Create a pie chart with a different color for each slice of the pie for the Family Size variable.  Interpret the chart.
# fam <- factor(x.dat$Fam_Size)
length(unique(x.dat$Fam_Size))  # there are 4 unique values in family size, we will need 4 colors

## [1] 4

fam <- table(x.dat$Fam_Size)
fam

## 
##   1   2   3   4 
## 743 630 491 636

str(fam)

##  'table' int [1:4(1d)] 743 630 491 636
##  - attr(*, "dimnames")=List of 1
##   ..$ : chr [1:4] "1" "2" "3" "4"

colors = c("red", "yellow", "green", "blue")  
pie(fam, 
    labels = as.character(fam), 
    main = "Family Size Distribution", 
    col = colors, # col=c("red","yellow","blue","green"), col = rainbow(length(x))
    clockwise = TRUE, 
    border = c('pink'))
legend("topleft", c("1","2","3","4"), cex = 0.8, fill = colors)

The family sizes of the list of current customers of Universal Bank is almost evenly distributed. In most cases, those customers are single or married with two children or without child.

# c.    Create a horizontal bar chart with the Education variable that has color and labels. What does this say about the Education levels of the customers?    
edu <- table(x.dat$Education)
edu

## 
##    1    2    3 
## 1065  706  729

bp <- barplot(edu, main="Education levels of the customers", xlab="Level", ylab="Frequency", ylim = c(0,1200), col = rainbow(length(edu)))
text(bp, edu, labels = edu, pos=3)

There are 3 Education levels of the customers. Most of the customers don’t have college degrees, however, there is just a slight difference between the amount of customers having college degrees or graduate/advanced degrees, and the latter is a little higher.

## 
## 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 
##  3 15 32 29 48 63 65 72 62 63 53 65 72 58 42 58 67 57 74 62 69 63 61 63 59 55 
## 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 
## 53 73 67 79 51 72 69 67 70 63 60 64 58 69 56 42 43 11  3

The graph is a left-skewed histogram. The customers are the elderly crowds from 40 to 80.

# e.    Create a horizontal boxplot of the Income variable with labels.  What does this boxplot show?  Is the data skewed?  Are there any outliers for this variable?  If so, how many?  

boxplot(x.dat$Income, main="Estimated annual income of the customer ($1,000)")

outliers = boxplot(x.dat$Income, plot=FALSE)$out
str(outliers)

##  num [1:41] 193 194 195 191 200 205 204 195 192 194 ...

summary(x.dat$Income)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00   39.00   64.00   74.45   99.25  205.00

The median annual income of the customers will be $64000$, and most of the income lie between $39000$ and $99250$. The distribution on income data is right skewed. However, there are 41 outliers.

# f.    Create a boxplot of the Average Credit Card Debt conditional on the customer’s acceptance of the bank’s offer (Accept Offer).  What information does this chart convey?  Hint:  Use the “~” command after the variable for which the boxplot is made in order to link the variable upon which you are conditioning.

boxplot(x.dat$Crdt_Crd_Avg ~ x.dat$Acpt_Offer, main="Average Credit Card Debt of Customers(in $1,000s)", xlab = "Offer Accepted", ylab = "Debt")

The customers who have accepted the personal loan offered in the last solicitation tend to have higher average spending on bank issued credit cards per month.

# g.    Create a scatter plot with Income on the x-axis and Average Credit Card Debt on the y-axis.  Add the least squares regression line to the plot.  What information does this chart convey?  What is the “abline”?
plot(x.dat$Crdt_Crd_Avg ~ x.dat$Income, xlab = "Income", ylab = "Debt")
abline(lm(x.dat$Crdt_Crd_Avg ~ x.dat$Income))

The scatter plot is more dense with lower income and lower debt. There tends to be a linear relationship between income and average spending on bank issued credit cards per month. The “abline” is the predicted linear model on Average Credit Card Debt by independent variable “income”.

# h.    Create a second scatter plot with Average Credit Card Debt on the x-axis and Accept Offer on the y-axis.  Add the least squares regression line to the plot.  What is the take-away from this chart?
plot(x.dat$Acpt_Offer ~ x.dat$Crdt_Crd_Avg, xlab = "Debt", ylab = "Offer Accepted")
abline(lm(x.dat$Acpt_Offer ~ x.dat$Crdt_Crd_Avg))

There is no linear relationship between the average spending on bank issued credit cards per month and whether the customers are going to accept the offer or not. We might consider using logistic regression in stead of linear regression to predict yes/no dependent variable.

# i.    Download and install the package “lattice”.  What are two interesting commands in this package?  What do they do?  
#install.packages("lattice")
#library(lattice)
histogram(~Income | Acpt_Offer, data = x.dat)

densityplot(~Income | Acpt_Offer, data = x.dat)

xyplot(Acpt_Offer ~ Income | Education, data = x.dat)

bwplot(~Income | Fam_Size, data = x.dat)

splom(x.dat)

The command densityplot in lattice can create density plot for us to gain interesting insight by looking at the plot. And the xyplot command which I created is to show the offer accpeted regressed against income conditioned on the education level.

# j.    Using lattice, create a density plot of the Average Credit Card Debt.  What does this chart show?
densityplot(~ Crdt_Crd_Avg, data = x.dat)

This chart shows the distribution of the average spending on bank issued credit cards per month. Most of the customers have $0-4000 monthly debt.

# k.    Using lattice, create a histogram of customer Income conditional on if the customer accepted the offer.  What is the story in this diagram?   Hint:  Use the “pipe” to condition on the customer response variable as follows:  histogram(~Income | Acpt_Offer, data= x2.dat)
histogram(~Income | Acpt_Offer, data = x.dat)

Customers accpeted the loan offers are those one who have higher income. The two histogram graphs are likely mirroring each other, which means the company should focusing on the high income customers which are more likely to accept an offer to purchase a financial product (a personal loan)

# l.    Using lattice, create one additional useful chart.  Explain why the chart is useful toward the goal in this assignment.
bwplot(~Income | Fam_Size, data = x.dat)

The boxplots above is the income distribution conditioned on family size. We can tell that larger family size with one or two kids are having less scattered income. And single customers and dink families have wider range income. The independent variable “income” will be a good indicater for the bank to find their target customers, but we still need to figure out the relationship between family size and income. Will teh combination of those to be a better indicater?

# m.    In consideration of the goal of understanding which customer characteristics are related to the decision on the bank’s offer, what has been learned so far based on the graphics created?

Income will be a huge indicater on accepting the offer or not, however, there will be more factors we will need to do further research on, working experience, family size, average monthly dept, for instance.

# n.    Create a new data frame, x2.dat, from x.dat removing the Education and the ID variables from the data set.  Why would we do this?
x2.dat <- subset(x.dat, select = -c(Education, ID))

For the education variable, we have already had the dummy variables for education levels and the ID variable is also redundant for modeling.

# o.    Compute descriptive statistics, e.g., means, standard deviations, etc., for all variables in x2.dat. 
#options(digits=2)  # reserve two decimal fractions
options(scipen = 100)
stat.desc(x2.dat)

##                         Age      Work_Exp         Income      Fam_Size
## nbr.val        2500.0000000  2500.0000000   2500.0000000 2500.00000000
## nbr.null          0.0000000    49.0000000      0.0000000    0.00000000
## nbr.na            0.0000000     0.0000000      0.0000000    0.00000000
## min              23.0000000     0.0000000      8.0000000    1.00000000
## max              67.0000000    42.0000000    205.0000000    4.00000000
## range            44.0000000    42.0000000    197.0000000    3.00000000
## sum          113365.0000000 50331.0000000 186118.0000000 6020.00000000
## median           45.0000000    20.0000000     64.0000000    2.00000000
## mean             45.3460000    20.1324000     74.4472000    2.40800000
## SE.mean           0.2303904     0.2300565      0.9334480    0.02319719
## CI.mean.0.95      0.4517757     0.4511210      1.8304110    0.04548769
## var             132.6993637   132.3149962   2178.3129373    1.34527411
## std.dev          11.5195210    11.5028256     46.6724002    1.15985952
## coef.var          0.2540361     0.5713589      0.6269195    0.48166924
##               Crdt_Crd_Avg      EdLevel.1      EdLevel.2      Mortgage
## nbr.val      2500.00000000 2500.000000000 2500.000000000   2500.000000
## nbr.null       54.00000000 1435.000000000 1794.000000000   1708.000000
## nbr.na          0.00000000    0.000000000    0.000000000      0.000000
## min             0.00000000    0.000000000    0.000000000      0.000000
## max            10.00000000    1.000000000    1.000000000    617.000000
## range          10.00000000    1.000000000    1.000000000    617.000000
## sum          4878.21000000 1065.000000000  706.000000000 143471.000000
## median          1.50000000    0.000000000    0.000000000      0.000000
## mean            1.95128400    0.426000000    0.282400000     57.388400
## SE.mean         0.03590898    0.009891852    0.009005139      2.016328
## CI.mean.0.95    0.07041442    0.019397069    0.017658301      3.953845
## var             3.22363785    0.244621849    0.202731333  10163.947124
## std.dev         1.79544920    0.494592609    0.450256963    100.816403
## coef.var        0.92013731    1.161015514    1.594394344      1.756738
##                Sec_Account     CD_Account    Online_Bank    Credit_Card
## nbr.val      2500.00000000 2500.000000000 2500.000000000 2500.000000000
## nbr.null     2226.00000000 2344.000000000 1006.000000000 1774.000000000
## nbr.na          0.00000000    0.000000000    0.000000000    0.000000000
## min             0.00000000    0.000000000    0.000000000    0.000000000
## max             1.00000000    1.000000000    1.000000000    1.000000000
## range           1.00000000    1.000000000    1.000000000    1.000000000
## sum           274.00000000  156.000000000 1494.000000000  726.000000000
## median          0.00000000    0.000000000    1.000000000    0.000000000
## mean            0.10960000    0.062400000    0.597600000    0.290400000
## SE.mean         0.00624906    0.004838580    0.009809597    0.009080755
## CI.mean.0.95    0.01225387    0.009488039    0.019235773    0.017806578
## var             0.09762689    0.058529652    0.240570468    0.206150300
## std.dev         0.31245302    0.241929022    0.490479835    0.454037774
## coef.var        2.85084874    3.877067666    0.820749389    1.563490957
##                  Acpt_Offer
## nbr.val      2500.000000000
## nbr.null     2244.000000000
## nbr.na          0.000000000
## min             0.000000000
## max             1.000000000
## range           1.000000000
## sum           256.000000000
## median          0.000000000
## mean            0.102400000
## SE.mean         0.006064685
## CI.mean.0.95    0.011892325
## var             0.091951020
## std.dev         0.303234267
## coef.var        2.961272134

# p.    Compute correlations statistics for all variables in x2.dat.  Given the goal of finding out which customers will accept the bank’s offer, which variables seem like the best predictors?
cor(x2.dat)[13, ]

##           Age      Work_Exp        Income      Fam_Size  Crdt_Crd_Avg 
## -0.0029299059 -0.0030854094  0.5172405150  0.0575159659  0.3769046179 
##     EdLevel.1     EdLevel.2      Mortgage   Sec_Account    CD_Account 
## -0.1522331265  0.0841320863  0.1245543957  0.0124271760  0.3383284125 
##   Online_Bank   Credit_Card    Acpt_Offer 
##  0.0134912840 -0.0009951699  1.0000000000

Income, Crdt_Crd_Avg, EdLevel.1, MortgageCD_Account seem like the best predictors.

# q.    Assuming that Universal Bank seeks to build a model to predict which customers will accept the bank’s offer of the personal loan product, what types of methods or techniques make the most sense in this case?

In my opinion, logstic regression will be the best mothods to predict accepting offer or not, since the dependent variable is bianry, yes or no.

# r.    (Addendum 1)   Run a simple linear regression that predicts average credit card debt as a function of income.  Hint:  Recall, to run a regression we use the “lm” (linear model) command.  How well does this model fit?  Interpret in a sentence the value of the slope coefficient in the regression.
reg1 <- lm(Crdt_Crd_Avg ~ Income, data = x2.dat)
summary(reg1)

## 
## Call:
## lm(formula = Crdt_Crd_Avg ~ Income, data = x2.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8581 -0.6885 -0.0500  0.6644  4.9536 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 0.1103670  0.0518049    2.13              0.0332 *  
## Income      0.0247278  0.0005896   41.94 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.376 on 2498 degrees of freedom
## Multiple R-squared:  0.4132, Adjusted R-squared:  0.413 
## F-statistic:  1759 on 1 and 2498 DF,  p-value: < 0.00000000000000022

41.3% of the variation in average credit card debt is explained by the income variable in the model above. Variable interpretations: Each $1000$ income increases average credit card debt by $24.73$ (no factor need to be held constant)

# s.    (Addendum 2)  Is the model in part r above significant?  Explain.

The model above is significant. The p-value is less than 0.05, so we are able to reject NULL hypothesis.

# t.    (Addendum 3)  Using the model from part r, provide a 99% prediction interval estimate for month credit card debt for a person making $75,000 in annual income.  Interpret the result. 
newdata = data.frame(Income = 75)
predict(reg1, newdata, interval="predict", level = 0.99)

##        fit       lwr     upr
## 1 1.964954 -1.581923 5.51183

The equation will be y = 0.11037 + 0.02473 * Income(in $1000s) So the 99% prediction interval estimate for month credit card debt for a person making $75,000$ in annual income is between 0 and $5500$.

Homework 1: Descriptive Analytics in R

Yu Zhang

4/13/2020

R Markdown