CAP 5703 Assignment#2 Danilo Martinez
You will use classification methods to predict whether a given customer accepts his/her personal loan offer based on the Universal Bank dataset. There are a total of 5,000 customers in the data set and 14 variables. A brief description of the 14 variables are given below:
Name Description ID Customer ID Age Customer's age in completed year Experience # years of professional experience Income Annual income of the customer (1,000) ZIPcode Home address ZIP code Family Family size of the customer CCAvg Average monthly credit card spending (1, 000) Education Education level: 1: undergrad; 2, Graduate; 3; Advance/Professional Mortgage Value of house mortgage if any (1, 000) Personal loan Did this customer accept the personal loan offered in he last campaign? 1, yes; 0, no Securities Acct Does the customer have a securities account with the bank? CD Account Does the customer have a certifcate of deposit (CD) account with the bank? Online Does the customer use internet bank facilities? CreditCard Does the customer use a credit card issued by the Bank?
Use the first 4000 records as your training set and the rest as your test set. Explore four different classification methods, including logistic regression, linear discriminate analysis(LDA), Quadratic discriminate analysis(QDA), KNN using only two predictor variables (Income and CC Avg) to predict whether a given customer accept his/her personal loan offer. Compare the test error of these four methods. For the KNN method, try all values of K = {1.2.. 20}and choose the optimal K when comparing the KNN method with other classification methods. Describe your findings.
Setting working directory and importing data file.
#Setting
working Directory to import files
setwd("C:/Users/Me/OneDriveLatestData/OneDrive
- University of Central Florida - UCF/Data Mining I")
#Readng in Data File
info<-read.csv("UniversalBank.csv", header=TRUE)
head(info,10)
##
ID Age Experience Income ZIP.Code Family CCAvg Education Mortgage
## 1 1 25
1 49 91107 4 1.6 1 0
## 2 2 45
19 34 90089 3 1.5 1 0
## 3 3 39
15 11 94720 1 1.0 1 0
## 4 4 35
9 100 94112 1 2.7 2 0
## 5 5 35
8 45 91330 4 1.0 2 0
## 6 6 37
13 29 92121 4 0.4 2 155
## 7 7 53
27 72 91711 2 1.5 2 0
## 8 8 50
24 22 93943 1 0.3 3 0
## 9 9 35
10 81 90089 3 0.6 2 104
## 10 10 34
9 180 93023 1 8.9 3 0
## Personal.Loan
Securities.Account CD.Account Online CreditCard
## 1
0 1 0 0 0
## 2
0 1 0 0 0
## 3
0 0 0 0 0
## 4
0 0 0 0 0
## 5
0 0 0 0 1
## 6
0 0 0 1 0
## 7
0 0 0 1 0
## 8
0 0 0 0 1
## 9 0
0 0 1 0
## 10
1 0 0 0 0
Dividing data into training and test set.
#Setting
up training set
train=head(info,4000)
#Setting up test set
test=tail(info,1000)
#Removing unused data
rm(info)
Logistic Regression is a special type of regression where a binary response variable is related to a set of explanatory variables, which can be discrete and/or continuous. The important point here to note is that in linear regression, the expected values of the response variable are modeled based on combination of values taken by the predictors. In logistic regression, probability, or odds of the response taking a particular value, is modeled based on combination of values taken by the predictors. Like regression, and unlike log-linear models, we make an explicit distinction between a response variable and one or more predictor (explanatory) variables. Creating model for logistic regression using training set
#Including
library necessary for predict function
library(caret)
#Creating training model
using logistic regression
model=glm(Personal.Loan~.,data=train,family=binomial(link ="logit"))
#Printing out the
logistic model
summary(model)
##
##
Call:
##
glm(formula = Personal.Loan ~ ., family = binomial(link = "logit"),
##
data = train)
##
##
Deviance Residuals:
##
Min 1Q Median 3Q Max
##
-3.1518 -0.2087 -0.0852 -0.0341 3.8598
##
##
Coefficients:
##
Estimate Std. Error z value Pr(>|z|)
##
(Intercept) -1.118e+01 4.500e+00 -2.485 0.012954 *
##
ID -8.350e-06 7.037e-05 -0.119 0.905551
##
Age -6.458e-02 6.745e-02 -0.957 0.338350
##
Experience 7.546e-02 6.716e-02 1.124 0.261193
##
Income 5.254e-02 2.795e-03 18.797 < 2e-16 ***
##
ZIP.Code -4.018e-06 4.446e-05 -0.090 0.927987
##
Family 6.812e-01 8.064e-02 8.448 < 2e-16 ***
##
CCAvg 1.452e-01 4.331e-02 3.352 0.000802 ***
##
Education 1.672e+00 1.243e-01 13.446 < 2e-16 ***
##
Mortgage 4.646e-04 6.059e-04 0.767 0.443190
##
Securities.Account -9.590e-01 3.093e-01 -3.101 0.001929 **
##
CD.Account 3.772e+00 3.564e-01 10.583 < 2e-16 ***
##
Online -5.996e-01 1.718e-01 -3.490 0.000483 ***
##
CreditCard -1.102e+00 2.234e-01 -4.933 8.08e-07 ***
##
---
##
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion
parameter for binomial family taken to be 1)
##
## Null deviance:
2587.5 on 3999 degrees of freedom
## Residual deviance:
1060.8 on 3986 degrees of freedom
## AIC: 1088.8
##
## Number of Fisher
Scoring iterations: 7
Fitting model to the test set and checking accuracy.
#Fitting
trainning model on test set
pred = predict(model,newdata=test,type="response")
#Including library for
misclassification error function
library(InformationValue)
#Calculating Accuracy
1-misClassError(test$Personal.Loan,pred)
## [1] 0.958
Logistic regression involves directly modeling Pr(Y = k|X = x) using the logistic function for the case of two response classes. In statistical jargon, we model the conditional distribution of the response Y, given the predictor(s) X. We now consider an alternative and less direct approach to estimating these probabilities.
Linear Discriminant Analysis. In this alternative approach, we model the distribution of the predictors X separately in each of the response classes (i.e. given Y ), and then use Bayes' theorem to flip these around into estimates for Pr(Y = k|X = x). When these distributions are assumed to be normal, it turns out that the model is very similar in form to logistic regression. Why do we need another method, when we have logistic regression? There are several reasons: . When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem. . If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model. . Linear discriminant analysis is popular when we have more than two response classes. Creating model for linear discriminant using training set
#Including
library necessary for predict function
library(MASS)
#Creating training model
using linear discriminant analysis
model=lda(Personal.Loan~.,data=train,family=binomial(link ="logit"))
#Printing out the
logistic model
model
##
Call:
## lda(Personal.Loan ~
., data = train, family = binomial(link = "logit"))
##
## Prior probabilities
of groups:
## 0 1
## 0.90075 0.09925
##
## Group means:
## ID Age
Experience Income ZIP.Code Family CCAvg
## 0 2004.797
45.36331 20.11990 66.40216 93152.64 2.374410 1.732248
## 1 1961.506
45.13350 19.90932 144.70025 93126.20 2.624685 3.923401
## Education Mortgage
Securities.Account CD.Account Online CreditCard
## 0 1.836803
53.36914 0.1038024 0.03469331 0.5939495 0.2903136
## 1 2.226700
99.74811 0.1309824 0.28967254 0.6045340 0.2921914
##
## Coefficients of
linear discriminants:
##
LD1
## ID
-9.878476e-06
## Age
-4.198180e-02
## Experience
4.514831e-02
## Income
2.094895e-02
## ZIP.Code
3.998308e-06
## Family
2.375879e-01
## CCAvg
8.744485e-02
## Education
5.616817e-01
## Mortgage
2.895433e-04
## Securities.Account
-3.937272e-01
## CD.Account
2.278671e+00
## Online
-1.673432e-01
## CreditCard
-3.203705e-01
Fitting model to the test set and checking accuracy.
#Fitting
trainning model on test set
pred = predict(model,newdata=test,type="response")
# Assess the accuracy of
the prediction
# percent correct for
each category
ct <- table(test$Personal.Loan, pred$class)
diag(prop.table(ct, 1))
##
0 1
## 0.9803708 0.6024096
#
total percent correct
sum(diag(prop.table(ct)))
## [1] 0.949
LDA assumes that the observations within each class are drawn from a multivariate Gaussian distribution with a class specific mean vector and a covariance matrix that is common to all K classes.
Quadratic discriminant analysis (QDA) provides an alternative. Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes' theorem in order to perform prediction. However, unlike LDA, QDA assumes that each class has its own covariance matrix. So the QDA classifier involves assigning an observation X = x to the class for which this quantity is largest. The quantity x appears as a quadratic function, and this is where QDA gets its name. Why does it matter whether or not we assume that the K classes share a common covariance matrix? In other words, why would one prefer LDA to QDA, or vice-versa? The answer lies in the bias-variance trade-off. When there are p predictors, then estimating a covariance matrix requires estimating p(p+1)/2 parameters. QDA estimates a separate covariance matrix for each class, for a total of Kp(p+1)/2 parameters. With 50 predictors this is some multiple of 1,275, which is a lot of parameters. By instead assuming that the K classes share a common covariance matrix, the LDA model becomes linear in x, which means there are Kp linear coefficients to estimate. Consequently, LDA is a much less flexible classifier than QDA, and so has substantially lower variance. This can potentially lead to improved prediction performance. But there is a trade-off: if LDA's assumption that the K classes share a common covariance matrix is badly off, then LDA can suffer from high bias. Roughly speaking, LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the K classes is clearly untenable. Creating model for quadratic discriminant analysis using training set.
#Creating
training model using quadratic discriminant analysis
model=qda(Personal.Loan~.,data=train,family=binomial(link ="logit"))
#Printing out the
logistic model
model
##
Call:
## qda(Personal.Loan ~
., data = train, family = binomial(link = "logit"))
##
## Prior probabilities
of groups:
## 0 1
## 0.90075 0.09925
##
## Group means:
## ID Age
Experience Income ZIP.Code Family CCAvg
## 0 2004.797
45.36331 20.11990 66.40216 93152.64 2.374410 1.732248
## 1 1961.506
45.13350 19.90932 144.70025 93126.20 2.624685 3.923401
## Education Mortgage
Securities.Account CD.Account Online CreditCard
## 0 1.836803
53.36914 0.1038024 0.03469331 0.5939495 0.2903136
## 1 2.226700
99.74811 0.1309824 0.28967254 0.6045340 0.2921914
Fitting model to the test set and checking accuracy.
#Fitting
trainning model on test set
pred = predict(model,newdata=test,type="response")
# Assess the accuracy of
the prediction
# percent correct for
each category
ct <- table(test$Personal.Loan, pred$class)
diag(prop.table(ct, 1))
##
0 1
## 0.9563795 0.6626506
#
total percent correct
sum(diag(prop.table(ct)))
## [1] 0.932
The K-nearest neighbors algorithm is one of the simplest machine learning algorithms and is an example of instance-based learning, where new data are classified based on stored, labeled instances. The distance between the stored data and the new instance is calculated by means of some kind of a similarity measure. This similarity measure is typically expressed by a distance measure such as the Euclidean distance, cosine similarity, or the Manhattan distance. In other words, the similarity to the data that was already in the system is calculated for any new data point that you input into the system. Then, you use this similarity value to perform predictive modeling. Predictive modeling is either classification, assigning a label or a class to the new instance, or regression, assigning a value to the new instance. Basically, tell me who your neighbors are, and I will tell you who you are. Chosing K is an important step in KNN. In theory, if their is an infinite number of samples available, the larger is k, the better is the classification. The one thing we must be aware of is that all k neighbors have to be close. This is possible when infinite number of samples are available, but it is impossible in practice since the number of samples is finite. It is referred to as a lazy learning algorithm because the function is only approximated locally and all computation is deferred until classification. Creating model for K Nearest Neighbor using training set. K 1 thru 20 and showing accuracy output for each.
#Creating
model using K Nearest neighbor
for (i in 1:20)
{
cat("Accuracy for KNN using K =
", i, "\n", sep = "")
model <- train(Personal.Loan~Income+CCAvg,data =train,
method ='kknn',algorithm=c("kd_tree"),tuneLength=3,number=1,k=i)
#Fitting trainning model on
test set
pred = predict(model,newdata=test)
# Assess the accuracy of the
prediction
# percent correct for each
category
ct <- table(test$Personal.Loan, pred)
diag(prop.table(ct, 1))
# total percent correct
print(sum(diag(prop.table(ct))))
}
##
Accuracy for KNN using K = 1
## [1] 0.925
## Accuracy for KNN
using K = 2
## [1] 0.859
## Accuracy for KNN
using K = 3
## [1] 0.827
## Accuracy for KNN
using K = 4
## [1] 0.805
## Accuracy for KNN
using K = 5
## [1] 0.787
## Accuracy for KNN
using K = 6
## [1] 0.781
## Accuracy for KNN
using K = 7
## [1] 0.776
## Accuracy for KNN
using K = 8
## [1] 0.769
## Accuracy for KNN
using K = 9
## [1] 0.766
## Accuracy for KNN
using K = 10
## [1] 0.764
## Accuracy for KNN
using K = 11
## [1] 0.759
## Accuracy for KNN
using K = 12
## [1] 0.758
## Accuracy for KNN
using K = 13
## [1] 0.757
## Accuracy for KNN
using K = 14
## [1] 0.756
## Accuracy for KNN
using K = 15
## [1] 0.755
## Accuracy for KNN
using K = 16
## [1] 0.755
## Accuracy for KNN
using K = 17
## [1] 0.755
## Accuracy for KNN
using K = 18
## [1] 0.753
## Accuracy for KNN
using K = 19
## [1] 0.751
## Accuracy for KNN
using K = 20
## [1] 0.75
The best accuracy for K Nearest Neighbor is achieved with K = 1, .925.
Though their motivations differ, the logistic regression
and LDA methods are closely connected. Both logistic regression and LDA produce
linear decision boundaries. The only difference between the two approaches lies
in the fact that ??0 and ??1(In logistic regression) are estimated using
maximum likelihood, whereas c0 and c1 (In LDA) are computed using the estimated
mean and variance from a normal distribution. This same connection between LDA
and logistic regression also holds for multidimensional data with p > 1. Since
logistic regression and LDA differ only in their fitting procedures, one might
expect the two approaches to give similar results. This is often, but not
always, the case. LDA assumes that the observations are drawn from a Gaussian
distribution with a common covariance matrix in each class, and so can provide
some improvements over logistic regression when this assumption approximately
holds. Conversely, logistic regression can outperform LDA if these Gaussian
assumptions are not met.
KNN takes a completely different approach from the logistic regression, LDA,
and QDA classifiers. In order to make a prediction for an observation X = x,
the K training observations that are closest to x are identified. Then X is
assigned to the class to which the plurality of these observations belong.
Hence KNN is a completely non-parametric approach: no assumptions are made
about the shape of the decision boundary. Therefore, we can expect this
approach to dominate LDA and logistic regression when the decision boundary is highly
non-linear. On the other hand, KNN does not tell us which predictors are
important. QDA serves as a compromise between the non-parametric KNN method and
the linear LDA and logistic regression approaches. Since QDA assumes a
quadratic decision boundary, it can accurately model a wider range of problems
than can the linear methods. Though not as flexible as KNN, QDA can perform
better in the presence of a limited number of training observations because it
does make some assumptions about the form of the decision boundary. From the
above, we can see that the Universal bank data is not a Gaussian Distribution
and it is linear. The best approach, using accuracy as a measure, is the
logistic regression approach. The accuracy for the logistic approach is 95.8%.
Linear discriminant is not far below at 94.9%, followed by quadratic
discriminant at 93.2%, and finally KNN at 92.5%.
*ISLR Text