CAP 5703 Assignment#2  Danilo Martinez

You will use classification methods to predict whether a given customer accepts his/her personal loan offer based on the Universal Bank dataset. There are a total of 5,000 customers in the data set and 14 variables. A brief description of the 14 variables are given below:

Name Description ID Customer ID Age Customer's age in completed year Experience # years of professional experience Income Annual income of the customer (1,000) ZIPcode Home address ZIP code Family Family size of the customer CCAvg Average monthly credit card spending (1, 000) Education Education level: 1: undergrad; 2, Graduate; 3; Advance/Professional Mortgage Value of house mortgage if any (1, 000) Personal loan Did this customer accept the personal loan offered in he last campaign? 1, yes; 0, no Securities Acct Does the customer have a securities account with the bank? CD Account Does the customer have a certifcate of deposit (CD) account with the bank? Online Does the customer use internet bank facilities? CreditCard Does the customer use a credit card issued by the Bank?

Use the first 4000 records as your training set and the rest as your test set. Explore four different classification methods, including logistic regression, linear discriminate analysis(LDA), Quadratic discriminate analysis(QDA), KNN using only two predictor variables (Income and CC Avg) to predict whether a given customer accept his/her personal loan offer. Compare the test error of these four methods. For the KNN method, try all values of K = {1.2.. 20}and choose the optimal K when comparing the KNN method with other classification methods. Describe your findings.

Setting working directory and importing data file.

#Setting working Directory to import files
setwd("C:/Users/Me/OneDriveLatestData/OneDrive - University of Central Florida - UCF/Data Mining I")
#Readng in Data File
info<-read.csv("UniversalBank.csv", header=TRUE)
head(info,10)

##    ID Age Experience Income ZIP.Code Family CCAvg Education Mortgage
## 1   1  25          1     49    91107      4   1.6         1        0
## 2   2  45         19     34    90089      3   1.5         1        0
## 3   3  39         15     11    94720      1   1.0         1        0
## 4   4  35          9    100    94112      1   2.7         2        0
## 5   5  35          8     45    91330      4   1.0         2        0
## 6   6  37         13     29    92121      4   0.4         2      155
## 7   7  53         27     72    91711      2   1.5         2        0
## 8   8  50         24     22    93943      1   0.3         3        0
## 9   9  35         10     81    90089      3   0.6         2      104
## 10 10  34          9    180    93023      1   8.9         3        0
##    Personal.Loan Securities.Account CD.Account Online CreditCard
## 1              0                  1          0      0          0
## 2              0                  1          0      0          0
## 3              0                  0          0      0          0
## 4              0                  0          0      0          0
## 5              0                  0          0      0          1
## 6              0                  0          0      1          0
## 7              0                  0          0      1          0
## 8              0                  0          0      0          1
## 9              0                  0          0      1          0
## 10             1                  0          0      0          0

Dividing data into training and test set.

#Setting up training set
train=head(info,4000)
#Setting up test set
test=tail(info,1000)
#Removing unused data
rm(info)

Logistic Regression is a special type of regression where a binary response variable is related to a set of explanatory variables, which can be discrete and/or continuous. The important point here to note is that in linear regression, the expected values of the response variable are modeled based on combination of values taken by the predictors. In logistic regression, probability, or odds of the response taking a particular value, is modeled based on combination of values taken by the predictors. Like regression, and unlike log-linear models, we make an explicit distinction between a response variable and one or more predictor (explanatory) variables. Creating model for logistic regression using training set

#Including library necessary for predict function
library(caret)
#Creating training model using logistic regression
model=glm(Personal.Loan~.,data=train,family=binomial(link ="logit"))
#Printing out the logistic model
summary(model)

##
## Call:
## glm(formula = Personal.Loan ~ ., family = binomial(link = "logit"),
##     data = train)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max 
## -3.1518  -0.2087  -0.0852  -0.0341   3.8598 
##
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)   
## (Intercept)        -1.118e+01  4.500e+00  -2.485 0.012954 * 
## ID                 -8.350e-06  7.037e-05  -0.119 0.905551   
## Age                -6.458e-02  6.745e-02  -0.957 0.338350   
## Experience          7.546e-02  6.716e-02   1.124 0.261193   
## Income              5.254e-02  2.795e-03  18.797  < 2e-16 ***
## ZIP.Code           -4.018e-06  4.446e-05  -0.090 0.927987   
## Family              6.812e-01  8.064e-02   8.448  < 2e-16 ***
## CCAvg               1.452e-01  4.331e-02   3.352 0.000802 ***
## Education           1.672e+00  1.243e-01  13.446  < 2e-16 ***
## Mortgage            4.646e-04  6.059e-04   0.767 0.443190   
## Securities.Account -9.590e-01  3.093e-01  -3.101 0.001929 **
## CD.Account          3.772e+00  3.564e-01  10.583  < 2e-16 ***
## Online             -5.996e-01  1.718e-01  -3.490 0.000483 ***
## CreditCard         -1.102e+00  2.234e-01  -4.933 8.08e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 2587.5  on 3999  degrees of freedom
## Residual deviance: 1060.8  on 3986  degrees of freedom
## AIC: 1088.8
##
## Number of Fisher Scoring iterations: 7

Fitting model to the test set and checking accuracy.

#Fitting trainning model on test set
pred = predict(model,newdata=test,type="response")
#Including library for misclassification error function
library(InformationValue)
#Calculating Accuracy
1-misClassError(test$Personal.Loan,pred)

## [1] 0.958

Logistic regression involves directly modeling Pr(Y = k|X = x) using the logistic function for the case of two response classes. In statistical jargon, we model the conditional distribution of the response Y, given the predictor(s) X. We now consider an alternative and less direct approach to estimating these probabilities.

Linear Discriminant Analysis. In this alternative approach, we model the distribution of the predictors X separately in each of the response classes (i.e. given Y ), and then use Bayes' theorem to flip these around into estimates for Pr(Y = k|X = x). When these distributions are assumed to be normal, it turns out that the model is very similar in form to logistic regression. Why do we need another method, when we have logistic regression? There are several reasons: . When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem. . If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model. . Linear discriminant analysis is popular when we have more than two response classes. Creating model for linear discriminant using training set

#Including library necessary for predict function
library(MASS)
#Creating training model using linear discriminant analysis
model=lda(Personal.Loan~.,data=train,family=binomial(link ="logit"))
#Printing out the logistic model
model

## Call:
## lda(Personal.Loan ~ ., data = train, family = binomial(link = "logit"))
##
## Prior probabilities of groups:
##       0       1
## 0.90075 0.09925
##
## Group means:
##         ID      Age Experience    Income ZIP.Code   Family    CCAvg
## 0 2004.797 45.36331   20.11990  66.40216 93152.64 2.374410 1.732248
## 1 1961.506 45.13350   19.90932 144.70025 93126.20 2.624685 3.923401
##   Education Mortgage Securities.Account CD.Account    Online CreditCard
## 0  1.836803 53.36914          0.1038024 0.03469331 0.5939495  0.2903136
## 1  2.226700 99.74811          0.1309824 0.28967254 0.6045340  0.2921914
##
## Coefficients of linear discriminants:
##                              LD1
## ID                 -9.878476e-06
## Age                -4.198180e-02
## Experience          4.514831e-02
## Income              2.094895e-02
## ZIP.Code            3.998308e-06
## Family              2.375879e-01
## CCAvg               8.744485e-02
## Education           5.616817e-01
## Mortgage            2.895433e-04
## Securities.Account -3.937272e-01
## CD.Account          2.278671e+00
## Online             -1.673432e-01
## CreditCard         -3.203705e-01

Fitting model to the test set and checking accuracy.

#Fitting trainning model on test set
pred = predict(model,newdata=test,type="response")
# Assess the accuracy of the prediction
# percent correct for each category
ct <- table(test$Personal.Loan, pred$class)
diag(prop.table(ct, 1))

##         0         1
## 0.9803708 0.6024096

# total percent correct
sum(diag(prop.table(ct)))

## [1] 0.949

LDA assumes that the observations within each class are drawn from a multivariate Gaussian distribution with a class specific mean vector and a covariance matrix that is common to all K classes.

Quadratic discriminant analysis (QDA) provides an alternative. Like LDA, the QDA classifier results from assuming that the observations from each class are drawn from a Gaussian distribution, and plugging estimates for the parameters into Bayes' theorem in order to perform prediction. However, unlike LDA, QDA assumes that each class has its own covariance matrix. So the QDA classifier involves assigning an observation X = x to the class for which this quantity is largest. The quantity x appears as a quadratic function, and this is where QDA gets its name. Why does it matter whether or not we assume that the K classes share a common covariance matrix? In other words, why would one prefer LDA to QDA, or vice-versa? The answer lies in the bias-variance trade-off. When there are p predictors, then estimating a covariance matrix requires estimating p(p+1)/2 parameters. QDA estimates a separate covariance matrix for each class, for a total of Kp(p+1)/2 parameters. With 50 predictors this is some multiple of 1,275, which is a lot of parameters. By instead assuming that the K classes share a common covariance matrix, the LDA model becomes linear in x, which means there are Kp linear coefficients to estimate. Consequently, LDA is a much less flexible classifier than QDA, and so has substantially lower variance. This can potentially lead to improved prediction performance. But there is a trade-off: if LDA's assumption that the K classes share a common covariance matrix is badly off, then LDA can suffer from high bias. Roughly speaking, LDA tends to be a better bet than QDA if there are relatively few training observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the K classes is clearly untenable. Creating model for quadratic discriminant analysis using training set.

#Creating training model using quadratic discriminant analysis
model=qda(Personal.Loan~.,data=train,family=binomial(link ="logit"))
#Printing out the logistic model
model

## Call:
## qda(Personal.Loan ~ ., data = train, family = binomial(link = "logit"))
##
## Prior probabilities of groups:
##       0       1
## 0.90075 0.09925
##
## Group means:
##         ID      Age Experience    Income ZIP.Code   Family    CCAvg
## 0 2004.797 45.36331   20.11990  66.40216 93152.64 2.374410 1.732248
## 1 1961.506 45.13350   19.90932 144.70025 93126.20 2.624685 3.923401
##   Education Mortgage Securities.Account CD.Account    Online CreditCard
## 0  1.836803 53.36914          0.1038024 0.03469331 0.5939495  0.2903136
## 1  2.226700 99.74811          0.1309824 0.28967254 0.6045340  0.2921914

Fitting model to the test set and checking accuracy.

#Fitting trainning model on test set
pred = predict(model,newdata=test,type="response")
# Assess the accuracy of the prediction
# percent correct for each category
ct <- table(test$Personal.Loan, pred$class)
diag(prop.table(ct, 1))

##         0         1
## 0.9563795 0.6626506

# total percent correct
sum(diag(prop.table(ct)))

## [1] 0.932

The K-nearest neighbors algorithm is one of the simplest machine learning algorithms and is an example of instance-based learning, where new data are classified based on stored, labeled instances. The distance between the stored data and the new instance is calculated by means of some kind of a similarity measure. This similarity measure is typically expressed by a distance measure such as the Euclidean distance, cosine similarity, or the Manhattan distance. In other words, the similarity to the data that was already in the system is calculated for any new data point that you input into the system. Then, you use this similarity value to perform predictive modeling. Predictive modeling is either classification, assigning a label or a class to the new instance, or regression, assigning a value to the new instance. Basically, tell me who your neighbors are, and I will tell you who you are. Chosing K is an important step in KNN. In theory, if their is an infinite number of samples available, the larger is k, the better is the classification. The one thing we must be aware of is that all k neighbors have to be close. This is possible when infinite number of samples are available, but it is impossible in practice since the number of samples is finite. It is referred to as a lazy learning algorithm because the function is only approximated locally and all computation is deferred until classification. Creating model for K Nearest Neighbor using training set. K 1 thru 20 and showing accuracy output for each.

#Creating model using K Nearest neighbor
for (i in 1:20)
{
  cat("Accuracy for KNN using K = ", i, "\n", sep = "")
  model <- train(Personal.Loan~Income+CCAvg,data =train,
  method ='kknn',algorithm=c("kd_tree"),tuneLength=3,number=1,k=i)
  #Fitting trainning model on test set
  pred = predict(model,newdata=test)
  # Assess the accuracy of the prediction
  # percent correct for each category
  ct <- table(test$Personal.Loan, pred)
  diag(prop.table(ct, 1))
  # total percent correct
  print(sum(diag(prop.table(ct))))
}

## Accuracy for KNN using K = 1
## [1] 0.925
## Accuracy for KNN using K = 2
## [1] 0.859
## Accuracy for KNN using K = 3
## [1] 0.827
## Accuracy for KNN using K = 4
## [1] 0.805
## Accuracy for KNN using K = 5
## [1] 0.787
## Accuracy for KNN using K = 6
## [1] 0.781
## Accuracy for KNN using K = 7
## [1] 0.776
## Accuracy for KNN using K = 8
## [1] 0.769
## Accuracy for KNN using K = 9
## [1] 0.766
## Accuracy for KNN using K = 10
## [1] 0.764
## Accuracy for KNN using K = 11
## [1] 0.759
## Accuracy for KNN using K = 12
## [1] 0.758
## Accuracy for KNN using K = 13
## [1] 0.757
## Accuracy for KNN using K = 14
## [1] 0.756
## Accuracy for KNN using K = 15
## [1] 0.755
## Accuracy for KNN using K = 16
## [1] 0.755
## Accuracy for KNN using K = 17
## [1] 0.755
## Accuracy for KNN using K = 18
## [1] 0.753
## Accuracy for KNN using K = 19
## [1] 0.751
## Accuracy for KNN using K = 20
## [1] 0.75

The best accuracy for K Nearest Neighbor is achieved with K = 1, .925.

Though their motivations differ, the logistic regression and LDA methods are closely connected. Both logistic regression and LDA produce linear decision boundaries. The only difference between the two approaches lies in the fact that ??0 and ??1(In logistic regression) are estimated using maximum likelihood, whereas c0 and c1 (In LDA) are computed using the estimated mean and variance from a normal distribution. This same connection between LDA and logistic regression also holds for multidimensional data with p > 1. Since logistic regression and LDA differ only in their fitting procedures, one might expect the two approaches to give similar results. This is often, but not always, the case. LDA assumes that the observations are drawn from a Gaussian distribution with a common covariance matrix in each class, and so can provide some improvements over logistic regression when this assumption approximately holds. Conversely, logistic regression can outperform LDA if these Gaussian assumptions are not met.
KNN takes a completely different approach from the logistic regression, LDA, and QDA classifiers. In order to make a prediction for an observation X = x, the K training observations that are closest to x are identified. Then X is assigned to the class to which the plurality of these observations belong. Hence KNN is a completely non-parametric approach: no assumptions are made about the shape of the decision boundary. Therefore, we can expect this approach to dominate LDA and logistic regression when the decision boundary is highly non-linear. On the other hand, KNN does not tell us which predictors are important. QDA serves as a compromise between the non-parametric KNN method and the linear LDA and logistic regression approaches. Since QDA assumes a quadratic decision boundary, it can accurately model a wider range of problems than can the linear methods. Though not as flexible as KNN, QDA can perform better in the presence of a limited number of training observations because it does make some assumptions about the form of the decision boundary. From the above, we can see that the Universal bank data is not a Gaussian Distribution and it is linear. The best approach, using accuracy as a measure, is the logistic regression approach. The accuracy for the logistic approach is 95.8%. Linear discriminant is not far below at 94.9%, followed by quadratic discriminant at 93.2%, and finally KNN at 92.5%.

*ISLR Text