Part 1 - Logistic Regression Model

Data Source

This project uses a relatively small sample of data with less than 600 observations in the training set to solve a binary classification problem. In this case, the logistic regression model is useful for quick implementation with good accuracy. Classification rules of the logistic model will be manually tuned based on prediction accuracy of cross validation using the training and testing dataset. In addition, the random forest model is well known for its robustness with the Out of Bag (OOB) estimates. For comparison purposes, we will also build and tune a random forest model with the training and testing dataset.

The original data was downloaded from https://archive.ics.uci.edu/ml/datasets/Credit+Approval and has been split into a training data and a test data. The details of the variables in this data set can be found in the link above. Model classification and prediction are the primary focus for the selected dataset in this project.

Data Cleaning

All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data. This dataset is interesting because there is a good mix of attributes – continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values. We will remove these missing values with the purpose of only keep the complete information for each observation for modeling.

The dataset is divided into two sets, one for model training purpose and one for model testing purpose. In the training dataset, there are 16 variables with 547 valid observations; in the testing dataset, there are 16 variables with 100 valid observations. Among the 15 predictor variables, there are 9 categorical variables and 6 continuous variables. The response variable is set as “Y”, with “0” indicating a “-“ (decline) result and “1” indicating a “+” (approve) result.

library(car)
library(glmnet)
library(randomForest)
library(pROC)
#training data
data.raw=read.table('project.1.data.2.train.txt',sep=',',na.strings = '?')
head(data.raw)
##   V1    V2    V3 V4 V5 V6 V7   V8 V9 V10 V11 V12 V13 V14 V15 V16
## 1  b 30.83 0.000  u  g  w  v 1.25  t   t   1   f   g 202   0   +
## 2  a 58.67 4.460  u  g  q  h 3.04  t   t   6   f   g  43 560   +
## 3  a 24.50 0.500  u  g  q  h 1.50  t   f   0   f   g 280 824   +
## 4  b 27.83 1.540  u  g  w  v 3.75  t   t   5   t   g 100   3   +
## 5  b 20.17 5.625  u  g  w  v 1.71  t   f   0   f   s 120   0   +
## 6  b 32.08 4.000  u  g  m  v 2.50  t   f   0   t   g 360   0   +
str(data.raw)
## 'data.frame':    583 obs. of  16 variables:
##  $ V1 : Factor w/ 2 levels "a","b": 2 1 1 2 2 2 2 1 2 2 ...
##  $ V2 : num  30.8 58.7 24.5 27.8 20.2 ...
##  $ V3 : num  0 4.46 0.5 1.54 5.62 ...
##  $ V4 : Factor w/ 3 levels "l","u","y": 2 2 2 2 2 2 2 2 3 3 ...
##  $ V5 : Factor w/ 3 levels "g","gg","p": 1 1 1 1 1 1 1 1 3 3 ...
##  $ V6 : Factor w/ 14 levels "aa","c","cc",..: 13 11 11 13 13 10 12 3 9 13 ...
##  $ V7 : Factor w/ 9 levels "bb","dd","ff",..: 8 4 4 8 8 8 4 8 4 8 ...
##  $ V8 : num  1.25 3.04 1.5 3.75 1.71 ...
##  $ V9 : Factor w/ 2 levels "f","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ V10: Factor w/ 2 levels "f","t": 2 2 1 2 1 1 1 1 1 1 ...
##  $ V11: int  1 6 0 5 0 0 0 0 0 0 ...
##  $ V12: Factor w/ 2 levels "f","t": 1 1 1 2 1 2 2 1 1 2 ...
##  $ V13: Factor w/ 3 levels "g","p","s": 1 1 1 1 3 1 1 1 1 1 ...
##  $ V14: int  202 43 280 100 120 360 164 80 180 52 ...
##  $ V15: int  0 560 824 3 0 0 31285 1349 314 1442 ...
##  $ V16: Factor w/ 2 levels "-","+": 2 2 2 2 2 2 2 2 2 2 ...
apply(is.na(data.raw),2,sum)
id_no_miss=(apply(is.na(data.raw),1,sum)==0)
data.train=data.raw[id_no_miss,]
sum(is.na(data.train))
str(data.train)
data.train$V16=as.character(data.train$V16)
data.train$V16
data.train$V16[data.train$V16=='+']=1
data.train$V16[data.train$V16=='-']=0
data.train$V16=as.factor(data.train$V16)
data.train$V16
colnames(data.train)[16]="Y"
str(data.train)
## 'data.frame':    547 obs. of  16 variables:
##  $ V1 : Factor w/ 2 levels "a","b": 2 1 1 2 2 2 2 1 2 2 ...
##  $ V2 : num  30.8 58.7 24.5 27.8 20.2 ...
##  $ V3 : num  0 4.46 0.5 1.54 5.62 ...
##  $ V4 : Factor w/ 3 levels "l","u","y": 2 2 2 2 2 2 2 2 3 3 ...
##  $ V5 : Factor w/ 3 levels "g","gg","p": 1 1 1 1 1 1 1 1 3 3 ...
##  $ V6 : Factor w/ 14 levels "aa","c","cc",..: 13 11 11 13 13 10 12 3 9 13 ...
##  $ V7 : Factor w/ 9 levels "bb","dd","ff",..: 8 4 4 8 8 8 4 8 4 8 ...
##  $ V8 : num  1.25 3.04 1.5 3.75 1.71 ...
##  $ V9 : Factor w/ 2 levels "f","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ V10: Factor w/ 2 levels "f","t": 2 2 1 2 1 1 1 1 1 1 ...
##  $ V11: int  1 6 0 5 0 0 0 0 0 0 ...
##  $ V12: Factor w/ 2 levels "f","t": 1 1 1 2 1 2 2 1 1 2 ...
##  $ V13: Factor w/ 3 levels "g","p","s": 1 1 1 1 3 1 1 1 1 1 ...
##  $ V14: int  202 43 280 100 120 360 164 80 180 52 ...
##  $ V15: int  0 560 824 3 0 0 31285 1349 314 1442 ...
##  $ Y  : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##testing data
data.raw1=read.table('project.1.data.2.test.txt',sep=',',na.strings = '?')
head(data.raw1)
str(data.raw1)
apply(is.na(data.raw1),2,sum)
data.test=data.raw1
data.test$V16=as.character(data.test$V16)
data.test$V16
data.test$V16[data.test$V16=='+']=1
data.test$V16[data.test$V16=='-']=0
data.test$V16=as.factor(data.test$V16)
data.test$V16
colnames(data.test)[16]="Y"
str(data.test)
## 'data.frame':    100 obs. of  16 variables:
##  $ V1 : Factor w/ 2 levels "a","b": 1 1 2 2 2 2 1 2 2 2 ...
##  $ V2 : num  22.7 25.2 17.9 31.8 38.7 ...
##  $ V3 : num  0.75 13.5 0.205 0.04 0.21 ...
##  $ V4 : Factor w/ 3 levels "l","u","y": 2 3 2 3 2 2 2 2 2 2 ...
##  $ V5 : Factor w/ 3 levels "g","gg","p": 1 3 1 3 1 1 1 1 1 1 ...
##  $ V6 : Factor w/ 13 levels "aa","c","cc",..: 2 6 1 10 9 2 1 12 4 12 ...
##  $ V7 : Factor w/ 7 levels "bb","ff","h",..: 6 2 6 6 6 1 6 6 6 6 ...
##  $ V8 : num  2 2 0.04 0.04 0.085 0.25 0.75 1.75 3.75 0.085 ...
##  $ V9 : Factor w/ 2 levels "f","t": 1 1 1 1 2 2 2 2 1 1 ...
##  $ V10: Factor w/ 2 levels "f","t": 2 2 1 1 1 1 2 2 1 1 ...
##  $ V11: int  2 1 0 0 0 0 4 5 0 0 ...
##  $ V12: Factor w/ 2 levels "f","t": 2 2 1 1 2 1 1 2 2 1 ...
##  $ V13: Factor w/ 2 levels "g","s": 1 1 1 1 1 1 1 1 1 1 ...
##  $ V14: int  200 200 280 0 280 349 0 60 0 320 ...
##  $ V15: int  394 1 750 0 0 23 1583 58 350 3552 ...
##  $ Y  : Factor w/ 2 levels "0","1": 1 1 1 1 2 2 2 2 1 1 ...

Data Exploration

For the training dataset, data is processed and visualized to show distributions of each predictor variable. There are 6 continuous variables and 10 categorical variables (including the “Y” result). In Figure 1, the 6 continuous variables all have skewed distributions. In the categorical variables, it shows in Figure 2 that a few levels in the categorical variables have extreme low observations, for example, level “I”, “gg”, “r”, “o” and “p” in different categories have less than 10 observations among the 547 observations. These levels could be potential outliers and contribute to extreme results in the model.

##                data.train.V2 data.train.V3 data.train.V8 data.train.V11
## data.train.V2     1.00000000     0.2180186    0.40183561     0.15114282
## data.train.V3     0.21801862     1.0000000    0.24671375     0.21419701
## data.train.V8     0.40183561     0.2467138    1.00000000     0.22832784
## data.train.V11    0.15114282     0.2141970    0.22832784     1.00000000
## data.train.V14   -0.07251010    -0.2089433   -0.05211074    -0.12218274
## data.train.V15    0.03125634     0.1297286    0.05964029     0.05382323
##                data.train.V14 data.train.V15
## data.train.V2     -0.07251010     0.03125634
## data.train.V3     -0.20894332     0.12972858
## data.train.V8     -0.05211074     0.05964029
## data.train.V11    -0.12218274     0.05382323
## data.train.V14     1.00000000     0.07449885
## data.train.V15     0.07449885     1.00000000

histograms of continuous variables

bar charts of categorical variables

Modeling and Analysis

Method and Result Selection

The training data is first fitted without interaction terms among variables using glm method, then “forward” method in “stepwise” model selection procedure is used to find a candidate model with the lowest AIC value. After that, the interaction terms among the variables in the candidate model are taken into considerations. The “forward” method in “stepwise” model selection procedure is again used for final model selection with the lowest AIC value. The model selection produced a final model as shown below:

full=glm(Y~.,data=data.train,family="binomial")
null=glm(Y~1,data=data.train,family = "binomial")
select=step(null,scope=list(lower=null,upper=full),direction="forward")
data1=(select$model)
pred=colnames(data1)[-1]
m=length(pred)
pred.names=NULL
for(i in 1:m){
  pred.names=c(pred.names, pred[i])
}
for(i in 1:(m-1)){
  for(j in (i+1):m){
    pred.names=c(pred.names, paste(pred[i], ":", pred[j]))
  }
}
Formula <- formula(paste("Y ~ ",
                         paste(pred.names, collapse=" + ")))
Formula
fit.full.1=glm(Formula,data=data1,family=binomial)
fit.null.1=glm(Y~1,data=data1,family=binomial)
select.1=step(fit.null.1,scope=list(lower=fit.null.1,upper=fit.full.1),direction='forward')
fit.1=glm(Y ~ V9 + V15 + V11 + V6 + V4 + V14 + V8 + V9:V15 + V9:V11 + V4:V14 + 
            V4:V8 + V14:V8,data=data1,family=binomial)
summary(fit.1)
## 
## Call:
## glm(formula = Y ~ V9 + V15 + V11 + V6 + V4 + V14 + V8 + V9:V15 + 
##     V9:V11 + V4:V14 + V4:V8 + V14:V8, family = binomial, data = data1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8318  -0.3638  -0.0857   0.2782   3.2335  
## 
## Coefficients: (2 not defined because of singularities)
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.744e+01  8.838e+02  -0.031 0.975229    
## V9t          2.842e+00  4.012e-01   7.084  1.4e-12 ***
## V15          4.123e-04  4.271e-04   0.965 0.334305    
## V11         -1.720e+00  1.295e+00  -1.328 0.184234    
## V6c          6.350e-01  5.717e-01   1.111 0.266701    
## V6cc         1.665e+00  7.791e-01   2.137 0.032620 *  
## V6d          1.089e+00  8.143e-01   1.337 0.181216    
## V6e          6.854e-01  1.004e+00   0.683 0.494856    
## V6ff        -5.938e-01  9.283e-01  -0.640 0.522427    
## V6i         -2.024e-01  7.186e-01  -0.282 0.778208    
## V6j          2.314e-01  1.232e+00   0.188 0.851026    
## V6k         -2.481e-01  7.244e-01  -0.342 0.731990    
## V6m          6.298e-01  7.146e-01   0.881 0.378072    
## V6q          7.061e-01  6.170e-01   1.144 0.252463    
## V6r          4.395e-01  4.035e+00   0.109 0.913260    
## V6w          1.464e+00  6.695e-01   2.186 0.028825 *  
## V6x          4.166e+00  1.145e+00   3.637 0.000276 ***
## V4u          2.447e+01  8.838e+02   0.028 0.977915    
## V4y          2.328e+01  8.838e+02   0.026 0.978981    
## V14          3.045e-03  1.287e-03   2.367 0.017953 *  
## V8          -3.590e-02  1.552e-01  -0.231 0.817051    
## V9t:V15      1.298e-03  7.567e-04   1.715 0.086330 .  
## V9t:V11      1.867e+00  1.297e+00   1.439 0.150029    
## V4u:V14     -5.206e-03  1.780e-03  -2.924 0.003456 ** 
## V4y:V14             NA         NA      NA       NA    
## V4u:V8       3.215e-01  1.637e-01   1.964 0.049556 *  
## V4y:V8              NA         NA      NA       NA    
## V14:V8      -5.176e-04  3.804e-04  -1.361 0.173661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 753.54  on 546  degrees of freedom
## Residual deviance: 301.79  on 521  degrees of freedom
## AIC: 353.79
## 
## Number of Fisher Scoring iterations: 13

The AIC value in the final model selection result is lower than the originally selected candidate model. Therefore, the model with interaction terms in variables is considered as the most balanced model between modeling simplicity and predicting accuracy.

The Anova goodness of fit test has also shown that the majority of the predictor variables in the final model are significant, except for V8 and the interaction term V14:V8. In this case, the final model with interaction terms will be used for further analysis.

Anova(fit.1)
## Analysis of Deviance Table (Type II tests)
## 
## Response: Y
##        LR Chisq Df Pr(>Chisq)    
## V9      133.855  1  < 2.2e-16 ***
## V15      18.664  1  1.559e-05 ***
## V11       5.015  1   0.025128 *  
## V6       34.110 13   0.001158 ** 
## V4        6.457  2   0.039616 *  
## V14       4.680  1   0.030514 *  
## V8        2.876  1   0.089883 .  
## V9:V15    3.889  1   0.048608 *  
## V9:V11    6.364  1   0.011649 *  
## V4:V14    7.513  1   0.006127 ** 
## V4:V8     4.135  1   0.042013 *  
## V14:V8    2.910  1   0.088047 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the above discussed method, the fitted final model is a logistic regression with mutiple interaction terms. This will be the base model used in the validation and regularization process.

Y = -0.274 + 2.84*v9t + 0.0004*v15 -1.72*v11 + 0.635*v6c + 1.665*v6cc + 10.89*v6d + 0.685*v6e - 0.594*v6ff – 0.202*v6i + 0.231*v6j – 0.248*v6k + 0.630*v6m + 0.706*v6q + 0.440*v6r + 1.464*v6w + 4.166*v6x + 24.47*v4u + 23.28*v4y + 0.00305*v14 – 0.0359*v8 + 0.00130*v9t:v15 + 1.867*v9t:v11 – 0.00521*v4u:v14 + 0.322*v4u:v8 – 0.000518*v14:v8

Model Validation with Leave-One-Out Method

The training data is first evaluated using the leave-one-out method to calculate the ROC curves and AUC. The optimal cut-off point π0 (pi0) and the corresponding classification rule and model prediction accuracies are also determined.

ROC Curve and AUC

Given the training data, the ROC curve illustrates the model’s sensitivity against 1-specificity. The ROC curve shows that the sensitivity values are large given a specificity level over 0.1. In addition, the AUC value is 95.3%. Consequently, the model should have a good level of predicting power.

##calculate ROC and AUCs using LOO method
pi0=seq(1,0,length.out = 500)
nsample=nrow(data.train)
roc.1=NULL
for (k in 1:length(pi0)){
  nsample=nrow(data.train)
  n1_11=0
  n1_10=0
  n1_01=0
  n1_00=0
  for (i in 1:nsample){
    training=data.train[-i,]
    test=data.train[i,]
    fit.1.training=glm(fit.1,data=data.train,family = binomial)
    if(predict(fit.1.training, test, type='response')>=pi0[k]){
      Y.pred.1=1
    }else{
      Y.pred.1=0
    }
    if((test$Y==1) & (Y.pred.1==1)){
      n1_11=n1_11+1
    }
    if((test$Y==1) & (Y.pred.1==0)){
      n1_10=n1_10+1
    }
    if((test$Y==0) & (Y.pred.1==1)){
      n1_01=n1_01+1
    }
    if((test$Y==0) & (Y.pred.1==0)){
      n1_00=n1_00+1
    }
  }
  sen1=n1_11/(n1_10+n1_11)
  spe1=n1_00/(n1_00+n1_01)
  roc.1=rbind(roc.1,c(1-spe1,sen1))
}
plot(roc.1,type='s',xlim=c(0,1),ylim = c(0,1),col='red',lwd=3,main='ROC curve',xlab='1-Specificity',ylab='Sensitivity')

str(roc.1)
auc.1=sum(roc.1[-500,2]*(roc.1[-1,1]-roc.1[-500,1]))
pred.accuracy.1=roc.1[,2]*sum(data.train$Y==1)/nsample+(1-roc.1[,1])*sum(data.train$Y==0)/nsample
auc.1
## [1] 0.9531907

Classification Rule

Based on the result from ROC curve and AUC, the optimal cut-off point should be π0 = 0.4849699. This produces a classification rule as following:

    Optimal π0 = 0.4849699
    * if π ≥ 0.4849699, Y=1
    * if π < 0.4849699, Y=0
    

given a set of data, if the calculated result is greater than or equal to 0.4849699, then the model should interpret the result as Y = 1, that is, an “approved” result. Otherwise, the model should interpret the result as Y = 0, that is, a result of “decline”.

pred.accuracy.1=roc.1[,2]*sum(data.train$Y==1)/nsample+(1-roc.1[,1])*sum(data.train$Y==0)/nsample
#cut-off point of pi0
which.max(pred.accuracy.1)
pi0.1=pi0[(which.max(pred.accuracy.1))]
pi0.1
## [1] 0.4849699

Prediction Accuracy

Based on the established classification rule and applying the leave-one-out method in model validation, the current model produced

Training Accuracy= 0.8939671
Prediction Accuracy/Testing Accuracy= 0.86
Difference in accuracy between training and testing = 3.4%
#prediction accuracy/training accuracy
pred.train=pred.accuracy.1[258]
pred.train
## [1] 0.8939671
#prediction accuracy/test accuracy
nsample=nrow(data.test)
n_correct=0
for (i in 1:nsample){
  test=data.test[i,]
  if(predict(fit.1, test, type='response')>pi0.1){
    Y.pred=1
  }else{
    Y.pred=0
  }
  if(test$Y==Y.pred){
    n_correct=n_correct+1
  }
}
pred.test=n_correct/nsample
pred.test
## [1] 0.86

Model Evaluation Applying Regularization

In addition to the evaluation using leave-one-out method, the model is then evaluated with regularity method. Specifically, ridge penalty is used to add a lambda (λ) term in front of variables’ coefficients to mitigate potential problems with collinearity.

Classification Rule

Based on the result from the training data after applying ridge penalty, the optimal cut-off point should be π0 = 0.5454545. This produces a classification rule as following:

  Tuning Parameter λ = 0.07454568
  Optimal π0 = 0.5454545
  * if π ≥ 0.5454545, Y=1
  * if π < 0.5454545, Y=0

given a set of data, if the calculated result is greater than or equal to 0.5454545, then the model should interpret the result as Y = 1, that is, an “approved” result. Otherwise, the model should interpret the result as Y = 0, that is, a result of “decline”.

#model regularity - Ridge penalty
x=cbind(data.train$V9,data.train$V15,data.train$V11,data.train$V6,data.train$V4,data.train$V14,data.train$V8)
y=data.train$Y
fit.0=glmnet(x=x, y=y, family='binomial',alpha=0)
lambda.seq=fit.0$lambda
pi0=seq(0,1, length.out = 100)
correct.num=matrix(0, length(lambda.seq), length(pi0))
nsample=nrow(data.train)
for(i in 1:nsample){
  print(c("i=", i))
  x.train=x[-i,]
  x.test=matrix(x[i,],1,ncol(x))
  y.train=y[-i]
  y.test=y[i]
  fit.0.training=glmnet(x.train, y.train, family="binomial", alpha=0)
  for(j in 1:length(lambda.seq)){
    pred.prob=predict(fit.0.training, newx=x.test, s=lambda.seq[j], type="response")
    for(k in 1:length(pi0)){
      if(pred.prob>=pi0[k]){
        Y.pred.1=1
      }else{Y.pred.1=0}
      if((y.test==1)&(Y.pred.1==1)){
        correct.num[j,k]=correct.num[j,k]+1
      }
      if((y.test==0)&(Y.pred.1==0)){
        correct.num[j,k]=correct.num[j,k]+1
      }
    }
  }
}
#output
accuracy=correct.num/nsample
max(accuracy)
which(accuracy==max(accuracy), arr.ind=TRUE)
j=92
k=55
lambda=lambda.seq[j]
pi0.2=pi0[k]
lambda
## [1] 0.07454568
pi0.2
## [1] 0.5454545

Prediction Accuracy

Based on the classification rule after applying the regularity method and leave-one-out method in model evaluation, the training data produced

Training Accuracy= 0.8628885
Prediction Accuracy/Testing Accuracy= 0.85
Difference in accuracy between training and testing = 1.2%
#prediction accuracy after regularity/test accuracy
nsample=nrow(data.test)
n_correct=0
for (i in 1:nsample){
  test=data.test[i,]
  if(predict(fit.1, test, s = lambda, type='response')$fit>pi0.2){
    Y.pred=1
  }else{
    Y.pred=0
  }
  if(test$Y==Y.pred){
    n_correct=n_correct+1
  }
}
n_correct/nsample
pred.test1=n_correct/nsample
pred.train1=max(accuracy)
pred.train1
## [1] 0.8628885
pred.test1
## [1] 0.85

Logistic Regression Model Conclusion

The results from model validation and regularization imply that applying the regularity method to the model does not have significant improvements on training or prediction accuracy. However, the difference in accuracy between training and testing decreased by 2.2%, which indicates that the regularization is likely to produce more repliable predictions. The changes of model accuracy after applying regularization is summarized below:

* The training accuracy applying ridge penalty is 3% lower than the results from initial validation. 
* The testing accuracy applying ridge penalty is 1% lower than the results from initial validation. 
* The difference in accuracy between training and testing decreased by 2.2%.  

Therefore, the model tuned with ridge penalty method should be used for future predictions due to its higher predicting reliability. The tuning parameter, optimal cut-off point, classification rule and prediction accuracies are concluded below.

      Tuning Parameter λ = 0.07454568
      Optimal π0 = 0.5454545
       * if π ≥ 0.5454545, Y=1
       * if π < 0.5454545, Y=0

Training Accuracy= 0.8628885
Prediction Accuracy/Testing Accuracy= 0.85
Difference in accuracy between training and testing = 1.2%