Prediction methods analysis with the German Credit Data set

Tao Lin (Richie)
12/29/2015

Dataset introduction

This is a dataset that been widely used for machine learning practice. Here are some breif introduction to this dataset:

  • There are 1000 observations in this dataset.
  • 20 independent variables are there in the dataset, the dependent variable the evaluation of client's current credit status.
  • Data in this dataset have been replaced with code for the privacy concerns.

Prediction methods introduction

For this dataset, I am going to use four commonly used methods to build the machine learning model for our prediction.

  • Logistic Regression
  • LASSO (least absolute shrinkage and selection operator)
  • Classification Tree
  • Random Forest

Dataset decryption and cleaning

The dataset was encrypted, we need to decrypt it with the code book.

Before

summary(df1)[,c(1,3,4)]
 Check_Acc   Credit_history    Purpose     
 "A11:274  " "A30: 40  "    "A43    :280  "
 "A12:269  " "A31: 49  "    "A40    :234  "
 "A13: 63  " "A32:530  "    "A42    :181  "
 "A14:394  " "A33: 88  "    "A41    :103  "
 NA          "A34:293  "    "A49    : 97  "
 NA          NA             "A46    : 50  "
 NA          NA             "(Other): 55  "

Dataset decryption and cleaning

After

summary(df1)[,c(1,3,4)]
                  Check_Acc    
 "      ... <    0 DM   :274  "
 " 0 <= ... <  200 DM   :269  "
 "      ... >= 200 DM / : 63  "
 " no checking account  :394  "
 NA                            
 NA                            
 NA                            
                                   Credit_history 
 " no credits taken/                       : 40  "
 " all credits at this bank paid back duly : 49  "
 " existing credits paid back duly till now:530  "
 " delay in paying off in the past         : 88  "
 " critical account/                       :293  "
 NA                                               
 NA                                               
                  Purpose     
 " domestic appliances :280  "
 " car (new)           :234  "
 " radio/television    :181  "
 " car (used)          :103  "
 " others              : 97  "
 " retraining          : 50  "
 "(Other)              : 55  "

Dataset preperation for model building

We are going to divide the dataset into 0.7:0.3 for training and testing the model. For the logistic regression, we also need to transform the data frame with factors into the matrix with biominal value.

mat1 <- model.matrix(Credit ~ . , data = df1  )
n<- dim(df1)[1]

set.seed(1234)
train<- sample(1:n , 0.7*n)
xtrain<- mat1[train,]
xtest<- mat1[-train,]

ytrain<- df1$Credit[train]
ytest<- df1$Credit[-train]

Logistic Regression

Build the logistic Regression model.

m1 <- glm(Credit ~ . , family = binomial, data= data.frame(Credit= ytrain, xtrain))

Logistic Regression

Key Variables for the regression model.

sig.var<- summary(m1)$coeff[-1,4] <0.01
names(sig.var)[sig.var == T]
[1] "Check_Acc.no.checking.account"        
[2] "Duration"                             
[3] "Credit_history.critical.account.."    
[4] "Purpose.car..used.."                  
[5] "Purpose.radio.television."            
[6] "Purpose.domestic.appliances."         
[7] "Savings................1000.DM"       
[8] "Savings...unknown..no.savings.account"
[9] "Rate_of_income"                       

Logistic Regression

Predit outcome with Logistic Regression model, then use the test dataset to evaluate the model.

pred1<- predict.glm(m1,newdata = data.frame(ytest,xtest), type = "response")
result1<- table(ytest, floor(pred1+1.5))
result1

ytest   1   2
    1 176  25
    2  51  48
error1<- sum(result1[1,2], result1[2,1])/sum(result1)
error1
[1] 0.2533333

LASSO

LASSO is a useful method to generate logistic regression with seleted key variables.

lasso<- lars(x= mat1, y= as.numeric(df1$Credit), trace = T)

Plot the LASSO variables selection procedure plot of chunk unnamed-chunk-10

LASSO

Predit outcome with model generated by LASSO, then use the test dataset to evaluate the model

result2

ytest   1   2
    1 162  39
    2  34  65
error2
[1] 0.2433333

Classifiation Tree

Build the Classification Tree with rpart function.

m3.tree<- rpart(Credit ~ . , data = df1[train,], method= "class")
printcp(m3.tree)

Classification tree:
rpart(formula = Credit ~ ., data = df1[train, ], method = "class")

Variables actually used in tree construction:
[1] Check_Acc      Credit_history Duration       Employment    
[5] Loan_amount    Purpose        Savings       

Root node error: 201/700 = 0.28714

n= 700 

        CP nsplit rel error  xerror     xstd
1 0.057214      0   1.00000 1.00000 0.059553
2 0.034826      2   0.88557 1.00498 0.059641
3 0.029851      5   0.78109 1.01990 0.059901
4 0.022388      6   0.75124 0.96020 0.058822
5 0.014925      8   0.70647 0.96517 0.058916
6 0.010000     11   0.66169 1.02985 0.060071

Classifiation Tree

Plot the classification Tree. plot of chunk unnamed-chunk-14

Classifiation Tree

Use the tre model to predict and evaluate.

result3

ytest   0   1
    1 185  16
    2  58  41
error3
[1] 0.2466667

Random Forest

Random forests is a notion of the general technique of random decision forests that are an ensemble learning method for classification.

Call random forest classification function:

m4.rf<- randomForest(Credit ~ . , data = df1[train,], method= "class", ntree = 500, mtry = 10)

Random Forest

Result:


Call:
 randomForest(formula = Credit ~ ., data = df1[train, ], method = "class",      ntree = 500, mtry = 10) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 10

        OOB estimate of  error rate: 23%
Confusion matrix:
    1  2 class.error
1 457 42  0.08416834
2 119 82  0.59203980

Conclusion

Error rate of each method:

  • Logistic Regression: 25.33%
  • LASSO: 24.33%
  • Classification Tree: 24.67%
  • Random Forest: 23.71%