Prediction methods analysis with the German Credit Data set

Dataset introduction

This is a dataset that been widely used for machine learning practice. Here are some breif introduction to this dataset:

  • There are 1000 observations in this dataset.

  • 20 independent variables are there in the dataset, the dependent variable the evaluation of client’s current credit status.

  • Data in this dataset have been replaced with code for the privacy concerns.

Data will be split 80% for Model Calibration = train data, and 20% for Model Validation = test data

We use Confusion Matrix and Receiver Operating Characterictics for measure Accuracy of models or algorithms

References:

Step 1: Retreive Data & Split Data to calibration and test

library(stringr)
library(knitr)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
#example of MARS
data=read.csv("/home/peopleanalytics/Desktop/german_credit.csv", header = TRUE, sep = ",")
#Sample Indexes
set.seed(12345)
indexes = sample(1:nrow(data), size=0.2*nrow(data))
# Split data
test = data[indexes,]
train = data[-indexes,]
y<-train$Creditability
nv=ncol(data)
cm<-list()

Step 1 :Data Exploration

Top of Data

head(data)
##   Creditability Account.Balance Duration.of.Credit..month.
## 1           YES               1                         18
## 2           YES               1                          9
## 3           YES               2                         12
## 4           YES               1                         12
## 5           YES               1                         12
## 6           YES               1                         10
##   Payment.Status.of.Previous.Credit Purpose Credit.Amount
## 1                                 4       2          1049
## 2                                 4       0          2799
## 3                                 2       9           841
## 4                                 4       0          2122
## 5                                 4       0          2171
## 6                                 4       0          2241
##   Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
## 1                    1                            2                   4
## 2                    1                            3                   2
## 3                    2                            4                   2
## 4                    1                            3                   3
## 5                    1                            3                   4
## 6                    1                            2                   1
##   Sex...Marital.Status Guarantors Duration.in.Current.address
## 1                    2          1                           4
## 2                    3          1                           2
## 3                    2          1                           4
## 4                    3          1                           2
## 5                    3          1                           4
## 6                    3          1                           3
##   Most.valuable.available.asset Age..years. Concurrent.Credits
## 1                             2          21                  3
## 2                             1          36                  3
## 3                             1          23                  3
## 4                             1          39                  3
## 5                             2          38                  1
## 6                             1          48                  3
##   Type.of.apartment No.of.Credits.at.this.Bank Occupation No.of.dependents
## 1                 1                          1          3                1
## 2                 1                          2          3                2
## 3                 1                          1          2                1
## 4                 1                          2          2                2
## 5                 2                          2          2                1
## 6                 1                          2          2                2
##   Telephone Foreign.Worker
## 1         1              1
## 2         1              1
## 3         1              1
## 4         1              2
## 5         1              2
## 6         1              2

Data Structure

str(data)
## 'data.frame':    1000 obs. of  21 variables:
##  $ Creditability                    : Factor w/ 2 levels "NO","YES": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Account.Balance                  : int  1 1 2 1 1 1 1 1 4 2 ...
##  $ Duration.of.Credit..month.       : int  18 9 12 12 12 10 8 6 18 24 ...
##  $ Payment.Status.of.Previous.Credit: int  4 4 2 4 4 4 4 4 4 2 ...
##  $ Purpose                          : int  2 0 9 0 0 0 0 0 3 3 ...
##  $ Credit.Amount                    : int  1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
##  $ Value.Savings.Stocks             : int  1 1 2 1 1 1 1 1 1 3 ...
##  $ Length.of.current.employment     : int  2 3 4 3 3 2 4 2 1 1 ...
##  $ Instalment.per.cent              : int  4 2 2 3 4 1 1 2 4 1 ...
##  $ Sex...Marital.Status             : int  2 3 2 3 3 3 3 3 2 2 ...
##  $ Guarantors                       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Duration.in.Current.address      : int  4 2 4 2 4 3 4 4 4 4 ...
##  $ Most.valuable.available.asset    : int  2 1 1 1 2 1 1 1 3 4 ...
##  $ Age..years.                      : int  21 36 23 39 38 48 39 40 65 23 ...
##  $ Concurrent.Credits               : int  3 3 3 3 1 3 3 3 3 3 ...
##  $ Type.of.apartment                : int  1 1 1 1 2 1 2 2 2 1 ...
##  $ No.of.Credits.at.this.Bank       : int  1 2 1 2 2 2 2 1 2 1 ...
##  $ Occupation                       : int  3 3 2 2 2 2 2 2 1 1 ...
##  $ No.of.dependents                 : int  1 2 1 2 1 2 1 2 1 1 ...
##  $ Telephone                        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Foreign.Worker                   : int  1 1 1 2 2 2 2 2 1 1 ...

Summary

summary(data)
##  Creditability Account.Balance Duration.of.Credit..month.
##  NO :300       Min.   :1.000   Min.   : 4.0              
##  YES:700       1st Qu.:1.000   1st Qu.:12.0              
##                Median :2.000   Median :18.0              
##                Mean   :2.577   Mean   :20.9              
##                3rd Qu.:4.000   3rd Qu.:24.0              
##                Max.   :4.000   Max.   :72.0              
##  Payment.Status.of.Previous.Credit    Purpose       Credit.Amount  
##  Min.   :0.000                     Min.   : 0.000   Min.   :  250  
##  1st Qu.:2.000                     1st Qu.: 1.000   1st Qu.: 1366  
##  Median :2.000                     Median : 2.000   Median : 2320  
##  Mean   :2.545                     Mean   : 2.828   Mean   : 3271  
##  3rd Qu.:4.000                     3rd Qu.: 3.000   3rd Qu.: 3972  
##  Max.   :4.000                     Max.   :10.000   Max.   :18424  
##  Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
##  Min.   :1.000        Min.   :1.000                Min.   :1.000      
##  1st Qu.:1.000        1st Qu.:3.000                1st Qu.:2.000      
##  Median :1.000        Median :3.000                Median :3.000      
##  Mean   :2.105        Mean   :3.384                Mean   :2.973      
##  3rd Qu.:3.000        3rd Qu.:5.000                3rd Qu.:4.000      
##  Max.   :5.000        Max.   :5.000                Max.   :4.000      
##  Sex...Marital.Status   Guarantors    Duration.in.Current.address
##  Min.   :1.000        Min.   :1.000   Min.   :1.000              
##  1st Qu.:2.000        1st Qu.:1.000   1st Qu.:2.000              
##  Median :3.000        Median :1.000   Median :3.000              
##  Mean   :2.682        Mean   :1.145   Mean   :2.845              
##  3rd Qu.:3.000        3rd Qu.:1.000   3rd Qu.:4.000              
##  Max.   :4.000        Max.   :3.000   Max.   :4.000              
##  Most.valuable.available.asset  Age..years.    Concurrent.Credits
##  Min.   :1.000                 Min.   :19.00   Min.   :1.000     
##  1st Qu.:1.000                 1st Qu.:27.00   1st Qu.:3.000     
##  Median :2.000                 Median :33.00   Median :3.000     
##  Mean   :2.358                 Mean   :35.54   Mean   :2.675     
##  3rd Qu.:3.000                 3rd Qu.:42.00   3rd Qu.:3.000     
##  Max.   :4.000                 Max.   :75.00   Max.   :3.000     
##  Type.of.apartment No.of.Credits.at.this.Bank   Occupation   
##  Min.   :1.000     Min.   :1.000              Min.   :1.000  
##  1st Qu.:2.000     1st Qu.:1.000              1st Qu.:3.000  
##  Median :2.000     Median :1.000              Median :3.000  
##  Mean   :1.928     Mean   :1.407              Mean   :2.904  
##  3rd Qu.:2.000     3rd Qu.:2.000              3rd Qu.:3.000  
##  Max.   :3.000     Max.   :4.000              Max.   :4.000  
##  No.of.dependents   Telephone     Foreign.Worker 
##  Min.   :1.000    Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000    1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000    Median :1.000   Median :1.000  
##  Mean   :1.155    Mean   :1.404   Mean   :1.037  
##  3rd Qu.:1.000    3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   :2.000    Max.   :2.000   Max.   :2.000

Step 2: Model Identification

2.1. Multivariate Adaptive Regression Splines (MARS)

Reference for Multivariate Adaptive Regression Splines please visit :

2.1.1.Confusion Matrix Accuracy of MARS Model

## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
##      
##        NO YES
##   NO   32  24
##   YES  28 116
## Accuracy 
##     0.74

2.1.2 ROC(Receiver Operating Characteristic) of MARS

## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:plotrix':
## 
##     plotCI
## The following object is masked from 'package:stats':
## 
##     lowess

2.2 Algorithm C4.5 (Tree Diagram)

Reference for C4.5 - Tree Diagram please visit :

C4.5 - Tree Diagram

2.2.1 Confusion Matrix and Accuracy

##      
##        NO YES
##   NO   31  31
##   YES  29 109
## Accuracy 
##      0.7

2.2.2 ROC(Receiver Operating Characteristic) of C4.5

2.3 Algorithm PART

Reference for PART - Tree Diagram please visit :

PART

2.3.1 Confusion Matrix and Accuracy

##      
##        NO YES
##   NO   42  35
##   YES  18 105
## Accuracy 
##    0.735

2.3.2 ROC(Receiver Operating Characteristic) of PART

Model 2.4 Bagging CART

Reference for Bagging CART - Tree Diagram please visit :

Bagging

2.4.1 Confusion Matrix and Accuracy

##      
##        NO YES
##   NO   42  35
##   YES  18 105
## Accuracy 
##    0.735

2.4.2 ROC(Receiver Operating Characteristic) Bagging CART

Model 2.5 Random Forest

Reference for Random Forest Algorithm please visit :

Random Forest

2.5.1 Confusion Matrix and Accuracy

## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
##      
##        NO YES
##   NO   27  18
##   YES  33 122
## Accuracy 
##    0.745

2.5.2 ROC(Receiver Operating Characteristic) Random Forest

Model 2.6 Gradient Boosted Machine

Reference for Gradient Boosted Machine Algorithm please visit : Gradient Boosted Machine

2.6.1 Confusion Matrix and Accuracy

## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
##      
##        NO YES
##   NO   26  17
##   YES  34 123
## Accuracy 
##    0.745

2.6.2 ROC(Receiver Operating Characteristic)GBM

Model 2.7 Naive BAYESIAN

Reference for Random Forest Algorithm please visit :

NAive BAYESIAN

2.7.1 Confusion Matrix and Accuracy

##      
##        NO YES
##   NO   41  30
##   YES  19 110
## Accuracy 
##    0.755

2.7.2 ROC(Receiver Operating Characteristic) Naive Bayesian

## [1] 0.805119

3.Performance Comparison

## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## [1] 7

We are now ready to chart and will again compare on faceted pie charts.

## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:randomForest':
## 
##     combine

We conclude this second dataset analysis by tabulating the results obtained.

Model Accuracy Speed Overall
BAYES 0.755 0.1949105 0.1471575
RF 0.745 0.2840448 0.2116133
GBM 0.745 0.2726653 0.2031356
MARS 0.740 1.0000000 0.7400000
PART 0.735 0.8298587 0.6099461
BAG-CART 0.735 0.2855187 0.2098562
C45 0.700 0.2731981 0.1912387

4. Data Analysis

Comparing German Credit datasets’ accuracy, we observe that Naive Bayesian top the list at more than 75% accuracy while C4.5 exceeded 72.5% accuracy on the dataset.

From these observations, we should definitely include Naive Bayesian as prime models to tackle german cedit data.

Speed can also be determinant when selecting a model. Microbenchmark data allow us to compare average time used by the 7 models using a 5-run average. Normalized data is obtained by dividing all times by the minimum time recorded. Transforming time into Speed involves taking the reciprocal values.

Overall ranking is obtained here by merely forming the product of Accuracy x Speed. We observe overwhelming dependencies on Speed, with only MARS contenders for the German credit dataset.

5.Conclusions

We have compared 7 Machine Learning models and benchmarked their accuracy and speed on German Credit datasets. Although ranking accuracy seemed consistent these models, model execution speed which is often also a factor showed strong dependencies, so that combined ranking remains strongly dataset dependent.

What it means for data scientists is that benchmarking should remain in the front- and not on the back-burner of activities’ list, as well as continuous monitoring of new and more efficient and distributed and/or parallellized algorithms and their effects on different hardware platforms.

We have evaluated the accuracy on datasets and the analysis However, Speed ranking could reduce our options. We will continue to monitor benchmark new Machine Learning tools by applying them to broader datasets.