This is a dataset that been widely used for machine learning practice. Here are some breif introduction to this dataset:
There are 1000 observations in this dataset.
20 independent variables are there in the dataset, the dependent variable the evaluation of client’s current credit status.
Data in this dataset have been replaced with code for the privacy concerns.
Data will be split 80% for Model Calibration = train data, and 20% for Model Validation = test data
We use Confusion Matrix and Receiver Operating Characterictics for measure Accuracy of models or algorithms
References:
library(stringr)
library(knitr)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
#example of MARS
data=read.csv("/home/peopleanalytics/Desktop/german_credit.csv", header = TRUE, sep = ",")
#Sample Indexes
set.seed(12345)
indexes = sample(1:nrow(data), size=0.2*nrow(data))
# Split data
test = data[indexes,]
train = data[-indexes,]
y<-train$Creditability
nv=ncol(data)
cm<-list()
Top of Data
head(data)
## Creditability Account.Balance Duration.of.Credit..month.
## 1 YES 1 18
## 2 YES 1 9
## 3 YES 2 12
## 4 YES 1 12
## 5 YES 1 12
## 6 YES 1 10
## Payment.Status.of.Previous.Credit Purpose Credit.Amount
## 1 4 2 1049
## 2 4 0 2799
## 3 2 9 841
## 4 4 0 2122
## 5 4 0 2171
## 6 4 0 2241
## Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
## 1 1 2 4
## 2 1 3 2
## 3 2 4 2
## 4 1 3 3
## 5 1 3 4
## 6 1 2 1
## Sex...Marital.Status Guarantors Duration.in.Current.address
## 1 2 1 4
## 2 3 1 2
## 3 2 1 4
## 4 3 1 2
## 5 3 1 4
## 6 3 1 3
## Most.valuable.available.asset Age..years. Concurrent.Credits
## 1 2 21 3
## 2 1 36 3
## 3 1 23 3
## 4 1 39 3
## 5 2 38 1
## 6 1 48 3
## Type.of.apartment No.of.Credits.at.this.Bank Occupation No.of.dependents
## 1 1 1 3 1
## 2 1 2 3 2
## 3 1 1 2 1
## 4 1 2 2 2
## 5 2 2 2 1
## 6 1 2 2 2
## Telephone Foreign.Worker
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 2
## 5 1 2
## 6 1 2
Data Structure
str(data)
## 'data.frame': 1000 obs. of 21 variables:
## $ Creditability : Factor w/ 2 levels "NO","YES": 2 2 2 2 2 2 2 2 2 2 ...
## $ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ...
## $ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ...
## $ Payment.Status.of.Previous.Credit: int 4 4 2 4 4 4 4 4 4 2 ...
## $ Purpose : int 2 0 9 0 0 0 0 0 3 3 ...
## $ Credit.Amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
## $ Value.Savings.Stocks : int 1 1 2 1 1 1 1 1 1 3 ...
## $ Length.of.current.employment : int 2 3 4 3 3 2 4 2 1 1 ...
## $ Instalment.per.cent : int 4 2 2 3 4 1 1 2 4 1 ...
## $ Sex...Marital.Status : int 2 3 2 3 3 3 3 3 2 2 ...
## $ Guarantors : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Duration.in.Current.address : int 4 2 4 2 4 3 4 4 4 4 ...
## $ Most.valuable.available.asset : int 2 1 1 1 2 1 1 1 3 4 ...
## $ Age..years. : int 21 36 23 39 38 48 39 40 65 23 ...
## $ Concurrent.Credits : int 3 3 3 3 1 3 3 3 3 3 ...
## $ Type.of.apartment : int 1 1 1 1 2 1 2 2 2 1 ...
## $ No.of.Credits.at.this.Bank : int 1 2 1 2 2 2 2 1 2 1 ...
## $ Occupation : int 3 3 2 2 2 2 2 2 1 1 ...
## $ No.of.dependents : int 1 2 1 2 1 2 1 2 1 1 ...
## $ Telephone : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Foreign.Worker : int 1 1 1 2 2 2 2 2 1 1 ...
Summary
summary(data)
## Creditability Account.Balance Duration.of.Credit..month.
## NO :300 Min. :1.000 Min. : 4.0
## YES:700 1st Qu.:1.000 1st Qu.:12.0
## Median :2.000 Median :18.0
## Mean :2.577 Mean :20.9
## 3rd Qu.:4.000 3rd Qu.:24.0
## Max. :4.000 Max. :72.0
## Payment.Status.of.Previous.Credit Purpose Credit.Amount
## Min. :0.000 Min. : 0.000 Min. : 250
## 1st Qu.:2.000 1st Qu.: 1.000 1st Qu.: 1366
## Median :2.000 Median : 2.000 Median : 2320
## Mean :2.545 Mean : 2.828 Mean : 3271
## 3rd Qu.:4.000 3rd Qu.: 3.000 3rd Qu.: 3972
## Max. :4.000 Max. :10.000 Max. :18424
## Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:2.000
## Median :1.000 Median :3.000 Median :3.000
## Mean :2.105 Mean :3.384 Mean :2.973
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :4.000
## Sex...Marital.Status Guarantors Duration.in.Current.address
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000
## Median :3.000 Median :1.000 Median :3.000
## Mean :2.682 Mean :1.145 Mean :2.845
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:4.000
## Max. :4.000 Max. :3.000 Max. :4.000
## Most.valuable.available.asset Age..years. Concurrent.Credits
## Min. :1.000 Min. :19.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:27.00 1st Qu.:3.000
## Median :2.000 Median :33.00 Median :3.000
## Mean :2.358 Mean :35.54 Mean :2.675
## 3rd Qu.:3.000 3rd Qu.:42.00 3rd Qu.:3.000
## Max. :4.000 Max. :75.00 Max. :3.000
## Type.of.apartment No.of.Credits.at.this.Bank Occupation
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:3.000
## Median :2.000 Median :1.000 Median :3.000
## Mean :1.928 Mean :1.407 Mean :2.904
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :3.000 Max. :4.000 Max. :4.000
## No.of.dependents Telephone Foreign.Worker
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000
## Mean :1.155 Mean :1.404 Mean :1.037
## 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000
Reference for Multivariate Adaptive Regression Splines please visit :
## Loading required package: plotmo
## Loading required package: plotrix
## Loading required package: TeachingDemos
##
## NO YES
## NO 32 24
## YES 28 116
## Accuracy
## 0.74
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:plotrix':
##
## plotCI
## The following object is masked from 'package:stats':
##
## lowess
Reference for C4.5 - Tree Diagram please visit :
##
## NO YES
## NO 31 31
## YES 29 109
## Accuracy
## 0.7
##
## NO YES
## NO 42 35
## YES 18 105
## Accuracy
## 0.735
##
## NO YES
## NO 42 35
## YES 18 105
## Accuracy
## 0.735
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
##
## NO YES
## NO 27 18
## YES 33 122
## Accuracy
## 0.745
Reference for Gradient Boosted Machine Algorithm please visit : Gradient Boosted Machine
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
##
## NO YES
## NO 26 17
## YES 34 123
## Accuracy
## 0.745
##
## NO YES
## NO 41 30
## YES 19 110
## Accuracy
## 0.755
## [1] 0.805119
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## [1] 7
We are now ready to chart and will again compare on faceted pie charts.
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:randomForest':
##
## combine
We conclude this second dataset analysis by tabulating the results obtained.
Model | Accuracy | Speed | Overall |
---|---|---|---|
BAYES | 0.755 | 0.1949105 | 0.1471575 |
RF | 0.745 | 0.2840448 | 0.2116133 |
GBM | 0.745 | 0.2726653 | 0.2031356 |
MARS | 0.740 | 1.0000000 | 0.7400000 |
PART | 0.735 | 0.8298587 | 0.6099461 |
BAG-CART | 0.735 | 0.2855187 | 0.2098562 |
C45 | 0.700 | 0.2731981 | 0.1912387 |
Comparing German Credit datasets’ accuracy, we observe that Naive Bayesian top the list at more than 75% accuracy while C4.5 exceeded 72.5% accuracy on the dataset.
From these observations, we should definitely include Naive Bayesian as prime models to tackle german cedit data.
Speed can also be determinant when selecting a model. Microbenchmark data allow us to compare average time used by the 7 models using a 5-run average. Normalized data is obtained by dividing all times by the minimum time recorded. Transforming time into Speed involves taking the reciprocal values.
Overall ranking is obtained here by merely forming the product of Accuracy x Speed. We observe overwhelming dependencies on Speed, with only MARS contenders for the German credit dataset.
We have compared 7 Machine Learning models and benchmarked their accuracy and speed on German Credit datasets. Although ranking accuracy seemed consistent these models, model execution speed which is often also a factor showed strong dependencies, so that combined ranking remains strongly dataset dependent.
What it means for data scientists is that benchmarking should remain in the front- and not on the back-burner of activities’ list, as well as continuous monitoring of new and more efficient and distributed and/or parallellized algorithms and their effects on different hardware platforms.
We have evaluated the accuracy on datasets and the analysis However, Speed ranking could reduce our options. We will continue to monitor benchmark new Machine Learning tools by applying them to broader datasets.