German Credit Scoring data

1.0 Objective

The German credit score data are downloadable from http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)

Goal: Testing out several different supervised learning algorithms over credit data of subjects to find the one that accurately * * * predicts if an individual will be good candidate to loan or will tend to default

Metric: Loss : Asymmetric cost metric is used to calculate Loss.Ever false negative weighs 5 times to a False positive Missclassification Rate: Incorrect Classifications/Total Classifications Area under the curve(ROC): Area covered under a ROC Curve.

More priority of metrics is : Loss> MCR> AUC

Methodology: Random sample a training data set that contains 70% of original data points. Perform exploratory data analysis. Find a best model for Credit Scoring data using logistic regression with AIC and BIC. Draw ROC curve, report the AUC, and present the misclassification rate table of your final model.

Test the out-of-sample performance.

2.0 Initial Set Up

2.1 Package Loading

Required Packages

Tidyverse (dplyr, ggplot2..) - Data Read, Manipulation and visualisation
Plotly - Interactive Visualization
KableExtra - Styling for table (Styling Data Tables within Markdown)
gridExtra- Graphical arrangement
forecast- Time series and forecasting
ggplot2- Graphical representation
stringr- string manipulations
corrplot-making correlogram
knitrr- Dynamic report generation

2.2 Data Loading

We have used a seed of : 13383645

set.seed(13383645)
#Reading Data
german_credit = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")
#Assigning variable names
colnames(german_credit)=c("chk_acct","duration","credit_his","purpose","amount","saving_acct","present_emp","installment_rate","sex","other_debtor","present_resid","property","age","other_install","housing","n_credits","job","n_people","telephone","foreign","response")
#Response is in 1,2 - we need to change it to 0,1
german_credit$response = german_credit$response - 1

2.3 Sampling

Objective: In here we are random sampling the data in 2 parts 70% training and 30% testing

#Dividing into training and testing dataset
index <- sample(nrow(german_credit),size = nrow(german_credit)*0.70)
german_credit_train <- german_credit[index,]
german_credit_test <- german_credit[-index,]

2.4 Data Snapshot

kable(head(german_credit_train,10) )  %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "300px")

	chk_acct	duration	credit_his	purpose	amount	saving_acct	present_emp	installment_rate	sex	other_debtor	present_resid	property	age	other_install	housing	n_credits	job	n_people	telephone	foreign	response
39	A13	10	A32	A44	1225	A61	A73	2	A93	A101	2	A123	37	A143	A152	1	A173	1	A192	A201	0
799	A14	24	A33	A40	717	A65	A75	4	A94	A101	4	A123	54	A143	A152	2	A173	1	A192	A201	0
579	A12	36	A34	A40	2820	A61	A72	4	A91	A101	4	A123	27	A143	A152	2	A173	1	A191	A201	1
686	A14	60	A32	A40	6527	A65	A73	4	A93	A101	4	A124	34	A143	A153	1	A173	2	A192	A201	0
344	A12	18	A32	A49	4439	A61	A75	1	A93	A102	1	A121	33	A141	A152	1	A174	1	A192	A201	0
360	A11	30	A32	A42	2406	A61	A74	4	A92	A101	4	A121	23	A143	A151	1	A173	1	A191	A201	1
395	A14	9	A34	A42	2406	A61	A71	2	A93	A101	3	A123	31	A143	A152	1	A174	1	A191	A201	0
721	A13	9	A30	A43	1337	A61	A72	4	A93	A101	2	A123	34	A143	A152	2	A174	1	A192	A201	1
689	A14	9	A32	A43	2753	A62	A75	3	A93	A102	4	A123	35	A143	A152	1	A173	1	A192	A201	0
144	A11	18	A32	A42	2462	A61	A73	2	A93	A101	2	A123	22	A143	A152	1	A173	1	A191	A201	1

#head(german_credit,10)

This German Credit Score classifies people as good or bad borrowers based on their attributes. The response variable is 1 for good borrower or loan and 2 refers to bad borrower or loan. For the ease of working on dataset we have changes the response to binary 0,1.

3.0 Exploratory Data Analysis

3.1 Initial Analysis

Dimension

dimCr<-dim(german_credit)

The dimension of data: 1000, 21

Structure

str(german_credit)

## 'data.frame':    1000 obs. of  21 variables:
##  $ chk_acct        : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ duration        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_his      : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ purpose         : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ amount          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ saving_acct     : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ present_emp     : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ installment_rate: int  4 2 2 2 3 2 3 2 2 4 ...
##  $ sex             : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ other_debtor    : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ present_resid   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property        : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ age             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_install   : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ housing         : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ n_credits       : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job             : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ n_people        : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone       : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign         : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
##  $ response        : num  0 1 0 0 1 0 0 0 0 1 ...

Observation : Data is a mixture of Factors and numeric columns

Description

chk_acct: Status of existing checking account
duration: Duration in month
credit_his: Credit history
purpose: Purpose (car,furniture,education)
amount: Credit amount
saving_acct: Savings account/bonds
present_emp: Present employment since
installment_rate: Installment rate in percentage of disposable income
sex: Personal status and sex
other_debtor: Other debtors / guarantors
present_resid:Present residence since
property:Property(real estate,life insurance)
age: Age in years
other_install: Other installment plans(bank,stores,none)
housing: housing(rent,own,free)
n_credit: Number of existing credits at this bank
job: Job
n_people: Number of people being liable to provide maintenance for
telephone: Telephone
foreign: foreign worker
response: yes/no

In order to see how the response variable is affected by other factors, we will regress it on other variables. As it has only 0 and 1 values, we will use binomial regression.

Data Quality Check

Data quality is another very important step in EDA, it’s imperative to have good data quality for a optimum analysis.

metadata<-t(introduce(german_credit))
colnames(metadata)<-"Values"
metadata

##                      Values
## rows                   1000
## columns                  21
## discrete_columns         13
## continuous_columns        8
## all_missing_columns       0
## total_missing_values      0
## complete_rows          1000
## total_observations    21000
## memory_usage         101632

plot_intro(german_credit)

The good news is that this dataset has no missing/ Abnormal values in dataset. The structure looks coherent

3.2 Deep Dive

Analysis based on Visualization

Analysis of Continious Variables

par(mfrow=c(2,4))
boxplot(german_credit$duration, col = "lightblue", xlab="Duration")
boxplot(german_credit$amount, col = "lightblue", xlab="Amount")
boxplot(german_credit$installment_rate, col = "lightblue", xlab="Installment Rate")
boxplot(german_credit$present_resid, col = "lightblue", xlab="Present resid")
boxplot(german_credit$age, col = "lightblue", xlab="Age")
boxplot(german_credit$n_credits, col = "lightblue", xlab="Num Credits")
boxplot(german_credit$n_people, col = "lightblue", xlab="Num People")

#duration
g1<- ggplot(german_credit, aes(x = as.factor(response), y = duration, fill = as.factor(response))) + geom_boxplot() + theme(legend.position = "none")

#amount
g2<-ggplot(german_credit, aes(x = as.factor(response), y = amount, fill = as.factor(response))) + geom_boxplot() +
  theme(legend.position = "none")

#AGE
g4<-ggplot(german_credit, aes(x = as.factor(response), y = age, fill = as.factor(response))) + 
geom_boxplot() + theme(legend.position = "none")




ggarrange(g1, g2,g4,
                    labels = c("duration", "amount","age"),
                    ncol = 3, nrow = 1)

Observation

From the age variable, we see that the median value for bad records is lesser than that of good records, here we can safely assume young people tends to be riskier.
The median value and the range of the duration variable appears to be on the higher side of bad records as compared to good records
For the amount variable, we observe that the median values are pretty close for bad and good borrowers

#Installment Rates

g3<-ggplot(german_credit, aes(factor(installment_rate), ..count..)) + 
  geom_bar(aes(fill = as.factor(response)), position = "dodge") 

g4<-ggplot(german_credit, aes(chk_acct, ..count..)) + 
  geom_bar(aes(fill = as.factor(response)), position = "dodge") 

g5<-ggplot(german_credit, aes(credit_his, ..count..)) + 
  geom_bar(aes(fill = as.factor(response)), position = "dodge") 

g6<-ggplot(german_credit, aes(purpose, ..count..)) + 
  geom_bar(aes(fill = as.factor(response)), position = "dodge")

g7<-ggplot(german_credit, aes(saving_acct, ..count..)) + 
  geom_bar(aes(fill = as.factor(response)), position = "dodge") 

g8<-ggplot(german_credit, aes(other_debtor, ..count..)) + 
  geom_bar(aes(fill = as.factor(response)), position = "dodge")

g9<-ggplot(german_credit, aes(sex, ..count..)) + 
  geom_bar(aes(fill = as.factor(response)), position = "dodge")

g10<-ggplot(german_credit, aes(other_install, ..count..)) + 
  geom_bar(aes(fill = as.factor(response)), position = "dodge") 

g11<-ggplot(german_credit, aes(foreign, ..count..)) + 
  geom_bar(aes(fill = as.factor(response)), position = "dodge") 

g12<-ggplot(german_credit, aes(present_emp, ..count..)) + 
  geom_bar(aes(fill = as.factor(response)), position = "dodge") 


ggarrange(g3, g4,
                    labels = c("Installment Rates", "chk_acct"),
                    ncol = 1, nrow = 2)

ggarrange(g5, g6,
                    labels = c("credit_his", "purpose"),
                    ncol = 1, nrow = 2)

ggarrange(g7, g8,
                    labels = c("saving_acct", "other_debtor"),
                    ncol = 1, nrow = 2)

ggarrange(g9, g10,
                    labels = c("sex", "other_install"),
                    ncol = 1, nrow = 2)

ggarrange(g11,g12,
                    labels = c("foreign","Present_emp"),
                    ncol = 1, nrow = 2)

The installment_rate variable has a significant difference between the good and bad records, we see that bad records have almost the double median value than good ones.
For chk_acct A11 & A12 do not have much difference between good loans & bad loans buT A13 & A14 have considerably large difference.
For credit_his, for categories A32 - A34 we see the number of good credit are greater and for categories A30-A31.
We observe similar trends in other variables: sex, other_debtor, saving_acct, other_install and foreign.

4.0 Model Selection

4.1 Logistic Regression

Lets make a Full Model to start with

Objective: As part of best model we try to find the most suitable model by logistic regression using AIC,BIC criterians.

#full model
g.credit.glm0<- glm(response~., family=binomial, data=german_credit_train)
summary(g.credit.glm0)

## 
## Call:
## glm(formula = response ~ ., family = binomial, data = german_credit_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2331  -0.7054  -0.3760   0.7091   2.5299  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -5.728e-01  1.297e+00  -0.442 0.658826    
## chk_acctA12       -3.849e-01  2.668e-01  -1.443 0.149100    
## chk_acctA13       -8.735e-01  4.445e-01  -1.965 0.049391 *  
## chk_acctA14       -1.711e+00  2.784e-01  -6.146 7.96e-10 ***
## duration           3.154e-02  1.115e-02   2.829 0.004672 ** 
## credit_hisA31      9.205e-01  7.632e-01   1.206 0.227791    
## credit_hisA32      5.860e-02  5.745e-01   0.102 0.918752    
## credit_hisA33     -3.645e-01  6.095e-01  -0.598 0.549841    
## credit_hisA34     -9.561e-01  5.810e-01  -1.646 0.099832 .  
## purposeA41        -1.407e+00  4.247e-01  -3.313 0.000924 ***
## purposeA410       -2.378e+00  1.216e+00  -1.955 0.050604 .  
## purposeA42        -9.080e-01  3.227e-01  -2.814 0.004893 ** 
## purposeA43        -8.173e-01  2.947e-01  -2.774 0.005541 ** 
## purposeA44        -8.093e-01  8.741e-01  -0.926 0.354502    
## purposeA45        -3.644e-01  6.146e-01  -0.593 0.553207    
## purposeA46        -1.252e-02  5.116e-01  -0.024 0.980477    
## purposeA48        -1.622e+01  6.593e+02  -0.025 0.980378    
## purposeA49        -5.032e-01  3.993e-01  -1.260 0.207586    
## amount             9.281e-05  5.677e-05   1.635 0.102094    
## saving_acctA62    -2.180e-01  3.522e-01  -0.619 0.536042    
## saving_acctA63    -8.445e-02  4.551e-01  -0.186 0.852773    
## saving_acctA64    -6.878e-01  6.248e-01  -1.101 0.270984    
## saving_acctA65    -1.082e+00  3.311e-01  -3.269 0.001081 ** 
## present_empA72    -4.795e-01  5.057e-01  -0.948 0.343017    
## present_empA73    -6.431e-01  4.872e-01  -1.320 0.186844    
## present_empA74    -1.227e+00  5.303e-01  -2.314 0.020668 *  
## present_empA75    -6.104e-01  4.932e-01  -1.238 0.215837    
## installment_rate   3.112e-01  1.072e-01   2.904 0.003689 ** 
## sexA92             2.397e-01  4.854e-01   0.494 0.621382    
## sexA93            -3.204e-01  4.765e-01  -0.672 0.501305    
## sexA94            -2.237e-02  5.504e-01  -0.041 0.967578    
## other_debtorA102   7.311e-02  5.047e-01   0.145 0.884828    
## other_debtorA103  -7.387e-01  4.743e-01  -1.557 0.119376    
## present_resid      4.087e-02  1.074e-01   0.381 0.703474    
## propertyA122       3.985e-01  3.031e-01   1.315 0.188640    
## propertyA123       1.029e-01  2.824e-01   0.365 0.715464    
## propertyA124       4.752e-01  5.106e-01   0.931 0.352045    
## age               -1.630e-02  1.097e-02  -1.486 0.137153    
## other_installA142  6.909e-02  5.409e-01   0.128 0.898367    
## other_installA143 -9.548e-01  2.951e-01  -3.235 0.001215 ** 
## housingA152       -4.889e-01  2.847e-01  -1.717 0.085970 .  
## housingA153       -4.466e-01  5.612e-01  -0.796 0.426152    
## n_credits          5.451e-01  2.358e-01   2.312 0.020774 *  
## jobA172            7.983e-01  7.412e-01   1.077 0.281409    
## jobA173            8.302e-01  7.084e-01   1.172 0.241185    
## jobA174            7.056e-01  7.220e-01   0.977 0.328469    
## n_people           2.768e-01  3.010e-01   0.920 0.357681    
## telephoneA192     -2.917e-01  2.446e-01  -1.192 0.233093    
## foreignA202       -1.553e+00  8.607e-01  -1.804 0.071257 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 856.90  on 699  degrees of freedom
## Residual deviance: 628.28  on 651  degrees of freedom
## AIC: 726.28
## 
## Number of Fisher Scoring iterations: 14

Observation: We observe that amongst the 20 predictors only a few are significant. In order to choose the significant variables we decide to use AIC/BIC approach

4.1.2 Backward AIC

#backward selection
#AIC
g.credit.glm.back <- step(g.credit.glm0) # backward selection (if you don't specify anything)

## Start:  AIC=726.28
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + present_emp + installment_rate + sex + other_debtor + 
##     present_resid + property + age + other_install + housing + 
##     n_credits + job + n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - job               3   629.70 721.70
## - property          3   630.59 722.59
## - present_resid     1   628.42 724.42
## - other_debtor      2   630.93 724.93
## - n_people          1   629.12 725.12
## - sex               3   633.23 725.23
## - housing           2   631.26 725.26
## - present_emp       4   635.50 725.50
## - telephone         1   629.71 725.71
## <none>                  628.28 726.28
## - age               1   630.52 726.52
## - amount            1   630.94 726.94
## - foreign           1   632.43 728.43
## - n_credits         1   633.66 729.66
## - saving_acct       4   640.92 730.92
## - duration          1   636.40 732.40
## - installment_rate  1   636.97 732.97
## - credit_his        4   643.45 733.45
## - purpose           9   654.76 734.76
## - other_install     2   641.84 735.84
## - chk_acct          3   674.39 766.39
## 
## Step:  AIC=721.7
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + present_emp + installment_rate + sex + other_debtor + 
##     present_resid + property + age + other_install + housing + 
##     n_credits + n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - property          3   632.04 718.04
## - present_emp       4   635.74 719.74
## - present_resid     1   629.96 719.96
## - other_debtor      2   632.14 720.14
## - housing           2   632.40 720.40
## - n_people          1   630.54 720.54
## - sex               3   634.59 720.59
## - telephone         1   631.31 721.31
## <none>                  629.70 721.70
## - age               1   632.17 722.17
## - amount            1   632.39 722.39
## - foreign           1   633.90 723.90
## - n_credits         1   634.75 724.75
## - saving_acct       4   643.12 727.12
## - credit_his        4   644.25 728.25
## - duration          1   638.34 728.34
## - installment_rate  1   639.09 729.09
## - purpose           9   655.91 729.91
## - other_install     2   644.00 732.00
## - chk_acct          3   675.18 761.18
## 
## Step:  AIC=718.04
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + present_emp + installment_rate + sex + other_debtor + 
##     present_resid + age + other_install + housing + n_credits + 
##     n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - present_resid     1   632.30 716.30
## - other_debtor      2   634.54 716.54
## - present_emp       4   638.55 716.55
## - n_people          1   632.83 716.83
## - housing           2   634.87 716.87
## - sex               3   637.02 717.02
## - telephone         1   633.37 717.37
## <none>                  632.04 718.04
## - age               1   634.36 718.36
## - amount            1   634.93 718.93
## - foreign           1   636.03 720.03
## - n_credits         1   636.75 720.75
## - saving_acct       4   645.15 723.15
## - credit_his        4   646.22 724.22
## - duration          1   641.04 725.04
## - installment_rate  1   641.68 725.68
## - purpose           9   658.65 726.65
## - other_install     2   646.82 728.82
## - chk_acct          3   678.62 758.62
## 
## Step:  AIC=716.3
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + present_emp + installment_rate + sex + other_debtor + 
##     age + other_install + housing + n_credits + n_people + telephone + 
##     foreign
## 
##                    Df Deviance    AIC
## - present_emp       4   638.66 714.66
## - other_debtor      2   634.90 714.90
## - n_people          1   633.15 715.15
## - sex               3   637.34 715.34
## - telephone         1   633.60 715.60
## - housing           2   636.04 716.04
## <none>                  632.30 716.30
## - age               1   634.44 716.44
## - amount            1   635.12 717.12
## - foreign           1   636.32 718.32
## - n_credits         1   637.07 719.07
## - saving_acct       4   645.37 721.37
## - credit_his        4   646.43 722.43
## - duration          1   641.55 723.55
## - installment_rate  1   641.95 723.95
## - purpose           9   658.88 724.88
## - other_install     2   646.87 726.87
## - chk_acct          3   679.70 757.70
## 
## Step:  AIC=714.66
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + installment_rate + sex + other_debtor + age + 
##     other_install + housing + n_credits + n_people + telephone + 
##     foreign
## 
##                    Df Deviance    AIC
## - n_people          1   639.50 713.50
## - telephone         1   639.94 713.94
## - other_debtor      2   642.10 714.10
## - age               1   640.48 714.48
## <none>                  638.66 714.66
## - housing           2   642.69 714.69
## - sex               3   644.80 714.80
## - amount            1   641.60 715.60
## - foreign           1   642.58 716.58
## - n_credits         1   642.99 716.99
## - credit_his        4   652.54 720.54
## - duration          1   646.91 720.91
## - saving_acct       4   652.99 720.99
## - purpose           9   664.03 722.03
## - installment_rate  1   648.87 722.87
## - other_install     2   654.12 726.12
## - chk_acct          3   688.29 758.29
## 
## Step:  AIC=713.5
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + installment_rate + sex + other_debtor + age + 
##     other_install + housing + n_credits + telephone + foreign
## 
##                    Df Deviance    AIC
## - other_debtor      2   642.76 712.76
## - telephone         1   640.79 712.79
## - sex               3   644.94 712.94
## - age               1   641.19 713.19
## <none>                  639.50 713.50
## - housing           2   643.67 713.67
## - amount            1   642.33 714.33
## - foreign           1   643.50 715.50
## - n_credits         1   644.17 716.17
## - saving_acct       4   653.39 719.39
## - duration          1   647.65 719.65
## - credit_his        4   653.75 719.75
## - installment_rate  1   649.38 721.38
## - purpose           9   665.95 721.95
## - other_install     2   654.88 724.88
## - chk_acct          3   689.31 757.31
## 
## Step:  AIC=712.76
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + installment_rate + sex + age + other_install + 
##     housing + n_credits + telephone + foreign
## 
##                    Df Deviance    AIC
## - telephone         1   643.97 711.97
## - sex               3   648.46 712.46
## - age               1   644.63 712.63
## <none>                  642.76 712.76
## - housing           2   647.12 713.12
## - amount            1   645.91 713.91
## - n_credits         1   647.56 715.56
## - foreign           1   647.97 715.97
## - saving_acct       4   656.05 718.05
## - duration          1   650.74 718.74
## - credit_his        4   656.75 718.75
## - installment_rate  1   652.94 720.94
## - purpose           9   670.63 722.63
## - other_install     2   657.77 723.77
## - chk_acct          3   690.81 754.81
## 
## Step:  AIC=711.97
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + installment_rate + sex + age + other_install + 
##     housing + n_credits + foreign
## 
##                    Df Deviance    AIC
## - sex               3   649.61 711.61
## <none>                  643.97 711.97
## - housing           2   648.29 712.29
## - age               1   646.29 712.29
## - amount            1   646.37 712.37
## - n_credits         1   648.49 714.49
## - foreign           1   648.82 714.82
## - saving_acct       4   657.36 717.36
## - credit_his        4   658.09 718.09
## - duration          1   652.65 718.65
## - installment_rate  1   653.42 719.42
## - purpose           9   672.01 722.01
## - other_install     2   658.90 722.90
## - chk_acct          3   693.22 755.22
## 
## Step:  AIC=711.61
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + installment_rate + age + other_install + housing + 
##     n_credits + foreign
## 
##                    Df Deviance    AIC
## - amount            1   651.40 711.40
## <none>                  649.61 711.61
## - age               1   652.53 712.53
## - housing           2   655.51 713.51
## - n_credits         1   653.81 713.81
## - foreign           1   655.25 715.25
## - installment_rate  1   657.13 717.13
## - saving_acct       4   663.51 717.51
## - duration          1   658.50 718.50
## - credit_his        4   664.93 718.93
## - other_install     2   663.65 721.65
## - purpose           9   679.00 723.00
## - chk_acct          3   698.56 754.56
## 
## Step:  AIC=711.4
## response ~ chk_acct + duration + credit_his + purpose + saving_acct + 
##     installment_rate + age + other_install + housing + n_credits + 
##     foreign
## 
##                    Df Deviance    AIC
## <none>                  651.40 711.40
## - age               1   654.04 712.04
## - housing           2   657.43 713.43
## - n_credits         1   655.82 713.82
## - foreign           1   657.16 715.16
## - installment_rate  1   657.22 715.22
## - saving_acct       4   664.64 716.64
## - credit_his        4   667.45 719.45
## - other_install     2   665.47 721.47
## - purpose           9   679.78 721.78
## - duration          1   673.27 731.27
## - chk_acct          3   699.45 753.45

Selected model:

Step: AIC=666.84 response ~ chk_acct + duration + credit_his + purpose + saving_acct + installment_rate + other_debtor + other_install + housing + telephone + foreign

summary(g.credit.glm.back)

## 
## Call:
## glm(formula = response ~ chk_acct + duration + credit_his + purpose + 
##     saving_acct + installment_rate + age + other_install + housing + 
##     n_credits + foreign, family = binomial, data = german_credit_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0395  -0.7362  -0.4081   0.7500   2.6596  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.605744   0.874879   0.692 0.488702    
## chk_acctA12        -0.372430   0.253976  -1.466 0.142539    
## chk_acctA13        -0.903270   0.431385  -2.094 0.036270 *  
## chk_acctA14        -1.673746   0.267017  -6.268 3.65e-10 ***
## duration            0.039346   0.008555   4.599 4.25e-06 ***
## credit_hisA31       0.671842   0.728624   0.922 0.356492    
## credit_hisA32      -0.116590   0.550797  -0.212 0.832360    
## credit_hisA33      -0.462586   0.584424  -0.792 0.428638    
## credit_hisA34      -1.123011   0.552559  -2.032 0.042115 *  
## purposeA41         -1.312801   0.393048  -3.340 0.000838 ***
## purposeA410        -2.469875   1.161103  -2.127 0.033405 *  
## purposeA42         -0.785402   0.306218  -2.565 0.010322 *  
## purposeA43         -0.977602   0.282374  -3.462 0.000536 ***
## purposeA44         -0.676233   0.812261  -0.833 0.405109    
## purposeA45         -0.453100   0.595793  -0.760 0.446956    
## purposeA46          0.020329   0.491951   0.041 0.967038    
## purposeA48        -15.994174 640.729808  -0.025 0.980085    
## purposeA49         -0.610618   0.379777  -1.608 0.107872    
## saving_acctA62     -0.242863   0.334731  -0.726 0.468117    
## saving_acctA63     -0.159479   0.431181  -0.370 0.711483    
## saving_acctA64     -0.604616   0.583488  -1.036 0.300104    
## saving_acctA65     -1.074838   0.317708  -3.383 0.000717 ***
## installment_rate    0.220548   0.092396   2.387 0.016987 *  
## age                -0.015437   0.009564  -1.614 0.106508    
## other_installA142   0.175658   0.508142   0.346 0.729578    
## other_installA143  -0.900106   0.283061  -3.180 0.001473 ** 
## housingA152        -0.583248   0.257559  -2.265 0.023542 *  
## housingA153        -0.136733   0.393178  -0.348 0.728019    
## n_credits           0.474657   0.225390   2.106 0.035210 *  
## foreignA202        -1.729716   0.838006  -2.064 0.039010 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 856.9  on 699  degrees of freedom
## Residual deviance: 651.4  on 670  degrees of freedom
## AIC: 711.4
## 
## Number of Fisher Scoring iterations: 14

#deviance
#AIC
#BIC
(aicmodel<-glance(g.credit.glm.back)[,c(4,5)])

## # A tibble: 1 x 2
##     AIC   BIC
##   <dbl> <dbl>
## 1  711.  848.

4.1.3 Backward BIC

g.credit.glm.back.BIC <- step(g.credit.glm0, k=log(nrow(german_credit_train)))

## Start:  AIC=949.28
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + present_emp + installment_rate + sex + other_debtor + 
##     present_resid + property + age + other_install + housing + 
##     n_credits + job + n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - purpose           9   654.76 916.81
## - present_emp       4   635.50 930.30
## - job               3   629.70 931.05
## - property          3   630.59 931.94
## - sex               3   633.23 934.58
## - saving_acct       4   640.92 935.72
## - credit_his        4   643.45 938.25
## - other_debtor      2   630.93 938.83
## - housing           2   631.26 939.16
## - present_resid     1   628.42 942.87
## - n_people          1   629.12 943.57
## - telephone         1   629.71 944.16
## - age               1   630.52 944.97
## - amount            1   630.94 945.39
## - foreign           1   632.43 946.89
## - n_credits         1   633.66 948.11
## <none>                  628.28 949.28
## - other_install     2   641.84 949.74
## - duration          1   636.40 950.85
## - installment_rate  1   636.97 951.42
## - chk_acct          3   674.39 975.74
## 
## Step:  AIC=916.81
## response ~ chk_acct + duration + credit_his + amount + saving_acct + 
##     present_emp + installment_rate + sex + other_debtor + present_resid + 
##     property + age + other_install + housing + n_credits + job + 
##     n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - present_emp       4   660.47 896.31
## - job               3   655.91 898.30
## - property          3   657.55 899.94
## - sex               3   660.76 903.15
## - saving_acct       4   667.49 903.33
## - credit_his        4   667.78 903.62
## - housing           2   657.87 906.81
## - other_debtor      2   658.67 907.61
## - present_resid     1   654.90 910.39
## - age               1   655.89 911.38
## - amount            1   656.21 911.70
## - telephone         1   656.27 911.76
## - n_people          1   656.63 912.12
## - foreign           1   658.09 913.58
## - n_credits         1   660.04 915.53
## <none>                  654.76 916.81
## - other_install     2   668.30 917.24
## - installment_rate  1   662.72 918.22
## - duration          1   663.07 918.56
## - chk_acct          3   703.39 945.78
## 
## Step:  AIC=896.31
## response ~ chk_acct + duration + credit_his + amount + saving_acct + 
##     installment_rate + sex + other_debtor + present_resid + property + 
##     age + other_install + housing + n_credits + job + n_people + 
##     telephone + foreign
## 
##                    Df Deviance    AIC
## - job               3   660.69 876.87
## - property          3   663.74 879.93
## - credit_his        4   673.14 882.78
## - sex               3   667.85 884.04
## - saving_acct       4   674.72 884.35
## - housing           2   663.36 886.10
## - other_debtor      2   665.10 887.84
## - present_resid     1   660.53 889.82
## - age               1   661.58 890.86
## - amount            1   661.91 891.19
## - telephone         1   662.07 891.35
## - n_people          1   662.37 891.66
## - foreign           1   663.63 892.92
## - n_credits         1   665.23 894.52
## <none>                  660.47 896.31
## - duration          1   668.17 897.46
## - other_install     2   674.85 897.59
## - installment_rate  1   668.79 898.08
## - chk_acct          3   711.50 927.68
## 
## Step:  AIC=876.87
## response ~ chk_acct + duration + credit_his + amount + saving_acct + 
##     installment_rate + sex + other_debtor + present_resid + property + 
##     age + other_install + housing + n_credits + n_people + telephone + 
##     foreign
## 
##                    Df Deviance    AIC
## - property          3   663.96 860.50
## - credit_his        4   673.20 863.18
## - sex               3   668.10 864.63
## - saving_acct       4   675.17 865.15
## - housing           2   663.48 866.56
## - other_debtor      2   665.19 868.28
## - present_resid     1   660.77 870.41
## - age               1   661.94 871.58
## - amount            1   662.19 871.83
## - telephone         1   662.51 872.15
## - n_people          1   662.59 872.22
## - foreign           1   663.87 873.50
## - n_credits         1   665.35 874.98
## <none>                  660.69 876.87
## - other_install     2   675.22 878.30
## - duration          1   668.77 878.40
## - installment_rate  1   669.36 878.99
## - chk_acct          3   711.58 908.11
## 
## Step:  AIC=860.5
## response ~ chk_acct + duration + credit_his + amount + saving_acct + 
##     installment_rate + sex + other_debtor + present_resid + age + 
##     other_install + housing + n_credits + n_people + telephone + 
##     foreign
## 
##                    Df Deviance    AIC
## - credit_his        4   675.84 846.17
## - saving_acct       4   678.02 848.35
## - sex               3   671.80 848.68
## - housing           2   667.43 850.86
## - other_debtor      2   668.91 852.34
## - present_resid     1   664.03 854.01
## - age               1   664.99 854.97
## - telephone         1   665.45 855.43
## - amount            1   665.61 855.59
## - n_people          1   665.87 855.85
## - foreign           1   666.93 856.91
## - n_credits         1   668.07 858.05
## <none>                  663.96 860.50
## - duration          1   672.16 862.14
## - other_install     2   679.05 862.48
## - installment_rate  1   672.87 862.85
## - chk_acct          3   717.11 893.99
## 
## Step:  AIC=846.17
## response ~ chk_acct + duration + amount + saving_acct + installment_rate + 
##     sex + other_debtor + present_resid + age + other_install + 
##     housing + n_credits + n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - saving_acct       4   689.55 833.67
## - sex               3   685.30 835.98
## - other_debtor      2   680.34 837.56
## - housing           2   680.50 837.72
## - present_resid     1   675.85 839.63
## - n_credits         1   676.24 840.02
## - age               1   677.19 840.97
## - telephone         1   677.77 841.54
## - n_people          1   678.24 842.02
## - amount            1   678.31 842.09
## - foreign           1   679.45 843.23
## <none>                  675.84 846.17
## - duration          1   683.86 847.64
## - installment_rate  1   685.02 848.79
## - other_install     2   697.98 855.20
## - chk_acct          3   738.13 888.81
## 
## Step:  AIC=833.67
## response ~ chk_acct + duration + amount + installment_rate + 
##     sex + other_debtor + present_resid + age + other_install + 
##     housing + n_credits + n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - sex               3   699.11 823.58
## - other_debtor      2   693.19 824.21
## - housing           2   694.22 825.24
## - present_resid     1   689.55 827.12
## - n_credits         1   690.19 827.77
## - n_people          1   691.05 828.63
## - age               1   691.06 828.63
## - amount            1   691.42 828.99
## - telephone         1   691.58 829.15
## - foreign           1   694.52 832.09
## <none>                  689.55 833.67
## - duration          1   696.91 834.49
## - installment_rate  1   697.74 835.32
## - other_install     2   710.29 841.32
## - chk_acct          3   768.99 893.46
## 
## Step:  AIC=823.58
## response ~ chk_acct + duration + amount + installment_rate + 
##     other_debtor + present_resid + age + other_install + housing + 
##     n_credits + n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - other_debtor      2   702.93 814.30
## - present_resid     1   699.13 817.05
## - n_credits         1   699.37 817.29
## - n_people          1   699.55 817.47
## - housing           2   706.33 817.70
## - amount            1   700.23 818.15
## - age               1   701.08 819.00
## - telephone         1   701.20 819.12
## - installment_rate  1   704.63 822.55
## - foreign           1   704.67 822.58
## <none>                  699.11 823.58
## - duration          1   706.26 824.18
## - other_install     2   718.73 830.10
## - chk_acct          3   778.39 883.21
## 
## Step:  AIC=814.3
## response ~ chk_acct + duration + amount + installment_rate + 
##     present_resid + age + other_install + housing + n_credits + 
##     n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - present_resid     1   702.94 807.76
## - n_credits         1   703.24 808.05
## - n_people          1   703.28 808.09
## - housing           2   710.26 808.53
## - amount            1   704.27 809.09
## - telephone         1   704.93 809.74
## - age               1   705.14 809.96
## - installment_rate  1   708.50 813.32
## <none>                  702.93 814.30
## - duration          1   709.82 814.64
## - foreign           1   709.91 814.72
## - other_install     2   721.67 819.93
## - chk_acct          3   779.65 871.36
## 
## Step:  AIC=807.76
## response ~ chk_acct + duration + amount + installment_rate + 
##     age + other_install + housing + n_credits + n_people + telephone + 
##     foreign
## 
##                    Df Deviance    AIC
## - n_credits         1   703.24 801.50
## - n_people          1   703.28 801.54
## - amount            1   704.28 802.55
## - housing           2   711.03 802.74
## - telephone         1   704.93 803.20
## - age               1   705.33 803.60
## - installment_rate  1   708.50 806.77
## <none>                  702.94 807.76
## - duration          1   709.83 808.09
## - foreign           1   709.91 808.17
## - other_install     2   721.77 813.48
## - chk_acct          3   779.73 864.90
## 
## Step:  AIC=801.5
## response ~ chk_acct + duration + amount + installment_rate + 
##     age + other_install + housing + n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - n_people          1   703.63 795.35
## - amount            1   704.63 796.34
## - housing           2   711.21 796.38
## - telephone         1   705.14 796.85
## - age               1   705.49 797.21
## - installment_rate  1   708.81 800.52
## <none>                  703.24 801.50
## - duration          1   710.07 801.78
## - foreign           1   710.14 801.86
## - other_install     2   722.04 807.21
## - chk_acct          3   779.80 858.41
## 
## Step:  AIC=795.35
## response ~ chk_acct + duration + amount + installment_rate + 
##     age + other_install + housing + telephone + foreign
## 
##                    Df Deviance    AIC
## - amount            1   705.03 790.19
## - housing           2   711.66 790.27
## - telephone         1   705.53 790.70
## - age               1   705.74 790.90
## - installment_rate  1   709.14 794.31
## <none>                  703.63 795.35
## - duration          1   710.41 795.58
## - foreign           1   710.54 795.70
## - other_install     2   722.70 801.31
## - chk_acct          3   780.15 852.21
## 
## Step:  AIC=790.19
## response ~ chk_acct + duration + installment_rate + age + other_install + 
##     housing + telephone + foreign
## 
##                    Df Deviance    AIC
## - telephone         1   706.25 784.86
## - age               1   706.93 785.55
## - housing           2   713.65 785.71
## - installment_rate  1   709.18 787.79
## <none>                  705.03 790.19
## - foreign           1   711.84 790.45
## - other_install     2   724.39 796.45
## - duration          1   723.41 802.02
## - chk_acct          3   780.70 846.21
## 
## Step:  AIC=784.86
## response ~ chk_acct + duration + installment_rate + age + other_install + 
##     housing + foreign
## 
##                    Df Deviance    AIC
## - housing           2   714.62 780.13
## - age               1   708.69 780.75
## - installment_rate  1   710.47 782.53
## - foreign           1   712.68 784.74
## <none>                  706.25 784.86
## - other_install     2   725.39 790.90
## - duration          1   723.85 795.91
## - chk_acct          3   785.27 844.23
## 
## Step:  AIC=780.13
## response ~ chk_acct + duration + installment_rate + age + other_install + 
##     foreign
## 
##                    Df Deviance    AIC
## - age               1   717.95 776.91
## - installment_rate  1   718.04 777.00
## <none>                  714.62 780.13
## - foreign           1   721.34 780.30
## - other_install     2   732.57 784.98
## - duration          1   735.30 794.26
## - chk_acct          3   800.23 846.09
## 
## Step:  AIC=776.91
## response ~ chk_acct + duration + installment_rate + other_install + 
##     foreign
## 
##                    Df Deviance    AIC
## - installment_rate  1   721.02 773.43
## - foreign           1   724.20 776.61
## <none>                  717.95 776.91
## - other_install     2   735.06 780.92
## - duration          1   738.52 790.92
## - chk_acct          3   804.27 843.58
## 
## Step:  AIC=773.43
## response ~ chk_acct + duration + other_install + foreign
## 
##                 Df Deviance    AIC
## <none>               721.02 773.43
## - foreign        1   727.86 773.72
## - other_install  2   738.45 777.76
## - duration       1   741.61 787.47
## - chk_acct       3   806.60 839.36

summary(g.credit.glm.back.BIC)

## 
## Call:
## glm(formula = response ~ chk_acct + duration + other_install + 
##     foreign, family = binomial, data = german_credit_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6638  -0.8233  -0.4746   0.9475   2.4098  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.118756   0.304258   0.390 0.696305    
## chk_acctA12       -0.606083   0.220820  -2.745 0.006057 ** 
## chk_acctA13       -1.122102   0.401245  -2.797 0.005165 ** 
## chk_acctA14       -2.028836   0.241103  -8.415  < 2e-16 ***
## duration           0.032562   0.007252   4.490 7.12e-06 ***
## other_installA142  0.075650   0.467622   0.162 0.871482    
## other_installA143 -0.917145   0.249165  -3.681 0.000232 ***
## foreignA202       -1.735661   0.787882  -2.203 0.027599 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 856.90  on 699  degrees of freedom
## Residual deviance: 721.02  on 692  degrees of freedom
## AIC: 737.02
## 
## Number of Fisher Scoring iterations: 5

Selected model by BIC:

Step: AIC=757.28 response ~ chk_acct + duration

bicmodel<-glance(g.credit.glm.back.BIC)[,c(4,5)]

df<-data.frame(rbind(aicmodel,bicmodel))
row.names(df)<-c("AICmodel","BICmodel")
df

##               AIC      BIC
## AICmodel 711.3967 847.9291
## BICmodel 737.0234 773.4320

Here we observe that BIC being more parsimonious in model selection only selects 2 predictors as significant where as AIC working towards more prediction ability selects 11 predictors.

In this scenario because we desire more prediction ability we select the model selected the AIC criterian.

4.1.4 Model Analysis:

Below is the in sample ROC curve: ROC Curve

In-sample : Training data

pred.gc.train<- predict(g.credit.glm.back, type="response")

pred0 <- prediction(pred.gc.train, german_credit_train$response)
perf0 <- performance(pred0, "tpr", "fpr")

plot(perf0, colorize=TRUE)

#Get the AUC
AUC0<-unlist(slot(performance(pred0, "auc"), "y.values"))
#0.8496752

Observation: The roc curve looks good and we also observe 0.8162998 as AUC value. As per Rule of thumb AUC value > .70 can be considered satisfactory.ROC curve signifies overall measure of goodness of classification, hence a higher value signifies that model has good classification ability.

But this measure is calculated on training sample, which is not a good data to make a decision. Let’s check the same for out of sample..

Confusion Matrix

Considering the data has asymmetric distribution , we choose cutoff probability as 1/6. The cost function will also change accordingly..

The misclassification rate table is as follows:

pcut1=.16

# get binary prediction
class.gc.train<- (pred.gc.train>pcut1)*1
# get confusion matrix
table(german_credit_train$response, class.gc.train, dnn = c("True", "Predicted"))

##     Predicted
## True   0   1
##    0 238 251
##    1  22 189

Intepretation

False positive: 218 False Negative: 20

Misclassification = False positive + False Negative We observe that:

There are 20 candidates that were actually defaulters but were predicted as good borrowers these are called as False negative classification.

There are 218 candidates were actually good borrowers but were predicted as bad borrowers these are called as False positive classification.

If we wish to penalize False negatives more then the weights in cost function will change…

#cost function define
costfunc = function(obs, pred.p){
  pcut=.16
  weight1 = 5   # define the weight for "true=1 but pred=0" (FN)
  weight0 = 1    # define the weight for "true=0 but pred=1" (FP)
  c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (FN)
  c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (FP)
  cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
  return(cost) # you have to return to a value when you write R functions
} # end of the function

# (equal-weighted) misclassification rate
MR<- mean(german_credit_train$response!=class.gc.train)
#asymettric cose
cost1<-costfunc(class.gc.train,german_credit_train$response)
# False positive rate
FPR<- sum(german_credit_train$response==0 & class.gc.train==1)/sum(german_credit_train$response==0)
# False negative rate (exercise)
FNR<- sum(german_credit_train$response==1 & class.gc.train==0)/sum(german_credit_train$response==1)

trn.val<-data.frame(cbind(MR,cost1,FPR,FNR,AUC0))
colnames(trn.val)<-c("MissClassRt","AsyCost","FPR","FNR","AUC")
rownames(trn.val)<-c("Training Sample")
trn.val

##                 MissClassRt  AsyCost       FPR       FNR       AUC
## Training Sample        0.39 1.824286 0.5132924 0.1042654 0.8162998

Intepretation

MissClassRt: This signifies that 34% of candidates were missclassified.
AsyCost: Asymmetric cost based on asymmetric cost function, we considered a cut-off probability 1/6 (equivalent to 5:1 asymmetric cost).
FPR: False positive rate signifies that out of total 44% were classified as False positive
FNR: False negative rate signifies that out of total 9.4% were classified as False negative
AUC: Area under the curve signifies overall strength of classification.

Out-of-sample : Testing data ROC Curve

#Part 02 Out of sample prediction
# apply fitted model to test sample (predicted probabilities)
predTst <- predict(g.credit.glm.back,german_credit_test, type="response")

pred2 <- prediction(predTst, german_credit_test$response)
perf2 <- performance(pred2, "tpr", "fpr")
plot(perf2, colorize=TRUE)

#Get the AUC
AUC2<-unlist(slot(performance(pred2, "auc"), "y.values"))
#0.7173563

Intepretation Observation: The roc curve looks good and we also observe 0.769583 as AUC value. As per Rule of thumb AUC value > .70 can be considered satisfactory.ROC curve signifies overall measure of goodness of classification, hence a higher value signifies that model has good classification ability.

Confusion Matrix

Considering the data has asymmetric distribution , we choose cutoff probability as 1/6. The cost function will also change accordingly..

The misclassification rate table is as follows:

# step 1. get binary classification
class.predTst<- (predTst>pcut1)*1
# step 2. get confusion matrix, MR, FPR, FNR
table(german_credit_test$response, class.predTst, dnn = c("True", "Predicted"))

##     Predicted
## True   0   1
##    0  99 112
##    1  10  79

Intepretation

False positive: 107 False Negative: 20

Misclassification = False positive + False Negative We observe that:

There are 20 candidates that were actually defaulters but were predicted as good borrowers these are called as False negative classification.

There are 107 candidates were actually good borrowers but were predicted as bad borrowers these are called as False Negative.

If we wish to penalize False negatives more then the weights in cost function will change…

# (equal-weighted) misclassification rate
MR2<- mean(german_credit_test$response!=class.predTst)
#asym cost
cost2<-costfunc(class.predTst,german_credit_test$response)
# False positive rate
FPR2<- sum(german_credit_test$response==0 & class.predTst==1)/sum(german_credit_test$response==0)
# False negative rate (exercise)
FNR2<- sum(german_credit_test$response==1 & class.predTst==0)/sum(german_credit_test$response==1)

test.val<-c(MR2,cost2,FPR2,FNR2,AUC2)
cols<-c("MR","COST","FPR","FNR","AUC")
rows<-c("Training set","Testing set")

report<-rbind(trn.val,test.val)
colnames(report)<-cols
rownames(report)<-rows

costfunc = function(obs, pred.p){
    weight1 = 5   # define the weight for "true=1 but pred=0" (FN)
    weight0 = 1    # define the weight for "true=0 but pred=1" (FP)
    pcut = .16 
    c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (FN)
    c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (FP)
    cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
    return(cost) # you have to return to a value when you write R functions
} # end of the function
credit.glm.cv<- glm(response~. , family=binomial, data=german_credit) 
cv.result = cv.glm(data=german_credit, glmfit=credit.glm.cv, cost=costfunc,K=10)
cv.result$delta[2]

## [1] 0.5324

The cross validated error =.5345 Table of Comparision

kable(report)  %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "170px")

	MR	COST	FPR	FNR	AUC
Training set	0.3900000	1.824286	0.5132924	0.1042654	0.8162998
Testing set	0.4066667	1.900000	0.5308057	0.1123596	0.7695830

#report

Intepretation

MissClassRt: This signifies the candidates were missclassified, we observe that this has increased on testing sample.
AsyCost: Asymmetric cost based on asymmetric cost function, we considered a cut-off probability 1/6 (equivalent to 5:1 asymmetric cost).we observe that this has increased on testing sample.
FPR: False positive rate signifies that out of total XX% were classified as False positive
FNR: False negative rate signifies that out of total YY% were classified as False Negative
AUC: Area under the curve signifies overall strength of classification.

Our model is not as good to predict the testing values as the training values. FPR and FNR have increased and Area Under the curve has decreased from 85% to 72%. However, 72% is acceptable score and we can conclude that our model is working fairly well.

4.2 Cart

CART stands for classification and regression tree.

The distinctiove feature with this algo is that we generally have a asymmetric cost function. In the credit scoring case it means that false negatives (predicting 0 when truth is 1, or giving out loans that end up in default) will cost more than false positives (predicting 1 when truth is 0, rejecting loans that you should not reject).

Here we make the assumption that false negative cost 5 times of false positive. In real life the cost structure should be carefully researched.

credit.rpart0 <- rpart(formula = response ~ ., data = german_credit_train, method = "class")

credit.rpart <- rpart(formula = response ~ . , data = german_credit_train, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))

Symmetric Cost

pred0<- predict(credit.rpart0, type="class")
table(german_credit_train$response, pred0, dnn = c("True", "Pred"))

##     Pred
## True   0   1
##    0 446  43
##    1  96 115

False negatives are heavy costing so lets, add more penality of 5:1

Asymmetric Cost insample

pred.in<- predict(credit.rpart, type="class")
table(german_credit_train$response, pred.in, dnn = c("True", "Pred"))

##     Pred
## True   0   1
##    0 259 230
##    1   2 209

As expected our False negatives have reduced.Hence, we select credit.rpart as our model for Binary classification Tree.

Missclassification rate & LOSS Value

cost <- function(r, pi){
  weight1 = 5
  weight0 = 1
  c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
  c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
  return(mean(weight1*c1+weight0*c0))
}
#LOSS
(insamplecost.df<-cost(german_credit_train$response,pred.in))

## [1] 0.3428571

#Misclassification rate
(insampleMCR.df<-mean(german_credit_train$response!=pred.in))

## [1] 0.3314286

#roc
credit.train.pred.in.prob.rpart<- predict(credit.rpart, german_credit_train, type="prob")[,2]
pred <- prediction(credit.train.pred.in.prob.rpart, german_credit_train$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.7986218

LOSS:.342 MCR:.331 AUC:.7986 Out of sample Prediction

pred.out<- predict(credit.rpart, german_credit_test, type="class")
table(german_credit_test$response, pred.out, dnn=c("Truth","Predicted"))

##      Predicted
## Truth   0   1
##     0  98 113
##     1  16  73

Missclassification rate & LOSS Value

#LOSS
(outsamplecost.df<-cost(german_credit_test$response,pred.out))

## [1] 0.6433333

#Misclassification rate
(outsampleMCR.df<-mean(german_credit_test$response!=pred.out))

## [1] 0.43

#roc
credit.test.pred.out.prob.rpart<- predict(credit.rpart, german_credit_test, type="prob")[,2]
pred <- prediction(credit.test.pred.out.prob.rpart, german_credit_test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.7234677

LOSS:.643 MCR:.43 AUC:.723 Pruning a tree

The ideal way of creating a binary tree is to construct a large tree and then prune it to an optimum level

credit.rpart_prune <- rpart(formula = response ~ . , data = german_credit_train, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)),cp = 0.001)

We have a bushy tree, Now we can try to prune up to a optimum level using plotcp function

prp(credit.rpart_prune)

plotcp(credit.rpart_prune)

printcp(credit.rpart_prune)

## 
## Classification tree:
## rpart(formula = response ~ ., data = german_credit_train, method = "class", 
##     parms = list(loss = matrix(c(0, 5, 1, 0), nrow = 2)), cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] age           amount        chk_acct      credit_his    duration     
##  [6] job           n_credits     other_debtor  other_install present_emp  
## [11] property      purpose       saving_acct  
## 
## Root node error: 489/700 = 0.69857
## 
## n= 700 
## 
##           CP nsplit rel error xerror    xstd
## 1  0.1053170      0   1.00000 5.0000 0.12414
## 2  0.0439673      2   0.78937 2.5583 0.12763
## 3  0.0306748      4   0.70143 2.9243 0.13108
## 4  0.0143149      6   0.64008 2.8241 0.13025
## 5  0.0129516     13   0.53988 2.7321 0.12879
## 6  0.0102249     16   0.50102 2.6933 0.12835
## 7  0.0081800     17   0.49080 2.6933 0.12835
## 8  0.0068166     18   0.48262 2.6421 0.12750
## 9  0.0040900     24   0.43967 2.3865 0.12374
## 10 0.0010225     28   0.42331 2.3967 0.12388
## 11 0.0010000     30   0.42127 2.3967 0.12388

xerror gives you the cross-validation (default is 10-fold) error. You can see that the rel error (in-sample error) is always decreasing as model is more complex, while the cross-validation error (measure of performance on future observations) is not. That is why we prune the tree to avoid overfitting the training data.

credit.tree<-prune(credit.rpart_prune, cp = 0.002)

4.2.2 Model Assessment

Insample Prediction

credit.train.pred.in<- predict(credit.tree, german_credit_train, type="class")
table(german_credit_train$response, credit.train.pred.in, dnn=c("Truth","Predicted"))

##      Predicted
## Truth   0   1
##     0 317 172
##     1   7 204

Here, we observe that FN are few more than the last tree(made with default parameters). FP although seems to have improved for insample data.

Missclassification rate & LOSS Value

#LOSS
(insamplecost.pr<-cost(german_credit_train$response,credit.train.pred.in))

## [1] 0.2957143

#Misclassification rate
(insampleMCR.pr<-mean(german_credit_train$response!=credit.train.pred.in))

## [1] 0.2557143

#roc
credit.train.pred.in.prob<- predict(credit.tree, german_credit_train, type="prob")[,2]
pred <- prediction(credit.train.pred.in.prob, german_credit_train$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.8674343

LOSS:.295 MCR:.255 AUC:.867 Out of sample Prediction

credit.test.pred.out<- predict(credit.tree, german_credit_test, type="class")
table(german_credit_test$response, credit.test.pred.out, dnn=c("Truth","Predicted"))

##      Predicted
## Truth   0   1
##     0 124  87
##     1  23  66

Missclassification rate & LOSS Value

#LOSS
(outsamplecost.pr<-cost(german_credit_test$response,credit.test.pred.out))

## [1] 0.6733333

#Misclassification rate
(outsampleMCR.pr<-mean(german_credit_test$response!=credit.test.pred.out))

## [1] 0.3666667

#roc
credit.test.pred.in.prob<- predict(credit.tree, german_credit_test, type="prob")[,2]
pred <- prediction(credit.test.pred.in.prob, german_credit_test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.7409074

LOSS:.673 MCR:.366 AUC:.740

What tree do we choose?

The insample performance is better for pruned tree(but that not what we are concerned about) The out of sample Loss is more in pruned tree,but overall MCR is better.

Hence, to choose a tree we need to choose between these to trade-off parameters. Here, we need lesser loss the better(we see small difference .67 and .64) but a significant difference between MCR(.36 and .48) between the two.Also AUC is beter for prune Model.

Hence, we choose the pruned tree: credit.tree

Please note, we can use different mechanisms to choose the best model such as : AUC, LOSS, MCR.

4.3 Essemble Methods

Advanced Tree Models – Bagging, Random Forests, and Boosting

4.3.1 Bagging

Bagging stands for Bootstrap and Aggregating. It employs the idea of bootstrap but the purpose is not to study bias and standard errors of estimates. Instead, the goal of Bagging is to improve prediction accuracy. It fits a tree for each bootsrap sample, and then aggregate the predicted values from all these different trees. For more details, you may look at Wikepedia, or you can find the original paper Leo Breiman (1996).

To my best knowledge, it seems that bagging() won’t take an argument for asymmetric loss. Therefore, the classification results might not be appropriate.Lets check it out…

credit.bg <- randomForest(as.factor(response)~., data = german_credit_train,mtry=ncol(german_credit_train)-1,ntree=1000)
credit.bg

## 
## Call:
##  randomForest(formula = as.factor(response) ~ ., data = german_credit_train,      mtry = ncol(german_credit_train) - 1, ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 20
## 
##         OOB estimate of  error rate: 25.14%
## Confusion matrix:
##     0  1 class.error
## 0 425 64   0.1308793
## 1 112 99   0.5308057

Models sensitivity will be less because the FN are high which is concerning, although FP are less which is good thing.

plot(credit.bg, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd=rep(2, 3), lty = c(1,2,3), col = c("black", "red", "green"))

Insample Analysis

## confusion matrix
credit.bg.pred<- predict(credit.bg, type = "prob")[,2]
optimal.pcut= .16#our assumption
credit.bg.pred.class<- (credit.bg.pred>optimal.pcut)*1
table(german_credit_train$response, credit.bg.pred.class, dnn = c("True", "Pred"))

##     Pred
## True   0   1
##    0 206 283
##    1  22 189

Lets calculate the Loss and MCR for insample

#LOSS
(BG.insamplecost.pr<-cost(german_credit_train$response,credit.bg.pred.class))

## [1] 0.5614286

#Misclassification rate
(BG.insampleMCR.pr<-mean(german_credit_train$response!=credit.bg.pred.class))

## [1] 0.4357143

#roc
pred <- prediction(credit.bg.pred, german_credit_train$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.7789133

LOSS is (.58)very high because rpart does not have a indefined cost assymetric parameter MCR=.44 Approx AUC=.772 Out of sample Analysis

Please higher value of AUC can also be by fluke

## confusion matrix
optimal.pcut= .16#our assumption
credit.bg.pred_test<- predict(credit.bg,newdata=german_credit_test, type = "prob")[,2]
credit.bg.pred.class.test<- (credit.bg.pred_test>optimal.pcut)*1
table(german_credit_test$response, credit.bg.pred.class.test, dnn = c("True", "Pred"))

##     Pred
## True   0   1
##    0  88 123
##    1   7  82

We can see the FN have reduced in Out of sample as well.

Lets calculate the Loss and MCR for insample and out of sample Missclassification rate & LOSS Value

#LOSS
(BG.outsamplecost.pr<-cost(german_credit_test$response,credit.bg.pred.class.test))

## [1] 0.5266667

#Misclassification rate
(BG.outsampleMCR.pr<-mean(german_credit_test$response!=credit.bg.pred.class.test))

## [1] 0.4333333

#roc
pred <- prediction(credit.bg.pred_test, german_credit_test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.8004952

LOSS=.51 MCR=.43 AUC=.800

4.3.2 RandomForest

credit.rf <- randomForest(as.factor(response)~., data = german_credit_train,mtry=sqrt(ncol(german_credit_train)-1),ntree=1000)
credit.rf

## 
## Call:
##  randomForest(formula = as.factor(response) ~ ., data = german_credit_train,      mtry = sqrt(ncol(german_credit_train) - 1), ntree = 1000) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 24.57%
## Confusion matrix:
##     0  1 class.error
## 0 446 43  0.08793456
## 1 129 82  0.61137441

Models sensitivity will be less because the FN are high which is concerning, although FP are less which is good thing.

plot(credit.rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd=rep(2, 3), lty = c(1,2,3), col = c("black", "red", "green"))

Insample Analysis

## confusion matrix
credit.rf.pred<- predict(credit.rf, type = "prob")[,2]
optimal.pcut= .16#our assumption
credit.rf.pred.class<- (credit.rf.pred>optimal.pcut)*1
table(german_credit_train$response, credit.rf.pred.class, dnn = c("True", "Pred"))

##     Pred
## True   0   1
##    0 167 322
##    1  19 192

Note thatFN are less which is a positive sign..

Lets calculate the Loss and MCR for insample

#LOSS
(rf.insamplecost<-cost(german_credit_train$response,credit.rf.pred.class))

## [1] 0.5957143

#Misclassification rate
(rf.insampleMCR<-mean(german_credit_train$response!=credit.rf.pred.class))

## [1] 0.4871429

#roc
pred <- prediction(credit.rf.pred, german_credit_train$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.7757053

LOSS=.58 MCR=.44 AUC=.781

Out of sample Analysis

## confusion matrix
credit.rf.pred_test<- predict(credit.rf,newdata=german_credit_test, type = "prob")[,2]
optimal.pcut= .16#our assumption
credit.rf.pred.class.test<- (credit.rf.pred_test>optimal.pcut)*1
table(german_credit_test$response, credit.rf.pred.class.test, dnn = c("True", "Pred"))

##     Pred
## True   0   1
##    0  76 135
##    1   4  85

Lets calculate the Loss and MCR for insample and out of sample Missclassification rate & LOSS Value

#LOSS
(rf.outsamplecost<-cost(german_credit_test$response,credit.rf.pred.class.test))

## [1] 0.5166667

#Misclassification rate
(rf.outsampleMCR<-mean(german_credit_test$response!=credit.rf.pred.class.test))

## [1] 0.4633333

#roc
pred <- prediction(credit.rf.pred_test, german_credit_test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.8065126

LOSS=.5133 MCR=.46 AUC=.805

4.3.3 Gradient Boosting

Boosting builds a number of small trees, and each time, the response is the residual from last tree. It is a sequential procedure. We use gbm package to build boosted trees.

credit.bo= gbm(response~., data = german_credit_train, distribution = "bernoulli",n.trees = 100, shrinkage = 0.01, interaction.depth = 8)
summary(credit.bo)

##                               var     rel.inf
## chk_acct                 chk_acct 24.79473162
## purpose                   purpose 13.93340620
## duration                 duration 12.26373402
## amount                     amount  7.88413169
## other_install       other_install  6.58860998
## credit_his             credit_his  5.69527536
## present_emp           present_emp  5.05194876
## saving_acct           saving_acct  4.90412351
## property                 property  4.55080555
## age                           age  4.49352334
## installment_rate installment_rate  1.93167495
## present_resid       present_resid  1.88800520
## housing                   housing  1.75965162
## job                           job  1.43809345
## sex                           sex  1.35982765
## other_debtor         other_debtor  0.63424947
## n_credits               n_credits  0.34181637
## telephone               telephone  0.26795525
## n_people                 n_people  0.13894866
## foreign                   foreign  0.07948735

par(mfrow=c(1,3))
plot(credit.bo, i="chk_acct")

plot(credit.bo, i="duration")

plot(credit.bo, i="purpose")

Insample analysis

pred.credit.bo.in<- predict(credit.bo, newdata = german_credit_train,type ="response" ,n.trees =100 )
optimal.pcut= .16#our assumption
credit.bo.pred.class<- (pred.credit.bo.in>optimal.pcut)*1
table(german_credit_train$response, credit.bo.pred.class, dnn = c("True", "Pred"))

##     Pred
## True   1
##    0 489
##    1 211

#LOSS
(bo.insamplecost<-cost(german_credit_train$response,credit.bo.pred.class))

## [1] 0.6985714

#Misclassification rate
(bo.insampleMCR<-mean(german_credit_train$response!=credit.bo.pred.class))

## [1] 0.6985714

#class(pred.credit.bo.in)
pred <- prediction(pred.credit.bo.in, german_credit_train$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.8853837

LOSS:0.6985714 MCR:0.6985714 AUC:.886

Out of sample analysis

pred.credit.bo.out<- predict(credit.bo, newdata = german_credit_test,type ="response" ,n.trees =100 )
optimal.pcut= .16#our assumption
credit.bo.pred.class<- (pred.credit.bo.out>optimal.pcut)*1
table(german_credit_test$response, credit.bo.pred.class, dnn = c("True", "Pred"))

##     Pred
## True   1
##    0 211
##    1  89

#LOSS
(bo.outsamplecost<-cost(german_credit_test$response,credit.bo.pred.class))

## [1] 0.7033333

#Misclassification rate
(bo.outsampleMCR<-mean(german_credit_test$response!=credit.bo.pred.class))

## [1] 0.7033333

pred <- prediction(pred.credit.bo.out, german_credit_test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.8173492

LOSS:0.703 MCR:0.703 AUC:.819

4.4 GAM

n <- names(german_credit_train)
cont<-c("duration","age","amount","response")
f <- as.formula(paste("response ~s(duration)+s(age)+s(amount) +", paste(n[!n %in% cont], collapse = " + ")))
f

## response ~ s(duration) + s(age) + s(amount) + chk_acct + credit_his + 
##     purpose + saving_acct + present_emp + installment_rate + 
##     sex + other_debtor + present_resid + property + other_install + 
##     housing + n_credits + job + n_people + telephone + foreign

credit.gam <- gam(formula = f, family=binomial,data=german_credit_train);
summary(credit.gam)

## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## response ~ s(duration) + s(age) + s(amount) + chk_acct + credit_his + 
##     purpose + saving_acct + present_emp + installment_rate + 
##     sex + other_debtor + present_resid + property + other_install + 
##     housing + n_credits + job + n_people + telephone + foreign
## 
## Parametric coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -1.713e-01  1.217e+00  -0.141  0.88808    
## chk_acctA12       -3.865e-01  2.670e-01  -1.448  0.14772    
## chk_acctA13       -8.783e-01  4.446e-01  -1.976  0.04819 *  
## chk_acctA14       -1.713e+00  2.785e-01  -6.150 7.73e-10 ***
## credit_hisA31      9.244e-01  7.644e-01   1.209  0.22654    
## credit_hisA32      4.953e-02  5.750e-01   0.086  0.93136    
## credit_hisA33     -3.767e-01  6.105e-01  -0.617  0.53720    
## credit_hisA34     -9.624e-01  5.810e-01  -1.656  0.09764 .  
## purposeA41        -1.385e+00  4.283e-01  -3.234  0.00122 ** 
## purposeA410       -2.374e+00  1.222e+00  -1.943  0.05201 .  
## purposeA42        -8.916e-01  3.238e-01  -2.754  0.00589 ** 
## purposeA43        -8.111e-01  2.949e-01  -2.751  0.00594 ** 
## purposeA44        -8.147e-01  8.736e-01  -0.933  0.35099    
## purposeA45        -3.656e-01  6.160e-01  -0.594  0.55284    
## purposeA46        -1.802e-02  5.118e-01  -0.035  0.97191    
## purposeA48        -3.922e+01  3.355e+07   0.000  1.00000    
## purposeA49        -4.909e-01  3.994e-01  -1.229  0.21906    
## saving_acctA62    -2.220e-01  3.528e-01  -0.629  0.52922    
## saving_acctA63    -9.380e-02  4.548e-01  -0.206  0.83660    
## saving_acctA64    -6.864e-01  6.235e-01  -1.101  0.27090    
## saving_acctA65    -1.074e+00  3.318e-01  -3.236  0.00121 ** 
## present_empA72    -4.915e-01  5.071e-01  -0.969  0.33241    
## present_empA73    -6.510e-01  4.884e-01  -1.333  0.18254    
## present_empA74    -1.243e+00  5.321e-01  -2.337  0.01944 *  
## present_empA75    -6.279e-01  4.946e-01  -1.270  0.20424    
## installment_rate   2.954e-01  1.084e-01   2.726  0.00642 ** 
## sexA92             2.369e-01  4.855e-01   0.488  0.62553    
## sexA93            -3.098e-01  4.767e-01  -0.650  0.51571    
## sexA94            -2.722e-02  5.502e-01  -0.049  0.96054    
## other_debtorA102   8.110e-02  5.064e-01   0.160  0.87276    
## other_debtorA103  -7.361e-01  4.746e-01  -1.551  0.12096    
## present_resid      3.749e-02  1.075e-01   0.349  0.72728    
## propertyA122       4.050e-01  3.033e-01   1.335  0.18176    
## propertyA123       1.077e-01  2.830e-01   0.381  0.70351    
## propertyA124       4.756e-01  5.127e-01   0.928  0.35359    
## other_installA142  7.392e-02  5.406e-01   0.137  0.89124    
## other_installA143 -9.441e-01  2.954e-01  -3.195  0.00140 ** 
## housingA152       -4.925e-01  2.852e-01  -1.727  0.08419 .  
## housingA153       -4.355e-01  5.638e-01  -0.772  0.43993    
## n_credits          5.473e-01  2.365e-01   2.314  0.02067 *  
## jobA172            8.127e-01  7.422e-01   1.095  0.27352    
## jobA173            8.556e-01  7.096e-01   1.206  0.22790    
## jobA174            7.307e-01  7.225e-01   1.011  0.31180    
## n_people           2.842e-01  3.015e-01   0.943  0.34585    
## telephoneA192     -2.878e-01  2.446e-01  -1.176  0.23943    
## foreignA202       -1.550e+00  8.599e-01  -1.803  0.07144 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##               edf Ref.df Chi.sq p-value   
## s(duration) 1.268  1.489  7.828 0.00826 **
## s(age)      1.001  1.002  2.147 0.14294   
## s(amount)   1.531  1.901  4.085 0.18323   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.256   Deviance explained = 26.9%
## UBRE = 0.037162  Scale est. = 1         n = 700

plot(credit.gam, shade=TRUE,seWithMean=TRUE,scale=0, pages = 1)

AIC(credit.gam)

## [1] 726.0132

BIC(credit.gam)

## [1] 952.6584

credit.gam$deviance

## [1] 626.4126

We can notice that even the continious variables in the data do not have a non-linear distribution. This is backed up by the above plots where we can see a straight line

Insample Performance

pcut.gam <- .16
prob.gam.in<-predict(credit.gam,german_credit_train,type="response")
pred.gam.in.class<-(prob.gam.in>=pcut.gam)*1
table(german_credit_train$response,pred.gam.in.class,dnn=c("Observed","Predicted"))

##         Predicted
## Observed   0   1
##        0 264 225
##        1  21 190

FN : Looks fairly well but the FP looks really high

#LOSS
(gam.insamplecost<-cost(german_credit_train$response,pred.gam.in.class))

## [1] 0.4714286

#Misclassification rate
(gam.insampleMCR<-mean(german_credit_train$response!=pred.gam.in.class))

## [1] 0.3514286

pred <- prediction(as.numeric(prob.gam.in),german_credit_train$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.8340457

LOSS: .471 MCR: .351 AUC: .834

Out of sample Performance

pcut.gam <- .16
prob.gam.out<-predict(credit.gam,german_credit_test,type="response")
pred.gam.out.class<-(prob.gam.out>=pcut.gam)*1
table(german_credit_test$response,pred.gam.out.class,dnn=c("Observed","Predicted"))

##         Predicted
## Observed   0   1
##        0 102 109
##        1   8  81

FN : Looks fairly well but the FP looks really high

#LOSS
(gam.out.samplecost<-cost(german_credit_test$response,pred.gam.out.class))

## [1] 0.4966667

#Misclassification rate
(gam.out.sampleMCR<-mean(german_credit_test$response!=pred.gam.out.class))

## [1] 0.39

pred <- prediction(as.numeric(prob.gam.out),german_credit_test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.7912029

LOSS:.496 MCR: .390 AUC: .791

4.5 Neural Net

credit.nnet <- nnet(response~., data=german_credit_train, size=8,decay=0.003, maxit=500)

## # weights:  401
## initial  value 199.027104 
## iter  10 value 147.060497
## iter  20 value 134.217896
## iter  30 value 126.707014
## iter  40 value 113.508218
## iter  50 value 103.778762
## iter  60 value 101.224480
## iter  70 value 100.659776
## iter  80 value 100.619145
## iter  90 value 100.618564
## iter 100 value 100.618461
## final  value 100.618332 
## converged

Insample Analysis

prob.nnet= predict(credit.nnet,german_credit_train)
pred.nnet = as.numeric(prob.nnet > 0.16)
table(german_credit_train$response,pred.nnet, dnn=c("Observed","Predicted"))

##         Predicted
## Observed   0   1
##        0 399  90
##        1  65 146

#LOSS
(nnet.insamplecost<-cost(german_credit_train$response,pred.nnet))

## [1] 0.5928571

#Misclassification rate
(nnet.insampleMCR<-mean(german_credit_train$response!=pred.nnet))

## [1] 0.2214286

pred <- prediction(prob.nnet,german_credit_train$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.791789

LOSS:0.274 MCR:.108 AUC:.737

Out of sample Analysis

prob.nnet.out= predict(credit.nnet,german_credit_test)
pred.nnet.out = as.numeric(prob.nnet.out > 0.16)
table(german_credit_test$response,pred.nnet.out, dnn=c("Observed","Predicted"))

##         Predicted
## Observed   0   1
##        0 164  47
##        1  30  59

#LOSS
(nnet.insamplecost<-cost(german_credit_test$response,pred.nnet.out))

## [1] 0.6566667

#Misclassification rate
(nnet.insampleMCR<-mean(german_credit_test$response!=pred.nnet.out))

## [1] 0.2566667

pred <- prediction(prob.nnet.out,german_credit_test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))

## [1] 0.7842271

LOSS: .74 MCR:.30 AUC: .737

5.0 Insights

Conclusion:

Ensemble methods we observe are not performing well on this data set. The reason is they do not have a asymmetric cost setting functionality as in rpart. GAM do take a lead here and outperform all other algorithms

German Credit Scoring Data

Vidhi Rathod

09 April 2020

German Credit Scoring data

1.0 Objective

2.0 Initial Set Up

2.1 Package Loading

2.2 Data Loading

2.3 Sampling

2.4 Data Snapshot

3.0 Exploratory Data Analysis

3.1 Initial Analysis

3.2 Deep Dive

4.0 Model Selection

4.1 Logistic Regression

4.1.2 Backward AIC

4.1.3 Backward BIC

4.1.4 Model Analysis:

4.2 Cart

4.2.2 Model Assessment

4.3 Essemble Methods

4.3.1 Bagging

4.3.2 RandomForest

4.3.3 Gradient Boosting

4.4 GAM

4.5 Neural Net

5.0 Insights