The German credit score data are downloadable from http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)
Goal: Testing out several different supervised learning algorithms over credit data of subjects to find the one that accurately * * * predicts if an individual will be good candidate to loan or will tend to default
Metric: Loss : Asymmetric cost metric is used to calculate Loss.Ever false negative weighs 5 times to a False positive Missclassification Rate: Incorrect Classifications/Total Classifications Area under the curve(ROC): Area covered under a ROC Curve.
More priority of metrics is : Loss> MCR> AUC
Methodology: Random sample a training data set that contains 70% of original data points. Perform exploratory data analysis. Find a best model for Credit Scoring data using logistic regression with AIC and BIC. Draw ROC curve, report the AUC, and present the misclassification rate table of your final model.
Test the out-of-sample performance.
Required Packages
We have used a seed of : 13383645
set.seed(13383645)
#Reading Data
german_credit = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")
#Assigning variable names
colnames(german_credit)=c("chk_acct","duration","credit_his","purpose","amount","saving_acct","present_emp","installment_rate","sex","other_debtor","present_resid","property","age","other_install","housing","n_credits","job","n_people","telephone","foreign","response")
#Response is in 1,2 - we need to change it to 0,1
german_credit$response = german_credit$response - 1
Objective: In here we are random sampling the data in 2 parts 70% training and 30% testing
#Dividing into training and testing dataset
index <- sample(nrow(german_credit),size = nrow(german_credit)*0.70)
german_credit_train <- german_credit[index,]
german_credit_test <- german_credit[-index,]
kable(head(german_credit_train,10) ) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "300px")
| chk_acct | duration | credit_his | purpose | amount | saving_acct | present_emp | installment_rate | sex | other_debtor | present_resid | property | age | other_install | housing | n_credits | job | n_people | telephone | foreign | response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39 | A13 | 10 | A32 | A44 | 1225 | A61 | A73 | 2 | A93 | A101 | 2 | A123 | 37 | A143 | A152 | 1 | A173 | 1 | A192 | A201 | 0 |
| 799 | A14 | 24 | A33 | A40 | 717 | A65 | A75 | 4 | A94 | A101 | 4 | A123 | 54 | A143 | A152 | 2 | A173 | 1 | A192 | A201 | 0 |
| 579 | A12 | 36 | A34 | A40 | 2820 | A61 | A72 | 4 | A91 | A101 | 4 | A123 | 27 | A143 | A152 | 2 | A173 | 1 | A191 | A201 | 1 |
| 686 | A14 | 60 | A32 | A40 | 6527 | A65 | A73 | 4 | A93 | A101 | 4 | A124 | 34 | A143 | A153 | 1 | A173 | 2 | A192 | A201 | 0 |
| 344 | A12 | 18 | A32 | A49 | 4439 | A61 | A75 | 1 | A93 | A102 | 1 | A121 | 33 | A141 | A152 | 1 | A174 | 1 | A192 | A201 | 0 |
| 360 | A11 | 30 | A32 | A42 | 2406 | A61 | A74 | 4 | A92 | A101 | 4 | A121 | 23 | A143 | A151 | 1 | A173 | 1 | A191 | A201 | 1 |
| 395 | A14 | 9 | A34 | A42 | 2406 | A61 | A71 | 2 | A93 | A101 | 3 | A123 | 31 | A143 | A152 | 1 | A174 | 1 | A191 | A201 | 0 |
| 721 | A13 | 9 | A30 | A43 | 1337 | A61 | A72 | 4 | A93 | A101 | 2 | A123 | 34 | A143 | A152 | 2 | A174 | 1 | A192 | A201 | 1 |
| 689 | A14 | 9 | A32 | A43 | 2753 | A62 | A75 | 3 | A93 | A102 | 4 | A123 | 35 | A143 | A152 | 1 | A173 | 1 | A192 | A201 | 0 |
| 144 | A11 | 18 | A32 | A42 | 2462 | A61 | A73 | 2 | A93 | A101 | 2 | A123 | 22 | A143 | A152 | 1 | A173 | 1 | A191 | A201 | 1 |
#head(german_credit,10)
This German Credit Score classifies people as good or bad borrowers based on their attributes. The response variable is 1 for good borrower or loan and 2 refers to bad borrower or loan. For the ease of working on dataset we have changes the response to binary 0,1.
Dimension
dimCr<-dim(german_credit)
The dimension of data: 1000, 21
Structure
str(german_credit)
## 'data.frame': 1000 obs. of 21 variables:
## $ chk_acct : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_his : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ purpose : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ saving_acct : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ present_emp : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ installment_rate: int 4 2 2 2 3 2 3 2 2 4 ...
## $ sex : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ other_debtor : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ present_resid : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ other_install : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ n_credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ n_people : int 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
## $ response : num 0 1 0 0 1 0 0 0 0 1 ...
Observation : Data is a mixture of Factors and numeric columns
Description
In order to see how the response variable is affected by other factors, we will regress it on other variables. As it has only 0 and 1 values, we will use binomial regression.
Data Quality Check
Data quality is another very important step in EDA, it’s imperative to have good data quality for a optimum analysis.
metadata<-t(introduce(german_credit))
colnames(metadata)<-"Values"
metadata
## Values
## rows 1000
## columns 21
## discrete_columns 13
## continuous_columns 8
## all_missing_columns 0
## total_missing_values 0
## complete_rows 1000
## total_observations 21000
## memory_usage 101632
plot_intro(german_credit)
The good news is that this dataset has no missing/ Abnormal values in dataset. The structure looks coherent
Analysis based on Visualization
Analysis of Continious Variables
par(mfrow=c(2,4))
boxplot(german_credit$duration, col = "lightblue", xlab="Duration")
boxplot(german_credit$amount, col = "lightblue", xlab="Amount")
boxplot(german_credit$installment_rate, col = "lightblue", xlab="Installment Rate")
boxplot(german_credit$present_resid, col = "lightblue", xlab="Present resid")
boxplot(german_credit$age, col = "lightblue", xlab="Age")
boxplot(german_credit$n_credits, col = "lightblue", xlab="Num Credits")
boxplot(german_credit$n_people, col = "lightblue", xlab="Num People")
#duration
g1<- ggplot(german_credit, aes(x = as.factor(response), y = duration, fill = as.factor(response))) + geom_boxplot() + theme(legend.position = "none")
#amount
g2<-ggplot(german_credit, aes(x = as.factor(response), y = amount, fill = as.factor(response))) + geom_boxplot() +
theme(legend.position = "none")
#AGE
g4<-ggplot(german_credit, aes(x = as.factor(response), y = age, fill = as.factor(response))) +
geom_boxplot() + theme(legend.position = "none")
ggarrange(g1, g2,g4,
labels = c("duration", "amount","age"),
ncol = 3, nrow = 1)
Observation
#Installment Rates
g3<-ggplot(german_credit, aes(factor(installment_rate), ..count..)) +
geom_bar(aes(fill = as.factor(response)), position = "dodge")
g4<-ggplot(german_credit, aes(chk_acct, ..count..)) +
geom_bar(aes(fill = as.factor(response)), position = "dodge")
g5<-ggplot(german_credit, aes(credit_his, ..count..)) +
geom_bar(aes(fill = as.factor(response)), position = "dodge")
g6<-ggplot(german_credit, aes(purpose, ..count..)) +
geom_bar(aes(fill = as.factor(response)), position = "dodge")
g7<-ggplot(german_credit, aes(saving_acct, ..count..)) +
geom_bar(aes(fill = as.factor(response)), position = "dodge")
g8<-ggplot(german_credit, aes(other_debtor, ..count..)) +
geom_bar(aes(fill = as.factor(response)), position = "dodge")
g9<-ggplot(german_credit, aes(sex, ..count..)) +
geom_bar(aes(fill = as.factor(response)), position = "dodge")
g10<-ggplot(german_credit, aes(other_install, ..count..)) +
geom_bar(aes(fill = as.factor(response)), position = "dodge")
g11<-ggplot(german_credit, aes(foreign, ..count..)) +
geom_bar(aes(fill = as.factor(response)), position = "dodge")
g12<-ggplot(german_credit, aes(present_emp, ..count..)) +
geom_bar(aes(fill = as.factor(response)), position = "dodge")
ggarrange(g3, g4,
labels = c("Installment Rates", "chk_acct"),
ncol = 1, nrow = 2)
ggarrange(g5, g6,
labels = c("credit_his", "purpose"),
ncol = 1, nrow = 2)
ggarrange(g7, g8,
labels = c("saving_acct", "other_debtor"),
ncol = 1, nrow = 2)
ggarrange(g9, g10,
labels = c("sex", "other_install"),
ncol = 1, nrow = 2)
ggarrange(g11,g12,
labels = c("foreign","Present_emp"),
ncol = 1, nrow = 2)
The installment_rate variable has a significant difference between the good and bad records, we see that bad records have almost the double median value than good ones.
For chk_acct A11 & A12 do not have much difference between good loans & bad loans buT A13 & A14 have considerably large difference.
For credit_his, for categories A32 - A34 we see the number of good credit are greater and for categories A30-A31.
We observe similar trends in other variables: sex, other_debtor, saving_acct, other_install and foreign.
Lets make a Full Model to start with
Objective: As part of best model we try to find the most suitable model by logistic regression using AIC,BIC criterians.
#full model
g.credit.glm0<- glm(response~., family=binomial, data=german_credit_train)
summary(g.credit.glm0)
##
## Call:
## glm(formula = response ~ ., family = binomial, data = german_credit_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2331 -0.7054 -0.3760 0.7091 2.5299
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.728e-01 1.297e+00 -0.442 0.658826
## chk_acctA12 -3.849e-01 2.668e-01 -1.443 0.149100
## chk_acctA13 -8.735e-01 4.445e-01 -1.965 0.049391 *
## chk_acctA14 -1.711e+00 2.784e-01 -6.146 7.96e-10 ***
## duration 3.154e-02 1.115e-02 2.829 0.004672 **
## credit_hisA31 9.205e-01 7.632e-01 1.206 0.227791
## credit_hisA32 5.860e-02 5.745e-01 0.102 0.918752
## credit_hisA33 -3.645e-01 6.095e-01 -0.598 0.549841
## credit_hisA34 -9.561e-01 5.810e-01 -1.646 0.099832 .
## purposeA41 -1.407e+00 4.247e-01 -3.313 0.000924 ***
## purposeA410 -2.378e+00 1.216e+00 -1.955 0.050604 .
## purposeA42 -9.080e-01 3.227e-01 -2.814 0.004893 **
## purposeA43 -8.173e-01 2.947e-01 -2.774 0.005541 **
## purposeA44 -8.093e-01 8.741e-01 -0.926 0.354502
## purposeA45 -3.644e-01 6.146e-01 -0.593 0.553207
## purposeA46 -1.252e-02 5.116e-01 -0.024 0.980477
## purposeA48 -1.622e+01 6.593e+02 -0.025 0.980378
## purposeA49 -5.032e-01 3.993e-01 -1.260 0.207586
## amount 9.281e-05 5.677e-05 1.635 0.102094
## saving_acctA62 -2.180e-01 3.522e-01 -0.619 0.536042
## saving_acctA63 -8.445e-02 4.551e-01 -0.186 0.852773
## saving_acctA64 -6.878e-01 6.248e-01 -1.101 0.270984
## saving_acctA65 -1.082e+00 3.311e-01 -3.269 0.001081 **
## present_empA72 -4.795e-01 5.057e-01 -0.948 0.343017
## present_empA73 -6.431e-01 4.872e-01 -1.320 0.186844
## present_empA74 -1.227e+00 5.303e-01 -2.314 0.020668 *
## present_empA75 -6.104e-01 4.932e-01 -1.238 0.215837
## installment_rate 3.112e-01 1.072e-01 2.904 0.003689 **
## sexA92 2.397e-01 4.854e-01 0.494 0.621382
## sexA93 -3.204e-01 4.765e-01 -0.672 0.501305
## sexA94 -2.237e-02 5.504e-01 -0.041 0.967578
## other_debtorA102 7.311e-02 5.047e-01 0.145 0.884828
## other_debtorA103 -7.387e-01 4.743e-01 -1.557 0.119376
## present_resid 4.087e-02 1.074e-01 0.381 0.703474
## propertyA122 3.985e-01 3.031e-01 1.315 0.188640
## propertyA123 1.029e-01 2.824e-01 0.365 0.715464
## propertyA124 4.752e-01 5.106e-01 0.931 0.352045
## age -1.630e-02 1.097e-02 -1.486 0.137153
## other_installA142 6.909e-02 5.409e-01 0.128 0.898367
## other_installA143 -9.548e-01 2.951e-01 -3.235 0.001215 **
## housingA152 -4.889e-01 2.847e-01 -1.717 0.085970 .
## housingA153 -4.466e-01 5.612e-01 -0.796 0.426152
## n_credits 5.451e-01 2.358e-01 2.312 0.020774 *
## jobA172 7.983e-01 7.412e-01 1.077 0.281409
## jobA173 8.302e-01 7.084e-01 1.172 0.241185
## jobA174 7.056e-01 7.220e-01 0.977 0.328469
## n_people 2.768e-01 3.010e-01 0.920 0.357681
## telephoneA192 -2.917e-01 2.446e-01 -1.192 0.233093
## foreignA202 -1.553e+00 8.607e-01 -1.804 0.071257 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 856.90 on 699 degrees of freedom
## Residual deviance: 628.28 on 651 degrees of freedom
## AIC: 726.28
##
## Number of Fisher Scoring iterations: 14
Observation: We observe that amongst the 20 predictors only a few are significant. In order to choose the significant variables we decide to use AIC/BIC approach
#backward selection
#AIC
g.credit.glm.back <- step(g.credit.glm0) # backward selection (if you don't specify anything)
## Start: AIC=726.28
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + present_emp + installment_rate + sex + other_debtor +
## present_resid + property + age + other_install + housing +
## n_credits + job + n_people + telephone + foreign
##
## Df Deviance AIC
## - job 3 629.70 721.70
## - property 3 630.59 722.59
## - present_resid 1 628.42 724.42
## - other_debtor 2 630.93 724.93
## - n_people 1 629.12 725.12
## - sex 3 633.23 725.23
## - housing 2 631.26 725.26
## - present_emp 4 635.50 725.50
## - telephone 1 629.71 725.71
## <none> 628.28 726.28
## - age 1 630.52 726.52
## - amount 1 630.94 726.94
## - foreign 1 632.43 728.43
## - n_credits 1 633.66 729.66
## - saving_acct 4 640.92 730.92
## - duration 1 636.40 732.40
## - installment_rate 1 636.97 732.97
## - credit_his 4 643.45 733.45
## - purpose 9 654.76 734.76
## - other_install 2 641.84 735.84
## - chk_acct 3 674.39 766.39
##
## Step: AIC=721.7
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + present_emp + installment_rate + sex + other_debtor +
## present_resid + property + age + other_install + housing +
## n_credits + n_people + telephone + foreign
##
## Df Deviance AIC
## - property 3 632.04 718.04
## - present_emp 4 635.74 719.74
## - present_resid 1 629.96 719.96
## - other_debtor 2 632.14 720.14
## - housing 2 632.40 720.40
## - n_people 1 630.54 720.54
## - sex 3 634.59 720.59
## - telephone 1 631.31 721.31
## <none> 629.70 721.70
## - age 1 632.17 722.17
## - amount 1 632.39 722.39
## - foreign 1 633.90 723.90
## - n_credits 1 634.75 724.75
## - saving_acct 4 643.12 727.12
## - credit_his 4 644.25 728.25
## - duration 1 638.34 728.34
## - installment_rate 1 639.09 729.09
## - purpose 9 655.91 729.91
## - other_install 2 644.00 732.00
## - chk_acct 3 675.18 761.18
##
## Step: AIC=718.04
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + present_emp + installment_rate + sex + other_debtor +
## present_resid + age + other_install + housing + n_credits +
## n_people + telephone + foreign
##
## Df Deviance AIC
## - present_resid 1 632.30 716.30
## - other_debtor 2 634.54 716.54
## - present_emp 4 638.55 716.55
## - n_people 1 632.83 716.83
## - housing 2 634.87 716.87
## - sex 3 637.02 717.02
## - telephone 1 633.37 717.37
## <none> 632.04 718.04
## - age 1 634.36 718.36
## - amount 1 634.93 718.93
## - foreign 1 636.03 720.03
## - n_credits 1 636.75 720.75
## - saving_acct 4 645.15 723.15
## - credit_his 4 646.22 724.22
## - duration 1 641.04 725.04
## - installment_rate 1 641.68 725.68
## - purpose 9 658.65 726.65
## - other_install 2 646.82 728.82
## - chk_acct 3 678.62 758.62
##
## Step: AIC=716.3
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + present_emp + installment_rate + sex + other_debtor +
## age + other_install + housing + n_credits + n_people + telephone +
## foreign
##
## Df Deviance AIC
## - present_emp 4 638.66 714.66
## - other_debtor 2 634.90 714.90
## - n_people 1 633.15 715.15
## - sex 3 637.34 715.34
## - telephone 1 633.60 715.60
## - housing 2 636.04 716.04
## <none> 632.30 716.30
## - age 1 634.44 716.44
## - amount 1 635.12 717.12
## - foreign 1 636.32 718.32
## - n_credits 1 637.07 719.07
## - saving_acct 4 645.37 721.37
## - credit_his 4 646.43 722.43
## - duration 1 641.55 723.55
## - installment_rate 1 641.95 723.95
## - purpose 9 658.88 724.88
## - other_install 2 646.87 726.87
## - chk_acct 3 679.70 757.70
##
## Step: AIC=714.66
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + installment_rate + sex + other_debtor + age +
## other_install + housing + n_credits + n_people + telephone +
## foreign
##
## Df Deviance AIC
## - n_people 1 639.50 713.50
## - telephone 1 639.94 713.94
## - other_debtor 2 642.10 714.10
## - age 1 640.48 714.48
## <none> 638.66 714.66
## - housing 2 642.69 714.69
## - sex 3 644.80 714.80
## - amount 1 641.60 715.60
## - foreign 1 642.58 716.58
## - n_credits 1 642.99 716.99
## - credit_his 4 652.54 720.54
## - duration 1 646.91 720.91
## - saving_acct 4 652.99 720.99
## - purpose 9 664.03 722.03
## - installment_rate 1 648.87 722.87
## - other_install 2 654.12 726.12
## - chk_acct 3 688.29 758.29
##
## Step: AIC=713.5
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + installment_rate + sex + other_debtor + age +
## other_install + housing + n_credits + telephone + foreign
##
## Df Deviance AIC
## - other_debtor 2 642.76 712.76
## - telephone 1 640.79 712.79
## - sex 3 644.94 712.94
## - age 1 641.19 713.19
## <none> 639.50 713.50
## - housing 2 643.67 713.67
## - amount 1 642.33 714.33
## - foreign 1 643.50 715.50
## - n_credits 1 644.17 716.17
## - saving_acct 4 653.39 719.39
## - duration 1 647.65 719.65
## - credit_his 4 653.75 719.75
## - installment_rate 1 649.38 721.38
## - purpose 9 665.95 721.95
## - other_install 2 654.88 724.88
## - chk_acct 3 689.31 757.31
##
## Step: AIC=712.76
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + installment_rate + sex + age + other_install +
## housing + n_credits + telephone + foreign
##
## Df Deviance AIC
## - telephone 1 643.97 711.97
## - sex 3 648.46 712.46
## - age 1 644.63 712.63
## <none> 642.76 712.76
## - housing 2 647.12 713.12
## - amount 1 645.91 713.91
## - n_credits 1 647.56 715.56
## - foreign 1 647.97 715.97
## - saving_acct 4 656.05 718.05
## - duration 1 650.74 718.74
## - credit_his 4 656.75 718.75
## - installment_rate 1 652.94 720.94
## - purpose 9 670.63 722.63
## - other_install 2 657.77 723.77
## - chk_acct 3 690.81 754.81
##
## Step: AIC=711.97
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + installment_rate + sex + age + other_install +
## housing + n_credits + foreign
##
## Df Deviance AIC
## - sex 3 649.61 711.61
## <none> 643.97 711.97
## - housing 2 648.29 712.29
## - age 1 646.29 712.29
## - amount 1 646.37 712.37
## - n_credits 1 648.49 714.49
## - foreign 1 648.82 714.82
## - saving_acct 4 657.36 717.36
## - credit_his 4 658.09 718.09
## - duration 1 652.65 718.65
## - installment_rate 1 653.42 719.42
## - purpose 9 672.01 722.01
## - other_install 2 658.90 722.90
## - chk_acct 3 693.22 755.22
##
## Step: AIC=711.61
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + installment_rate + age + other_install + housing +
## n_credits + foreign
##
## Df Deviance AIC
## - amount 1 651.40 711.40
## <none> 649.61 711.61
## - age 1 652.53 712.53
## - housing 2 655.51 713.51
## - n_credits 1 653.81 713.81
## - foreign 1 655.25 715.25
## - installment_rate 1 657.13 717.13
## - saving_acct 4 663.51 717.51
## - duration 1 658.50 718.50
## - credit_his 4 664.93 718.93
## - other_install 2 663.65 721.65
## - purpose 9 679.00 723.00
## - chk_acct 3 698.56 754.56
##
## Step: AIC=711.4
## response ~ chk_acct + duration + credit_his + purpose + saving_acct +
## installment_rate + age + other_install + housing + n_credits +
## foreign
##
## Df Deviance AIC
## <none> 651.40 711.40
## - age 1 654.04 712.04
## - housing 2 657.43 713.43
## - n_credits 1 655.82 713.82
## - foreign 1 657.16 715.16
## - installment_rate 1 657.22 715.22
## - saving_acct 4 664.64 716.64
## - credit_his 4 667.45 719.45
## - other_install 2 665.47 721.47
## - purpose 9 679.78 721.78
## - duration 1 673.27 731.27
## - chk_acct 3 699.45 753.45
Selected model:
Step: AIC=666.84 response ~ chk_acct + duration + credit_his + purpose + saving_acct + installment_rate + other_debtor + other_install + housing + telephone + foreign
summary(g.credit.glm.back)
##
## Call:
## glm(formula = response ~ chk_acct + duration + credit_his + purpose +
## saving_acct + installment_rate + age + other_install + housing +
## n_credits + foreign, family = binomial, data = german_credit_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0395 -0.7362 -0.4081 0.7500 2.6596
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.605744 0.874879 0.692 0.488702
## chk_acctA12 -0.372430 0.253976 -1.466 0.142539
## chk_acctA13 -0.903270 0.431385 -2.094 0.036270 *
## chk_acctA14 -1.673746 0.267017 -6.268 3.65e-10 ***
## duration 0.039346 0.008555 4.599 4.25e-06 ***
## credit_hisA31 0.671842 0.728624 0.922 0.356492
## credit_hisA32 -0.116590 0.550797 -0.212 0.832360
## credit_hisA33 -0.462586 0.584424 -0.792 0.428638
## credit_hisA34 -1.123011 0.552559 -2.032 0.042115 *
## purposeA41 -1.312801 0.393048 -3.340 0.000838 ***
## purposeA410 -2.469875 1.161103 -2.127 0.033405 *
## purposeA42 -0.785402 0.306218 -2.565 0.010322 *
## purposeA43 -0.977602 0.282374 -3.462 0.000536 ***
## purposeA44 -0.676233 0.812261 -0.833 0.405109
## purposeA45 -0.453100 0.595793 -0.760 0.446956
## purposeA46 0.020329 0.491951 0.041 0.967038
## purposeA48 -15.994174 640.729808 -0.025 0.980085
## purposeA49 -0.610618 0.379777 -1.608 0.107872
## saving_acctA62 -0.242863 0.334731 -0.726 0.468117
## saving_acctA63 -0.159479 0.431181 -0.370 0.711483
## saving_acctA64 -0.604616 0.583488 -1.036 0.300104
## saving_acctA65 -1.074838 0.317708 -3.383 0.000717 ***
## installment_rate 0.220548 0.092396 2.387 0.016987 *
## age -0.015437 0.009564 -1.614 0.106508
## other_installA142 0.175658 0.508142 0.346 0.729578
## other_installA143 -0.900106 0.283061 -3.180 0.001473 **
## housingA152 -0.583248 0.257559 -2.265 0.023542 *
## housingA153 -0.136733 0.393178 -0.348 0.728019
## n_credits 0.474657 0.225390 2.106 0.035210 *
## foreignA202 -1.729716 0.838006 -2.064 0.039010 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 856.9 on 699 degrees of freedom
## Residual deviance: 651.4 on 670 degrees of freedom
## AIC: 711.4
##
## Number of Fisher Scoring iterations: 14
#deviance
#AIC
#BIC
(aicmodel<-glance(g.credit.glm.back)[,c(4,5)])
## # A tibble: 1 x 2
## AIC BIC
## <dbl> <dbl>
## 1 711. 848.
g.credit.glm.back.BIC <- step(g.credit.glm0, k=log(nrow(german_credit_train)))
## Start: AIC=949.28
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + present_emp + installment_rate + sex + other_debtor +
## present_resid + property + age + other_install + housing +
## n_credits + job + n_people + telephone + foreign
##
## Df Deviance AIC
## - purpose 9 654.76 916.81
## - present_emp 4 635.50 930.30
## - job 3 629.70 931.05
## - property 3 630.59 931.94
## - sex 3 633.23 934.58
## - saving_acct 4 640.92 935.72
## - credit_his 4 643.45 938.25
## - other_debtor 2 630.93 938.83
## - housing 2 631.26 939.16
## - present_resid 1 628.42 942.87
## - n_people 1 629.12 943.57
## - telephone 1 629.71 944.16
## - age 1 630.52 944.97
## - amount 1 630.94 945.39
## - foreign 1 632.43 946.89
## - n_credits 1 633.66 948.11
## <none> 628.28 949.28
## - other_install 2 641.84 949.74
## - duration 1 636.40 950.85
## - installment_rate 1 636.97 951.42
## - chk_acct 3 674.39 975.74
##
## Step: AIC=916.81
## response ~ chk_acct + duration + credit_his + amount + saving_acct +
## present_emp + installment_rate + sex + other_debtor + present_resid +
## property + age + other_install + housing + n_credits + job +
## n_people + telephone + foreign
##
## Df Deviance AIC
## - present_emp 4 660.47 896.31
## - job 3 655.91 898.30
## - property 3 657.55 899.94
## - sex 3 660.76 903.15
## - saving_acct 4 667.49 903.33
## - credit_his 4 667.78 903.62
## - housing 2 657.87 906.81
## - other_debtor 2 658.67 907.61
## - present_resid 1 654.90 910.39
## - age 1 655.89 911.38
## - amount 1 656.21 911.70
## - telephone 1 656.27 911.76
## - n_people 1 656.63 912.12
## - foreign 1 658.09 913.58
## - n_credits 1 660.04 915.53
## <none> 654.76 916.81
## - other_install 2 668.30 917.24
## - installment_rate 1 662.72 918.22
## - duration 1 663.07 918.56
## - chk_acct 3 703.39 945.78
##
## Step: AIC=896.31
## response ~ chk_acct + duration + credit_his + amount + saving_acct +
## installment_rate + sex + other_debtor + present_resid + property +
## age + other_install + housing + n_credits + job + n_people +
## telephone + foreign
##
## Df Deviance AIC
## - job 3 660.69 876.87
## - property 3 663.74 879.93
## - credit_his 4 673.14 882.78
## - sex 3 667.85 884.04
## - saving_acct 4 674.72 884.35
## - housing 2 663.36 886.10
## - other_debtor 2 665.10 887.84
## - present_resid 1 660.53 889.82
## - age 1 661.58 890.86
## - amount 1 661.91 891.19
## - telephone 1 662.07 891.35
## - n_people 1 662.37 891.66
## - foreign 1 663.63 892.92
## - n_credits 1 665.23 894.52
## <none> 660.47 896.31
## - duration 1 668.17 897.46
## - other_install 2 674.85 897.59
## - installment_rate 1 668.79 898.08
## - chk_acct 3 711.50 927.68
##
## Step: AIC=876.87
## response ~ chk_acct + duration + credit_his + amount + saving_acct +
## installment_rate + sex + other_debtor + present_resid + property +
## age + other_install + housing + n_credits + n_people + telephone +
## foreign
##
## Df Deviance AIC
## - property 3 663.96 860.50
## - credit_his 4 673.20 863.18
## - sex 3 668.10 864.63
## - saving_acct 4 675.17 865.15
## - housing 2 663.48 866.56
## - other_debtor 2 665.19 868.28
## - present_resid 1 660.77 870.41
## - age 1 661.94 871.58
## - amount 1 662.19 871.83
## - telephone 1 662.51 872.15
## - n_people 1 662.59 872.22
## - foreign 1 663.87 873.50
## - n_credits 1 665.35 874.98
## <none> 660.69 876.87
## - other_install 2 675.22 878.30
## - duration 1 668.77 878.40
## - installment_rate 1 669.36 878.99
## - chk_acct 3 711.58 908.11
##
## Step: AIC=860.5
## response ~ chk_acct + duration + credit_his + amount + saving_acct +
## installment_rate + sex + other_debtor + present_resid + age +
## other_install + housing + n_credits + n_people + telephone +
## foreign
##
## Df Deviance AIC
## - credit_his 4 675.84 846.17
## - saving_acct 4 678.02 848.35
## - sex 3 671.80 848.68
## - housing 2 667.43 850.86
## - other_debtor 2 668.91 852.34
## - present_resid 1 664.03 854.01
## - age 1 664.99 854.97
## - telephone 1 665.45 855.43
## - amount 1 665.61 855.59
## - n_people 1 665.87 855.85
## - foreign 1 666.93 856.91
## - n_credits 1 668.07 858.05
## <none> 663.96 860.50
## - duration 1 672.16 862.14
## - other_install 2 679.05 862.48
## - installment_rate 1 672.87 862.85
## - chk_acct 3 717.11 893.99
##
## Step: AIC=846.17
## response ~ chk_acct + duration + amount + saving_acct + installment_rate +
## sex + other_debtor + present_resid + age + other_install +
## housing + n_credits + n_people + telephone + foreign
##
## Df Deviance AIC
## - saving_acct 4 689.55 833.67
## - sex 3 685.30 835.98
## - other_debtor 2 680.34 837.56
## - housing 2 680.50 837.72
## - present_resid 1 675.85 839.63
## - n_credits 1 676.24 840.02
## - age 1 677.19 840.97
## - telephone 1 677.77 841.54
## - n_people 1 678.24 842.02
## - amount 1 678.31 842.09
## - foreign 1 679.45 843.23
## <none> 675.84 846.17
## - duration 1 683.86 847.64
## - installment_rate 1 685.02 848.79
## - other_install 2 697.98 855.20
## - chk_acct 3 738.13 888.81
##
## Step: AIC=833.67
## response ~ chk_acct + duration + amount + installment_rate +
## sex + other_debtor + present_resid + age + other_install +
## housing + n_credits + n_people + telephone + foreign
##
## Df Deviance AIC
## - sex 3 699.11 823.58
## - other_debtor 2 693.19 824.21
## - housing 2 694.22 825.24
## - present_resid 1 689.55 827.12
## - n_credits 1 690.19 827.77
## - n_people 1 691.05 828.63
## - age 1 691.06 828.63
## - amount 1 691.42 828.99
## - telephone 1 691.58 829.15
## - foreign 1 694.52 832.09
## <none> 689.55 833.67
## - duration 1 696.91 834.49
## - installment_rate 1 697.74 835.32
## - other_install 2 710.29 841.32
## - chk_acct 3 768.99 893.46
##
## Step: AIC=823.58
## response ~ chk_acct + duration + amount + installment_rate +
## other_debtor + present_resid + age + other_install + housing +
## n_credits + n_people + telephone + foreign
##
## Df Deviance AIC
## - other_debtor 2 702.93 814.30
## - present_resid 1 699.13 817.05
## - n_credits 1 699.37 817.29
## - n_people 1 699.55 817.47
## - housing 2 706.33 817.70
## - amount 1 700.23 818.15
## - age 1 701.08 819.00
## - telephone 1 701.20 819.12
## - installment_rate 1 704.63 822.55
## - foreign 1 704.67 822.58
## <none> 699.11 823.58
## - duration 1 706.26 824.18
## - other_install 2 718.73 830.10
## - chk_acct 3 778.39 883.21
##
## Step: AIC=814.3
## response ~ chk_acct + duration + amount + installment_rate +
## present_resid + age + other_install + housing + n_credits +
## n_people + telephone + foreign
##
## Df Deviance AIC
## - present_resid 1 702.94 807.76
## - n_credits 1 703.24 808.05
## - n_people 1 703.28 808.09
## - housing 2 710.26 808.53
## - amount 1 704.27 809.09
## - telephone 1 704.93 809.74
## - age 1 705.14 809.96
## - installment_rate 1 708.50 813.32
## <none> 702.93 814.30
## - duration 1 709.82 814.64
## - foreign 1 709.91 814.72
## - other_install 2 721.67 819.93
## - chk_acct 3 779.65 871.36
##
## Step: AIC=807.76
## response ~ chk_acct + duration + amount + installment_rate +
## age + other_install + housing + n_credits + n_people + telephone +
## foreign
##
## Df Deviance AIC
## - n_credits 1 703.24 801.50
## - n_people 1 703.28 801.54
## - amount 1 704.28 802.55
## - housing 2 711.03 802.74
## - telephone 1 704.93 803.20
## - age 1 705.33 803.60
## - installment_rate 1 708.50 806.77
## <none> 702.94 807.76
## - duration 1 709.83 808.09
## - foreign 1 709.91 808.17
## - other_install 2 721.77 813.48
## - chk_acct 3 779.73 864.90
##
## Step: AIC=801.5
## response ~ chk_acct + duration + amount + installment_rate +
## age + other_install + housing + n_people + telephone + foreign
##
## Df Deviance AIC
## - n_people 1 703.63 795.35
## - amount 1 704.63 796.34
## - housing 2 711.21 796.38
## - telephone 1 705.14 796.85
## - age 1 705.49 797.21
## - installment_rate 1 708.81 800.52
## <none> 703.24 801.50
## - duration 1 710.07 801.78
## - foreign 1 710.14 801.86
## - other_install 2 722.04 807.21
## - chk_acct 3 779.80 858.41
##
## Step: AIC=795.35
## response ~ chk_acct + duration + amount + installment_rate +
## age + other_install + housing + telephone + foreign
##
## Df Deviance AIC
## - amount 1 705.03 790.19
## - housing 2 711.66 790.27
## - telephone 1 705.53 790.70
## - age 1 705.74 790.90
## - installment_rate 1 709.14 794.31
## <none> 703.63 795.35
## - duration 1 710.41 795.58
## - foreign 1 710.54 795.70
## - other_install 2 722.70 801.31
## - chk_acct 3 780.15 852.21
##
## Step: AIC=790.19
## response ~ chk_acct + duration + installment_rate + age + other_install +
## housing + telephone + foreign
##
## Df Deviance AIC
## - telephone 1 706.25 784.86
## - age 1 706.93 785.55
## - housing 2 713.65 785.71
## - installment_rate 1 709.18 787.79
## <none> 705.03 790.19
## - foreign 1 711.84 790.45
## - other_install 2 724.39 796.45
## - duration 1 723.41 802.02
## - chk_acct 3 780.70 846.21
##
## Step: AIC=784.86
## response ~ chk_acct + duration + installment_rate + age + other_install +
## housing + foreign
##
## Df Deviance AIC
## - housing 2 714.62 780.13
## - age 1 708.69 780.75
## - installment_rate 1 710.47 782.53
## - foreign 1 712.68 784.74
## <none> 706.25 784.86
## - other_install 2 725.39 790.90
## - duration 1 723.85 795.91
## - chk_acct 3 785.27 844.23
##
## Step: AIC=780.13
## response ~ chk_acct + duration + installment_rate + age + other_install +
## foreign
##
## Df Deviance AIC
## - age 1 717.95 776.91
## - installment_rate 1 718.04 777.00
## <none> 714.62 780.13
## - foreign 1 721.34 780.30
## - other_install 2 732.57 784.98
## - duration 1 735.30 794.26
## - chk_acct 3 800.23 846.09
##
## Step: AIC=776.91
## response ~ chk_acct + duration + installment_rate + other_install +
## foreign
##
## Df Deviance AIC
## - installment_rate 1 721.02 773.43
## - foreign 1 724.20 776.61
## <none> 717.95 776.91
## - other_install 2 735.06 780.92
## - duration 1 738.52 790.92
## - chk_acct 3 804.27 843.58
##
## Step: AIC=773.43
## response ~ chk_acct + duration + other_install + foreign
##
## Df Deviance AIC
## <none> 721.02 773.43
## - foreign 1 727.86 773.72
## - other_install 2 738.45 777.76
## - duration 1 741.61 787.47
## - chk_acct 3 806.60 839.36
summary(g.credit.glm.back.BIC)
##
## Call:
## glm(formula = response ~ chk_acct + duration + other_install +
## foreign, family = binomial, data = german_credit_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6638 -0.8233 -0.4746 0.9475 2.4098
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.118756 0.304258 0.390 0.696305
## chk_acctA12 -0.606083 0.220820 -2.745 0.006057 **
## chk_acctA13 -1.122102 0.401245 -2.797 0.005165 **
## chk_acctA14 -2.028836 0.241103 -8.415 < 2e-16 ***
## duration 0.032562 0.007252 4.490 7.12e-06 ***
## other_installA142 0.075650 0.467622 0.162 0.871482
## other_installA143 -0.917145 0.249165 -3.681 0.000232 ***
## foreignA202 -1.735661 0.787882 -2.203 0.027599 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 856.90 on 699 degrees of freedom
## Residual deviance: 721.02 on 692 degrees of freedom
## AIC: 737.02
##
## Number of Fisher Scoring iterations: 5
Selected model by BIC:
Step: AIC=757.28 response ~ chk_acct + duration
bicmodel<-glance(g.credit.glm.back.BIC)[,c(4,5)]
df<-data.frame(rbind(aicmodel,bicmodel))
row.names(df)<-c("AICmodel","BICmodel")
df
## AIC BIC
## AICmodel 711.3967 847.9291
## BICmodel 737.0234 773.4320
Here we observe that BIC being more parsimonious in model selection only selects 2 predictors as significant where as AIC working towards more prediction ability selects 11 predictors.
In this scenario because we desire more prediction ability we select the model selected the AIC criterian.
Below is the in sample ROC curve: ROC Curve
In-sample : Training data
pred.gc.train<- predict(g.credit.glm.back, type="response")
pred0 <- prediction(pred.gc.train, german_credit_train$response)
perf0 <- performance(pred0, "tpr", "fpr")
plot(perf0, colorize=TRUE)
#Get the AUC
AUC0<-unlist(slot(performance(pred0, "auc"), "y.values"))
#0.8496752
Observation: The roc curve looks good and we also observe 0.8162998 as AUC value. As per Rule of thumb AUC value > .70 can be considered satisfactory.ROC curve signifies overall measure of goodness of classification, hence a higher value signifies that model has good classification ability.
But this measure is calculated on training sample, which is not a good data to make a decision. Let’s check the same for out of sample..
Confusion Matrix
Considering the data has asymmetric distribution , we choose cutoff probability as 1/6. The cost function will also change accordingly..
The misclassification rate table is as follows:
pcut1=.16
# get binary prediction
class.gc.train<- (pred.gc.train>pcut1)*1
# get confusion matrix
table(german_credit_train$response, class.gc.train, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 238 251
## 1 22 189
Intepretation
False positive: 218 False Negative: 20
Misclassification = False positive + False Negative We observe that:
There are 20 candidates that were actually defaulters but were predicted as good borrowers these are called as False negative classification.
There are 218 candidates were actually good borrowers but were predicted as bad borrowers these are called as False positive classification.
If we wish to penalize False negatives more then the weights in cost function will change…
#cost function define
costfunc = function(obs, pred.p){
pcut=.16
weight1 = 5 # define the weight for "true=1 but pred=0" (FN)
weight0 = 1 # define the weight for "true=0 but pred=1" (FP)
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (FP)
cost = mean(weight1*c1 + weight0*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
} # end of the function
# (equal-weighted) misclassification rate
MR<- mean(german_credit_train$response!=class.gc.train)
#asymettric cose
cost1<-costfunc(class.gc.train,german_credit_train$response)
# False positive rate
FPR<- sum(german_credit_train$response==0 & class.gc.train==1)/sum(german_credit_train$response==0)
# False negative rate (exercise)
FNR<- sum(german_credit_train$response==1 & class.gc.train==0)/sum(german_credit_train$response==1)
trn.val<-data.frame(cbind(MR,cost1,FPR,FNR,AUC0))
colnames(trn.val)<-c("MissClassRt","AsyCost","FPR","FNR","AUC")
rownames(trn.val)<-c("Training Sample")
trn.val
## MissClassRt AsyCost FPR FNR AUC
## Training Sample 0.39 1.824286 0.5132924 0.1042654 0.8162998
Intepretation
Out-of-sample : Testing data ROC Curve
#Part 02 Out of sample prediction
# apply fitted model to test sample (predicted probabilities)
predTst <- predict(g.credit.glm.back,german_credit_test, type="response")
pred2 <- prediction(predTst, german_credit_test$response)
perf2 <- performance(pred2, "tpr", "fpr")
plot(perf2, colorize=TRUE)
#Get the AUC
AUC2<-unlist(slot(performance(pred2, "auc"), "y.values"))
#0.7173563
Intepretation Observation: The roc curve looks good and we also observe 0.769583 as AUC value. As per Rule of thumb AUC value > .70 can be considered satisfactory.ROC curve signifies overall measure of goodness of classification, hence a higher value signifies that model has good classification ability.
Confusion Matrix
Considering the data has asymmetric distribution , we choose cutoff probability as 1/6. The cost function will also change accordingly..
The misclassification rate table is as follows:
# step 1. get binary classification
class.predTst<- (predTst>pcut1)*1
# step 2. get confusion matrix, MR, FPR, FNR
table(german_credit_test$response, class.predTst, dnn = c("True", "Predicted"))
## Predicted
## True 0 1
## 0 99 112
## 1 10 79
Intepretation
False positive: 107 False Negative: 20
Misclassification = False positive + False Negative We observe that:
There are 20 candidates that were actually defaulters but were predicted as good borrowers these are called as False negative classification.
There are 107 candidates were actually good borrowers but were predicted as bad borrowers these are called as False Negative.
If we wish to penalize False negatives more then the weights in cost function will change…
# (equal-weighted) misclassification rate
MR2<- mean(german_credit_test$response!=class.predTst)
#asym cost
cost2<-costfunc(class.predTst,german_credit_test$response)
# False positive rate
FPR2<- sum(german_credit_test$response==0 & class.predTst==1)/sum(german_credit_test$response==0)
# False negative rate (exercise)
FNR2<- sum(german_credit_test$response==1 & class.predTst==0)/sum(german_credit_test$response==1)
test.val<-c(MR2,cost2,FPR2,FNR2,AUC2)
cols<-c("MR","COST","FPR","FNR","AUC")
rows<-c("Training set","Testing set")
report<-rbind(trn.val,test.val)
colnames(report)<-cols
rownames(report)<-rows
costfunc = function(obs, pred.p){
weight1 = 5 # define the weight for "true=1 but pred=0" (FN)
weight0 = 1 # define the weight for "true=0 but pred=1" (FP)
pcut = .16
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (FP)
cost = mean(weight1*c1 + weight0*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
} # end of the function
credit.glm.cv<- glm(response~. , family=binomial, data=german_credit)
cv.result = cv.glm(data=german_credit, glmfit=credit.glm.cv, cost=costfunc,K=10)
cv.result$delta[2]
## [1] 0.5324
The cross validated error =.5345 Table of Comparision
kable(report) %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive")) %>% scroll_box(width = "100%", height = "170px")
| MR | COST | FPR | FNR | AUC | |
|---|---|---|---|---|---|
| Training set | 0.3900000 | 1.824286 | 0.5132924 | 0.1042654 | 0.8162998 |
| Testing set | 0.4066667 | 1.900000 | 0.5308057 | 0.1123596 | 0.7695830 |
#report
Intepretation
MissClassRt: This signifies the candidates were missclassified, we observe that this has increased on testing sample.
AsyCost: Asymmetric cost based on asymmetric cost function, we considered a cut-off probability 1/6 (equivalent to 5:1 asymmetric cost).we observe that this has increased on testing sample.
FPR: False positive rate signifies that out of total XX% were classified as False positive
FNR: False negative rate signifies that out of total YY% were classified as False Negative
AUC: Area under the curve signifies overall strength of classification.
Our model is not as good to predict the testing values as the training values. FPR and FNR have increased and Area Under the curve has decreased from 85% to 72%. However, 72% is acceptable score and we can conclude that our model is working fairly well.
CART stands for classification and regression tree.
The distinctiove feature with this algo is that we generally have a asymmetric cost function. In the credit scoring case it means that false negatives (predicting 0 when truth is 1, or giving out loans that end up in default) will cost more than false positives (predicting 1 when truth is 0, rejecting loans that you should not reject).
Here we make the assumption that false negative cost 5 times of false positive. In real life the cost structure should be carefully researched.
credit.rpart0 <- rpart(formula = response ~ ., data = german_credit_train, method = "class")
credit.rpart <- rpart(formula = response ~ . , data = german_credit_train, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))
Symmetric Cost
pred0<- predict(credit.rpart0, type="class")
table(german_credit_train$response, pred0, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 446 43
## 1 96 115
False negatives are heavy costing so lets, add more penality of 5:1
Asymmetric Cost insample
pred.in<- predict(credit.rpart, type="class")
table(german_credit_train$response, pred.in, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 259 230
## 1 2 209
As expected our False negatives have reduced.Hence, we select credit.rpart as our model for Binary classification Tree.
Missclassification rate & LOSS Value
cost <- function(r, pi){
weight1 = 5
weight0 = 1
c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
return(mean(weight1*c1+weight0*c0))
}
#LOSS
(insamplecost.df<-cost(german_credit_train$response,pred.in))
## [1] 0.3428571
#Misclassification rate
(insampleMCR.df<-mean(german_credit_train$response!=pred.in))
## [1] 0.3314286
#roc
credit.train.pred.in.prob.rpart<- predict(credit.rpart, german_credit_train, type="prob")[,2]
pred <- prediction(credit.train.pred.in.prob.rpart, german_credit_train$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7986218
LOSS:.342 MCR:.331 AUC:.7986 Out of sample Prediction
pred.out<- predict(credit.rpart, german_credit_test, type="class")
table(german_credit_test$response, pred.out, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 98 113
## 1 16 73
Missclassification rate & LOSS Value
#LOSS
(outsamplecost.df<-cost(german_credit_test$response,pred.out))
## [1] 0.6433333
#Misclassification rate
(outsampleMCR.df<-mean(german_credit_test$response!=pred.out))
## [1] 0.43
#roc
credit.test.pred.out.prob.rpart<- predict(credit.rpart, german_credit_test, type="prob")[,2]
pred <- prediction(credit.test.pred.out.prob.rpart, german_credit_test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7234677
LOSS:.643 MCR:.43 AUC:.723 Pruning a tree
The ideal way of creating a binary tree is to construct a large tree and then prune it to an optimum level
credit.rpart_prune <- rpart(formula = response ~ . , data = german_credit_train, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)),cp = 0.001)
We have a bushy tree, Now we can try to prune up to a optimum level using plotcp function
prp(credit.rpart_prune)
plotcp(credit.rpart_prune)
printcp(credit.rpart_prune)
##
## Classification tree:
## rpart(formula = response ~ ., data = german_credit_train, method = "class",
## parms = list(loss = matrix(c(0, 5, 1, 0), nrow = 2)), cp = 0.001)
##
## Variables actually used in tree construction:
## [1] age amount chk_acct credit_his duration
## [6] job n_credits other_debtor other_install present_emp
## [11] property purpose saving_acct
##
## Root node error: 489/700 = 0.69857
##
## n= 700
##
## CP nsplit rel error xerror xstd
## 1 0.1053170 0 1.00000 5.0000 0.12414
## 2 0.0439673 2 0.78937 2.5583 0.12763
## 3 0.0306748 4 0.70143 2.9243 0.13108
## 4 0.0143149 6 0.64008 2.8241 0.13025
## 5 0.0129516 13 0.53988 2.7321 0.12879
## 6 0.0102249 16 0.50102 2.6933 0.12835
## 7 0.0081800 17 0.49080 2.6933 0.12835
## 8 0.0068166 18 0.48262 2.6421 0.12750
## 9 0.0040900 24 0.43967 2.3865 0.12374
## 10 0.0010225 28 0.42331 2.3967 0.12388
## 11 0.0010000 30 0.42127 2.3967 0.12388
xerror gives you the cross-validation (default is 10-fold) error. You can see that the rel error (in-sample error) is always decreasing as model is more complex, while the cross-validation error (measure of performance on future observations) is not. That is why we prune the tree to avoid overfitting the training data.
credit.tree<-prune(credit.rpart_prune, cp = 0.002)
Insample Prediction
credit.train.pred.in<- predict(credit.tree, german_credit_train, type="class")
table(german_credit_train$response, credit.train.pred.in, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 317 172
## 1 7 204
Here, we observe that FN are few more than the last tree(made with default parameters). FP although seems to have improved for insample data.
Missclassification rate & LOSS Value
#LOSS
(insamplecost.pr<-cost(german_credit_train$response,credit.train.pred.in))
## [1] 0.2957143
#Misclassification rate
(insampleMCR.pr<-mean(german_credit_train$response!=credit.train.pred.in))
## [1] 0.2557143
#roc
credit.train.pred.in.prob<- predict(credit.tree, german_credit_train, type="prob")[,2]
pred <- prediction(credit.train.pred.in.prob, german_credit_train$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8674343
LOSS:.295 MCR:.255 AUC:.867 Out of sample Prediction
credit.test.pred.out<- predict(credit.tree, german_credit_test, type="class")
table(german_credit_test$response, credit.test.pred.out, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 124 87
## 1 23 66
Missclassification rate & LOSS Value
#LOSS
(outsamplecost.pr<-cost(german_credit_test$response,credit.test.pred.out))
## [1] 0.6733333
#Misclassification rate
(outsampleMCR.pr<-mean(german_credit_test$response!=credit.test.pred.out))
## [1] 0.3666667
#roc
credit.test.pred.in.prob<- predict(credit.tree, german_credit_test, type="prob")[,2]
pred <- prediction(credit.test.pred.in.prob, german_credit_test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7409074
LOSS:.673 MCR:.366 AUC:.740
What tree do we choose?
The insample performance is better for pruned tree(but that not what we are concerned about) The out of sample Loss is more in pruned tree,but overall MCR is better.
Hence, to choose a tree we need to choose between these to trade-off parameters. Here, we need lesser loss the better(we see small difference .67 and .64) but a significant difference between MCR(.36 and .48) between the two.Also AUC is beter for prune Model.
Hence, we choose the pruned tree: credit.tree
Please note, we can use different mechanisms to choose the best model such as : AUC, LOSS, MCR.
Advanced Tree Models – Bagging, Random Forests, and Boosting
Bagging stands for Bootstrap and Aggregating. It employs the idea of bootstrap but the purpose is not to study bias and standard errors of estimates. Instead, the goal of Bagging is to improve prediction accuracy. It fits a tree for each bootsrap sample, and then aggregate the predicted values from all these different trees. For more details, you may look at Wikepedia, or you can find the original paper Leo Breiman (1996).
To my best knowledge, it seems that bagging() won’t take an argument for asymmetric loss. Therefore, the classification results might not be appropriate.Lets check it out…
credit.bg <- randomForest(as.factor(response)~., data = german_credit_train,mtry=ncol(german_credit_train)-1,ntree=1000)
credit.bg
##
## Call:
## randomForest(formula = as.factor(response) ~ ., data = german_credit_train, mtry = ncol(german_credit_train) - 1, ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 20
##
## OOB estimate of error rate: 25.14%
## Confusion matrix:
## 0 1 class.error
## 0 425 64 0.1308793
## 1 112 99 0.5308057
Models sensitivity will be less because the FN are high which is concerning, although FP are less which is good thing.
plot(credit.bg, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd=rep(2, 3), lty = c(1,2,3), col = c("black", "red", "green"))
Insample Analysis
## confusion matrix
credit.bg.pred<- predict(credit.bg, type = "prob")[,2]
optimal.pcut= .16#our assumption
credit.bg.pred.class<- (credit.bg.pred>optimal.pcut)*1
table(german_credit_train$response, credit.bg.pred.class, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 206 283
## 1 22 189
Lets calculate the Loss and MCR for insample
#LOSS
(BG.insamplecost.pr<-cost(german_credit_train$response,credit.bg.pred.class))
## [1] 0.5614286
#Misclassification rate
(BG.insampleMCR.pr<-mean(german_credit_train$response!=credit.bg.pred.class))
## [1] 0.4357143
#roc
pred <- prediction(credit.bg.pred, german_credit_train$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7789133
LOSS is (.58)very high because rpart does not have a indefined cost assymetric parameter MCR=.44 Approx AUC=.772 Out of sample Analysis
Please higher value of AUC can also be by fluke
## confusion matrix
optimal.pcut= .16#our assumption
credit.bg.pred_test<- predict(credit.bg,newdata=german_credit_test, type = "prob")[,2]
credit.bg.pred.class.test<- (credit.bg.pred_test>optimal.pcut)*1
table(german_credit_test$response, credit.bg.pred.class.test, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 88 123
## 1 7 82
We can see the FN have reduced in Out of sample as well.
Lets calculate the Loss and MCR for insample and out of sample Missclassification rate & LOSS Value
#LOSS
(BG.outsamplecost.pr<-cost(german_credit_test$response,credit.bg.pred.class.test))
## [1] 0.5266667
#Misclassification rate
(BG.outsampleMCR.pr<-mean(german_credit_test$response!=credit.bg.pred.class.test))
## [1] 0.4333333
#roc
pred <- prediction(credit.bg.pred_test, german_credit_test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8004952
LOSS=.51 MCR=.43 AUC=.800
credit.rf <- randomForest(as.factor(response)~., data = german_credit_train,mtry=sqrt(ncol(german_credit_train)-1),ntree=1000)
credit.rf
##
## Call:
## randomForest(formula = as.factor(response) ~ ., data = german_credit_train, mtry = sqrt(ncol(german_credit_train) - 1), ntree = 1000)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 24.57%
## Confusion matrix:
## 0 1 class.error
## 0 446 43 0.08793456
## 1 129 82 0.61137441
Models sensitivity will be less because the FN are high which is concerning, although FP are less which is good thing.
plot(credit.rf, lwd=rep(2, 3))
legend("right", legend = c("OOB Error", "FPR", "FNR"), lwd=rep(2, 3), lty = c(1,2,3), col = c("black", "red", "green"))
Insample Analysis
## confusion matrix
credit.rf.pred<- predict(credit.rf, type = "prob")[,2]
optimal.pcut= .16#our assumption
credit.rf.pred.class<- (credit.rf.pred>optimal.pcut)*1
table(german_credit_train$response, credit.rf.pred.class, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 167 322
## 1 19 192
Note thatFN are less which is a positive sign..
Lets calculate the Loss and MCR for insample
#LOSS
(rf.insamplecost<-cost(german_credit_train$response,credit.rf.pred.class))
## [1] 0.5957143
#Misclassification rate
(rf.insampleMCR<-mean(german_credit_train$response!=credit.rf.pred.class))
## [1] 0.4871429
#roc
pred <- prediction(credit.rf.pred, german_credit_train$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7757053
LOSS=.58 MCR=.44 AUC=.781
Out of sample Analysis
## confusion matrix
credit.rf.pred_test<- predict(credit.rf,newdata=german_credit_test, type = "prob")[,2]
optimal.pcut= .16#our assumption
credit.rf.pred.class.test<- (credit.rf.pred_test>optimal.pcut)*1
table(german_credit_test$response, credit.rf.pred.class.test, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 76 135
## 1 4 85
Lets calculate the Loss and MCR for insample and out of sample Missclassification rate & LOSS Value
#LOSS
(rf.outsamplecost<-cost(german_credit_test$response,credit.rf.pred.class.test))
## [1] 0.5166667
#Misclassification rate
(rf.outsampleMCR<-mean(german_credit_test$response!=credit.rf.pred.class.test))
## [1] 0.4633333
#roc
pred <- prediction(credit.rf.pred_test, german_credit_test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8065126
LOSS=.5133 MCR=.46 AUC=.805
Boosting builds a number of small trees, and each time, the response is the residual from last tree. It is a sequential procedure. We use gbm package to build boosted trees.
credit.bo= gbm(response~., data = german_credit_train, distribution = "bernoulli",n.trees = 100, shrinkage = 0.01, interaction.depth = 8)
summary(credit.bo)
## var rel.inf
## chk_acct chk_acct 24.79473162
## purpose purpose 13.93340620
## duration duration 12.26373402
## amount amount 7.88413169
## other_install other_install 6.58860998
## credit_his credit_his 5.69527536
## present_emp present_emp 5.05194876
## saving_acct saving_acct 4.90412351
## property property 4.55080555
## age age 4.49352334
## installment_rate installment_rate 1.93167495
## present_resid present_resid 1.88800520
## housing housing 1.75965162
## job job 1.43809345
## sex sex 1.35982765
## other_debtor other_debtor 0.63424947
## n_credits n_credits 0.34181637
## telephone telephone 0.26795525
## n_people n_people 0.13894866
## foreign foreign 0.07948735
par(mfrow=c(1,3))
plot(credit.bo, i="chk_acct")
plot(credit.bo, i="duration")
plot(credit.bo, i="purpose")
Insample analysis
pred.credit.bo.in<- predict(credit.bo, newdata = german_credit_train,type ="response" ,n.trees =100 )
optimal.pcut= .16#our assumption
credit.bo.pred.class<- (pred.credit.bo.in>optimal.pcut)*1
table(german_credit_train$response, credit.bo.pred.class, dnn = c("True", "Pred"))
## Pred
## True 1
## 0 489
## 1 211
#LOSS
(bo.insamplecost<-cost(german_credit_train$response,credit.bo.pred.class))
## [1] 0.6985714
#Misclassification rate
(bo.insampleMCR<-mean(german_credit_train$response!=credit.bo.pred.class))
## [1] 0.6985714
#class(pred.credit.bo.in)
pred <- prediction(pred.credit.bo.in, german_credit_train$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8853837
LOSS:0.6985714 MCR:0.6985714 AUC:.886
Out of sample analysis
pred.credit.bo.out<- predict(credit.bo, newdata = german_credit_test,type ="response" ,n.trees =100 )
optimal.pcut= .16#our assumption
credit.bo.pred.class<- (pred.credit.bo.out>optimal.pcut)*1
table(german_credit_test$response, credit.bo.pred.class, dnn = c("True", "Pred"))
## Pred
## True 1
## 0 211
## 1 89
#LOSS
(bo.outsamplecost<-cost(german_credit_test$response,credit.bo.pred.class))
## [1] 0.7033333
#Misclassification rate
(bo.outsampleMCR<-mean(german_credit_test$response!=credit.bo.pred.class))
## [1] 0.7033333
pred <- prediction(pred.credit.bo.out, german_credit_test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8173492
LOSS:0.703 MCR:0.703 AUC:.819
n <- names(german_credit_train)
cont<-c("duration","age","amount","response")
f <- as.formula(paste("response ~s(duration)+s(age)+s(amount) +", paste(n[!n %in% cont], collapse = " + ")))
f
## response ~ s(duration) + s(age) + s(amount) + chk_acct + credit_his +
## purpose + saving_acct + present_emp + installment_rate +
## sex + other_debtor + present_resid + property + other_install +
## housing + n_credits + job + n_people + telephone + foreign
credit.gam <- gam(formula = f, family=binomial,data=german_credit_train);
summary(credit.gam)
##
## Family: binomial
## Link function: logit
##
## Formula:
## response ~ s(duration) + s(age) + s(amount) + chk_acct + credit_his +
## purpose + saving_acct + present_emp + installment_rate +
## sex + other_debtor + present_resid + property + other_install +
## housing + n_credits + job + n_people + telephone + foreign
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.713e-01 1.217e+00 -0.141 0.88808
## chk_acctA12 -3.865e-01 2.670e-01 -1.448 0.14772
## chk_acctA13 -8.783e-01 4.446e-01 -1.976 0.04819 *
## chk_acctA14 -1.713e+00 2.785e-01 -6.150 7.73e-10 ***
## credit_hisA31 9.244e-01 7.644e-01 1.209 0.22654
## credit_hisA32 4.953e-02 5.750e-01 0.086 0.93136
## credit_hisA33 -3.767e-01 6.105e-01 -0.617 0.53720
## credit_hisA34 -9.624e-01 5.810e-01 -1.656 0.09764 .
## purposeA41 -1.385e+00 4.283e-01 -3.234 0.00122 **
## purposeA410 -2.374e+00 1.222e+00 -1.943 0.05201 .
## purposeA42 -8.916e-01 3.238e-01 -2.754 0.00589 **
## purposeA43 -8.111e-01 2.949e-01 -2.751 0.00594 **
## purposeA44 -8.147e-01 8.736e-01 -0.933 0.35099
## purposeA45 -3.656e-01 6.160e-01 -0.594 0.55284
## purposeA46 -1.802e-02 5.118e-01 -0.035 0.97191
## purposeA48 -3.922e+01 3.355e+07 0.000 1.00000
## purposeA49 -4.909e-01 3.994e-01 -1.229 0.21906
## saving_acctA62 -2.220e-01 3.528e-01 -0.629 0.52922
## saving_acctA63 -9.380e-02 4.548e-01 -0.206 0.83660
## saving_acctA64 -6.864e-01 6.235e-01 -1.101 0.27090
## saving_acctA65 -1.074e+00 3.318e-01 -3.236 0.00121 **
## present_empA72 -4.915e-01 5.071e-01 -0.969 0.33241
## present_empA73 -6.510e-01 4.884e-01 -1.333 0.18254
## present_empA74 -1.243e+00 5.321e-01 -2.337 0.01944 *
## present_empA75 -6.279e-01 4.946e-01 -1.270 0.20424
## installment_rate 2.954e-01 1.084e-01 2.726 0.00642 **
## sexA92 2.369e-01 4.855e-01 0.488 0.62553
## sexA93 -3.098e-01 4.767e-01 -0.650 0.51571
## sexA94 -2.722e-02 5.502e-01 -0.049 0.96054
## other_debtorA102 8.110e-02 5.064e-01 0.160 0.87276
## other_debtorA103 -7.361e-01 4.746e-01 -1.551 0.12096
## present_resid 3.749e-02 1.075e-01 0.349 0.72728
## propertyA122 4.050e-01 3.033e-01 1.335 0.18176
## propertyA123 1.077e-01 2.830e-01 0.381 0.70351
## propertyA124 4.756e-01 5.127e-01 0.928 0.35359
## other_installA142 7.392e-02 5.406e-01 0.137 0.89124
## other_installA143 -9.441e-01 2.954e-01 -3.195 0.00140 **
## housingA152 -4.925e-01 2.852e-01 -1.727 0.08419 .
## housingA153 -4.355e-01 5.638e-01 -0.772 0.43993
## n_credits 5.473e-01 2.365e-01 2.314 0.02067 *
## jobA172 8.127e-01 7.422e-01 1.095 0.27352
## jobA173 8.556e-01 7.096e-01 1.206 0.22790
## jobA174 7.307e-01 7.225e-01 1.011 0.31180
## n_people 2.842e-01 3.015e-01 0.943 0.34585
## telephoneA192 -2.878e-01 2.446e-01 -1.176 0.23943
## foreignA202 -1.550e+00 8.599e-01 -1.803 0.07144 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(duration) 1.268 1.489 7.828 0.00826 **
## s(age) 1.001 1.002 2.147 0.14294
## s(amount) 1.531 1.901 4.085 0.18323
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.256 Deviance explained = 26.9%
## UBRE = 0.037162 Scale est. = 1 n = 700
plot(credit.gam, shade=TRUE,seWithMean=TRUE,scale=0, pages = 1)
AIC(credit.gam)
## [1] 726.0132
BIC(credit.gam)
## [1] 952.6584
credit.gam$deviance
## [1] 626.4126
We can notice that even the continious variables in the data do not have a non-linear distribution. This is backed up by the above plots where we can see a straight line
Insample Performance
pcut.gam <- .16
prob.gam.in<-predict(credit.gam,german_credit_train,type="response")
pred.gam.in.class<-(prob.gam.in>=pcut.gam)*1
table(german_credit_train$response,pred.gam.in.class,dnn=c("Observed","Predicted"))
## Predicted
## Observed 0 1
## 0 264 225
## 1 21 190
FN : Looks fairly well but the FP looks really high
#LOSS
(gam.insamplecost<-cost(german_credit_train$response,pred.gam.in.class))
## [1] 0.4714286
#Misclassification rate
(gam.insampleMCR<-mean(german_credit_train$response!=pred.gam.in.class))
## [1] 0.3514286
pred <- prediction(as.numeric(prob.gam.in),german_credit_train$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.8340457
LOSS: .471 MCR: .351 AUC: .834
Out of sample Performance
pcut.gam <- .16
prob.gam.out<-predict(credit.gam,german_credit_test,type="response")
pred.gam.out.class<-(prob.gam.out>=pcut.gam)*1
table(german_credit_test$response,pred.gam.out.class,dnn=c("Observed","Predicted"))
## Predicted
## Observed 0 1
## 0 102 109
## 1 8 81
FN : Looks fairly well but the FP looks really high
#LOSS
(gam.out.samplecost<-cost(german_credit_test$response,pred.gam.out.class))
## [1] 0.4966667
#Misclassification rate
(gam.out.sampleMCR<-mean(german_credit_test$response!=pred.gam.out.class))
## [1] 0.39
pred <- prediction(as.numeric(prob.gam.out),german_credit_test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7912029
LOSS:.496 MCR: .390 AUC: .791
credit.nnet <- nnet(response~., data=german_credit_train, size=8,decay=0.003, maxit=500)
## # weights: 401
## initial value 199.027104
## iter 10 value 147.060497
## iter 20 value 134.217896
## iter 30 value 126.707014
## iter 40 value 113.508218
## iter 50 value 103.778762
## iter 60 value 101.224480
## iter 70 value 100.659776
## iter 80 value 100.619145
## iter 90 value 100.618564
## iter 100 value 100.618461
## final value 100.618332
## converged
Insample Analysis
prob.nnet= predict(credit.nnet,german_credit_train)
pred.nnet = as.numeric(prob.nnet > 0.16)
table(german_credit_train$response,pred.nnet, dnn=c("Observed","Predicted"))
## Predicted
## Observed 0 1
## 0 399 90
## 1 65 146
#LOSS
(nnet.insamplecost<-cost(german_credit_train$response,pred.nnet))
## [1] 0.5928571
#Misclassification rate
(nnet.insampleMCR<-mean(german_credit_train$response!=pred.nnet))
## [1] 0.2214286
pred <- prediction(prob.nnet,german_credit_train$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.791789
LOSS:0.274 MCR:.108 AUC:.737
Out of sample Analysis
prob.nnet.out= predict(credit.nnet,german_credit_test)
pred.nnet.out = as.numeric(prob.nnet.out > 0.16)
table(german_credit_test$response,pred.nnet.out, dnn=c("Observed","Predicted"))
## Predicted
## Observed 0 1
## 0 164 47
## 1 30 59
#LOSS
(nnet.insamplecost<-cost(german_credit_test$response,pred.nnet.out))
## [1] 0.6566667
#Misclassification rate
(nnet.insampleMCR<-mean(german_credit_test$response!=pred.nnet.out))
## [1] 0.2566667
pred <- prediction(prob.nnet.out,german_credit_test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7842271
LOSS: .74 MCR:.30 AUC: .737
Conclusion:
Ensemble methods we observe are not performing well on this data set. The reason is they do not have a asymmetric cost setting functionality as in rpart. GAM do take a lead here and outperform all other algorithms