Problem Testing out several different supervised learning algorithms over credit data of subjects to find the one that accurately predicts if an individual will be good candidate to loan or will tend to default. Random sample a training data set that contains 80% of the original data points. Please use 5:1 cost ratio and pcut=1/6. Methods of analysis includes the Summary statistics of the variables and finding correlation between variables, Exploratory data analysis using visualization. Random sample a training data set that contains 80% of original data points. Build a generalized linear regression as in HW3.Try different link functions (logistic, probit, complementary log-log link) and compare. Fitting various models such as generalized linear regression using different variable selections and finding the best model. Find a best model for Credit Scoring data using logistic regression with AIC and BIC. Draw ROC curve, report the AUC, and present the misclassification rate table of your final model. Test the out-of-sample performance. Calculate 5-fold cross validation, fit a regression tree (CART) on the same data and compare it with the logistic regression model. Finally, repeat the sample process for another random sample with conclusions and detailed comparison tables.
This report provides an analysis and evaluation of the factors affecting the risk level of the individuals for taking loan. To start with, the random sample with 80:20 dataset has to be created, find a best model using logistic regression with AIC and BIC and LASSO variable selection. Using the final model, Report in-sample (MSE) and out-of-sample prediction errors (MSPE). Build a regression tree and compare it with the logistic regression model. Finally, repeat the sample process for another random sample with conclusions and detailed comparison tables.
Initially, exploratory data analysis on the dataset has been performed. Further, a logistic regression model has been built to predict customers as risky or not, along with variable selection for the model building process.
The German credit scoring data is a dataset provided by Prof. Hogmann in the file german.data. The data set has information about 1000 individuals, on the basis of which they have been classified as risky or not. The variable response in the dataset corresponds to the risk label, 1 has been classified as bad and 2 has been classified as good.
The Final model has 44% accurate in predicting the defaulters. And, the Asymmetric cost is 0.56125 in the testing dataset. To further the analysis, I will have a look at the model building and variable selection, there are both positive and negative correlation with the response variable which explains that the final model is not as good to predict the testing values as the training values. However, Area Under the curve for the final model has increased from 81.5% to 57.7% which clearly shows that it is not the right model for the prediction. However, AUC for the regression tree denotes 81% which changes slightly to 79% which is an acceptable score and I can conclude that our model is working fairly well for the regression tree.
german_data = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")
#Assigning variable names
colnames(german_data)=c("chk_acct","duration","credit_his","purpose","amount","saving_acct","present_emp","installment_rate","sex","other_debtor","present_resid","property","age","other_install","housing","n_credits","job","n_people","telephone","foreign","response")
dim(german_data)
## [1] 1000 21
#Response is in 1,2 - we need to change it to 0,1
german_data$response = german_data$response - 1
head(german_data)
## chk_acct duration credit_his purpose amount saving_acct present_emp
## 1 A11 6 A34 A43 1169 A65 A75
## 2 A12 48 A32 A43 5951 A61 A73
## 3 A14 12 A34 A46 2096 A61 A74
## 4 A11 42 A32 A42 7882 A61 A74
## 5 A11 24 A33 A40 4870 A61 A73
## 6 A14 36 A32 A46 9055 A65 A73
## installment_rate sex other_debtor present_resid property age other_install
## 1 4 A93 A101 4 A121 67 A143
## 2 2 A92 A101 2 A121 22 A143
## 3 2 A93 A101 3 A121 49 A143
## 4 2 A93 A103 4 A122 45 A143
## 5 3 A93 A101 4 A124 53 A143
## 6 2 A93 A101 4 A124 35 A143
## housing n_credits job n_people telephone foreign response
## 1 A152 2 A173 1 A192 A201 0
## 2 A152 1 A173 1 A191 A201 1
## 3 A152 1 A172 2 A191 A201 0
## 4 A153 1 A173 2 A191 A201 0
## 5 A153 2 A173 2 A191 A201 1
## 6 A153 1 A172 2 A192 A201 0
glimpse(german_data)
## Rows: 1,000
## Columns: 21
## $ chk_acct <chr> "A11", "A12", "A14", "A11", "A11", "A14", "A14", "A1…
## $ duration <int> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12, 2…
## $ credit_his <chr> "A34", "A32", "A34", "A32", "A33", "A32", "A32", "A3…
## $ purpose <chr> "A43", "A43", "A46", "A42", "A40", "A46", "A42", "A4…
## $ amount <int> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 3059…
## $ saving_acct <chr> "A65", "A61", "A61", "A61", "A61", "A65", "A63", "A6…
## $ present_emp <chr> "A75", "A73", "A74", "A74", "A73", "A73", "A75", "A7…
## $ installment_rate <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, 4, 4, 2…
## $ sex <chr> "A93", "A92", "A93", "A93", "A93", "A93", "A93", "A9…
## $ other_debtor <chr> "A101", "A101", "A101", "A103", "A101", "A101", "A10…
## $ present_resid <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, 2, 4, 3…
## $ property <chr> "A121", "A121", "A121", "A122", "A124", "A124", "A12…
## $ age <int> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 22, …
## $ other_install <chr> "A143", "A143", "A143", "A143", "A143", "A143", "A14…
## $ housing <chr> "A152", "A152", "A152", "A153", "A153", "A153", "A15…
## $ n_credits <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 3…
## $ job <chr> "A173", "A173", "A172", "A173", "A173", "A172", "A17…
## $ n_people <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ telephone <chr> "A192", "A191", "A191", "A191", "A191", "A192", "A19…
## $ foreign <chr> "A201", "A201", "A201", "A201", "A201", "A201", "A20…
## $ response <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0…
german_data$sex<- as.factor(german_data$sex)
set.seed(14006542)
#in.train <- createDataPartition(as.factor(german_data$response), p=0.8, list=FALSE)
in.train <- sample(nrow(german_data),nrow(german_data)*0.70)
train <- german_data[in.train,]
test <- german_data[-in.train,]
summary(german_data)
## chk_acct duration credit_his purpose
## Length:1000 Min. : 4.0 Length:1000 Length:1000
## Class :character 1st Qu.:12.0 Class :character Class :character
## Mode :character Median :18.0 Mode :character Mode :character
## Mean :20.9
## 3rd Qu.:24.0
## Max. :72.0
## amount saving_acct present_emp installment_rate
## Min. : 250 Length:1000 Length:1000 Min. :1.000
## 1st Qu.: 1366 Class :character Class :character 1st Qu.:2.000
## Median : 2320 Mode :character Mode :character Median :3.000
## Mean : 3271 Mean :2.973
## 3rd Qu.: 3972 3rd Qu.:4.000
## Max. :18424 Max. :4.000
## sex other_debtor present_resid property
## A91: 50 Length:1000 Min. :1.000 Length:1000
## A92:310 Class :character 1st Qu.:2.000 Class :character
## A93:548 Mode :character Median :3.000 Mode :character
## A94: 92 Mean :2.845
## 3rd Qu.:4.000
## Max. :4.000
## age other_install housing n_credits
## Min. :19.00 Length:1000 Length:1000 Min. :1.000
## 1st Qu.:27.00 Class :character Class :character 1st Qu.:1.000
## Median :33.00 Mode :character Mode :character Median :1.000
## Mean :35.55 Mean :1.407
## 3rd Qu.:42.00 3rd Qu.:2.000
## Max. :75.00 Max. :4.000
## job n_people telephone foreign
## Length:1000 Min. :1.000 Length:1000 Length:1000
## Class :character 1st Qu.:1.000 Class :character Class :character
## Mode :character Median :1.000 Mode :character Mode :character
## Mean :1.155
## 3rd Qu.:1.000
## Max. :2.000
## response
## Min. :0.0
## 1st Qu.:0.0
## Median :0.0
## Mean :0.3
## 3rd Qu.:1.0
## Max. :1.0
str(german_data)
## 'data.frame': 1000 obs. of 21 variables:
## $ chk_acct : chr "A11" "A12" "A14" "A11" ...
## $ duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_his : chr "A34" "A32" "A34" "A32" ...
## $ purpose : chr "A43" "A43" "A46" "A42" ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ saving_acct : chr "A65" "A61" "A61" "A61" ...
## $ present_emp : chr "A75" "A73" "A74" "A74" ...
## $ installment_rate: int 4 2 2 2 3 2 3 2 2 4 ...
## $ sex : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ other_debtor : chr "A101" "A101" "A101" "A103" ...
## $ present_resid : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property : chr "A121" "A121" "A121" "A122" ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ other_install : chr "A143" "A143" "A143" "A143" ...
## $ housing : chr "A152" "A152" "A152" "A153" ...
## $ n_credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : chr "A173" "A173" "A172" "A173" ...
## $ n_people : int 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone : chr "A192" "A191" "A191" "A191" ...
## $ foreign : chr "A201" "A201" "A201" "A201" ...
## $ response : num 0 1 0 0 1 0 0 0 0 1 ...
sum(is.na(german_data[,colnames(german_data)]))
## [1] 0
#duration
g1<- ggplot(german_data, aes(x = as.factor(response), y = duration, fill = as.factor(response))) + geom_boxplot() + theme(legend.position = "none")
#amount
g2<-ggplot(german_data, aes(x = as.factor(response), y = amount, fill = as.factor(response))) + geom_boxplot() +
theme(legend.position = "none")
#AGE
g3<-ggplot(german_data, aes(x = as.factor(response), y = age, fill = as.factor(response))) +
geom_boxplot() + theme(legend.position = "none")
ggarrange(g1, g2,g3, labels = c("duration", "amount","age"),
ncol = 3, nrow = 1)
g4 <- ggplot(german_data, aes(factor(installment_rate), ..count..)) +
geom_bar(aes(fill = response), position = "dodge") + xlab("Installment Rates")
ggplot(german_data, aes(chk_acct, ..count..)) + geom_bar(aes(fill = response), position = "dodge")
## Model Selection
german_model<- glm(response~., family=binomial, data=train)
summary(german_model)
##
## Call:
## glm(formula = response ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4484 -0.7036 -0.3852 0.7243 2.4659
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.8851026 1.3845270 0.639 0.522640
## chk_acctA12 -0.5030711 0.2673708 -1.882 0.059897 .
## chk_acctA13 -0.4253927 0.4200581 -1.013 0.311204
## chk_acctA14 -1.6142326 0.2775579 -5.816 6.03e-09 ***
## duration 0.0184715 0.0112729 1.639 0.101303
## credit_hisA31 -0.1722079 0.6821243 -0.252 0.800687
## credit_hisA32 -0.9648772 0.5343126 -1.806 0.070945 .
## credit_hisA33 -1.1817914 0.5741403 -2.058 0.039555 *
## credit_hisA34 -2.0534852 0.5532425 -3.712 0.000206 ***
## purposeA41 -1.5070005 0.4352014 -3.463 0.000535 ***
## purposeA410 -1.4876300 0.8887980 -1.674 0.094179 .
## purposeA42 -0.9582826 0.3093139 -3.098 0.001948 **
## purposeA43 -1.0555883 0.3014197 -3.502 0.000462 ***
## purposeA44 0.4579776 1.0787464 0.425 0.671168
## purposeA45 -0.7796829 0.7167067 -1.088 0.276653
## purposeA46 0.4725778 0.4886015 0.967 0.333442
## purposeA48 -1.9697065 1.2389177 -1.590 0.111866
## purposeA49 -0.6796107 0.4099511 -1.658 0.097361 .
## amount 0.0001302 0.0000520 2.504 0.012294 *
## saving_acctA62 -0.2421310 0.3338108 -0.725 0.468235
## saving_acctA63 -0.6939914 0.5295323 -1.311 0.190002
## saving_acctA64 -0.8499188 0.5955164 -1.427 0.153523
## saving_acctA65 -0.6238339 0.2992676 -2.085 0.037111 *
## present_empA72 0.2094153 0.5311415 0.394 0.693379
## present_empA73 0.1673502 0.5108755 0.328 0.743233
## present_empA74 -0.3912869 0.5457785 -0.717 0.473415
## present_empA75 0.0197675 0.5069837 0.039 0.968898
## installment_rate 0.2857535 0.1083350 2.638 0.008347 **
## sexA92 0.0167802 0.4915429 0.034 0.972767
## sexA93 -0.5008504 0.4739034 -1.057 0.290575
## sexA94 -0.1609017 0.5728093 -0.281 0.778788
## other_debtorA102 0.5149136 0.4805445 1.072 0.283935
## other_debtorA103 -0.8208256 0.5162608 -1.590 0.111848
## present_resid 0.0160511 0.1088301 0.147 0.882747
## propertyA122 0.2391509 0.3059571 0.782 0.434421
## propertyA123 0.0252650 0.2861699 0.088 0.929649
## propertyA124 0.5562042 0.4908042 1.133 0.257109
## age -0.0183609 0.0113005 -1.625 0.104209
## other_installA142 0.0420092 0.4737015 0.089 0.929334
## other_installA143 -0.6097567 0.2942054 -2.073 0.038214 *
## housingA152 -0.2891686 0.2838236 -1.019 0.308283
## housingA153 -0.3839094 0.5566959 -0.690 0.490432
## n_credits 0.1383620 0.2320672 0.596 0.551031
## jobA172 0.4854772 0.8863314 0.548 0.583872
## jobA173 0.4167046 0.8676803 0.480 0.631049
## jobA174 0.8721011 0.8788201 0.992 0.321025
## n_people 0.1470461 0.2955121 0.498 0.618768
## telephoneA192 -0.3908976 0.2416691 -1.617 0.105772
## foreignA202 -1.0824703 0.6835033 -1.584 0.113260
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 850.06 on 699 degrees of freedom
## Residual deviance: 629.88 on 651 degrees of freedom
## AIC: 727.88
##
## Number of Fisher Scoring iterations: 5
predict_logit<- predict(german_model, test, type = "response")
prob.glm1.insample <- predict(german_model, type = "response")
predicted.glm1.insample <- prob.glm1.insample > 0.1667
predicted.glm1.insample <- as.numeric(predicted.glm1.insample)
mean(ifelse(train$response != predicted.glm1.insample, 1, 0))
## [1] 0.3514286
mean(ifelse(test$response != predicted.glm1.insample, 1, 0))
## Warning in test$response != predicted.glm1.insample: longer object length is not
## a multiple of shorter object length
## [1] 0.5342857
roc.plot(train$response == "1", prob.glm1.insample)$roc.vol$Area
## [1] 0.8311824
roc.plot(test$response == "1", prob.glm1.insample)$roc.vol$Area
## Warning in is.finite(x) & apply(pred, 1, f): longer object length is not a
## multiple of shorter object length
## [1] 0.480079
german_model_probit<- glm(response~., family=binomial(link = "probit"), data=train)
summary(german_model_probit)
##
## Call:
## glm(formula = response ~ ., family = binomial(link = "probit"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4086 -0.7191 -0.3714 0.7540 2.4914
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.495e-01 7.947e-01 0.566 0.571677
## chk_acctA12 -2.919e-01 1.563e-01 -1.868 0.061767 .
## chk_acctA13 -2.578e-01 2.465e-01 -1.046 0.295592
## chk_acctA14 -9.491e-01 1.576e-01 -6.024 1.7e-09 ***
## duration 1.036e-02 6.591e-03 1.571 0.116116
## credit_hisA31 -1.355e-01 3.973e-01 -0.341 0.733058
## credit_hisA32 -5.626e-01 3.095e-01 -1.818 0.069129 .
## credit_hisA33 -6.827e-01 3.332e-01 -2.049 0.040444 *
## credit_hisA34 -1.185e+00 3.182e-01 -3.723 0.000197 ***
## purposeA41 -8.863e-01 2.479e-01 -3.575 0.000350 ***
## purposeA410 -8.715e-01 5.177e-01 -1.683 0.092284 .
## purposeA42 -5.377e-01 1.792e-01 -3.001 0.002691 **
## purposeA43 -5.884e-01 1.726e-01 -3.409 0.000652 ***
## purposeA44 2.305e-01 6.061e-01 0.380 0.703660
## purposeA45 -3.915e-01 4.050e-01 -0.967 0.333734
## purposeA46 2.799e-01 2.828e-01 0.990 0.322272
## purposeA48 -1.151e+00 6.984e-01 -1.648 0.099395 .
## purposeA49 -3.676e-01 2.351e-01 -1.564 0.117813
## amount 7.482e-05 3.014e-05 2.482 0.013065 *
## saving_acctA62 -1.485e-01 1.942e-01 -0.765 0.444380
## saving_acctA63 -3.927e-01 2.868e-01 -1.369 0.170944
## saving_acctA64 -4.477e-01 3.253e-01 -1.376 0.168753
## saving_acctA65 -3.625e-01 1.703e-01 -2.129 0.033237 *
## present_empA72 1.193e-01 3.104e-01 0.385 0.700602
## present_empA73 1.161e-01 2.983e-01 0.389 0.697111
## present_empA74 -2.054e-01 3.165e-01 -0.649 0.516263
## present_empA75 2.104e-02 2.956e-01 0.071 0.943251
## installment_rate 1.621e-01 6.218e-02 2.606 0.009152 **
## sexA92 2.404e-02 2.873e-01 0.084 0.933299
## sexA93 -2.610e-01 2.767e-01 -0.943 0.345638
## sexA94 -8.603e-02 3.348e-01 -0.257 0.797214
## other_debtorA102 2.462e-01 2.843e-01 0.866 0.386491
## other_debtorA103 -4.622e-01 2.885e-01 -1.602 0.109110
## present_resid 4.170e-03 6.280e-02 0.066 0.947055
## propertyA122 1.748e-01 1.748e-01 1.000 0.317216
## propertyA123 2.206e-02 1.652e-01 0.134 0.893731
## propertyA124 3.308e-01 2.808e-01 1.178 0.238777
## age -1.008e-02 6.460e-03 -1.561 0.118577
## other_installA142 2.403e-02 2.787e-01 0.086 0.931297
## other_installA143 -3.646e-01 1.705e-01 -2.138 0.032503 *
## housingA152 -1.854e-01 1.651e-01 -1.123 0.261365
## housingA153 -2.067e-01 3.182e-01 -0.650 0.515948
## n_credits 8.351e-02 1.329e-01 0.628 0.529906
## jobA172 3.226e-01 5.022e-01 0.642 0.520639
## jobA173 2.938e-01 4.916e-01 0.598 0.550054
## jobA174 5.674e-01 4.992e-01 1.137 0.255681
## n_people 8.344e-02 1.703e-01 0.490 0.624223
## telephoneA192 -2.497e-01 1.382e-01 -1.806 0.070857 .
## foreignA202 -6.221e-01 3.751e-01 -1.658 0.097236 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 850.06 on 699 degrees of freedom
## Residual deviance: 630.34 on 651 degrees of freedom
## AIC: 728.34
##
## Number of Fisher Scoring iterations: 6
par(mfrow = c(2,2))
plot(german_model_probit)
AIC(german_model_probit)
## [1] 728.3355
BIC(german_model_probit)
## [1] 951.3384
german_model_cloglog<- glm(response~., family=binomial(link = "cloglog"), data=train)
## Warning: glm.fit: algorithm did not converge
summary(german_model_cloglog)
##
## Call:
## glm(formula = response ~ ., family = binomial(link = "cloglog"),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4002 -0.6914 -0.4134 0.6267 2.4210
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.450e-02 9.952e-01 -0.055 0.956325
## chk_acctA12 -3.900e-01 1.913e-01 -2.038 0.041503 *
## chk_acctA13 -3.044e-01 3.280e-01 -0.928 0.353381
## chk_acctA14 -1.305e+00 2.202e-01 -5.926 3.10e-09 ***
## duration 1.588e-02 8.139e-03 1.951 0.051054 .
## credit_hisA31 -2.296e-01 4.360e-01 -0.527 0.598441
## credit_hisA32 -6.889e-01 3.426e-01 -2.011 0.044336 *
## credit_hisA33 -8.745e-01 3.830e-01 -2.283 0.022424 *
## credit_hisA34 -1.572e+00 3.688e-01 -4.264 2.01e-05 ***
## purposeA41 -1.071e+00 3.345e-01 -3.203 0.001361 **
## purposeA410 -1.047e+00 6.438e-01 -1.626 0.103972
## purposeA42 -7.302e-01 2.278e-01 -3.205 0.001349 **
## purposeA43 -8.351e-01 2.265e-01 -3.687 0.000227 ***
## purposeA44 1.171e-02 6.099e-01 0.019 0.984681
## purposeA45 -5.681e-01 5.482e-01 -1.036 0.300048
## purposeA46 3.346e-01 3.300e-01 1.014 0.310596
## purposeA48 -1.537e+00 1.045e+00 -1.470 0.141602
## purposeA49 -6.221e-01 3.036e-01 -2.049 0.040454 *
## amount 7.678e-05 3.632e-05 2.114 0.034524 *
## saving_acctA62 -7.698e-02 2.447e-01 -0.315 0.753104
## saving_acctA63 -7.322e-01 4.572e-01 -1.602 0.109234
## saving_acctA64 -7.594e-01 4.919e-01 -1.544 0.122648
## saving_acctA65 -6.267e-01 2.337e-01 -2.681 0.007335 **
## present_empA72 1.760e-01 3.744e-01 0.470 0.638378
## present_empA73 2.487e-01 3.612e-01 0.688 0.491141
## present_empA74 -2.134e-01 3.961e-01 -0.539 0.590162
## present_empA75 1.584e-01 3.646e-01 0.434 0.664010
## installment_rate 2.341e-01 8.116e-02 2.884 0.003921 **
## sexA92 2.353e-02 3.669e-01 0.064 0.948862
## sexA93 -3.845e-01 3.534e-01 -1.088 0.276612
## sexA94 -1.190e-01 4.353e-01 -0.273 0.784546
## other_debtorA102 2.049e-01 3.402e-01 0.602 0.547109
## other_debtorA103 -7.417e-01 4.413e-01 -1.680 0.092861 .
## present_resid 2.612e-02 8.045e-02 0.325 0.745475
## propertyA122 1.951e-01 2.366e-01 0.824 0.409682
## propertyA123 -1.172e-02 2.212e-01 -0.053 0.957752
## propertyA124 4.575e-01 3.489e-01 1.311 0.189783
## age -1.230e-02 8.498e-03 -1.447 0.147904
## other_installA142 1.054e-01 3.405e-01 0.309 0.756964
## other_installA143 -4.462e-01 2.160e-01 -2.066 0.038833 *
## housingA152 -2.544e-01 2.083e-01 -1.221 0.222004
## housingA153 -4.612e-01 3.821e-01 -1.207 0.227470
## n_credits 1.263e-01 1.686e-01 0.749 0.453797
## jobA172 3.958e-01 6.138e-01 0.645 0.519035
## jobA173 3.515e-01 6.006e-01 0.585 0.558377
## jobA174 6.752e-01 6.194e-01 1.090 0.275709
## n_people 1.736e-01 2.246e-01 0.773 0.439636
## telephoneA192 -2.814e-01 1.820e-01 -1.546 0.122026
## foreignA202 -7.722e-01 5.437e-01 -1.420 0.155540
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 850.06 on 699 degrees of freedom
## Residual deviance: 632.40 on 651 degrees of freedom
## AIC: 730.4
##
## Number of Fisher Scoring iterations: 25
par(mfrow = c(2,2))
plot(german_model_cloglog)
AIC(german_model_cloglog)
## [1] 730.4027
BIC(german_model_cloglog)
## [1] 953.4057
german_model$deviance
## [1] 629.8755
AIC(german_model)
## [1] 727.8755
BIC(german_model)
## [1] 950.8784
german_model_back_aic <- step(german_model, direction = "backward") # backward selection (if you don't specify anything)
## Start: AIC=727.88
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + present_emp + installment_rate + sex + other_debtor +
## present_resid + property + age + other_install + housing +
## n_credits + job + n_people + telephone + foreign
##
## Df Deviance AIC
## - present_emp 4 633.62 723.62
## - property 3 631.70 723.70
## - job 3 632.18 724.18
## - housing 2 631.04 725.04
## - present_resid 1 629.90 725.90
## - n_people 1 630.12 726.12
## - n_credits 1 630.23 726.23
## - sex 3 634.53 726.53
## - saving_acct 4 636.89 726.89
## <none> 629.88 727.88
## - other_debtor 2 634.06 728.06
## - telephone 1 632.53 728.53
## - duration 1 632.56 728.56
## - age 1 632.58 728.58
## - foreign 1 632.76 728.76
## - other_install 2 635.58 729.58
## - amount 1 636.24 732.24
## - installment_rate 1 637.03 733.03
## - purpose 9 660.81 740.81
## - credit_his 4 653.65 743.65
## - chk_acct 3 668.73 760.73
##
## Step: AIC=723.62
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + installment_rate + sex + other_debtor + present_resid +
## property + age + other_install + housing + n_credits + job +
## n_people + telephone + foreign
##
## Df Deviance AIC
## - property 3 635.55 719.55
## - job 3 635.79 719.79
## - housing 2 634.74 720.74
## - present_resid 1 633.62 721.62
## - n_people 1 633.77 721.77
## - n_credits 1 633.95 721.95
## - saving_acct 4 641.36 723.36
## - sex 3 639.40 723.40
## <none> 633.62 723.62
## - duration 1 635.76 723.76
## - other_debtor 2 638.11 724.11
## - age 1 636.45 724.45
## - telephone 1 636.65 724.65
## - foreign 1 636.73 724.73
## - other_install 2 639.21 725.21
## - amount 1 639.81 727.81
## - installment_rate 1 640.88 728.88
## - purpose 9 664.77 736.77
## - credit_his 4 659.78 741.78
## - chk_acct 3 673.27 757.27
##
## Step: AIC=719.55
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + installment_rate + sex + other_debtor + present_resid +
## age + other_install + housing + n_credits + job + n_people +
## telephone + foreign
##
## Df Deviance AIC
## - job 3 637.97 715.97
## - housing 2 637.52 717.52
## - present_resid 1 635.55 717.55
## - n_people 1 635.67 717.67
## - n_credits 1 635.78 717.78
## - sex 3 641.03 719.03
## - saving_acct 4 643.08 719.08
## <none> 635.55 719.55
## - duration 1 637.68 719.68
## - telephone 1 638.15 720.15
## - foreign 1 638.33 720.33
## - age 1 638.45 720.45
## - other_debtor 2 640.46 720.46
## - other_install 2 641.33 721.33
## - amount 1 642.38 724.38
## - installment_rate 1 643.21 725.21
## - purpose 9 668.17 734.17
## - credit_his 4 662.05 738.05
## - chk_acct 3 676.25 754.25
##
## Step: AIC=715.97
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + installment_rate + sex + other_debtor + present_resid +
## age + other_install + housing + n_credits + n_people + telephone +
## foreign
##
## Df Deviance AIC
## - present_resid 1 637.98 713.98
## - n_people 1 638.10 714.10
## - n_credits 1 638.12 714.12
## - housing 2 640.15 714.15
## - sex 3 643.52 715.52
## - telephone 1 639.55 715.55
## - duration 1 639.70 715.70
## <none> 637.97 715.97
## - age 1 640.51 716.51
## - foreign 1 640.68 716.68
## - saving_acct 4 646.74 716.74
## - other_debtor 2 643.17 717.17
## - other_install 2 644.34 718.34
## - amount 1 647.10 723.10
## - installment_rate 1 647.44 723.44
## - purpose 9 669.34 729.34
## - credit_his 4 664.24 734.24
## - chk_acct 3 678.43 750.43
##
## Step: AIC=713.98
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + installment_rate + sex + other_debtor + age +
## other_install + housing + n_credits + n_people + telephone +
## foreign
##
## Df Deviance AIC
## - n_people 1 638.11 712.11
## - n_credits 1 638.13 712.13
## - housing 2 640.31 712.31
## - sex 3 643.52 713.52
## - telephone 1 639.58 713.58
## - duration 1 639.70 713.70
## <none> 637.98 713.98
## - foreign 1 640.68 714.68
## - age 1 640.74 714.74
## - saving_acct 4 646.88 714.88
## - other_debtor 2 643.18 715.18
## - other_install 2 644.41 716.41
## - amount 1 647.15 721.15
## - installment_rate 1 647.45 721.45
## - purpose 9 669.45 727.45
## - credit_his 4 664.30 732.30
## - chk_acct 3 678.48 748.48
##
## Step: AIC=712.11
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + installment_rate + sex + other_debtor + age +
## other_install + housing + n_credits + telephone + foreign
##
## Df Deviance AIC
## - n_credits 1 638.28 710.28
## - housing 2 640.51 710.51
## - sex 3 643.57 711.57
## - telephone 1 639.74 711.74
## - duration 1 639.83 711.83
## <none> 638.11 712.11
## - foreign 1 640.82 712.82
## - age 1 640.83 712.83
## - saving_acct 4 646.93 712.93
## - other_debtor 2 643.20 713.20
## - other_install 2 644.56 714.56
## - amount 1 647.20 719.20
## - installment_rate 1 647.46 719.46
## - purpose 9 669.57 725.57
## - credit_his 4 664.83 730.83
## - chk_acct 3 678.75 746.75
##
## Step: AIC=710.28
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + installment_rate + sex + other_debtor + age +
## other_install + housing + telephone + foreign
##
## Df Deviance AIC
## - housing 2 640.67 708.67
## - sex 3 643.65 709.65
## - telephone 1 639.83 709.83
## - duration 1 639.94 709.94
## <none> 638.28 710.28
## - age 1 640.94 710.94
## - foreign 1 641.08 711.08
## - saving_acct 4 647.20 711.20
## - other_debtor 2 643.35 711.35
## - other_install 2 645.15 713.15
## - amount 1 647.41 717.41
## - installment_rate 1 647.54 717.54
## - purpose 9 669.97 723.97
## - credit_his 4 667.24 731.24
## - chk_acct 3 678.87 744.87
##
## Step: AIC=708.67
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + installment_rate + sex + other_debtor + age +
## other_install + telephone + foreign
##
## Df Deviance AIC
## - telephone 1 642.20 708.20
## - duration 1 642.51 708.51
## - sex 3 646.67 708.67
## <none> 640.67 708.67
## - age 1 643.24 709.24
## - foreign 1 643.72 709.72
## - saving_acct 4 649.77 709.77
## - other_debtor 2 646.16 710.16
## - other_install 2 647.17 711.17
## - amount 1 650.17 716.17
## - installment_rate 1 650.40 716.40
## - purpose 9 672.97 722.97
## - credit_his 4 671.87 731.87
## - chk_acct 3 684.54 746.54
##
## Step: AIC=708.2
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + installment_rate + sex + other_debtor + age +
## other_install + foreign
##
## Df Deviance AIC
## - sex 3 648.03 708.03
## <none> 642.20 708.20
## - duration 1 644.21 708.21
## - foreign 1 645.01 709.01
## - saving_acct 4 651.53 709.53
## - other_debtor 2 647.58 709.58
## - age 1 645.59 709.59
## - other_install 2 648.82 710.82
## - amount 1 650.48 714.48
## - installment_rate 1 651.48 715.48
## - purpose 9 675.00 723.00
## - credit_his 4 673.71 731.71
## - chk_acct 3 686.96 746.96
##
## Step: AIC=708.03
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + installment_rate + other_debtor + age + other_install +
## foreign
##
## Df Deviance AIC
## - duration 1 649.89 707.89
## <none> 648.03 708.03
## - foreign 1 651.18 709.18
## - saving_acct 4 657.61 709.61
## - other_install 2 653.69 709.69
## - other_debtor 2 653.85 709.85
## - age 1 653.02 711.02
## - amount 1 654.76 712.76
## - installment_rate 1 655.49 713.49
## - purpose 9 680.16 722.16
## - credit_his 4 680.78 732.78
## - chk_acct 3 693.23 747.23
##
## Step: AIC=707.89
## response ~ chk_acct + credit_his + purpose + amount + saving_acct +
## installment_rate + other_debtor + age + other_install + foreign
##
## Df Deviance AIC
## <none> 649.89 707.89
## - saving_acct 4 659.21 709.21
## - other_install 2 655.48 709.48
## - foreign 1 653.56 709.56
## - other_debtor 2 655.62 709.62
## - age 1 655.42 711.42
## - installment_rate 1 660.55 716.55
## - purpose 9 681.44 721.44
## - amount 1 667.80 723.80
## - credit_his 4 683.96 733.96
## - chk_acct 3 697.12 749.12
summary(german_model_back_aic)
##
## Call:
## glm(formula = response ~ chk_acct + credit_his + purpose + amount +
## saving_acct + installment_rate + other_debtor + age + other_install +
## foreign, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3321 -0.7029 -0.4257 0.7341 2.5649
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.958e+00 7.282e-01 2.689 0.007161 **
## chk_acctA12 -5.317e-01 2.534e-01 -2.098 0.035918 *
## chk_acctA13 -5.632e-01 4.136e-01 -1.362 0.173217
## chk_acctA14 -1.708e+00 2.674e-01 -6.385 1.71e-10 ***
## credit_hisA31 -4.705e-01 6.335e-01 -0.743 0.457670
## credit_hisA32 -1.286e+00 4.874e-01 -2.639 0.008319 **
## credit_hisA33 -1.406e+00 5.425e-01 -2.592 0.009534 **
## credit_hisA34 -2.325e+00 5.194e-01 -4.476 7.60e-06 ***
## purposeA41 -1.281e+00 4.125e-01 -3.105 0.001903 **
## purposeA410 -1.455e+00 8.476e-01 -1.717 0.086015 .
## purposeA42 -8.260e-01 2.944e-01 -2.806 0.005015 **
## purposeA43 -1.017e+00 2.867e-01 -3.548 0.000389 ***
## purposeA44 5.379e-01 9.912e-01 0.543 0.587332
## purposeA45 -7.916e-01 6.726e-01 -1.177 0.239242
## purposeA46 6.498e-01 4.769e-01 1.363 0.173020
## purposeA48 -2.212e+00 1.209e+00 -1.829 0.067347 .
## purposeA49 -7.256e-01 3.881e-01 -1.870 0.061529 .
## amount 1.576e-04 3.803e-05 4.144 3.42e-05 ***
## saving_acctA62 -1.907e-01 3.190e-01 -0.598 0.550084
## saving_acctA63 -7.569e-01 5.227e-01 -1.448 0.147609
## saving_acctA64 -9.867e-01 5.826e-01 -1.693 0.090364 .
## saving_acctA65 -6.688e-01 2.856e-01 -2.341 0.019216 *
## installment_rate 3.128e-01 9.794e-02 3.194 0.001403 **
## other_debtorA102 5.970e-01 4.568e-01 1.307 0.191278
## other_debtorA103 -9.002e-01 5.017e-01 -1.794 0.072786 .
## age -2.165e-02 9.373e-03 -2.310 0.020903 *
## other_installA142 -1.198e-01 4.564e-01 -0.263 0.792902
## other_installA143 -6.231e-01 2.833e-01 -2.199 0.027867 *
## foreignA202 -1.156e+00 6.532e-01 -1.769 0.076863 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 850.06 on 699 degrees of freedom
## Residual deviance: 649.89 on 671 degrees of freedom
## AIC: 707.89
##
## Number of Fisher Scoring iterations: 5
german_model_back_aic$deviance
## [1] 649.8939
AIC(german_model_back_aic)
## [1] 707.8939
BIC(german_model_back_aic)
## [1] 839.8752
german_model.back_bic <- step(german_model, k=log(nrow(train)))
## Start: AIC=950.88
## response ~ chk_acct + duration + credit_his + purpose + amount +
## saving_acct + present_emp + installment_rate + sex + other_debtor +
## present_resid + property + age + other_install + housing +
## n_credits + job + n_people + telephone + foreign
##
## Df Deviance AIC
## - purpose 9 660.81 922.85
## - present_emp 4 633.62 928.41
## - saving_acct 4 636.89 931.69
## - property 3 631.70 933.05
## - job 3 632.18 933.53
## - sex 3 634.53 935.88
## - housing 2 631.04 938.94
## - other_debtor 2 634.06 941.96
## - other_install 2 635.58 943.48
## - present_resid 1 629.90 944.35
## - n_people 1 630.12 944.57
## - n_credits 1 630.23 944.68
## - telephone 1 632.53 946.98
## - duration 1 632.56 947.02
## - age 1 632.58 947.03
## - foreign 1 632.76 947.21
## - credit_his 4 653.65 948.45
## - amount 1 636.24 950.69
## <none> 629.88 950.88
## - installment_rate 1 637.03 951.48
## - chk_acct 3 668.73 970.08
##
## Step: AIC=922.85
## response ~ chk_acct + duration + credit_his + amount + saving_acct +
## present_emp + installment_rate + sex + other_debtor + present_resid +
## property + age + other_install + housing + n_credits + job +
## n_people + telephone + foreign
##
## Df Deviance AIC
## - present_emp 4 664.77 900.61
## - saving_acct 4 667.68 903.51
## - job 3 661.97 904.36
## - property 3 664.25 906.64
## - sex 3 664.79 907.18
## - housing 2 662.29 911.23
## - other_install 2 665.17 914.12
## - other_debtor 2 666.44 915.38
## - present_resid 1 660.81 916.30
## - n_people 1 661.05 916.54
## - n_credits 1 661.47 916.96
## - age 1 661.90 917.39
## - foreign 1 662.36 917.85
## - duration 1 662.67 918.16
## - credit_his 4 682.77 918.61
## - telephone 1 663.83 919.33
## - amount 1 665.80 921.29
## <none> 660.81 922.85
## - installment_rate 1 667.73 923.22
## - chk_acct 3 704.27 946.66
##
## Step: AIC=900.61
## response ~ chk_acct + duration + credit_his + amount + saving_acct +
## installment_rate + sex + other_debtor + present_resid + property +
## age + other_install + housing + n_credits + job + n_people +
## telephone + foreign
##
## Df Deviance AIC
## - saving_acct 4 672.10 881.73
## - job 3 665.71 881.89
## - property 3 668.17 884.36
## - sex 3 669.79 885.98
## - housing 2 666.21 888.94
## - other_install 2 668.91 891.65
## - other_debtor 2 670.55 893.29
## - present_resid 1 664.86 894.14
## - n_people 1 664.95 894.24
## - n_credits 1 665.41 894.70
## - age 1 666.23 895.52
## - duration 1 666.25 895.54
## - foreign 1 666.38 895.67
## - telephone 1 668.18 897.46
## - credit_his 4 689.10 898.73
## - amount 1 669.73 899.02
## <none> 664.77 900.61
## - installment_rate 1 671.54 900.83
## - chk_acct 3 708.68 924.86
##
## Step: AIC=881.73
## response ~ chk_acct + duration + credit_his + amount + installment_rate +
## sex + other_debtor + present_resid + property + age + other_install +
## housing + n_credits + job + n_people + telephone + foreign
##
## Df Deviance AIC
## - job 3 673.85 863.83
## - property 3 675.28 865.26
## - sex 3 677.13 867.12
## - housing 2 673.32 869.85
## - other_install 2 676.47 873.00
## - other_debtor 2 677.15 873.68
## - n_people 1 672.17 875.26
## - present_resid 1 672.34 875.42
## - n_credits 1 672.97 876.05
## - duration 1 673.59 876.67
## - age 1 673.94 877.02
## - foreign 1 674.11 877.19
## - telephone 1 676.14 879.22
## - amount 1 676.79 879.87
## - credit_his 4 696.94 880.37
## - installment_rate 1 678.55 881.64
## <none> 672.10 881.73
## - chk_acct 3 726.06 916.04
##
## Step: AIC=863.83
## response ~ chk_acct + duration + credit_his + amount + installment_rate +
## sex + other_debtor + present_resid + property + age + other_install +
## housing + n_credits + n_people + telephone + foreign
##
## Df Deviance AIC
## - property 3 677.20 847.53
## - sex 3 679.11 849.44
## - housing 2 675.06 851.94
## - other_install 2 678.93 855.81
## - other_debtor 2 679.24 856.12
## - n_people 1 673.94 857.37
## - present_resid 1 674.13 857.56
## - n_credits 1 674.60 858.03
## - duration 1 674.95 858.38
## - age 1 675.41 858.84
## - foreign 1 675.79 859.22
## - telephone 1 677.00 860.43
## - credit_his 4 698.46 862.24
## <none> 673.85 863.83
## - amount 1 680.55 863.98
## - installment_rate 1 681.64 865.07
## - chk_acct 3 727.96 898.29
##
## Step: AIC=847.53
## response ~ chk_acct + duration + credit_his + amount + installment_rate +
## sex + other_debtor + present_resid + age + other_install +
## housing + n_credits + n_people + telephone + foreign
##
## Df Deviance AIC
## - sex 3 682.04 832.72
## - housing 2 680.56 837.78
## - other_install 2 682.51 839.74
## - other_debtor 2 683.49 840.72
## - n_people 1 677.25 841.02
## - present_resid 1 677.47 841.25
## - n_credits 1 677.75 841.53
## - duration 1 678.28 842.05
## - age 1 678.76 842.53
## - foreign 1 678.83 842.60
## - telephone 1 679.66 843.43
## - credit_his 4 701.88 846.01
## <none> 677.20 847.53
## - amount 1 684.95 848.72
## - installment_rate 1 685.88 849.66
## - chk_acct 3 732.77 883.45
##
## Step: AIC=832.72
## response ~ chk_acct + duration + credit_his + amount + installment_rate +
## other_debtor + present_resid + age + other_install + housing +
## n_credits + n_people + telephone + foreign
##
## Df Deviance AIC
## - housing 2 685.83 823.41
## - other_install 2 686.93 824.50
## - other_debtor 2 688.39 825.97
## - n_people 1 682.13 826.25
## - present_resid 1 682.29 826.41
## - n_credits 1 682.48 826.61
## - duration 1 683.14 827.26
## - foreign 1 683.91 828.03
## - age 1 683.96 828.08
## - telephone 1 684.40 828.53
## - credit_his 4 707.36 831.83
## <none> 682.04 832.72
## - amount 1 688.62 832.74
## - installment_rate 1 688.87 832.99
## - chk_acct 3 739.52 870.54
##
## Step: AIC=823.41
## response ~ chk_acct + duration + credit_his + amount + installment_rate +
## other_debtor + present_resid + age + other_install + n_credits +
## n_people + telephone + foreign
##
## Df Deviance AIC
## - other_install 2 690.51 814.98
## - n_people 1 685.87 816.89
## - present_resid 1 685.88 816.90
## - n_credits 1 686.20 817.23
## - other_debtor 2 692.82 817.29
## - duration 1 686.91 817.93
## - foreign 1 687.87 818.89
## - telephone 1 688.19 819.21
## - age 1 688.57 819.59
## <none> 685.83 823.41
## - installment_rate 1 692.73 823.75
## - amount 1 692.98 824.00
## - credit_his 4 713.76 825.12
## - chk_acct 3 746.20 864.12
##
## Step: AIC=814.98
## response ~ chk_acct + duration + credit_his + amount + installment_rate +
## other_debtor + present_resid + age + n_credits + n_people +
## telephone + foreign
##
## Df Deviance AIC
## - n_people 1 690.52 808.43
## - present_resid 1 690.52 808.44
## - other_debtor 2 697.08 808.45
## - n_credits 1 691.25 809.17
## - duration 1 691.58 809.50
## - foreign 1 692.65 810.57
## - age 1 692.96 810.88
## - telephone 1 693.00 810.92
## <none> 690.51 814.98
## - installment_rate 1 697.36 815.28
## - amount 1 698.02 815.94
## - credit_his 4 724.25 822.51
## - chk_acct 3 749.45 854.27
##
## Step: AIC=808.43
## response ~ chk_acct + duration + credit_his + amount + installment_rate +
## other_debtor + present_resid + age + n_credits + telephone +
## foreign
##
## Df Deviance AIC
## - present_resid 1 690.52 801.89
## - other_debtor 2 697.19 802.01
## - n_credits 1 691.25 802.62
## - duration 1 691.59 802.95
## - foreign 1 692.66 804.02
## - telephone 1 693.00 804.37
## - age 1 693.02 804.39
## <none> 690.52 808.43
## - installment_rate 1 697.41 808.78
## - amount 1 698.02 809.38
## - credit_his 4 724.43 816.15
## - chk_acct 3 749.45 847.72
##
## Step: AIC=801.89
## response ~ chk_acct + duration + credit_his + amount + installment_rate +
## other_debtor + age + n_credits + telephone + foreign
##
## Df Deviance AIC
## - other_debtor 2 697.24 795.50
## - n_credits 1 691.28 796.09
## - duration 1 691.61 796.42
## - foreign 1 692.69 797.51
## - telephone 1 693.01 797.83
## - age 1 693.12 797.94
## <none> 690.52 801.89
## - installment_rate 1 697.43 802.25
## - amount 1 698.02 802.84
## - credit_his 4 724.43 809.60
## - chk_acct 3 749.81 841.52
##
## Step: AIC=795.5
## response ~ chk_acct + duration + credit_his + amount + installment_rate +
## age + n_credits + telephone + foreign
##
## Df Deviance AIC
## - n_credits 1 697.86 789.58
## - duration 1 698.20 789.91
## - telephone 1 699.47 791.19
## - foreign 1 699.61 791.33
## - age 1 700.02 791.73
## <none> 697.24 795.50
## - installment_rate 1 704.59 796.30
## - amount 1 706.24 797.95
## - credit_his 4 731.35 803.41
## - chk_acct 3 755.79 834.40
##
## Step: AIC=789.58
## response ~ chk_acct + duration + credit_his + amount + installment_rate +
## age + telephone + foreign
##
## Df Deviance AIC
## - duration 1 698.75 783.92
## - telephone 1 699.96 785.12
## - foreign 1 700.32 785.48
## - age 1 700.57 785.73
## <none> 697.86 789.58
## - installment_rate 1 705.10 790.27
## - amount 1 706.98 792.15
## - credit_his 4 732.32 797.83
## - chk_acct 3 756.20 828.26
##
## Step: AIC=783.92
## response ~ chk_acct + credit_his + amount + installment_rate +
## age + telephone + foreign
##
## Df Deviance AIC
## - telephone 1 700.95 779.57
## - foreign 1 701.59 780.20
## - age 1 701.75 780.36
## <none> 698.75 783.92
## - installment_rate 1 708.53 787.15
## - credit_his 4 734.13 793.09
## - amount 1 719.40 798.01
## - chk_acct 3 758.47 823.98
##
## Step: AIC=779.57
## response ~ chk_acct + credit_his + amount + installment_rate +
## age + foreign
##
## Df Deviance AIC
## - foreign 1 703.50 775.56
## - age 1 704.92 776.98
## <none> 700.95 779.57
## - installment_rate 1 710.38 782.44
## - credit_his 4 737.41 789.82
## - amount 1 719.44 791.50
## - chk_acct 3 762.47 821.43
##
## Step: AIC=775.56
## response ~ chk_acct + credit_his + amount + installment_rate +
## age
##
## Df Deviance AIC
## - age 1 707.36 772.87
## <none> 703.50 775.56
## - installment_rate 1 713.86 779.37
## - credit_his 4 740.14 785.99
## - amount 1 722.29 787.80
## - chk_acct 3 764.99 817.40
##
## Step: AIC=772.87
## response ~ chk_acct + credit_his + amount + installment_rate
##
## Df Deviance AIC
## <none> 707.36 772.87
## - installment_rate 1 716.76 775.72
## - amount 1 724.37 783.33
## - credit_his 4 746.30 785.60
## - chk_acct 3 769.56 815.41
summary(german_model.back_bic)
##
## Call:
## glm(formula = response ~ chk_acct + credit_his + amount + installment_rate,
## family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0031 -0.8105 -0.4888 0.8972 2.4994
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1529180 0.5610828 0.273 0.78521
## chk_acctA12 -0.5908353 0.2259036 -2.615 0.00891 **
## chk_acctA13 -0.7132500 0.3852302 -1.851 0.06410 .
## chk_acctA14 -1.8169105 0.2472187 -7.349 1.99e-13 ***
## credit_hisA31 -0.4375497 0.5847990 -0.748 0.45434
## credit_hisA32 -1.3956104 0.4592054 -3.039 0.00237 **
## credit_hisA33 -1.3501579 0.5228587 -2.582 0.00982 **
## credit_hisA34 -2.3322972 0.4921102 -4.739 2.14e-06 ***
## amount 0.0001362 0.0000337 4.043 5.27e-05 ***
## installment_rate 0.2698017 0.0897746 3.005 0.00265 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 850.06 on 699 degrees of freedom
## Residual deviance: 707.36 on 690 degrees of freedom
## AIC: 727.36
##
## Number of Fisher Scoring iterations: 4
german_model.back_bic$deviance
## [1] 707.3607
AIC(german_model.back_bic)
## [1] 727.3607
BIC(german_model.back_bic)
## [1] 772.8715
nums <- sapply(train, is.numeric)
nums <- as.vector(which(nums==TRUE))
germandata.nums.std <- scale(german_data[,nums])
# change categorical variables to dummy variables
dummy <- model.matrix( ~ chk_acct+credit_his+purpose+saving_acct+present_emp+ sex+other_debtor+property+other_install+housing+job+telephone+foreign-1,
data= german_data)
y.std <- german_data$response
germandata.std <- cbind(dummy,germandata.nums.std,y.std)
dim(germandata.std)
## [1] 1000 51
X.train <- as.matrix(germandata.std[in.train,-50])
Y.train <- as.factor(germandata.std[in.train,50])
lasso.fit<- glmnet(x=X.train, y=Y.train, family = "binomial", alpha = 1)
summary(lasso.fit)
## Length Class Mode
## a0 72 -none- numeric
## beta 3600 dgCMatrix S4
## df 72 -none- numeric
## dim 2 -none- numeric
## lambda 72 -none- numeric
## dev.ratio 72 -none- numeric
## nulldev 1 -none- numeric
## npasses 1 -none- numeric
## jerr 1 -none- numeric
## offset 1 -none- logical
## classnames 2 -none- character
## call 5 -none- call
## nobs 1 -none- numeric
plot(lasso.fit, xvar = "lambda", label=TRUE)
## Warning in plotCoef(x$beta, lambda = x$lambda, df = x$df, dev = x$dev.ratio, : 1
## or less nonzero coefficients; glmnet plot is not meaningful
lasso.fit<- glmnet(x=X.train, y=Y.train, family = "binomial", alpha = 1)
plot(lasso.fit, xvar = "lambda", label=TRUE)
## Warning in plotCoef(x$beta, lambda = x$lambda, df = x$df, dev = x$dev.ratio, : 1
## or less nonzero coefficients; glmnet plot is not meaningful
Using cross validation to find perfect lambda value
cv.lasso<- cv.glmnet(x = X.train, y = Y.train, family = "binomial", alpha = 1, nfolds = 10)
plot(cv.lasso)
coef(lasso.fit, s=cv.lasso$lambda.min)
## 51 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -7.822736
## chk_acctA11 .
## chk_acctA12 .
## chk_acctA13 .
## chk_acctA14 .
## credit_hisA31 .
## credit_hisA32 .
## credit_hisA33 .
## credit_hisA34 .
## purposeA41 .
## purposeA410 .
## purposeA42 .
## purposeA43 .
## purposeA44 .
## purposeA45 .
## purposeA46 .
## purposeA48 .
## purposeA49 .
## saving_acctA62 .
## saving_acctA63 .
## saving_acctA64 .
## saving_acctA65 .
## present_empA72 .
## present_empA73 .
## present_empA74 .
## present_empA75 .
## sexA92 .
## sexA93 .
## sexA94 .
## other_debtorA102 .
## other_debtorA103 .
## propertyA122 .
## propertyA123 .
## propertyA124 .
## other_installA142 .
## other_installA143 .
## housingA152 .
## housingA153 .
## jobA172 .
## jobA173 .
## jobA174 .
## telephoneA192 .
## foreignA202 .
## duration .
## amount .
## installment_rate .
## present_resid .
## age .
## n_credits .
## n_people .
## y.std 14.777128
cv.lasso$lambda.min
## [1] 0.0006174812
For our final model, we select the following variables:
credit.glm.final <- glm(response ~ chk_acct + duration + credit_his + purpose + amount + saving_acct + installment_rate + sex + other_debtor + age + other_install + n_people + foreign, family = "binomial", train)
summary(credit.glm.final)
##
## Call:
## glm(formula = response ~ chk_acct + duration + credit_his + purpose +
## amount + saving_acct + installment_rate + sex + other_debtor +
## age + other_install + n_people + foreign, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3337 -0.7165 -0.4110 0.7016 2.6315
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.655e+00 9.572e-01 1.729 0.083853 .
## chk_acctA12 -5.197e-01 2.584e-01 -2.012 0.044262 *
## chk_acctA13 -4.977e-01 4.143e-01 -1.201 0.229708
## chk_acctA14 -1.680e+00 2.710e-01 -6.199 5.68e-10 ***
## duration 1.523e-02 1.073e-02 1.420 0.155683
## credit_hisA31 -4.279e-01 6.440e-01 -0.664 0.506398
## credit_hisA32 -1.248e+00 5.014e-01 -2.488 0.012835 *
## credit_hisA33 -1.358e+00 5.567e-01 -2.439 0.014717 *
## credit_hisA34 -2.257e+00 5.337e-01 -4.229 2.35e-05 ***
## purposeA41 -1.358e+00 4.201e-01 -3.233 0.001224 **
## purposeA410 -1.401e+00 8.679e-01 -1.614 0.106567
## purposeA42 -8.784e-01 2.984e-01 -2.943 0.003246 **
## purposeA43 -1.112e+00 2.949e-01 -3.769 0.000164 ***
## purposeA44 3.808e-01 1.033e+00 0.369 0.712411
## purposeA45 -7.828e-01 6.784e-01 -1.154 0.248492
## purposeA46 5.942e-01 4.801e-01 1.238 0.215901
## purposeA48 -2.166e+00 1.238e+00 -1.750 0.080178 .
## purposeA49 -8.025e-01 3.981e-01 -2.016 0.043822 *
## amount 1.368e-04 4.773e-05 2.866 0.004153 **
## saving_acctA62 -2.121e-01 3.206e-01 -0.661 0.508303
## saving_acctA63 -7.487e-01 5.271e-01 -1.421 0.155444
## saving_acctA64 -1.007e+00 5.861e-01 -1.719 0.085703 .
## saving_acctA65 -6.844e-01 2.886e-01 -2.371 0.017738 *
## installment_rate 3.153e-01 1.042e-01 3.025 0.002490 **
## sexA92 6.420e-02 4.699e-01 0.137 0.891328
## sexA93 -5.053e-01 4.560e-01 -1.108 0.267723
## sexA94 -1.236e-01 5.523e-01 -0.224 0.822918
## other_debtorA102 6.289e-01 4.605e-01 1.366 0.172016
## other_debtorA103 -8.786e-01 5.104e-01 -1.721 0.085186 .
## age -1.770e-02 9.598e-03 -1.845 0.065079 .
## other_installA142 -3.485e-02 4.612e-01 -0.076 0.939762
## other_installA143 -6.559e-01 2.852e-01 -2.300 0.021470 *
## n_people 1.447e-01 2.889e-01 0.501 0.616478
## foreignA202 -1.037e+00 6.682e-01 -1.551 0.120804
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 850.06 on 699 degrees of freedom
## Residual deviance: 641.95 on 666 degrees of freedom
## AIC: 709.95
##
## Number of Fisher Scoring iterations: 5
AIC(credit.glm.final)
## [1] 709.9487
BIC(credit.glm.final)
## [1] 864.6855
credit.glm.final$deviance
## [1] 641.9487
prob.glm1.insample <- predict(credit.glm.final, type = "response")
predicted.glm1.insample <- prob.glm1.insample > 0.1667
predicted.glm1.insample <- as.numeric(predicted.glm1.insample)
mean(ifelse(train$response != predicted.glm1.insample, 1, 0))
## [1] 0.37
mean(ifelse(test$response != predicted.glm1.insample, 1, 0))
## Warning in test$response != predicted.glm1.insample: longer object length is not
## a multiple of shorter object length
## [1] 0.5414286
roc.plot(train$response == "1", prob.glm1.insample)$roc.vol$Area
## [1] 0.8239214
roc.plot(test$response == "1", prob.glm1.insample)$roc.vol$Area
## Warning in is.finite(x) & apply(pred, 1, f): longer object length is not a
## multiple of shorter object length
## [1] 0.4890136
prob.glm1.insample <- predict(credit.glm.final, type = "response")
predicted.glm1.insample <- prob.glm1.insample > 0.1667
predicted.glm1.insample <- as.numeric(predicted.glm1.insample)
mean(ifelse(train$response != predicted.glm1.insample, 1, 0))
## [1] 0.37
Confusion Matrix
table(train$response, predicted.glm1.insample, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 259 234
## 1 25 182
credit.glm.final$deviance
## [1] 641.9487
roc.plot(train$response == "1", prob.glm1.insample)
roc.plot(train$response == "1", prob.glm1.insample)$roc.vol$Area
## [1] 0.8239214
pred_aic <- predict(credit.glm.final,type="response")
hist(pred_aic)
#### Cut off 1/6
table(train$response, (pred_aic > 0.167)*1, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 259 234
## 1 25 182
#out of sample
prob.glm2.outsample <- predict(credit.glm.final, test, type = "response")
predicted.glm2.outsample <- prob.glm2.outsample > 0.1667
predicted.glm2.outsample <- as.numeric(predicted.glm2.outsample)
table(test$response, predicted.glm2.outsample, dnn = c("Truth", "Predicted"))
## Predicted
## Truth 0 1
## 0 116 91
## 1 8 85
mean(ifelse(test$response != predicted.glm2.outsample, 1, 0))
## [1] 0.33
roc.plot(test$response == "1", prob.glm2.outsample)
roc.plot(test$response == "1", prob.glm2.outsample)$roc.vol$Area
## [1] 0.8043738
#Cross Validation
model1.glm <- glm(response ~ ., data = german_data)
model2.glm <- glm(response ~ chk_acct + duration + credit_his + purpose + amount + saving_acct + installment_rate + sex + other_debtor + age + other_install + n_people + foreign, family = "binomial", german_data)
cv.glm(data = german_data, glmfit = model1.glm, K = 5)$delta[2]
## [1] 0.1690591
cv.glm(data = german_data, glmfit = model2.glm, K = 5)$delta[2]
## [1] 0.1631477
#cost function define
costfunc = function(obs, pred.p){
pcut=.167
weight1 = 5 # define the weight for "true=1 but pred=0" (FN)
weight0 = 1 # define the weight for "true=0 but pred=1" (FP)
c1 = (obs==1)&(pred.p<pcut) # count for "true=1 but pred=0" (FN)
c0 = (obs==0)&(pred.p>=pcut) # count for "true=0 but pred=1" (FP)
cost = mean(weight1*c1 + weight0*c0) # misclassification with weight
return(cost) # you have to return to a value when you write R functions
} # end of the function
german_glm<- glm(response ~ . , family=binomial, data=german_data);
cv_result <- cv.glm(data= german_data,
glmfit=german_glm, cost=costfunc, K=5)
cv_result$delta[2]
## [1] 0.5676
#asymettric cost
costfunc((pred_aic > 0.167)*1,train$response)
## [1] 1.707143
cost1 <- function(r, pi) {
weight1 = 5
weight0 = 1
c1 = (r == 1) & (pi < 0.17) #logical vector - true if actual 1 but predict 0
c0 = (r == 0) & (pi > 0.17) #logical vecotr - true if actual 0 but predict 1
return(mean(weight1 * c1 + weight0 * c0))
}
#default value of cp = 0.01
german.tree <- rpart(response ~ ., data = train)
german.tree
## n= 700
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 700 145.787100 0.2957143
## 2) chk_acct=A13,A14 323 39.448920 0.1424149
## 4) other_install=A143 265 25.826420 0.1094340 *
## 5) other_install=A141,A142 58 12.017240 0.2931034
## 10) purpose=A41,A410,A42,A43,A46 38 5.052632 0.1578947 *
## 11) purpose=A40,A49 20 4.950000 0.5500000 *
## 3) chk_acct=A11,A12 377 92.244030 0.4270557
## 6) credit_his=A32,A33,A34 332 78.876510 0.3885542
## 12) duration< 27.5 252 55.317460 0.3253968
## 24) amount< 8724.5 245 52.040820 0.3061224
## 48) purpose=A41,A410,A42,A43,A45,A48,A49 165 29.781820 0.2363636 *
## 49) purpose=A40,A44,A46 80 19.800000 0.4500000
## 98) amount>=1551.5 41 7.560976 0.2439024 *
## 99) amount< 1551.5 39 8.666667 0.6666667
## 198) age>=36.5 14 2.857143 0.2857143 *
## 199) age< 36.5 25 2.640000 0.8800000 *
## 25) amount>=8724.5 7 0.000000 1.0000000 *
## 13) duration>=27.5 80 19.387500 0.5875000
## 26) property=A122 20 4.550000 0.3500000 *
## 27) property=A121,A123,A124 60 13.333330 0.6666667 *
## 7) credit_his=A30,A31 45 9.244444 0.7111111
## 14) housing=A152,A153 32 7.718750 0.5937500
## 28) property=A121,A123 13 2.769231 0.3076923 *
## 29) property=A122,A124 19 3.157895 0.7894737 *
## 15) housing=A151 13 0.000000 1.0000000 *
plotcp(german.tree)
#Plotting the tree
rpart.plot(german.tree, type = 3, fallen.leaves = TRUE)
credit_rpart0 <- rpart(formula = response ~ ., data = train, method = "class")
credit_rpart <- rpart(formula = response ~ . , data = train, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))
pred0 <- predict(credit_rpart0, type="class")
table(train$response, pred0, dnn = c("True", "Pred"))
## Pred
## True 0 1
## 0 474 19
## 1 115 92
pred_german_train<- predict(credit_rpart, train, type="class")
credit_test_prob_rpart = predict(credit_rpart, test, type="prob")
table(train$response, pred_german_train, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 284 209
## 1 16 191
cost <- function(r, phat){
weight1 <- 5
weight0 <- 1
pcut <- weight0/(weight1+weight0)
c1 <- (r==1)&(phat<pcut) #logical vector - true if actual 1 but predict 0
c0 <-(r==0)&(phat>pcut) #logical vector - true if actual 0 but predict 1
return(mean(weight1*c1+weight0*c0))
}
pred_german_train<- predict(credit_rpart, train, type="prob")
cost(train$response, predict(credit_rpart, train, type="prob"))
## [1] 0.5585714
cost(test$response, predict(credit_rpart, test, type="prob"))
## [1] 0.655
pred1 <- prediction(pred_german_train[,2], train$response)
pred = prediction(credit_test_prob_rpart[,2], test$response)
perf = performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
slot(performance(pred1, "auc"), "y.values")[[1]]
## [1] 0.7719768
slot(performance(pred, "auc"), "y.values")[[1]]
## [1] 0.6974443
credit_test_pred_rpart <- as.numeric(credit_test_prob_rpart[,2] > 1/(5+1))
table(test$response, credit_test_pred_rpart, dnn=c("Truth","Predicted"))
## Predicted
## Truth 0 1
## 0 116 91
## 1 19 74
#Building a large tree
german.large.tree <- rpart(formula = response ~ . , data = train, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)), cp=0.001)
german.large.tree
## n= 700
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 700 493 1 (0.70428571 0.29571429)
## 2) chk_acct=A14 276 170 0 (0.87681159 0.12318841)
## 4) purpose=A41,A410,A43,A44,A48 123 30 0 (0.95121951 0.04878049)
## 8) other_install=A142,A143 109 15 0 (0.97247706 0.02752294)
## 16) age>=31.5 69 0 0 (1.00000000 0.00000000) *
## 17) age< 31.5 40 15 0 (0.92500000 0.07500000)
## 34) sex=A92,A94 23 0 0 (1.00000000 0.00000000) *
## 35) sex=A93 17 14 1 (0.82352941 0.17647059) *
## 9) other_install=A141 14 11 1 (0.78571429 0.21428571) *
## 5) purpose=A40,A42,A45,A46,A49 153 125 1 (0.81699346 0.18300654)
## 10) amount< 4158 125 80 0 (0.87200000 0.12800000)
## 20) duration< 9.5 24 0 0 (1.00000000 0.00000000) *
## 21) duration>=9.5 101 80 0 (0.84158416 0.15841584)
## 42) amount< 1112 15 0 0 (1.00000000 0.00000000) *
## 43) amount>=1112 86 70 1 (0.81395349 0.18604651)
## 86) amount>=3385.5 10 0 0 (1.00000000 0.00000000) *
## 87) amount< 3385.5 76 60 1 (0.78947368 0.21052632)
## 174) present_emp=A74,A75 36 20 0 (0.88888889 0.11111111)
## 348) credit_his=A30,A32 16 0 0 (1.00000000 0.00000000) *
## 349) credit_his=A33,A34 20 16 1 (0.80000000 0.20000000)
## 698) purpose=A42,A46,A49 11 5 0 (0.90909091 0.09090909) *
## 699) purpose=A40,A45 9 6 1 (0.66666667 0.33333333) *
## 175) present_emp=A71,A72,A73 40 28 1 (0.70000000 0.30000000)
## 350) other_install=A143 33 26 1 (0.78787879 0.21212121)
## 700) age>=30.5 16 5 0 (0.93750000 0.06250000) *
## 701) age< 30.5 17 11 1 (0.64705882 0.35294118) *
## 351) other_install=A141,A142 7 2 1 (0.28571429 0.71428571) *
## 11) amount>=4158 28 16 1 (0.57142857 0.42857143)
## 22) present_emp=A74,A75 15 10 0 (0.86666667 0.13333333) *
## 23) present_emp=A71,A72,A73 13 3 1 (0.23076923 0.76923077) *
## 3) chk_acct=A11,A12,A13 424 251 1 (0.59198113 0.40801887)
## 6) credit_his=A34 97 76 1 (0.78350515 0.21649485)
## 12) duration< 8.5 15 0 0 (1.00000000 0.00000000) *
## 13) duration>=8.5 82 61 1 (0.74390244 0.25609756)
## 26) purpose=A41,A42,A43,A45,A49 52 35 0 (0.86538462 0.13461538)
## 52) amount< 4864.5 40 15 0 (0.92500000 0.07500000)
## 104) amount>=2291.5 17 0 0 (1.00000000 0.00000000) *
## 105) amount< 2291.5 23 15 0 (0.86956522 0.13043478)
## 210) present_emp=A71,A72,A74 10 0 0 (1.00000000 0.00000000) *
## 211) present_emp=A73,A75 13 10 1 (0.76923077 0.23076923) *
## 53) amount>=4864.5 12 8 1 (0.66666667 0.33333333) *
## 27) purpose=A40,A410,A46 30 16 1 (0.53333333 0.46666667) *
## 7) credit_his=A30,A31,A32,A33 327 175 1 (0.53516820 0.46483180)
## 14) other_debtor=A103 21 10 0 (0.90476190 0.09523810)
## 28) present_emp=A73,A75 13 0 0 (1.00000000 0.00000000) *
## 29) present_emp=A72,A74 8 6 1 (0.75000000 0.25000000) *
## 15) other_debtor=A101,A102 306 156 1 (0.50980392 0.49019608)
## 30) chk_acct=A12,A13 175 105 1 (0.60000000 0.40000000)
## 60) saving_acct=A63,A64,A65 47 36 1 (0.76595745 0.23404255)
## 120) credit_his=A30,A33 8 0 0 (1.00000000 0.00000000) *
## 121) credit_his=A31,A32 39 28 1 (0.71794872 0.28205128)
## 242) amount< 4051.5 31 25 1 (0.80645161 0.19354839)
## 484) property=A123,A124 12 0 0 (1.00000000 0.00000000) *
## 485) property=A121,A122 19 13 1 (0.68421053 0.31578947) *
## 243) amount>=4051.5 8 3 1 (0.37500000 0.62500000) *
## 61) saving_acct=A61,A62 128 69 1 (0.53906250 0.46093750)
## 122) job=A173 72 47 1 (0.65277778 0.34722222)
## 244) present_emp=A71,A74,A75 25 20 0 (0.84000000 0.16000000)
## 488) purpose=A43,A45,A49 14 0 0 (1.00000000 0.00000000) *
## 489) purpose=A40,A42,A46 11 7 1 (0.63636364 0.36363636) *
## 245) present_emp=A72,A73 47 26 1 (0.55319149 0.44680851)
## 490) housing=A152 32 23 1 (0.71875000 0.28125000)
## 980) purpose=A40,A41,A410,A49 11 0 0 (1.00000000 0.00000000) *
## 981) purpose=A42,A43,A45 21 12 1 (0.57142857 0.42857143) *
## 491) housing=A151,A153 15 3 1 (0.20000000 0.80000000) *
## 123) job=A171,A172,A174 56 22 1 (0.39285714 0.60714286) *
## 31) chk_acct=A11 131 51 1 (0.38931298 0.61068702)
## 62) saving_acct=A63,A64 7 5 0 (0.85714286 0.14285714) *
## 63) saving_acct=A61,A62,A65 124 45 1 (0.36290323 0.63709677)
## 126) installment_rate< 2.5 40 21 1 (0.52500000 0.47500000)
## 252) property=A121,A123 24 16 1 (0.66666667 0.33333333)
## 504) job=A172,A174 8 0 0 (1.00000000 0.00000000) *
## 505) job=A173 16 8 1 (0.50000000 0.50000000) *
## 253) property=A122,A124 16 5 1 (0.31250000 0.68750000) *
## 127) installment_rate>=2.5 84 24 1 (0.28571429 0.71428571) *
plotcp(german.large.tree)
#Plotting the tree
rpart.plot(german.large.tree, type = 3, fallen.leaves = TRUE)
pred_german_train3 <- predict(german.large.tree, train, type="prob")
pred_german_train <- predict(german.large.tree, test, type="prob")
cost(train$response, predict(german.large.tree, train, type="prob"))
## [1] 0.5014286
cost(test$response, predict(german.large.tree, test, type="prob"))
## [1] 0.7216667
pred2 <- prediction(pred_german_train3[,2], train$response)
pred3 = prediction(pred_german_train[,2], test$response)
slot(performance(pred2, "auc"), "y.values")[[1]]
## [1] 0.8864489
slot(performance(pred3, "auc"), "y.values")[[1]]
## [1] 0.6677056
#however, from plotcp, we observe that a tree with more than 7 to 9 splits is not very helpful.
#further pruning the tree to limit to 9 splits;corresponding cp value from plot is 0.0072
pruned.tree <- prune(german.large.tree, cp = 0.04)
pruned.tree
## n= 700
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 700 493 1 (0.70428571 0.29571429)
## 2) chk_acct=A14 276 170 0 (0.87681159 0.12318841)
## 4) purpose=A41,A410,A43,A44,A48 123 30 0 (0.95121951 0.04878049) *
## 5) purpose=A40,A42,A45,A46,A49 153 125 1 (0.81699346 0.18300654)
## 10) amount< 4158 125 80 0 (0.87200000 0.12800000) *
## 11) amount>=4158 28 16 1 (0.57142857 0.42857143) *
## 3) chk_acct=A11,A12,A13 424 251 1 (0.59198113 0.40801887) *
rpart.plot(pruned.tree, type = 3, fallen.leaves = TRUE, extra = 1)
pred_german_train2 <- predict(pruned.tree, train, type="prob")
pred_german_train4 <- predict(pruned.tree, test, type="prob")
cost(train$response, predict(pruned.tree, train, type="prob"))
## [1] 0.6214286
cost(test$response, predict(pruned.tree, test, type="prob"))
## [1] 0.5966667
pred2 <- prediction(pred_german_train2[,2], train$response)
pred3 <- prediction(pred_german_train4[,2], test$response)
slot(performance(pred2, "auc"), "y.values")[[1]]
## [1] 0.6832319
slot(performance(pred3, "auc"), "y.values")[[1]]
## [1] 0.6841463
#In-sample MSE
mean((predict(german.tree) - train$response) ^ 2)
## [1] 0.1463992
#out-of-sample performance
mean((predict(german.tree, newdata = test) - test$response) ^ 2) #default tree
## [1] 0.1829166
mean((predict(german.large.tree) - train$response) ^ 2) #large tree
## [1] 0.3759564
mean((predict(german.large.tree, newdata = test) - test$response) ^ 2) #large tree
## [1] 0.3805692
mean((predict(pruned.tree) - train$response) ^ 2) #pruned tree
## [1] 0.3158154
mean((predict(pruned.tree, newdata = test) - test$response) ^ 2) #large tree
## [1] 0.3179086
glm_model <- glm(formula = response ~ ., data = train)
prob.glm1.insample <- predict(glm_model, type = "response")
predicted.glm1.insample <- prob.glm1.insample > 0.1667
predicted.glm1.insample <- as.numeric(predicted.glm1.insample)
mean(ifelse(train$response != predicted.glm1.insample, 1, 0))
## [1] 0.4342857
mean(ifelse(test$response != predicted.glm1.insample, 1, 0))
## Warning in test$response != predicted.glm1.insample: longer object length is not
## a multiple of shorter object length
## [1] 0.5714286
roc.plot(train$response == "1", prob.glm1.insample)$roc.vol$Area
## [1] 0.8299282
roc.plot(test$response == "1", prob.glm1.insample)$roc.vol$Area
## Warning in is.finite(x) & apply(pred, 1, f): longer object length is not a
## multiple of shorter object length
## [1] 0.4813776