Executive Summary:

Problem Testing out several different supervised learning algorithms over credit data of subjects to find the one that accurately predicts if an individual will be good candidate to loan or will tend to default. Random sample a training data set that contains 80% of the original data points. Please use 5:1 cost ratio and pcut=1/6. Methods of analysis includes the Summary statistics of the variables and finding correlation between variables, Exploratory data analysis using visualization. Random sample a training data set that contains 80% of original data points. Build a generalized linear regression as in HW3.Try different link functions (logistic, probit, complementary log-log link) and compare. Fitting various models such as generalized linear regression using different variable selections and finding the best model. Find a best model for Credit Scoring data using logistic regression with AIC and BIC. Draw ROC curve, report the AUC, and present the misclassification rate table of your final model. Test the out-of-sample performance. Calculate 5-fold cross validation, fit a regression tree (CART) on the same data and compare it with the logistic regression model. Finally, repeat the sample process for another random sample with conclusions and detailed comparison tables.

Approach

This report provides an analysis and evaluation of the factors affecting the risk level of the individuals for taking loan. To start with, the random sample with 80:20 dataset has to be created, find a best model using logistic regression with AIC and BIC and LASSO variable selection. Using the final model, Report in-sample (MSE) and out-of-sample prediction errors (MSPE). Build a regression tree and compare it with the logistic regression model. Finally, repeat the sample process for another random sample with conclusions and detailed comparison tables.

Initially, exploratory data analysis on the dataset has been performed. Further, a logistic regression model has been built to predict customers as risky or not, along with variable selection for the model building process.

Major Results:

The German credit scoring data is a dataset provided by Prof. Hogmann in the file german.data. The data set has information about 1000 individuals, on the basis of which they have been classified as risky or not. The variable response in the dataset corresponds to the risk label, 1 has been classified as bad and 2 has been classified as good.

The Final model has 44% accurate in predicting the defaulters. And, the Asymmetric cost is 0.56125 in the testing dataset. To further the analysis, I will have a look at the model building and variable selection, there are both positive and negative correlation with the response variable which explains that the final model is not as good to predict the testing values as the training values. However, Area Under the curve for the final model has increased from 81.5% to 57.7% which clearly shows that it is not the right model for the prediction. However, AUC for the regression tree denotes 81% which changes slightly to 79% which is an acceptable score and I can conclude that our model is working fairly well for the regression tree.

german_data = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")

#Assigning variable names
colnames(german_data)=c("chk_acct","duration","credit_his","purpose","amount","saving_acct","present_emp","installment_rate","sex","other_debtor","present_resid","property","age","other_install","housing","n_credits","job","n_people","telephone","foreign","response")

dim(german_data)
## [1] 1000   21
#Response is in 1,2 - we need to change it to 0,1
german_data$response = german_data$response - 1

head(german_data)
##   chk_acct duration credit_his purpose amount saving_acct present_emp
## 1      A11        6        A34     A43   1169         A65         A75
## 2      A12       48        A32     A43   5951         A61         A73
## 3      A14       12        A34     A46   2096         A61         A74
## 4      A11       42        A32     A42   7882         A61         A74
## 5      A11       24        A33     A40   4870         A61         A73
## 6      A14       36        A32     A46   9055         A65         A73
##   installment_rate sex other_debtor present_resid property age other_install
## 1                4 A93         A101             4     A121  67          A143
## 2                2 A92         A101             2     A121  22          A143
## 3                2 A93         A101             3     A121  49          A143
## 4                2 A93         A103             4     A122  45          A143
## 5                3 A93         A101             4     A124  53          A143
## 6                2 A93         A101             4     A124  35          A143
##   housing n_credits  job n_people telephone foreign response
## 1    A152         2 A173        1      A192    A201        0
## 2    A152         1 A173        1      A191    A201        1
## 3    A152         1 A172        2      A191    A201        0
## 4    A153         1 A173        2      A191    A201        0
## 5    A153         2 A173        2      A191    A201        1
## 6    A153         1 A172        2      A192    A201        0
glimpse(german_data)
## Rows: 1,000
## Columns: 21
## $ chk_acct         <chr> "A11", "A12", "A14", "A11", "A11", "A14", "A14", "A1…
## $ duration         <int> 6, 48, 12, 42, 24, 36, 24, 36, 12, 30, 12, 48, 12, 2…
## $ credit_his       <chr> "A34", "A32", "A34", "A32", "A33", "A32", "A32", "A3…
## $ purpose          <chr> "A43", "A43", "A46", "A42", "A40", "A46", "A42", "A4…
## $ amount           <int> 1169, 5951, 2096, 7882, 4870, 9055, 2835, 6948, 3059…
## $ saving_acct      <chr> "A65", "A61", "A61", "A61", "A61", "A65", "A63", "A6…
## $ present_emp      <chr> "A75", "A73", "A74", "A74", "A73", "A73", "A75", "A7…
## $ installment_rate <int> 4, 2, 2, 2, 3, 2, 3, 2, 2, 4, 3, 3, 1, 4, 2, 4, 4, 2…
## $ sex              <chr> "A93", "A92", "A93", "A93", "A93", "A93", "A93", "A9…
## $ other_debtor     <chr> "A101", "A101", "A101", "A103", "A101", "A101", "A10…
## $ present_resid    <int> 4, 2, 3, 4, 4, 4, 4, 2, 4, 2, 1, 4, 1, 4, 4, 2, 4, 3…
## $ property         <chr> "A121", "A121", "A121", "A122", "A124", "A124", "A12…
## $ age              <int> 67, 22, 49, 45, 53, 35, 53, 35, 61, 28, 25, 24, 22, …
## $ other_install    <chr> "A143", "A143", "A143", "A143", "A143", "A143", "A14…
## $ housing          <chr> "A152", "A152", "A152", "A153", "A153", "A153", "A15…
## $ n_credits        <int> 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 3…
## $ job              <chr> "A173", "A173", "A172", "A173", "A173", "A172", "A17…
## $ n_people         <int> 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ telephone        <chr> "A192", "A191", "A191", "A191", "A191", "A192", "A19…
## $ foreign          <chr> "A201", "A201", "A201", "A201", "A201", "A201", "A20…
## $ response         <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0…

Sampling

german_data$sex<- as.factor(german_data$sex)
set.seed(14006542)

#in.train <- createDataPartition(as.factor(german_data$response), p=0.8, list=FALSE)
in.train <- sample(nrow(german_data),nrow(german_data)*0.70)
train <- german_data[in.train,]
test <- german_data[-in.train,]
summary(german_data)
##    chk_acct            duration     credit_his          purpose         
##  Length:1000        Min.   : 4.0   Length:1000        Length:1000       
##  Class :character   1st Qu.:12.0   Class :character   Class :character  
##  Mode  :character   Median :18.0   Mode  :character   Mode  :character  
##                     Mean   :20.9                                        
##                     3rd Qu.:24.0                                        
##                     Max.   :72.0                                        
##      amount      saving_acct        present_emp        installment_rate
##  Min.   :  250   Length:1000        Length:1000        Min.   :1.000   
##  1st Qu.: 1366   Class :character   Class :character   1st Qu.:2.000   
##  Median : 2320   Mode  :character   Mode  :character   Median :3.000   
##  Mean   : 3271                                         Mean   :2.973   
##  3rd Qu.: 3972                                         3rd Qu.:4.000   
##  Max.   :18424                                         Max.   :4.000   
##   sex      other_debtor       present_resid     property        
##  A91: 50   Length:1000        Min.   :1.000   Length:1000       
##  A92:310   Class :character   1st Qu.:2.000   Class :character  
##  A93:548   Mode  :character   Median :3.000   Mode  :character  
##  A94: 92                      Mean   :2.845                     
##                               3rd Qu.:4.000                     
##                               Max.   :4.000                     
##       age        other_install        housing            n_credits    
##  Min.   :19.00   Length:1000        Length:1000        Min.   :1.000  
##  1st Qu.:27.00   Class :character   Class :character   1st Qu.:1.000  
##  Median :33.00   Mode  :character   Mode  :character   Median :1.000  
##  Mean   :35.55                                         Mean   :1.407  
##  3rd Qu.:42.00                                         3rd Qu.:2.000  
##  Max.   :75.00                                         Max.   :4.000  
##      job               n_people      telephone           foreign         
##  Length:1000        Min.   :1.000   Length:1000        Length:1000       
##  Class :character   1st Qu.:1.000   Class :character   Class :character  
##  Mode  :character   Median :1.000   Mode  :character   Mode  :character  
##                     Mean   :1.155                                        
##                     3rd Qu.:1.000                                        
##                     Max.   :2.000                                        
##     response  
##  Min.   :0.0  
##  1st Qu.:0.0  
##  Median :0.0  
##  Mean   :0.3  
##  3rd Qu.:1.0  
##  Max.   :1.0
str(german_data)
## 'data.frame':    1000 obs. of  21 variables:
##  $ chk_acct        : chr  "A11" "A12" "A14" "A11" ...
##  $ duration        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_his      : chr  "A34" "A32" "A34" "A32" ...
##  $ purpose         : chr  "A43" "A43" "A46" "A42" ...
##  $ amount          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ saving_acct     : chr  "A65" "A61" "A61" "A61" ...
##  $ present_emp     : chr  "A75" "A73" "A74" "A74" ...
##  $ installment_rate: int  4 2 2 2 3 2 3 2 2 4 ...
##  $ sex             : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ other_debtor    : chr  "A101" "A101" "A101" "A103" ...
##  $ present_resid   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property        : chr  "A121" "A121" "A121" "A122" ...
##  $ age             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_install   : chr  "A143" "A143" "A143" "A143" ...
##  $ housing         : chr  "A152" "A152" "A152" "A153" ...
##  $ n_credits       : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job             : chr  "A173" "A173" "A172" "A173" ...
##  $ n_people        : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone       : chr  "A192" "A191" "A191" "A191" ...
##  $ foreign         : chr  "A201" "A201" "A201" "A201" ...
##  $ response        : num  0 1 0 0 1 0 0 0 0 1 ...

Data Quality Check

sum(is.na(german_data[,colnames(german_data)]))
## [1] 0
#duration
g1<- ggplot(german_data, aes(x = as.factor(response), y = duration, fill = as.factor(response))) + geom_boxplot() + theme(legend.position = "none")

#amount
g2<-ggplot(german_data, aes(x = as.factor(response), y = amount, fill = as.factor(response))) + geom_boxplot() +
  theme(legend.position = "none")

#AGE
g3<-ggplot(german_data, aes(x = as.factor(response), y = age, fill = as.factor(response))) + 
  geom_boxplot() + theme(legend.position = "none")

ggarrange(g1, g2,g3, labels = c("duration", "amount","age"),
          ncol = 3, nrow = 1)

g4 <- ggplot(german_data, aes(factor(installment_rate), ..count..)) + 
  geom_bar(aes(fill = response), position = "dodge") + xlab("Installment Rates")

EDA for continuous variables

ggplot(german_data, aes(chk_acct, ..count..)) + geom_bar(aes(fill = response), position = "dodge") 

## Model Selection

german_model<- glm(response~., family=binomial, data=train)
summary(german_model)
## 
## Call:
## glm(formula = response ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4484  -0.7036  -0.3852   0.7243   2.4659  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.8851026  1.3845270   0.639 0.522640    
## chk_acctA12       -0.5030711  0.2673708  -1.882 0.059897 .  
## chk_acctA13       -0.4253927  0.4200581  -1.013 0.311204    
## chk_acctA14       -1.6142326  0.2775579  -5.816 6.03e-09 ***
## duration           0.0184715  0.0112729   1.639 0.101303    
## credit_hisA31     -0.1722079  0.6821243  -0.252 0.800687    
## credit_hisA32     -0.9648772  0.5343126  -1.806 0.070945 .  
## credit_hisA33     -1.1817914  0.5741403  -2.058 0.039555 *  
## credit_hisA34     -2.0534852  0.5532425  -3.712 0.000206 ***
## purposeA41        -1.5070005  0.4352014  -3.463 0.000535 ***
## purposeA410       -1.4876300  0.8887980  -1.674 0.094179 .  
## purposeA42        -0.9582826  0.3093139  -3.098 0.001948 ** 
## purposeA43        -1.0555883  0.3014197  -3.502 0.000462 ***
## purposeA44         0.4579776  1.0787464   0.425 0.671168    
## purposeA45        -0.7796829  0.7167067  -1.088 0.276653    
## purposeA46         0.4725778  0.4886015   0.967 0.333442    
## purposeA48        -1.9697065  1.2389177  -1.590 0.111866    
## purposeA49        -0.6796107  0.4099511  -1.658 0.097361 .  
## amount             0.0001302  0.0000520   2.504 0.012294 *  
## saving_acctA62    -0.2421310  0.3338108  -0.725 0.468235    
## saving_acctA63    -0.6939914  0.5295323  -1.311 0.190002    
## saving_acctA64    -0.8499188  0.5955164  -1.427 0.153523    
## saving_acctA65    -0.6238339  0.2992676  -2.085 0.037111 *  
## present_empA72     0.2094153  0.5311415   0.394 0.693379    
## present_empA73     0.1673502  0.5108755   0.328 0.743233    
## present_empA74    -0.3912869  0.5457785  -0.717 0.473415    
## present_empA75     0.0197675  0.5069837   0.039 0.968898    
## installment_rate   0.2857535  0.1083350   2.638 0.008347 ** 
## sexA92             0.0167802  0.4915429   0.034 0.972767    
## sexA93            -0.5008504  0.4739034  -1.057 0.290575    
## sexA94            -0.1609017  0.5728093  -0.281 0.778788    
## other_debtorA102   0.5149136  0.4805445   1.072 0.283935    
## other_debtorA103  -0.8208256  0.5162608  -1.590 0.111848    
## present_resid      0.0160511  0.1088301   0.147 0.882747    
## propertyA122       0.2391509  0.3059571   0.782 0.434421    
## propertyA123       0.0252650  0.2861699   0.088 0.929649    
## propertyA124       0.5562042  0.4908042   1.133 0.257109    
## age               -0.0183609  0.0113005  -1.625 0.104209    
## other_installA142  0.0420092  0.4737015   0.089 0.929334    
## other_installA143 -0.6097567  0.2942054  -2.073 0.038214 *  
## housingA152       -0.2891686  0.2838236  -1.019 0.308283    
## housingA153       -0.3839094  0.5566959  -0.690 0.490432    
## n_credits          0.1383620  0.2320672   0.596 0.551031    
## jobA172            0.4854772  0.8863314   0.548 0.583872    
## jobA173            0.4167046  0.8676803   0.480 0.631049    
## jobA174            0.8721011  0.8788201   0.992 0.321025    
## n_people           0.1470461  0.2955121   0.498 0.618768    
## telephoneA192     -0.3908976  0.2416691  -1.617 0.105772    
## foreignA202       -1.0824703  0.6835033  -1.584 0.113260    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 850.06  on 699  degrees of freedom
## Residual deviance: 629.88  on 651  degrees of freedom
## AIC: 727.88
## 
## Number of Fisher Scoring iterations: 5
predict_logit<- predict(german_model, test, type = "response")

prob.glm1.insample <- predict(german_model, type = "response")
predicted.glm1.insample <- prob.glm1.insample > 0.1667
predicted.glm1.insample <- as.numeric(predicted.glm1.insample)

mean(ifelse(train$response != predicted.glm1.insample, 1, 0))
## [1] 0.3514286
mean(ifelse(test$response != predicted.glm1.insample, 1, 0))
## Warning in test$response != predicted.glm1.insample: longer object length is not
## a multiple of shorter object length
## [1] 0.5342857
roc.plot(train$response == "1", prob.glm1.insample)$roc.vol$Area

## [1] 0.8311824
roc.plot(test$response == "1", prob.glm1.insample)$roc.vol$Area
## Warning in is.finite(x) & apply(pred, 1, f): longer object length is not a
## multiple of shorter object length

## [1] 0.480079
german_model_probit<- glm(response~., family=binomial(link = "probit"), data=train)
summary(german_model_probit)
## 
## Call:
## glm(formula = response ~ ., family = binomial(link = "probit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4086  -0.7191  -0.3714   0.7540   2.4914  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        4.495e-01  7.947e-01   0.566 0.571677    
## chk_acctA12       -2.919e-01  1.563e-01  -1.868 0.061767 .  
## chk_acctA13       -2.578e-01  2.465e-01  -1.046 0.295592    
## chk_acctA14       -9.491e-01  1.576e-01  -6.024  1.7e-09 ***
## duration           1.036e-02  6.591e-03   1.571 0.116116    
## credit_hisA31     -1.355e-01  3.973e-01  -0.341 0.733058    
## credit_hisA32     -5.626e-01  3.095e-01  -1.818 0.069129 .  
## credit_hisA33     -6.827e-01  3.332e-01  -2.049 0.040444 *  
## credit_hisA34     -1.185e+00  3.182e-01  -3.723 0.000197 ***
## purposeA41        -8.863e-01  2.479e-01  -3.575 0.000350 ***
## purposeA410       -8.715e-01  5.177e-01  -1.683 0.092284 .  
## purposeA42        -5.377e-01  1.792e-01  -3.001 0.002691 ** 
## purposeA43        -5.884e-01  1.726e-01  -3.409 0.000652 ***
## purposeA44         2.305e-01  6.061e-01   0.380 0.703660    
## purposeA45        -3.915e-01  4.050e-01  -0.967 0.333734    
## purposeA46         2.799e-01  2.828e-01   0.990 0.322272    
## purposeA48        -1.151e+00  6.984e-01  -1.648 0.099395 .  
## purposeA49        -3.676e-01  2.351e-01  -1.564 0.117813    
## amount             7.482e-05  3.014e-05   2.482 0.013065 *  
## saving_acctA62    -1.485e-01  1.942e-01  -0.765 0.444380    
## saving_acctA63    -3.927e-01  2.868e-01  -1.369 0.170944    
## saving_acctA64    -4.477e-01  3.253e-01  -1.376 0.168753    
## saving_acctA65    -3.625e-01  1.703e-01  -2.129 0.033237 *  
## present_empA72     1.193e-01  3.104e-01   0.385 0.700602    
## present_empA73     1.161e-01  2.983e-01   0.389 0.697111    
## present_empA74    -2.054e-01  3.165e-01  -0.649 0.516263    
## present_empA75     2.104e-02  2.956e-01   0.071 0.943251    
## installment_rate   1.621e-01  6.218e-02   2.606 0.009152 ** 
## sexA92             2.404e-02  2.873e-01   0.084 0.933299    
## sexA93            -2.610e-01  2.767e-01  -0.943 0.345638    
## sexA94            -8.603e-02  3.348e-01  -0.257 0.797214    
## other_debtorA102   2.462e-01  2.843e-01   0.866 0.386491    
## other_debtorA103  -4.622e-01  2.885e-01  -1.602 0.109110    
## present_resid      4.170e-03  6.280e-02   0.066 0.947055    
## propertyA122       1.748e-01  1.748e-01   1.000 0.317216    
## propertyA123       2.206e-02  1.652e-01   0.134 0.893731    
## propertyA124       3.308e-01  2.808e-01   1.178 0.238777    
## age               -1.008e-02  6.460e-03  -1.561 0.118577    
## other_installA142  2.403e-02  2.787e-01   0.086 0.931297    
## other_installA143 -3.646e-01  1.705e-01  -2.138 0.032503 *  
## housingA152       -1.854e-01  1.651e-01  -1.123 0.261365    
## housingA153       -2.067e-01  3.182e-01  -0.650 0.515948    
## n_credits          8.351e-02  1.329e-01   0.628 0.529906    
## jobA172            3.226e-01  5.022e-01   0.642 0.520639    
## jobA173            2.938e-01  4.916e-01   0.598 0.550054    
## jobA174            5.674e-01  4.992e-01   1.137 0.255681    
## n_people           8.344e-02  1.703e-01   0.490 0.624223    
## telephoneA192     -2.497e-01  1.382e-01  -1.806 0.070857 .  
## foreignA202       -6.221e-01  3.751e-01  -1.658 0.097236 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 850.06  on 699  degrees of freedom
## Residual deviance: 630.34  on 651  degrees of freedom
## AIC: 728.34
## 
## Number of Fisher Scoring iterations: 6
par(mfrow = c(2,2))
plot(german_model_probit)

AIC(german_model_probit)
## [1] 728.3355
BIC(german_model_probit)
## [1] 951.3384
german_model_cloglog<- glm(response~., family=binomial(link = "cloglog"), data=train)
## Warning: glm.fit: algorithm did not converge
summary(german_model_cloglog)
## 
## Call:
## glm(formula = response ~ ., family = binomial(link = "cloglog"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4002  -0.6914  -0.4134   0.6267   2.4210  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -5.450e-02  9.952e-01  -0.055 0.956325    
## chk_acctA12       -3.900e-01  1.913e-01  -2.038 0.041503 *  
## chk_acctA13       -3.044e-01  3.280e-01  -0.928 0.353381    
## chk_acctA14       -1.305e+00  2.202e-01  -5.926 3.10e-09 ***
## duration           1.588e-02  8.139e-03   1.951 0.051054 .  
## credit_hisA31     -2.296e-01  4.360e-01  -0.527 0.598441    
## credit_hisA32     -6.889e-01  3.426e-01  -2.011 0.044336 *  
## credit_hisA33     -8.745e-01  3.830e-01  -2.283 0.022424 *  
## credit_hisA34     -1.572e+00  3.688e-01  -4.264 2.01e-05 ***
## purposeA41        -1.071e+00  3.345e-01  -3.203 0.001361 ** 
## purposeA410       -1.047e+00  6.438e-01  -1.626 0.103972    
## purposeA42        -7.302e-01  2.278e-01  -3.205 0.001349 ** 
## purposeA43        -8.351e-01  2.265e-01  -3.687 0.000227 ***
## purposeA44         1.171e-02  6.099e-01   0.019 0.984681    
## purposeA45        -5.681e-01  5.482e-01  -1.036 0.300048    
## purposeA46         3.346e-01  3.300e-01   1.014 0.310596    
## purposeA48        -1.537e+00  1.045e+00  -1.470 0.141602    
## purposeA49        -6.221e-01  3.036e-01  -2.049 0.040454 *  
## amount             7.678e-05  3.632e-05   2.114 0.034524 *  
## saving_acctA62    -7.698e-02  2.447e-01  -0.315 0.753104    
## saving_acctA63    -7.322e-01  4.572e-01  -1.602 0.109234    
## saving_acctA64    -7.594e-01  4.919e-01  -1.544 0.122648    
## saving_acctA65    -6.267e-01  2.337e-01  -2.681 0.007335 ** 
## present_empA72     1.760e-01  3.744e-01   0.470 0.638378    
## present_empA73     2.487e-01  3.612e-01   0.688 0.491141    
## present_empA74    -2.134e-01  3.961e-01  -0.539 0.590162    
## present_empA75     1.584e-01  3.646e-01   0.434 0.664010    
## installment_rate   2.341e-01  8.116e-02   2.884 0.003921 ** 
## sexA92             2.353e-02  3.669e-01   0.064 0.948862    
## sexA93            -3.845e-01  3.534e-01  -1.088 0.276612    
## sexA94            -1.190e-01  4.353e-01  -0.273 0.784546    
## other_debtorA102   2.049e-01  3.402e-01   0.602 0.547109    
## other_debtorA103  -7.417e-01  4.413e-01  -1.680 0.092861 .  
## present_resid      2.612e-02  8.045e-02   0.325 0.745475    
## propertyA122       1.951e-01  2.366e-01   0.824 0.409682    
## propertyA123      -1.172e-02  2.212e-01  -0.053 0.957752    
## propertyA124       4.575e-01  3.489e-01   1.311 0.189783    
## age               -1.230e-02  8.498e-03  -1.447 0.147904    
## other_installA142  1.054e-01  3.405e-01   0.309 0.756964    
## other_installA143 -4.462e-01  2.160e-01  -2.066 0.038833 *  
## housingA152       -2.544e-01  2.083e-01  -1.221 0.222004    
## housingA153       -4.612e-01  3.821e-01  -1.207 0.227470    
## n_credits          1.263e-01  1.686e-01   0.749 0.453797    
## jobA172            3.958e-01  6.138e-01   0.645 0.519035    
## jobA173            3.515e-01  6.006e-01   0.585 0.558377    
## jobA174            6.752e-01  6.194e-01   1.090 0.275709    
## n_people           1.736e-01  2.246e-01   0.773 0.439636    
## telephoneA192     -2.814e-01  1.820e-01  -1.546 0.122026    
## foreignA202       -7.722e-01  5.437e-01  -1.420 0.155540    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 850.06  on 699  degrees of freedom
## Residual deviance: 632.40  on 651  degrees of freedom
## AIC: 730.4
## 
## Number of Fisher Scoring iterations: 25
par(mfrow = c(2,2))
plot(german_model_cloglog)

AIC(german_model_cloglog)
## [1] 730.4027
BIC(german_model_cloglog)
## [1] 953.4057
german_model$deviance
## [1] 629.8755
AIC(german_model)
## [1] 727.8755
BIC(german_model)
## [1] 950.8784

Backward AIC

german_model_back_aic <- step(german_model,  direction = "backward") # backward selection (if you don't specify anything)
## Start:  AIC=727.88
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + present_emp + installment_rate + sex + other_debtor + 
##     present_resid + property + age + other_install + housing + 
##     n_credits + job + n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - present_emp       4   633.62 723.62
## - property          3   631.70 723.70
## - job               3   632.18 724.18
## - housing           2   631.04 725.04
## - present_resid     1   629.90 725.90
## - n_people          1   630.12 726.12
## - n_credits         1   630.23 726.23
## - sex               3   634.53 726.53
## - saving_acct       4   636.89 726.89
## <none>                  629.88 727.88
## - other_debtor      2   634.06 728.06
## - telephone         1   632.53 728.53
## - duration          1   632.56 728.56
## - age               1   632.58 728.58
## - foreign           1   632.76 728.76
## - other_install     2   635.58 729.58
## - amount            1   636.24 732.24
## - installment_rate  1   637.03 733.03
## - purpose           9   660.81 740.81
## - credit_his        4   653.65 743.65
## - chk_acct          3   668.73 760.73
## 
## Step:  AIC=723.62
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + installment_rate + sex + other_debtor + present_resid + 
##     property + age + other_install + housing + n_credits + job + 
##     n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - property          3   635.55 719.55
## - job               3   635.79 719.79
## - housing           2   634.74 720.74
## - present_resid     1   633.62 721.62
## - n_people          1   633.77 721.77
## - n_credits         1   633.95 721.95
## - saving_acct       4   641.36 723.36
## - sex               3   639.40 723.40
## <none>                  633.62 723.62
## - duration          1   635.76 723.76
## - other_debtor      2   638.11 724.11
## - age               1   636.45 724.45
## - telephone         1   636.65 724.65
## - foreign           1   636.73 724.73
## - other_install     2   639.21 725.21
## - amount            1   639.81 727.81
## - installment_rate  1   640.88 728.88
## - purpose           9   664.77 736.77
## - credit_his        4   659.78 741.78
## - chk_acct          3   673.27 757.27
## 
## Step:  AIC=719.55
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + installment_rate + sex + other_debtor + present_resid + 
##     age + other_install + housing + n_credits + job + n_people + 
##     telephone + foreign
## 
##                    Df Deviance    AIC
## - job               3   637.97 715.97
## - housing           2   637.52 717.52
## - present_resid     1   635.55 717.55
## - n_people          1   635.67 717.67
## - n_credits         1   635.78 717.78
## - sex               3   641.03 719.03
## - saving_acct       4   643.08 719.08
## <none>                  635.55 719.55
## - duration          1   637.68 719.68
## - telephone         1   638.15 720.15
## - foreign           1   638.33 720.33
## - age               1   638.45 720.45
## - other_debtor      2   640.46 720.46
## - other_install     2   641.33 721.33
## - amount            1   642.38 724.38
## - installment_rate  1   643.21 725.21
## - purpose           9   668.17 734.17
## - credit_his        4   662.05 738.05
## - chk_acct          3   676.25 754.25
## 
## Step:  AIC=715.97
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + installment_rate + sex + other_debtor + present_resid + 
##     age + other_install + housing + n_credits + n_people + telephone + 
##     foreign
## 
##                    Df Deviance    AIC
## - present_resid     1   637.98 713.98
## - n_people          1   638.10 714.10
## - n_credits         1   638.12 714.12
## - housing           2   640.15 714.15
## - sex               3   643.52 715.52
## - telephone         1   639.55 715.55
## - duration          1   639.70 715.70
## <none>                  637.97 715.97
## - age               1   640.51 716.51
## - foreign           1   640.68 716.68
## - saving_acct       4   646.74 716.74
## - other_debtor      2   643.17 717.17
## - other_install     2   644.34 718.34
## - amount            1   647.10 723.10
## - installment_rate  1   647.44 723.44
## - purpose           9   669.34 729.34
## - credit_his        4   664.24 734.24
## - chk_acct          3   678.43 750.43
## 
## Step:  AIC=713.98
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + installment_rate + sex + other_debtor + age + 
##     other_install + housing + n_credits + n_people + telephone + 
##     foreign
## 
##                    Df Deviance    AIC
## - n_people          1   638.11 712.11
## - n_credits         1   638.13 712.13
## - housing           2   640.31 712.31
## - sex               3   643.52 713.52
## - telephone         1   639.58 713.58
## - duration          1   639.70 713.70
## <none>                  637.98 713.98
## - foreign           1   640.68 714.68
## - age               1   640.74 714.74
## - saving_acct       4   646.88 714.88
## - other_debtor      2   643.18 715.18
## - other_install     2   644.41 716.41
## - amount            1   647.15 721.15
## - installment_rate  1   647.45 721.45
## - purpose           9   669.45 727.45
## - credit_his        4   664.30 732.30
## - chk_acct          3   678.48 748.48
## 
## Step:  AIC=712.11
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + installment_rate + sex + other_debtor + age + 
##     other_install + housing + n_credits + telephone + foreign
## 
##                    Df Deviance    AIC
## - n_credits         1   638.28 710.28
## - housing           2   640.51 710.51
## - sex               3   643.57 711.57
## - telephone         1   639.74 711.74
## - duration          1   639.83 711.83
## <none>                  638.11 712.11
## - foreign           1   640.82 712.82
## - age               1   640.83 712.83
## - saving_acct       4   646.93 712.93
## - other_debtor      2   643.20 713.20
## - other_install     2   644.56 714.56
## - amount            1   647.20 719.20
## - installment_rate  1   647.46 719.46
## - purpose           9   669.57 725.57
## - credit_his        4   664.83 730.83
## - chk_acct          3   678.75 746.75
## 
## Step:  AIC=710.28
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + installment_rate + sex + other_debtor + age + 
##     other_install + housing + telephone + foreign
## 
##                    Df Deviance    AIC
## - housing           2   640.67 708.67
## - sex               3   643.65 709.65
## - telephone         1   639.83 709.83
## - duration          1   639.94 709.94
## <none>                  638.28 710.28
## - age               1   640.94 710.94
## - foreign           1   641.08 711.08
## - saving_acct       4   647.20 711.20
## - other_debtor      2   643.35 711.35
## - other_install     2   645.15 713.15
## - amount            1   647.41 717.41
## - installment_rate  1   647.54 717.54
## - purpose           9   669.97 723.97
## - credit_his        4   667.24 731.24
## - chk_acct          3   678.87 744.87
## 
## Step:  AIC=708.67
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + installment_rate + sex + other_debtor + age + 
##     other_install + telephone + foreign
## 
##                    Df Deviance    AIC
## - telephone         1   642.20 708.20
## - duration          1   642.51 708.51
## - sex               3   646.67 708.67
## <none>                  640.67 708.67
## - age               1   643.24 709.24
## - foreign           1   643.72 709.72
## - saving_acct       4   649.77 709.77
## - other_debtor      2   646.16 710.16
## - other_install     2   647.17 711.17
## - amount            1   650.17 716.17
## - installment_rate  1   650.40 716.40
## - purpose           9   672.97 722.97
## - credit_his        4   671.87 731.87
## - chk_acct          3   684.54 746.54
## 
## Step:  AIC=708.2
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + installment_rate + sex + other_debtor + age + 
##     other_install + foreign
## 
##                    Df Deviance    AIC
## - sex               3   648.03 708.03
## <none>                  642.20 708.20
## - duration          1   644.21 708.21
## - foreign           1   645.01 709.01
## - saving_acct       4   651.53 709.53
## - other_debtor      2   647.58 709.58
## - age               1   645.59 709.59
## - other_install     2   648.82 710.82
## - amount            1   650.48 714.48
## - installment_rate  1   651.48 715.48
## - purpose           9   675.00 723.00
## - credit_his        4   673.71 731.71
## - chk_acct          3   686.96 746.96
## 
## Step:  AIC=708.03
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + installment_rate + other_debtor + age + other_install + 
##     foreign
## 
##                    Df Deviance    AIC
## - duration          1   649.89 707.89
## <none>                  648.03 708.03
## - foreign           1   651.18 709.18
## - saving_acct       4   657.61 709.61
## - other_install     2   653.69 709.69
## - other_debtor      2   653.85 709.85
## - age               1   653.02 711.02
## - amount            1   654.76 712.76
## - installment_rate  1   655.49 713.49
## - purpose           9   680.16 722.16
## - credit_his        4   680.78 732.78
## - chk_acct          3   693.23 747.23
## 
## Step:  AIC=707.89
## response ~ chk_acct + credit_his + purpose + amount + saving_acct + 
##     installment_rate + other_debtor + age + other_install + foreign
## 
##                    Df Deviance    AIC
## <none>                  649.89 707.89
## - saving_acct       4   659.21 709.21
## - other_install     2   655.48 709.48
## - foreign           1   653.56 709.56
## - other_debtor      2   655.62 709.62
## - age               1   655.42 711.42
## - installment_rate  1   660.55 716.55
## - purpose           9   681.44 721.44
## - amount            1   667.80 723.80
## - credit_his        4   683.96 733.96
## - chk_acct          3   697.12 749.12
summary(german_model_back_aic)
## 
## Call:
## glm(formula = response ~ chk_acct + credit_his + purpose + amount + 
##     saving_acct + installment_rate + other_debtor + age + other_install + 
##     foreign, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3321  -0.7029  -0.4257   0.7341   2.5649  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.958e+00  7.282e-01   2.689 0.007161 ** 
## chk_acctA12       -5.317e-01  2.534e-01  -2.098 0.035918 *  
## chk_acctA13       -5.632e-01  4.136e-01  -1.362 0.173217    
## chk_acctA14       -1.708e+00  2.674e-01  -6.385 1.71e-10 ***
## credit_hisA31     -4.705e-01  6.335e-01  -0.743 0.457670    
## credit_hisA32     -1.286e+00  4.874e-01  -2.639 0.008319 ** 
## credit_hisA33     -1.406e+00  5.425e-01  -2.592 0.009534 ** 
## credit_hisA34     -2.325e+00  5.194e-01  -4.476 7.60e-06 ***
## purposeA41        -1.281e+00  4.125e-01  -3.105 0.001903 ** 
## purposeA410       -1.455e+00  8.476e-01  -1.717 0.086015 .  
## purposeA42        -8.260e-01  2.944e-01  -2.806 0.005015 ** 
## purposeA43        -1.017e+00  2.867e-01  -3.548 0.000389 ***
## purposeA44         5.379e-01  9.912e-01   0.543 0.587332    
## purposeA45        -7.916e-01  6.726e-01  -1.177 0.239242    
## purposeA46         6.498e-01  4.769e-01   1.363 0.173020    
## purposeA48        -2.212e+00  1.209e+00  -1.829 0.067347 .  
## purposeA49        -7.256e-01  3.881e-01  -1.870 0.061529 .  
## amount             1.576e-04  3.803e-05   4.144 3.42e-05 ***
## saving_acctA62    -1.907e-01  3.190e-01  -0.598 0.550084    
## saving_acctA63    -7.569e-01  5.227e-01  -1.448 0.147609    
## saving_acctA64    -9.867e-01  5.826e-01  -1.693 0.090364 .  
## saving_acctA65    -6.688e-01  2.856e-01  -2.341 0.019216 *  
## installment_rate   3.128e-01  9.794e-02   3.194 0.001403 ** 
## other_debtorA102   5.970e-01  4.568e-01   1.307 0.191278    
## other_debtorA103  -9.002e-01  5.017e-01  -1.794 0.072786 .  
## age               -2.165e-02  9.373e-03  -2.310 0.020903 *  
## other_installA142 -1.198e-01  4.564e-01  -0.263 0.792902    
## other_installA143 -6.231e-01  2.833e-01  -2.199 0.027867 *  
## foreignA202       -1.156e+00  6.532e-01  -1.769 0.076863 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 850.06  on 699  degrees of freedom
## Residual deviance: 649.89  on 671  degrees of freedom
## AIC: 707.89
## 
## Number of Fisher Scoring iterations: 5
german_model_back_aic$deviance
## [1] 649.8939
AIC(german_model_back_aic)
## [1] 707.8939
BIC(german_model_back_aic)
## [1] 839.8752
german_model.back_bic <- step(german_model, k=log(nrow(train))) 
## Start:  AIC=950.88
## response ~ chk_acct + duration + credit_his + purpose + amount + 
##     saving_acct + present_emp + installment_rate + sex + other_debtor + 
##     present_resid + property + age + other_install + housing + 
##     n_credits + job + n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - purpose           9   660.81 922.85
## - present_emp       4   633.62 928.41
## - saving_acct       4   636.89 931.69
## - property          3   631.70 933.05
## - job               3   632.18 933.53
## - sex               3   634.53 935.88
## - housing           2   631.04 938.94
## - other_debtor      2   634.06 941.96
## - other_install     2   635.58 943.48
## - present_resid     1   629.90 944.35
## - n_people          1   630.12 944.57
## - n_credits         1   630.23 944.68
## - telephone         1   632.53 946.98
## - duration          1   632.56 947.02
## - age               1   632.58 947.03
## - foreign           1   632.76 947.21
## - credit_his        4   653.65 948.45
## - amount            1   636.24 950.69
## <none>                  629.88 950.88
## - installment_rate  1   637.03 951.48
## - chk_acct          3   668.73 970.08
## 
## Step:  AIC=922.85
## response ~ chk_acct + duration + credit_his + amount + saving_acct + 
##     present_emp + installment_rate + sex + other_debtor + present_resid + 
##     property + age + other_install + housing + n_credits + job + 
##     n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - present_emp       4   664.77 900.61
## - saving_acct       4   667.68 903.51
## - job               3   661.97 904.36
## - property          3   664.25 906.64
## - sex               3   664.79 907.18
## - housing           2   662.29 911.23
## - other_install     2   665.17 914.12
## - other_debtor      2   666.44 915.38
## - present_resid     1   660.81 916.30
## - n_people          1   661.05 916.54
## - n_credits         1   661.47 916.96
## - age               1   661.90 917.39
## - foreign           1   662.36 917.85
## - duration          1   662.67 918.16
## - credit_his        4   682.77 918.61
## - telephone         1   663.83 919.33
## - amount            1   665.80 921.29
## <none>                  660.81 922.85
## - installment_rate  1   667.73 923.22
## - chk_acct          3   704.27 946.66
## 
## Step:  AIC=900.61
## response ~ chk_acct + duration + credit_his + amount + saving_acct + 
##     installment_rate + sex + other_debtor + present_resid + property + 
##     age + other_install + housing + n_credits + job + n_people + 
##     telephone + foreign
## 
##                    Df Deviance    AIC
## - saving_acct       4   672.10 881.73
## - job               3   665.71 881.89
## - property          3   668.17 884.36
## - sex               3   669.79 885.98
## - housing           2   666.21 888.94
## - other_install     2   668.91 891.65
## - other_debtor      2   670.55 893.29
## - present_resid     1   664.86 894.14
## - n_people          1   664.95 894.24
## - n_credits         1   665.41 894.70
## - age               1   666.23 895.52
## - duration          1   666.25 895.54
## - foreign           1   666.38 895.67
## - telephone         1   668.18 897.46
## - credit_his        4   689.10 898.73
## - amount            1   669.73 899.02
## <none>                  664.77 900.61
## - installment_rate  1   671.54 900.83
## - chk_acct          3   708.68 924.86
## 
## Step:  AIC=881.73
## response ~ chk_acct + duration + credit_his + amount + installment_rate + 
##     sex + other_debtor + present_resid + property + age + other_install + 
##     housing + n_credits + job + n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - job               3   673.85 863.83
## - property          3   675.28 865.26
## - sex               3   677.13 867.12
## - housing           2   673.32 869.85
## - other_install     2   676.47 873.00
## - other_debtor      2   677.15 873.68
## - n_people          1   672.17 875.26
## - present_resid     1   672.34 875.42
## - n_credits         1   672.97 876.05
## - duration          1   673.59 876.67
## - age               1   673.94 877.02
## - foreign           1   674.11 877.19
## - telephone         1   676.14 879.22
## - amount            1   676.79 879.87
## - credit_his        4   696.94 880.37
## - installment_rate  1   678.55 881.64
## <none>                  672.10 881.73
## - chk_acct          3   726.06 916.04
## 
## Step:  AIC=863.83
## response ~ chk_acct + duration + credit_his + amount + installment_rate + 
##     sex + other_debtor + present_resid + property + age + other_install + 
##     housing + n_credits + n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - property          3   677.20 847.53
## - sex               3   679.11 849.44
## - housing           2   675.06 851.94
## - other_install     2   678.93 855.81
## - other_debtor      2   679.24 856.12
## - n_people          1   673.94 857.37
## - present_resid     1   674.13 857.56
## - n_credits         1   674.60 858.03
## - duration          1   674.95 858.38
## - age               1   675.41 858.84
## - foreign           1   675.79 859.22
## - telephone         1   677.00 860.43
## - credit_his        4   698.46 862.24
## <none>                  673.85 863.83
## - amount            1   680.55 863.98
## - installment_rate  1   681.64 865.07
## - chk_acct          3   727.96 898.29
## 
## Step:  AIC=847.53
## response ~ chk_acct + duration + credit_his + amount + installment_rate + 
##     sex + other_debtor + present_resid + age + other_install + 
##     housing + n_credits + n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - sex               3   682.04 832.72
## - housing           2   680.56 837.78
## - other_install     2   682.51 839.74
## - other_debtor      2   683.49 840.72
## - n_people          1   677.25 841.02
## - present_resid     1   677.47 841.25
## - n_credits         1   677.75 841.53
## - duration          1   678.28 842.05
## - age               1   678.76 842.53
## - foreign           1   678.83 842.60
## - telephone         1   679.66 843.43
## - credit_his        4   701.88 846.01
## <none>                  677.20 847.53
## - amount            1   684.95 848.72
## - installment_rate  1   685.88 849.66
## - chk_acct          3   732.77 883.45
## 
## Step:  AIC=832.72
## response ~ chk_acct + duration + credit_his + amount + installment_rate + 
##     other_debtor + present_resid + age + other_install + housing + 
##     n_credits + n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - housing           2   685.83 823.41
## - other_install     2   686.93 824.50
## - other_debtor      2   688.39 825.97
## - n_people          1   682.13 826.25
## - present_resid     1   682.29 826.41
## - n_credits         1   682.48 826.61
## - duration          1   683.14 827.26
## - foreign           1   683.91 828.03
## - age               1   683.96 828.08
## - telephone         1   684.40 828.53
## - credit_his        4   707.36 831.83
## <none>                  682.04 832.72
## - amount            1   688.62 832.74
## - installment_rate  1   688.87 832.99
## - chk_acct          3   739.52 870.54
## 
## Step:  AIC=823.41
## response ~ chk_acct + duration + credit_his + amount + installment_rate + 
##     other_debtor + present_resid + age + other_install + n_credits + 
##     n_people + telephone + foreign
## 
##                    Df Deviance    AIC
## - other_install     2   690.51 814.98
## - n_people          1   685.87 816.89
## - present_resid     1   685.88 816.90
## - n_credits         1   686.20 817.23
## - other_debtor      2   692.82 817.29
## - duration          1   686.91 817.93
## - foreign           1   687.87 818.89
## - telephone         1   688.19 819.21
## - age               1   688.57 819.59
## <none>                  685.83 823.41
## - installment_rate  1   692.73 823.75
## - amount            1   692.98 824.00
## - credit_his        4   713.76 825.12
## - chk_acct          3   746.20 864.12
## 
## Step:  AIC=814.98
## response ~ chk_acct + duration + credit_his + amount + installment_rate + 
##     other_debtor + present_resid + age + n_credits + n_people + 
##     telephone + foreign
## 
##                    Df Deviance    AIC
## - n_people          1   690.52 808.43
## - present_resid     1   690.52 808.44
## - other_debtor      2   697.08 808.45
## - n_credits         1   691.25 809.17
## - duration          1   691.58 809.50
## - foreign           1   692.65 810.57
## - age               1   692.96 810.88
## - telephone         1   693.00 810.92
## <none>                  690.51 814.98
## - installment_rate  1   697.36 815.28
## - amount            1   698.02 815.94
## - credit_his        4   724.25 822.51
## - chk_acct          3   749.45 854.27
## 
## Step:  AIC=808.43
## response ~ chk_acct + duration + credit_his + amount + installment_rate + 
##     other_debtor + present_resid + age + n_credits + telephone + 
##     foreign
## 
##                    Df Deviance    AIC
## - present_resid     1   690.52 801.89
## - other_debtor      2   697.19 802.01
## - n_credits         1   691.25 802.62
## - duration          1   691.59 802.95
## - foreign           1   692.66 804.02
## - telephone         1   693.00 804.37
## - age               1   693.02 804.39
## <none>                  690.52 808.43
## - installment_rate  1   697.41 808.78
## - amount            1   698.02 809.38
## - credit_his        4   724.43 816.15
## - chk_acct          3   749.45 847.72
## 
## Step:  AIC=801.89
## response ~ chk_acct + duration + credit_his + amount + installment_rate + 
##     other_debtor + age + n_credits + telephone + foreign
## 
##                    Df Deviance    AIC
## - other_debtor      2   697.24 795.50
## - n_credits         1   691.28 796.09
## - duration          1   691.61 796.42
## - foreign           1   692.69 797.51
## - telephone         1   693.01 797.83
## - age               1   693.12 797.94
## <none>                  690.52 801.89
## - installment_rate  1   697.43 802.25
## - amount            1   698.02 802.84
## - credit_his        4   724.43 809.60
## - chk_acct          3   749.81 841.52
## 
## Step:  AIC=795.5
## response ~ chk_acct + duration + credit_his + amount + installment_rate + 
##     age + n_credits + telephone + foreign
## 
##                    Df Deviance    AIC
## - n_credits         1   697.86 789.58
## - duration          1   698.20 789.91
## - telephone         1   699.47 791.19
## - foreign           1   699.61 791.33
## - age               1   700.02 791.73
## <none>                  697.24 795.50
## - installment_rate  1   704.59 796.30
## - amount            1   706.24 797.95
## - credit_his        4   731.35 803.41
## - chk_acct          3   755.79 834.40
## 
## Step:  AIC=789.58
## response ~ chk_acct + duration + credit_his + amount + installment_rate + 
##     age + telephone + foreign
## 
##                    Df Deviance    AIC
## - duration          1   698.75 783.92
## - telephone         1   699.96 785.12
## - foreign           1   700.32 785.48
## - age               1   700.57 785.73
## <none>                  697.86 789.58
## - installment_rate  1   705.10 790.27
## - amount            1   706.98 792.15
## - credit_his        4   732.32 797.83
## - chk_acct          3   756.20 828.26
## 
## Step:  AIC=783.92
## response ~ chk_acct + credit_his + amount + installment_rate + 
##     age + telephone + foreign
## 
##                    Df Deviance    AIC
## - telephone         1   700.95 779.57
## - foreign           1   701.59 780.20
## - age               1   701.75 780.36
## <none>                  698.75 783.92
## - installment_rate  1   708.53 787.15
## - credit_his        4   734.13 793.09
## - amount            1   719.40 798.01
## - chk_acct          3   758.47 823.98
## 
## Step:  AIC=779.57
## response ~ chk_acct + credit_his + amount + installment_rate + 
##     age + foreign
## 
##                    Df Deviance    AIC
## - foreign           1   703.50 775.56
## - age               1   704.92 776.98
## <none>                  700.95 779.57
## - installment_rate  1   710.38 782.44
## - credit_his        4   737.41 789.82
## - amount            1   719.44 791.50
## - chk_acct          3   762.47 821.43
## 
## Step:  AIC=775.56
## response ~ chk_acct + credit_his + amount + installment_rate + 
##     age
## 
##                    Df Deviance    AIC
## - age               1   707.36 772.87
## <none>                  703.50 775.56
## - installment_rate  1   713.86 779.37
## - credit_his        4   740.14 785.99
## - amount            1   722.29 787.80
## - chk_acct          3   764.99 817.40
## 
## Step:  AIC=772.87
## response ~ chk_acct + credit_his + amount + installment_rate
## 
##                    Df Deviance    AIC
## <none>                  707.36 772.87
## - installment_rate  1   716.76 775.72
## - amount            1   724.37 783.33
## - credit_his        4   746.30 785.60
## - chk_acct          3   769.56 815.41
summary(german_model.back_bic)
## 
## Call:
## glm(formula = response ~ chk_acct + credit_his + amount + installment_rate, 
##     family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0031  -0.8105  -0.4888   0.8972   2.4994  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       0.1529180  0.5610828   0.273  0.78521    
## chk_acctA12      -0.5908353  0.2259036  -2.615  0.00891 ** 
## chk_acctA13      -0.7132500  0.3852302  -1.851  0.06410 .  
## chk_acctA14      -1.8169105  0.2472187  -7.349 1.99e-13 ***
## credit_hisA31    -0.4375497  0.5847990  -0.748  0.45434    
## credit_hisA32    -1.3956104  0.4592054  -3.039  0.00237 ** 
## credit_hisA33    -1.3501579  0.5228587  -2.582  0.00982 ** 
## credit_hisA34    -2.3322972  0.4921102  -4.739 2.14e-06 ***
## amount            0.0001362  0.0000337   4.043 5.27e-05 ***
## installment_rate  0.2698017  0.0897746   3.005  0.00265 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 850.06  on 699  degrees of freedom
## Residual deviance: 707.36  on 690  degrees of freedom
## AIC: 727.36
## 
## Number of Fisher Scoring iterations: 4
german_model.back_bic$deviance
## [1] 707.3607
AIC(german_model.back_bic)
## [1] 727.3607
BIC(german_model.back_bic)
## [1] 772.8715

Lasso variable selection

nums <- sapply(train, is.numeric)
nums <- as.vector(which(nums==TRUE))
germandata.nums.std <- scale(german_data[,nums])

# change categorical variables to dummy variables
dummy <- model.matrix( ~ chk_acct+credit_his+purpose+saving_acct+present_emp+               sex+other_debtor+property+other_install+housing+job+telephone+foreign-1, 
                       data= german_data)
y.std <- german_data$response
germandata.std <- cbind(dummy,germandata.nums.std,y.std)
dim(germandata.std)
## [1] 1000   51
X.train <- as.matrix(germandata.std[in.train,-50])
Y.train <- as.factor(germandata.std[in.train,50])

lasso.fit<- glmnet(x=X.train, y=Y.train, family = "binomial", alpha = 1)
summary(lasso.fit)
##            Length Class     Mode     
## a0           72   -none-    numeric  
## beta       3600   dgCMatrix S4       
## df           72   -none-    numeric  
## dim           2   -none-    numeric  
## lambda       72   -none-    numeric  
## dev.ratio    72   -none-    numeric  
## nulldev       1   -none-    numeric  
## npasses       1   -none-    numeric  
## jerr          1   -none-    numeric  
## offset        1   -none-    logical  
## classnames    2   -none-    character
## call          5   -none-    call     
## nobs          1   -none-    numeric
plot(lasso.fit, xvar = "lambda", label=TRUE)
## Warning in plotCoef(x$beta, lambda = x$lambda, df = x$df, dev = x$dev.ratio, : 1
## or less nonzero coefficients; glmnet plot is not meaningful

lasso.fit<- glmnet(x=X.train, y=Y.train, family = "binomial", alpha = 1)
plot(lasso.fit, xvar = "lambda", label=TRUE)
## Warning in plotCoef(x$beta, lambda = x$lambda, df = x$df, dev = x$dev.ratio, : 1
## or less nonzero coefficients; glmnet plot is not meaningful

Using cross validation to find perfect lambda value

cv.lasso<- cv.glmnet(x = X.train, y = Y.train, family = "binomial", alpha = 1, nfolds = 10)
plot(cv.lasso)

coef(lasso.fit, s=cv.lasso$lambda.min)
## 51 x 1 sparse Matrix of class "dgCMatrix"
##                           1
## (Intercept)       -7.822736
## chk_acctA11        .       
## chk_acctA12        .       
## chk_acctA13        .       
## chk_acctA14        .       
## credit_hisA31      .       
## credit_hisA32      .       
## credit_hisA33      .       
## credit_hisA34      .       
## purposeA41         .       
## purposeA410        .       
## purposeA42         .       
## purposeA43         .       
## purposeA44         .       
## purposeA45         .       
## purposeA46         .       
## purposeA48         .       
## purposeA49         .       
## saving_acctA62     .       
## saving_acctA63     .       
## saving_acctA64     .       
## saving_acctA65     .       
## present_empA72     .       
## present_empA73     .       
## present_empA74     .       
## present_empA75     .       
## sexA92             .       
## sexA93             .       
## sexA94             .       
## other_debtorA102   .       
## other_debtorA103   .       
## propertyA122       .       
## propertyA123       .       
## propertyA124       .       
## other_installA142  .       
## other_installA143  .       
## housingA152        .       
## housingA153        .       
## jobA172            .       
## jobA173            .       
## jobA174            .       
## telephoneA192      .       
## foreignA202        .       
## duration           .       
## amount             .       
## installment_rate   .       
## present_resid      .       
## age                .       
## n_credits          .       
## n_people           .       
## y.std             14.777128
cv.lasso$lambda.min
## [1] 0.0006174812

Final logistic model for GLM

For our final model, we select the following variables:

credit.glm.final <- glm(response ~ chk_acct + duration + credit_his + purpose + amount + saving_acct + installment_rate + sex + other_debtor + age + other_install + n_people + foreign, family = "binomial", train)

summary(credit.glm.final)
## 
## Call:
## glm(formula = response ~ chk_acct + duration + credit_his + purpose + 
##     amount + saving_acct + installment_rate + sex + other_debtor + 
##     age + other_install + n_people + foreign, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3337  -0.7165  -0.4110   0.7016   2.6315  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        1.655e+00  9.572e-01   1.729 0.083853 .  
## chk_acctA12       -5.197e-01  2.584e-01  -2.012 0.044262 *  
## chk_acctA13       -4.977e-01  4.143e-01  -1.201 0.229708    
## chk_acctA14       -1.680e+00  2.710e-01  -6.199 5.68e-10 ***
## duration           1.523e-02  1.073e-02   1.420 0.155683    
## credit_hisA31     -4.279e-01  6.440e-01  -0.664 0.506398    
## credit_hisA32     -1.248e+00  5.014e-01  -2.488 0.012835 *  
## credit_hisA33     -1.358e+00  5.567e-01  -2.439 0.014717 *  
## credit_hisA34     -2.257e+00  5.337e-01  -4.229 2.35e-05 ***
## purposeA41        -1.358e+00  4.201e-01  -3.233 0.001224 ** 
## purposeA410       -1.401e+00  8.679e-01  -1.614 0.106567    
## purposeA42        -8.784e-01  2.984e-01  -2.943 0.003246 ** 
## purposeA43        -1.112e+00  2.949e-01  -3.769 0.000164 ***
## purposeA44         3.808e-01  1.033e+00   0.369 0.712411    
## purposeA45        -7.828e-01  6.784e-01  -1.154 0.248492    
## purposeA46         5.942e-01  4.801e-01   1.238 0.215901    
## purposeA48        -2.166e+00  1.238e+00  -1.750 0.080178 .  
## purposeA49        -8.025e-01  3.981e-01  -2.016 0.043822 *  
## amount             1.368e-04  4.773e-05   2.866 0.004153 ** 
## saving_acctA62    -2.121e-01  3.206e-01  -0.661 0.508303    
## saving_acctA63    -7.487e-01  5.271e-01  -1.421 0.155444    
## saving_acctA64    -1.007e+00  5.861e-01  -1.719 0.085703 .  
## saving_acctA65    -6.844e-01  2.886e-01  -2.371 0.017738 *  
## installment_rate   3.153e-01  1.042e-01   3.025 0.002490 ** 
## sexA92             6.420e-02  4.699e-01   0.137 0.891328    
## sexA93            -5.053e-01  4.560e-01  -1.108 0.267723    
## sexA94            -1.236e-01  5.523e-01  -0.224 0.822918    
## other_debtorA102   6.289e-01  4.605e-01   1.366 0.172016    
## other_debtorA103  -8.786e-01  5.104e-01  -1.721 0.085186 .  
## age               -1.770e-02  9.598e-03  -1.845 0.065079 .  
## other_installA142 -3.485e-02  4.612e-01  -0.076 0.939762    
## other_installA143 -6.559e-01  2.852e-01  -2.300 0.021470 *  
## n_people           1.447e-01  2.889e-01   0.501 0.616478    
## foreignA202       -1.037e+00  6.682e-01  -1.551 0.120804    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 850.06  on 699  degrees of freedom
## Residual deviance: 641.95  on 666  degrees of freedom
## AIC: 709.95
## 
## Number of Fisher Scoring iterations: 5
AIC(credit.glm.final)
## [1] 709.9487
BIC(credit.glm.final)
## [1] 864.6855
credit.glm.final$deviance
## [1] 641.9487
prob.glm1.insample <- predict(credit.glm.final, type = "response")
predicted.glm1.insample <- prob.glm1.insample > 0.1667
predicted.glm1.insample <- as.numeric(predicted.glm1.insample)

mean(ifelse(train$response != predicted.glm1.insample, 1, 0))
## [1] 0.37
mean(ifelse(test$response != predicted.glm1.insample, 1, 0))
## Warning in test$response != predicted.glm1.insample: longer object length is not
## a multiple of shorter object length
## [1] 0.5414286
roc.plot(train$response == "1", prob.glm1.insample)$roc.vol$Area

## [1] 0.8239214
roc.plot(test$response == "1", prob.glm1.insample)$roc.vol$Area
## Warning in is.finite(x) & apply(pred, 1, f): longer object length is not a
## multiple of shorter object length

## [1] 0.4890136

Model Evaluation:

prob.glm1.insample <- predict(credit.glm.final, type = "response")
predicted.glm1.insample <- prob.glm1.insample > 0.1667
predicted.glm1.insample <- as.numeric(predicted.glm1.insample)
mean(ifelse(train$response != predicted.glm1.insample, 1, 0))
## [1] 0.37

Confusion Matrix

table(train$response, predicted.glm1.insample, dnn = c("Truth", "Predicted"))
##      Predicted
## Truth   0   1
##     0 259 234
##     1  25 182
credit.glm.final$deviance
## [1] 641.9487
roc.plot(train$response == "1", prob.glm1.insample)
roc.plot(train$response == "1", prob.glm1.insample)$roc.vol$Area

## [1] 0.8239214

Prediction

pred_aic <- predict(credit.glm.final,type="response")
hist(pred_aic)

#### Cut off 1/6

table(train$response, (pred_aic > 0.167)*1, dnn=c("Truth","Predicted"))
##      Predicted
## Truth   0   1
##     0 259 234
##     1  25 182

Out-of-sample prediction (more important)

Misclassificatuon Rate of Final Rate for out sample

#out of sample
prob.glm2.outsample <- predict(credit.glm.final, test, type = "response")
predicted.glm2.outsample <- prob.glm2.outsample > 0.1667
predicted.glm2.outsample <- as.numeric(predicted.glm2.outsample)
table(test$response, predicted.glm2.outsample, dnn = c("Truth", "Predicted"))
##      Predicted
## Truth   0   1
##     0 116  91
##     1   8  85
mean(ifelse(test$response != predicted.glm2.outsample, 1, 0))
## [1] 0.33

AUC of Final Rate for insample and outsample

roc.plot(test$response == "1", prob.glm2.outsample)
roc.plot(test$response == "1", prob.glm2.outsample)$roc.vol$Area

## [1] 0.8043738
#Cross Validation
model1.glm <- glm(response ~ ., data = german_data)

model2.glm <- glm(response ~ chk_acct + duration + credit_his + purpose + amount + saving_acct + installment_rate + sex + other_debtor + age + other_install + n_people + foreign, family = "binomial", german_data)

cv.glm(data = german_data, glmfit = model1.glm, K = 5)$delta[2]
## [1] 0.1690591
cv.glm(data = german_data, glmfit = model2.glm, K = 5)$delta[2]
## [1] 0.1631477

Intepretation

#cost function define
costfunc = function(obs, pred.p){
  pcut=.167
  weight1 = 5   # define the weight for "true=1 but pred=0" (FN)
  weight0 = 1    # define the weight for "true=0 but pred=1" (FP)
  c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (FN)
  c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (FP)
  cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
  return(cost) # you have to return to a value when you write R functions
} # end of the function
german_glm<- glm(response ~ . , family=binomial, data=german_data);  
cv_result  <- cv.glm(data= german_data, 
                     glmfit=german_glm, cost=costfunc, K=5)  
cv_result$delta[2]
## [1] 0.5676
#asymettric cost
costfunc((pred_aic > 0.167)*1,train$response)
## [1] 1.707143
cost1 <- function(r, pi) {
  weight1 = 5
  weight0 = 1
  c1 = (r == 1) & (pi < 0.17)  #logical vector - true if actual 1 but predict 0
  c0 = (r == 0) & (pi > 0.17)  #logical vecotr - true if actual 0 but predict 1
  return(mean(weight1 * c1 + weight0 * c0))
}

(v) Fit a regression tree (CART) on the same data; repeat the above step (iii).

#default value of cp = 0.01
german.tree <- rpart(response ~ ., data = train)

german.tree
## n= 700 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 700 145.787100 0.2957143  
##     2) chk_acct=A13,A14 323  39.448920 0.1424149  
##       4) other_install=A143 265  25.826420 0.1094340 *
##       5) other_install=A141,A142 58  12.017240 0.2931034  
##        10) purpose=A41,A410,A42,A43,A46 38   5.052632 0.1578947 *
##        11) purpose=A40,A49 20   4.950000 0.5500000 *
##     3) chk_acct=A11,A12 377  92.244030 0.4270557  
##       6) credit_his=A32,A33,A34 332  78.876510 0.3885542  
##        12) duration< 27.5 252  55.317460 0.3253968  
##          24) amount< 8724.5 245  52.040820 0.3061224  
##            48) purpose=A41,A410,A42,A43,A45,A48,A49 165  29.781820 0.2363636 *
##            49) purpose=A40,A44,A46 80  19.800000 0.4500000  
##              98) amount>=1551.5 41   7.560976 0.2439024 *
##              99) amount< 1551.5 39   8.666667 0.6666667  
##               198) age>=36.5 14   2.857143 0.2857143 *
##               199) age< 36.5 25   2.640000 0.8800000 *
##          25) amount>=8724.5 7   0.000000 1.0000000 *
##        13) duration>=27.5 80  19.387500 0.5875000  
##          26) property=A122 20   4.550000 0.3500000 *
##          27) property=A121,A123,A124 60  13.333330 0.6666667 *
##       7) credit_his=A30,A31 45   9.244444 0.7111111  
##        14) housing=A152,A153 32   7.718750 0.5937500  
##          28) property=A121,A123 13   2.769231 0.3076923 *
##          29) property=A122,A124 19   3.157895 0.7894737 *
##        15) housing=A151 13   0.000000 1.0000000 *
plotcp(german.tree)

#Plotting the tree
rpart.plot(german.tree, type = 3, fallen.leaves = TRUE)

credit_rpart0 <- rpart(formula = response ~ ., data = train, method = "class")

credit_rpart <- rpart(formula = response ~ . , data = train, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))

pred0 <- predict(credit_rpart0, type="class")
table(train$response, pred0, dnn = c("True", "Pred"))
##     Pred
## True   0   1
##    0 474  19
##    1 115  92
pred_german_train<- predict(credit_rpart, train, type="class")
credit_test_prob_rpart = predict(credit_rpart, test, type="prob")

table(train$response, pred_german_train, dnn=c("Truth","Predicted"))
##      Predicted
## Truth   0   1
##     0 284 209
##     1  16 191
cost <- function(r, phat){
  weight1 <- 5
  weight0 <- 1
  pcut <- weight0/(weight1+weight0) 
  c1 <- (r==1)&(phat<pcut) #logical vector - true if actual 1 but predict 0
  c0 <-(r==0)&(phat>pcut) #logical vector - true if actual 0 but predict 1
  return(mean(weight1*c1+weight0*c0))
}

pred_german_train<- predict(credit_rpart, train, type="prob")
cost(train$response, predict(credit_rpart, train, type="prob"))
## [1] 0.5585714
cost(test$response,  predict(credit_rpart, test, type="prob"))
## [1] 0.655
pred1 <- prediction(pred_german_train[,2], train$response)
pred = prediction(credit_test_prob_rpart[,2], test$response)

perf = performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

slot(performance(pred1, "auc"), "y.values")[[1]]
## [1] 0.7719768
slot(performance(pred, "auc"), "y.values")[[1]]
## [1] 0.6974443
credit_test_pred_rpart <- as.numeric(credit_test_prob_rpart[,2] > 1/(5+1))
table(test$response, credit_test_pred_rpart, dnn=c("Truth","Predicted"))
##      Predicted
## Truth   0   1
##     0 116  91
##     1  19  74
#Building a large tree
german.large.tree <- rpart(formula = response ~ . , data = train, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)), cp=0.001)

german.large.tree
## n= 700 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 700 493 1 (0.70428571 0.29571429)  
##     2) chk_acct=A14 276 170 0 (0.87681159 0.12318841)  
##       4) purpose=A41,A410,A43,A44,A48 123  30 0 (0.95121951 0.04878049)  
##         8) other_install=A142,A143 109  15 0 (0.97247706 0.02752294)  
##          16) age>=31.5 69   0 0 (1.00000000 0.00000000) *
##          17) age< 31.5 40  15 0 (0.92500000 0.07500000)  
##            34) sex=A92,A94 23   0 0 (1.00000000 0.00000000) *
##            35) sex=A93 17  14 1 (0.82352941 0.17647059) *
##         9) other_install=A141 14  11 1 (0.78571429 0.21428571) *
##       5) purpose=A40,A42,A45,A46,A49 153 125 1 (0.81699346 0.18300654)  
##        10) amount< 4158 125  80 0 (0.87200000 0.12800000)  
##          20) duration< 9.5 24   0 0 (1.00000000 0.00000000) *
##          21) duration>=9.5 101  80 0 (0.84158416 0.15841584)  
##            42) amount< 1112 15   0 0 (1.00000000 0.00000000) *
##            43) amount>=1112 86  70 1 (0.81395349 0.18604651)  
##              86) amount>=3385.5 10   0 0 (1.00000000 0.00000000) *
##              87) amount< 3385.5 76  60 1 (0.78947368 0.21052632)  
##               174) present_emp=A74,A75 36  20 0 (0.88888889 0.11111111)  
##                 348) credit_his=A30,A32 16   0 0 (1.00000000 0.00000000) *
##                 349) credit_his=A33,A34 20  16 1 (0.80000000 0.20000000)  
##                   698) purpose=A42,A46,A49 11   5 0 (0.90909091 0.09090909) *
##                   699) purpose=A40,A45 9   6 1 (0.66666667 0.33333333) *
##               175) present_emp=A71,A72,A73 40  28 1 (0.70000000 0.30000000)  
##                 350) other_install=A143 33  26 1 (0.78787879 0.21212121)  
##                   700) age>=30.5 16   5 0 (0.93750000 0.06250000) *
##                   701) age< 30.5 17  11 1 (0.64705882 0.35294118) *
##                 351) other_install=A141,A142 7   2 1 (0.28571429 0.71428571) *
##        11) amount>=4158 28  16 1 (0.57142857 0.42857143)  
##          22) present_emp=A74,A75 15  10 0 (0.86666667 0.13333333) *
##          23) present_emp=A71,A72,A73 13   3 1 (0.23076923 0.76923077) *
##     3) chk_acct=A11,A12,A13 424 251 1 (0.59198113 0.40801887)  
##       6) credit_his=A34 97  76 1 (0.78350515 0.21649485)  
##        12) duration< 8.5 15   0 0 (1.00000000 0.00000000) *
##        13) duration>=8.5 82  61 1 (0.74390244 0.25609756)  
##          26) purpose=A41,A42,A43,A45,A49 52  35 0 (0.86538462 0.13461538)  
##            52) amount< 4864.5 40  15 0 (0.92500000 0.07500000)  
##             104) amount>=2291.5 17   0 0 (1.00000000 0.00000000) *
##             105) amount< 2291.5 23  15 0 (0.86956522 0.13043478)  
##               210) present_emp=A71,A72,A74 10   0 0 (1.00000000 0.00000000) *
##               211) present_emp=A73,A75 13  10 1 (0.76923077 0.23076923) *
##            53) amount>=4864.5 12   8 1 (0.66666667 0.33333333) *
##          27) purpose=A40,A410,A46 30  16 1 (0.53333333 0.46666667) *
##       7) credit_his=A30,A31,A32,A33 327 175 1 (0.53516820 0.46483180)  
##        14) other_debtor=A103 21  10 0 (0.90476190 0.09523810)  
##          28) present_emp=A73,A75 13   0 0 (1.00000000 0.00000000) *
##          29) present_emp=A72,A74 8   6 1 (0.75000000 0.25000000) *
##        15) other_debtor=A101,A102 306 156 1 (0.50980392 0.49019608)  
##          30) chk_acct=A12,A13 175 105 1 (0.60000000 0.40000000)  
##            60) saving_acct=A63,A64,A65 47  36 1 (0.76595745 0.23404255)  
##             120) credit_his=A30,A33 8   0 0 (1.00000000 0.00000000) *
##             121) credit_his=A31,A32 39  28 1 (0.71794872 0.28205128)  
##               242) amount< 4051.5 31  25 1 (0.80645161 0.19354839)  
##                 484) property=A123,A124 12   0 0 (1.00000000 0.00000000) *
##                 485) property=A121,A122 19  13 1 (0.68421053 0.31578947) *
##               243) amount>=4051.5 8   3 1 (0.37500000 0.62500000) *
##            61) saving_acct=A61,A62 128  69 1 (0.53906250 0.46093750)  
##             122) job=A173 72  47 1 (0.65277778 0.34722222)  
##               244) present_emp=A71,A74,A75 25  20 0 (0.84000000 0.16000000)  
##                 488) purpose=A43,A45,A49 14   0 0 (1.00000000 0.00000000) *
##                 489) purpose=A40,A42,A46 11   7 1 (0.63636364 0.36363636) *
##               245) present_emp=A72,A73 47  26 1 (0.55319149 0.44680851)  
##                 490) housing=A152 32  23 1 (0.71875000 0.28125000)  
##                   980) purpose=A40,A41,A410,A49 11   0 0 (1.00000000 0.00000000) *
##                   981) purpose=A42,A43,A45 21  12 1 (0.57142857 0.42857143) *
##                 491) housing=A151,A153 15   3 1 (0.20000000 0.80000000) *
##             123) job=A171,A172,A174 56  22 1 (0.39285714 0.60714286) *
##          31) chk_acct=A11 131  51 1 (0.38931298 0.61068702)  
##            62) saving_acct=A63,A64 7   5 0 (0.85714286 0.14285714) *
##            63) saving_acct=A61,A62,A65 124  45 1 (0.36290323 0.63709677)  
##             126) installment_rate< 2.5 40  21 1 (0.52500000 0.47500000)  
##               252) property=A121,A123 24  16 1 (0.66666667 0.33333333)  
##                 504) job=A172,A174 8   0 0 (1.00000000 0.00000000) *
##                 505) job=A173 16   8 1 (0.50000000 0.50000000) *
##               253) property=A122,A124 16   5 1 (0.31250000 0.68750000) *
##             127) installment_rate>=2.5 84  24 1 (0.28571429 0.71428571) *
plotcp(german.large.tree)

#Plotting the tree
rpart.plot(german.large.tree, type = 3, fallen.leaves = TRUE)

pred_german_train3 <- predict(german.large.tree, train, type="prob")
pred_german_train <- predict(german.large.tree, test, type="prob")

cost(train$response, predict(german.large.tree, train, type="prob"))
## [1] 0.5014286
cost(test$response,  predict(german.large.tree, test, type="prob"))
## [1] 0.7216667
pred2 <- prediction(pred_german_train3[,2], train$response)
pred3 = prediction(pred_german_train[,2], test$response)

slot(performance(pred2, "auc"), "y.values")[[1]]
## [1] 0.8864489
slot(performance(pred3, "auc"), "y.values")[[1]]
## [1] 0.6677056
#however, from plotcp, we observe that a tree with more than 7 to 9 splits is not very helpful.
#further pruning the tree to limit to 9 splits;corresponding cp value from plot is 0.0072
pruned.tree <- prune(german.large.tree, cp = 0.04)
pruned.tree
## n= 700 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 700 493 1 (0.70428571 0.29571429)  
##    2) chk_acct=A14 276 170 0 (0.87681159 0.12318841)  
##      4) purpose=A41,A410,A43,A44,A48 123  30 0 (0.95121951 0.04878049) *
##      5) purpose=A40,A42,A45,A46,A49 153 125 1 (0.81699346 0.18300654)  
##       10) amount< 4158 125  80 0 (0.87200000 0.12800000) *
##       11) amount>=4158 28  16 1 (0.57142857 0.42857143) *
##    3) chk_acct=A11,A12,A13 424 251 1 (0.59198113 0.40801887) *
rpart.plot(pruned.tree, type = 3, fallen.leaves = TRUE, extra = 1)

pred_german_train2 <- predict(pruned.tree, train, type="prob")
pred_german_train4 <- predict(pruned.tree, test, type="prob")

cost(train$response, predict(pruned.tree, train, type="prob"))
## [1] 0.6214286
cost(test$response,  predict(pruned.tree, test, type="prob"))
## [1] 0.5966667
pred2 <- prediction(pred_german_train2[,2], train$response)
pred3 <- prediction(pred_german_train4[,2], test$response)

slot(performance(pred2, "auc"), "y.values")[[1]]
## [1] 0.6832319
slot(performance(pred3, "auc"), "y.values")[[1]]
## [1] 0.6841463
#In-sample MSE
mean((predict(german.tree) - train$response) ^ 2)      
## [1] 0.1463992
#out-of-sample performance
mean((predict(german.tree, newdata = test) - test$response) ^ 2)  #default tree
## [1] 0.1829166
mean((predict(german.large.tree) - train$response) ^ 2)  #large tree
## [1] 0.3759564
mean((predict(german.large.tree, newdata = test) - test$response) ^ 2)   #large tree
## [1] 0.3805692
mean((predict(pruned.tree) - train$response) ^ 2)       #pruned tree
## [1] 0.3158154
mean((predict(pruned.tree, newdata = test) - test$response) ^ 2)   #large tree
## [1] 0.3179086

Comparison with Linear Model

glm_model <- glm(formula = response ~ ., data = train)

prob.glm1.insample <- predict(glm_model, type = "response")
predicted.glm1.insample <- prob.glm1.insample > 0.1667
predicted.glm1.insample <- as.numeric(predicted.glm1.insample)

mean(ifelse(train$response != predicted.glm1.insample, 1, 0))
## [1] 0.4342857
mean(ifelse(test$response != predicted.glm1.insample, 1, 0))
## Warning in test$response != predicted.glm1.insample: longer object length is not
## a multiple of shorter object length
## [1] 0.5714286
roc.plot(train$response == "1", prob.glm1.insample)$roc.vol$Area

## [1] 0.8299282
roc.plot(test$response == "1", prob.glm1.insample)$roc.vol$Area
## Warning in is.finite(x) & apply(pred, 1, f): longer object length is not a
## multiple of shorter object length

## [1] 0.4813776