Homework7

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : num  1 0 1 1 0 1 1 1 1 0 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Split the dataset into training and test set. Please use the random seed as `2025` for reproducibility. (2 pts)

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.5.2

dim(GermanCredit)

## [1] 1000   49

set.seed(2025)
index <- sample(1:1000, 700)
GermanCredit_train <- GermanCredit[index, ]
GermanCredit_test <- GermanCredit[-index, ]
dim(GermanCredit_train)

## [1] 700  49

dim(GermanCredit_test)

## [1] 300  49

Your observation: Training data has 700 observations while Testing data has 300 observations. Both datasets have 49 variables.

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right. Then Please make a visualization of your fitted tree. (3 pts)

Credit_Tree <- rpart(formula = Class ~ ., 
                              data = GermanCredit_train,
                              method = "class")
prp(Credit_Tree)

rpart.plot(Credit_Tree, extra = 3)

Your observation: Applicants with a favorable checking account status tend to be deemed as good credit more of the time and if they have a poor checking account status then they are deemed to have bad credit #### 2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

pred_train_prob <- predict(Credit_Tree, type = "prob")
head(pred_train_prob)

##             0         1
## 909 0.1406250 0.8593750
## 460 0.1406250 0.8593750
## 932 0.3095238 0.6904762
## 922 0.1406250 0.8593750
## 961 0.1406250 0.8593750
## 279 0.1406250 0.8593750

head(GermanCredit_train)

##     Duration Amount InstallmentRatePercentage ResidenceDuration Age
## 909       15   3594                         1                 2  46
## 460       18   4594                         3                 2  32
## 932        9   1670                         4                 2  22
## 922       48  12749                         4                 1  37
## 961        6   1740                         2                 2  30
## 279        6   4611                         1                 4  32
##     NumberExistingCredits NumberPeopleMaintenance Telephone ForeignWorker Class
## 909                     2                       1         1             1     1
## 460                     1                       1         0             1     1
## 932                     1                       1         0             1     0
## 922                     1                       1         0             1     1
## 961                     2                       1         1             1     1
## 279                     1                       1         1             1     0
##     CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## 909                          0                              0
## 460                          0                              0
## 932                          0                              1
## 922                          0                              0
## 961                          0                              0
## 279                          0                              0
##     CheckingAccountStatus.gt.200 CreditHistory.NoCredit.AllPaid
## 909                            0                              0
## 460                            0                              0
## 932                            0                              0
## 922                            0                              0
## 961                            0                              0
## 279                            0                              0
##     CreditHistory.ThisBank.AllPaid CreditHistory.PaidDuly CreditHistory.Delay
## 909                              0                      0                   1
## 460                              0                      1                   0
## 932                              0                      1                   0
## 922                              0                      0                   1
## 961                              0                      0                   0
## 279                              0                      1                   0
##     Purpose.NewCar Purpose.UsedCar Purpose.Furniture.Equipment
## 909              0               1                           0
## 460              0               0                           0
## 932              0               0                           0
## 922              0               0                           0
## 961              0               0                           0
## 279              0               0                           1
##     Purpose.Radio.Television Purpose.DomesticAppliance Purpose.Repairs
## 909                        0                         0               0
## 460                        1                         0               0
## 932                        1                         0               0
## 922                        1                         0               0
## 961                        1                         0               0
## 279                        0                         0               0
##     Purpose.Education Purpose.Retraining Purpose.Business
## 909                 0                  0                0
## 460                 0                  0                0
## 932                 0                  0                0
## 922                 0                  0                0
## 961                 0                  0                0
## 279                 0                  0                0
##     SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## 909                          1                              0
## 460                          1                              0
## 932                          1                              0
## 922                          0                              0
## 961                          1                              0
## 279                          1                              0
##     SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## 909                               0                           0
## 460                               0                           0
## 932                               0                           0
## 922                               1                           0
## 961                               0                           0
## 279                               0                           0
##     EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## 909                       1                         0                         0
## 460                       1                         0                         0
## 932                       1                         0                         0
## 922                       0                         0                         1
## 961                       0                         0                         0
## 279                       1                         0                         0
##     EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
## 909                       0                                0
## 460                       0                                0
## 932                       0                                0
## 922                       0                                0
## 961                       1                                0
## 279                       0                                0
##     Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
## 909                         1                    0                           1
## 460                         0                    1                           1
## 932                         1                    0                           1
## 922                         0                    1                           1
## 961                         0                    0                           1
## 279                         1                    0                           1
##     OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
## 909                                  0                   0                  1
## 460                                  0                   0                  0
## 932                                  0                   0                  0
## 922                                  0                   0                  0
## 961                                  0                   1                  0
## 279                                  0                   0                  1
##     Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## 909                 0                          0                            0
## 460                 1                          0                            0
## 932                 1                          0                            0
## 922                 1                          0                            0
## 961                 0                          0                            0
## 279                 0                          0                            0
##     Housing.Rent Housing.Own Job.UnemployedUnskilled Job.UnskilledResident
## 909            0           1                       0                     1
## 460            0           1                       0                     0
## 932            0           1                       0                     0
## 922            0           1                       0                     0
## 961            1           0                       0                     0
## 279            0           1                       0                     0
##     Job.SkilledEmployee
## 909                   0
## 460                   1
## 932                   1
## 922                   0
## 961                   1
## 279                   1

Your observation: It seems to be more likely to have Good Credit than Bad Credit

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

act_value <- GermanCredit_train$Class
pred_value <- pred_train_prob[, 2]
pcut <-  0.5
pred_value <- 1 * (pred_value > pcut)

confusion_mat <- table(actual = act_value, predict = pred_value)
confusion_mat

##       predict
## actual   0   1
##      0 112 113
##      1  33 442

MR_rate <- (113+33)/(112+113+33+442)
MR_rate

## [1] 0.2085714

Your observation: About 20.9% of the time does the model predict the incorrect outcome

4. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

pred_test_prob <- predict(Credit_Tree, GermanCredit_test, type = "prob")
head(pred_test_prob)

##            0         1
## 3  0.1406250 0.8593750
## 4  0.8823529 0.1176471
## 7  0.1406250 0.8593750
## 9  0.1406250 0.8593750
## 17 0.1406250 0.8593750
## 21 0.1406250 0.8593750

Your observation: It seems to be more likely to have good credit than bad credit with the testing set too #### 5. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

act_value1 <- GermanCredit_test$Class
pred_value <- pred_test_prob[, 2]
pcut <-  0.5
pred_value1 <- 1 * (pred_value > pcut)

confusion_mat <- table(actual = act_value1, predict = pred_value1)
confusion_mat

##       predict
## actual   0   1
##      0  24  51
##      1  24 201

MR_rate <- (51+24)/(24+51+24+201)
MR_rate

## [1] 0.25

Your observation: 25% of the time the model will predict the wrong outcome

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right. (3 pts)

cost_matrix <-  matrix(c(0, 1,
                         2, 0),
                       nrow = 2, byrow = TRUE)
Credit_Tree1 <- rpart(formula = Class ~ .,
                      data = GermanCredit_train,
                      method = "class",
                      parms = list(loss = cost_matrix)
                      )
prp(Credit_Tree1)

rpart.plot(Credit_Tree1, extra = 3)

Your observation: The model favors cheaper end vs the more expensive end when determining good vs bad credit status #### 2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

pred_train_prob1 <- predict(Credit_Tree1, GermanCredit_train, type = "prob")
head(pred_train_prob1)

##             0         1
## 909 0.1406250 0.8593750
## 460 0.1406250 0.8593750
## 932 0.3882353 0.6117647
## 922 0.1406250 0.8593750
## 961 0.1406250 0.8593750
## 279 0.1406250 0.8593750

Your observation: It is more likely to have good credit rather than bad credit #### 3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

act_value2 <- GermanCredit_train$Class
pred_value <- pred_train_prob1[, 2]
pcut <-  0.5
pred_value2 <- 1 * (pred_value > pcut)

confusion_mat2 <- table(actual = act_value2, predict = pred_value2)
confusion_mat2

##       predict
## actual   0   1
##      0  54 171
##      1   9 466

MR_rate <- (171+9)/(54+171+9+466)
MR_rate

## [1] 0.2571429

Your observation: About 25.7% of the time does this model predict the incorrect outcome #### 4. Obtain ROC and AUC on training set (use predicted probabilities). (2 pts)

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

plot(roc(GermanCredit_train$Class, pred_train_prob1[, 2]))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc(GermanCredit_train$Class, pred_train_prob1[, 2])

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.7439

Your observation: The model can kind of separate good vs bad credit however, 25% of the time it’ll be incorrect

5. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

pred_test_prob1 <- predict(Credit_Tree1, GermanCredit_test, type = "prob")
head(pred_test_prob1)

##            0         1
## 3  0.1406250 0.8593750
## 4  0.8823529 0.1176471
## 7  0.1406250 0.8593750
## 9  0.1406250 0.8593750
## 17 0.1406250 0.8593750
## 21 0.1406250 0.8593750

Your observation: It is more likely people have good credit than bad credit #### 6. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

act_value3 <- GermanCredit_test$Class
pred_value <- pred_test_prob1[, 2]
pcut <-  0.5
pred_value3 <- 1 * (pred_value > pcut)

confusion_mat3 <- table(actual = act_value3, predict = pred_value3)
confusion_mat3

##       predict
## actual   0   1
##      0  12  63
##      1   7 218

MR_rate <- (63+7)/(12+63+7+218)
MR_rate

## [1] 0.2333333

Your observation: About 23.3% of the time the model will give the incorrect classification of Good against Bad credit #### 7. Obtain ROC and AUC on testing set. (use predicted probabilities). (2 pts)

plot(roc(GermanCredit_test$Class, pred_test_prob1[, 2]))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc(GermanCredit_test$Class, pred_test_prob1[, 2])

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.7049

Your observation: Around 30% of the time the model will be incorrect.

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis. (2 pts)

The model without the weights performs better than the model with weights. This is because the importance of errors were penalized higher so the FP and FN were more influential in the model overall. The weighted model could be better, .75 AUC for the training and .7 AUC for the testing data. The tree models show if your checking account is in good standing than you a most likely going to have a good credit score rather than bad.