Homework5

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : num  1 0 1 1 0 1 1 1 1 0 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Split the dataset into training and test set. Please use the random seed as `2025` for reproducibility. (2 pts)

dim(GermanCredit)

## [1] 1000   49

set.seed(2025)
index <- sample(1:1000, 700)
GermanCredit_train <- GermanCredit[index, ]
GermanCredit_test <- GermanCredit[-index, ]
dim(GermanCredit_train)

## [1] 700  49

dim(GermanCredit_test)

## [1] 300  49

Your observation: The training set has 700 observations while the testing set only has 300 observations. Both have 49 variables.

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right. Then Please make a visualization of your fitted tree. (3 pts)

library(rpart)
library(rpart.plot)
Credit_Tree <- rpart(formula = Class ~ ., 
                    data = GermanCredit_train, 
                    method = "class") 
prp(Credit_Tree)

rpart.plot(Credit_Tree,extra = 3)

Your observation: Applicants with a favorable checking account status tend to be deemed as good credit more of the time and if they have a poor checking acount status then they are deemed to have bad credit.

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

pred_train_prob <- predict(Credit_Tree, GermanCredit_train, type = "prob")
head(pred_train_prob)

##             0         1
## 909 0.1406250 0.8593750
## 460 0.1406250 0.8593750
## 932 0.3095238 0.6904762
## 922 0.1406250 0.8593750
## 961 0.1406250 0.8593750
## 279 0.1406250 0.8593750

head(GermanCredit_train)

##     Duration Amount InstallmentRatePercentage ResidenceDuration Age
## 909       15   3594                         1                 2  46
## 460       18   4594                         3                 2  32
## 932        9   1670                         4                 2  22
## 922       48  12749                         4                 1  37
## 961        6   1740                         2                 2  30
## 279        6   4611                         1                 4  32
##     NumberExistingCredits NumberPeopleMaintenance Telephone ForeignWorker Class
## 909                     2                       1         1             1     1
## 460                     1                       1         0             1     1
## 932                     1                       1         0             1     0
## 922                     1                       1         0             1     1
## 961                     2                       1         1             1     1
## 279                     1                       1         1             1     0
##     CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## 909                          0                              0
## 460                          0                              0
## 932                          0                              1
## 922                          0                              0
## 961                          0                              0
## 279                          0                              0
##     CheckingAccountStatus.gt.200 CreditHistory.NoCredit.AllPaid
## 909                            0                              0
## 460                            0                              0
## 932                            0                              0
## 922                            0                              0
## 961                            0                              0
## 279                            0                              0
##     CreditHistory.ThisBank.AllPaid CreditHistory.PaidDuly CreditHistory.Delay
## 909                              0                      0                   1
## 460                              0                      1                   0
## 932                              0                      1                   0
## 922                              0                      0                   1
## 961                              0                      0                   0
## 279                              0                      1                   0
##     Purpose.NewCar Purpose.UsedCar Purpose.Furniture.Equipment
## 909              0               1                           0
## 460              0               0                           0
## 932              0               0                           0
## 922              0               0                           0
## 961              0               0                           0
## 279              0               0                           1
##     Purpose.Radio.Television Purpose.DomesticAppliance Purpose.Repairs
## 909                        0                         0               0
## 460                        1                         0               0
## 932                        1                         0               0
## 922                        1                         0               0
## 961                        1                         0               0
## 279                        0                         0               0
##     Purpose.Education Purpose.Retraining Purpose.Business
## 909                 0                  0                0
## 460                 0                  0                0
## 932                 0                  0                0
## 922                 0                  0                0
## 961                 0                  0                0
## 279                 0                  0                0
##     SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## 909                          1                              0
## 460                          1                              0
## 932                          1                              0
## 922                          0                              0
## 961                          1                              0
## 279                          1                              0
##     SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## 909                               0                           0
## 460                               0                           0
## 932                               0                           0
## 922                               1                           0
## 961                               0                           0
## 279                               0                           0
##     EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## 909                       1                         0                         0
## 460                       1                         0                         0
## 932                       1                         0                         0
## 922                       0                         0                         1
## 961                       0                         0                         0
## 279                       1                         0                         0
##     EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
## 909                       0                                0
## 460                       0                                0
## 932                       0                                0
## 922                       0                                0
## 961                       1                                0
## 279                       0                                0
##     Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
## 909                         1                    0                           1
## 460                         0                    1                           1
## 932                         1                    0                           1
## 922                         0                    1                           1
## 961                         0                    0                           1
## 279                         1                    0                           1
##     OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
## 909                                  0                   0                  1
## 460                                  0                   0                  0
## 932                                  0                   0                  0
## 922                                  0                   0                  0
## 961                                  0                   1                  0
## 279                                  0                   0                  1
##     Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## 909                 0                          0                            0
## 460                 1                          0                            0
## 932                 1                          0                            0
## 922                 1                          0                            0
## 961                 0                          0                            0
## 279                 0                          0                            0
##     Housing.Rent Housing.Own Job.UnemployedUnskilled Job.UnskilledResident
## 909            0           1                       0                     1
## 460            0           1                       0                     0
## 932            0           1                       0                     0
## 922            0           1                       0                     0
## 961            1           0                       0                     0
## 279            0           1                       0                     0
##     Job.SkilledEmployee
## 909                   0
## 460                   1
## 932                   1
## 922                   0
## 961                   1
## 279                   1

Your observation: It seems to be more likely to have Good credit than bad credit

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

act_value <- GermanCredit_train$Class
pred_value <- pred_train_prob[, 2]
pcut <- 0.5
pred_value <- 1 * (pred_value > pcut)

confusion_mat <- table(actual = act_value, predict = pred_value)
confusion_mat

##       predict
## actual   0   1
##      0 112 113
##      1  33 442

MR_rate <- (113+33)/(112+113+33+442)
MR_rate

## [1] 0.2085714

Your observation: Around 20.9% of the time does this model predict the incorrect outcome

4. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

pred_test_prob <- predict(Credit_Tree, GermanCredit_test, type = "prob")
head(pred_test_prob)

##            0         1
## 3  0.1406250 0.8593750
## 4  0.8823529 0.1176471
## 7  0.1406250 0.8593750
## 9  0.1406250 0.8593750
## 17 0.1406250 0.8593750
## 21 0.1406250 0.8593750

Your observation: It seems to be more likely to have Good credit than bad credit with the testing set too…which makes sense.

5. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

act_value1 <- GermanCredit_test$Class
pred_value <- pred_test_prob[, 2]
pcut <- 0.5
pred_value1 <- 1 * (pred_value > pcut)

confusion_mat1 <- table(actual = act_value1, predict = pred_value1)
confusion_mat1

##       predict
## actual   0   1
##      0  24  51
##      1  24 201

MR_rate <- (51+24)/(24+51+24+201)
MR_rate

## [1] 0.25

Your observation: 25% of the time does this model predict the incorrect outcome

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right. (3 pts)

cost_matrix <- matrix(c(0, 1,
                        2, 0),
                      nrow = 2, byrow = TRUE)
Credit_Tree1 <- rpart(
  Class ~ ., 
  data = GermanCredit_train,
  method = "class",
  parms = list(loss = cost_matrix)
)
prp(Credit_Tree1)

rpart.plot(Credit_Tree1,extra = 3)

Your observation: This model favors the cheaper end vs the more expensive end when determining good vs bad credit status.

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

pred_train_prob1 <- predict(Credit_Tree1, GermanCredit_train, type = "prob")
head(pred_train_prob1)

##             0         1
## 909 0.1406250 0.8593750
## 460 0.1406250 0.8593750
## 932 0.3882353 0.6117647
## 922 0.1406250 0.8593750
## 961 0.1406250 0.8593750
## 279 0.1406250 0.8593750

Your observation: It is more likely people have good credit vs bad credit

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

act_value2 <- GermanCredit_train$Class
pred_value <- pred_train_prob1[, 2]
pcut <- 0.5
pred_value2 <- 1 * (pred_value > pcut)

confusion_mat2 <- table(actual = act_value2, predict = pred_value2)
confusion_mat2

##       predict
## actual   0   1
##      0  54 171
##      1   9 466

MR_rate <- (171+9)/(54+171+9+466)
MR_rate

## [1] 0.2571429

Your observation: Around 25.7% of the time does this model predict the incorrect outcome

4. Obtain ROC and AUC on training set (use predicted probabilities). (2 pts)

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

plot(roc(GermanCredit_train$Class, pred_train_prob1[,2]))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc(GermanCredit_train$Class, pred_train_prob1[,2])

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.7439

Your observation: The model can kind of separate Good vs Bad credit but around 25% of the time it will be incorrect. Not a bad model but definietly could be better.

5. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

pred_test_prob1 <- predict(Credit_Tree1, GermanCredit_test, type = "prob")
head(pred_test_prob1)

##            0         1
## 3  0.1406250 0.8593750
## 4  0.8823529 0.1176471
## 7  0.1406250 0.8593750
## 9  0.1406250 0.8593750
## 17 0.1406250 0.8593750
## 21 0.1406250 0.8593750

Your observation: It is more likely people have good credit vs bad credit

6. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

act_value3 <- GermanCredit_test$Class
pred_value <- pred_test_prob1[, 2]
pcut <- 0.5
pred_value3 <- 1 * (pred_value > pcut)

confusion_mat3 <- table(actual = act_value3, predict = pred_value3)
confusion_mat3

##       predict
## actual   0   1
##      0  12  63
##      1   7 218

MR_rate <- (63+7)/(12+63+7+218)
MR_rate

## [1] 0.2333333

Your observation: 23% of the time the model will give the incorrect classification of Good vs Bad credit

7. Obtain ROC and AUC on testing set. (use predicted probabilities). (2 pts)

plot(roc(GermanCredit_test$Class, pred_test_prob1[,2]))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

auc(GermanCredit_test$Class, pred_test_prob1[,2])

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Area under the curve: 0.7049

Your observation: The model can kind of separate Good vs Bad credit but around 30% of the time it will be incorrect. Not a bad model but definietly could be better.

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis. (2 pts)

The model without the weights performed better than the model with the weights. This makes sense as the importance of errors was penalized a lot more so the FP and FN were more influential in the model itself. The weighted model could be better .75 AUC for the training and .70 AUC. Based off the tree models, if your checking account is in good standing then you are more likely to have a good credit score vs a bad one.

Homework5

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Split the dataset into training and test set. Please use the random seed as 2025 for reproducibility. (2 pts)

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right. Then Please make a visualization of your fitted tree. (3 pts)

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

4. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

5. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right. (3 pts)

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

4. Obtain ROC and AUC on training set (use predicted probabilities). (2 pts)

5. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

6. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

7. Obtain ROC and AUC on testing set. (use predicted probabilities). (2 pts)

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis. (2 pts)

2. Split the dataset into training and test set. Please use the random seed as `2025` for reproducibility. (2 pts)