Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

install.packages('caret')

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: ggplot2
## Loading required package: lattice
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)
## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : num  1 0 1 1 0 1 1 1 1 0 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation: After using the as.numeric format, a lot of the data is notably binary with expectations such as Duration, Amount, etc. There are 7 that are labeled as integers, the rest if the variables are labeled as binary.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Explore the dataset to understand its structure.

summary(GermanCredit)
##     Duration        Amount      InstallmentRatePercentage ResidenceDuration
##  Min.   : 4.0   Min.   :  250   Min.   :1.000             Min.   :1.000    
##  1st Qu.:12.0   1st Qu.: 1366   1st Qu.:2.000             1st Qu.:2.000    
##  Median :18.0   Median : 2320   Median :3.000             Median :3.000    
##  Mean   :20.9   Mean   : 3271   Mean   :2.973             Mean   :2.845    
##  3rd Qu.:24.0   3rd Qu.: 3972   3rd Qu.:4.000             3rd Qu.:4.000    
##  Max.   :72.0   Max.   :18424   Max.   :4.000             Max.   :4.000    
##       Age        NumberExistingCredits NumberPeopleMaintenance   Telephone    
##  Min.   :19.00   Min.   :1.000         Min.   :1.000           Min.   :0.000  
##  1st Qu.:27.00   1st Qu.:1.000         1st Qu.:1.000           1st Qu.:0.000  
##  Median :33.00   Median :1.000         Median :1.000           Median :1.000  
##  Mean   :35.55   Mean   :1.407         Mean   :1.155           Mean   :0.596  
##  3rd Qu.:42.00   3rd Qu.:2.000         3rd Qu.:1.000           3rd Qu.:1.000  
##  Max.   :75.00   Max.   :4.000         Max.   :2.000           Max.   :1.000  
##  ForeignWorker       Class     CheckingAccountStatus.lt.0
##  Min.   :0.000   Min.   :0.0   Min.   :0.000             
##  1st Qu.:1.000   1st Qu.:0.0   1st Qu.:0.000             
##  Median :1.000   Median :1.0   Median :0.000             
##  Mean   :0.963   Mean   :0.7   Mean   :0.274             
##  3rd Qu.:1.000   3rd Qu.:1.0   3rd Qu.:1.000             
##  Max.   :1.000   Max.   :1.0   Max.   :1.000             
##  CheckingAccountStatus.0.to.200 CheckingAccountStatus.gt.200
##  Min.   :0.000                  Min.   :0.000               
##  1st Qu.:0.000                  1st Qu.:0.000               
##  Median :0.000                  Median :0.000               
##  Mean   :0.269                  Mean   :0.063               
##  3rd Qu.:1.000                  3rd Qu.:0.000               
##  Max.   :1.000                  Max.   :1.000               
##  CreditHistory.NoCredit.AllPaid CreditHistory.ThisBank.AllPaid
##  Min.   :0.00                   Min.   :0.000                 
##  1st Qu.:0.00                   1st Qu.:0.000                 
##  Median :0.00                   Median :0.000                 
##  Mean   :0.04                   Mean   :0.049                 
##  3rd Qu.:0.00                   3rd Qu.:0.000                 
##  Max.   :1.00                   Max.   :1.000                 
##  CreditHistory.PaidDuly CreditHistory.Delay Purpose.NewCar  Purpose.UsedCar
##  Min.   :0.00           Min.   :0.000       Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00           1st Qu.:0.000       1st Qu.:0.000   1st Qu.:0.000  
##  Median :1.00           Median :0.000       Median :0.000   Median :0.000  
##  Mean   :0.53           Mean   :0.088       Mean   :0.234   Mean   :0.103  
##  3rd Qu.:1.00           3rd Qu.:0.000       3rd Qu.:0.000   3rd Qu.:0.000  
##  Max.   :1.00           Max.   :1.000       Max.   :1.000   Max.   :1.000  
##  Purpose.Furniture.Equipment Purpose.Radio.Television Purpose.DomesticAppliance
##  Min.   :0.000               Min.   :0.00             Min.   :0.000            
##  1st Qu.:0.000               1st Qu.:0.00             1st Qu.:0.000            
##  Median :0.000               Median :0.00             Median :0.000            
##  Mean   :0.181               Mean   :0.28             Mean   :0.012            
##  3rd Qu.:0.000               3rd Qu.:1.00             3rd Qu.:0.000            
##  Max.   :1.000               Max.   :1.00             Max.   :1.000            
##  Purpose.Repairs Purpose.Education Purpose.Retraining Purpose.Business
##  Min.   :0.000   Min.   :0.00      Min.   :0.000      Min.   :0.000   
##  1st Qu.:0.000   1st Qu.:0.00      1st Qu.:0.000      1st Qu.:0.000   
##  Median :0.000   Median :0.00      Median :0.000      Median :0.000   
##  Mean   :0.022   Mean   :0.05      Mean   :0.009      Mean   :0.097   
##  3rd Qu.:0.000   3rd Qu.:0.00      3rd Qu.:0.000      3rd Qu.:0.000   
##  Max.   :1.000   Max.   :1.00      Max.   :1.000      Max.   :1.000   
##  SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
##  Min.   :0.000              Min.   :0.000                 
##  1st Qu.:0.000              1st Qu.:0.000                 
##  Median :1.000              Median :0.000                 
##  Mean   :0.603              Mean   :0.103                 
##  3rd Qu.:1.000              3rd Qu.:0.000                 
##  Max.   :1.000              Max.   :1.000                 
##  SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
##  Min.   :0.000                   Min.   :0.000              
##  1st Qu.:0.000                   1st Qu.:0.000              
##  Median :0.000                   Median :0.000              
##  Mean   :0.063                   Mean   :0.048              
##  3rd Qu.:0.000                   3rd Qu.:0.000              
##  Max.   :1.000                   Max.   :1.000              
##  EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
##  Min.   :0.000           Min.   :0.000             Min.   :0.000            
##  1st Qu.:0.000           1st Qu.:0.000             1st Qu.:0.000            
##  Median :0.000           Median :0.000             Median :0.000            
##  Mean   :0.172           Mean   :0.339             Mean   :0.174            
##  3rd Qu.:0.000           3rd Qu.:1.000             3rd Qu.:0.000            
##  Max.   :1.000           Max.   :1.000             Max.   :1.000            
##  EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
##  Min.   :0.000           Min.   :0.00                    
##  1st Qu.:0.000           1st Qu.:0.00                    
##  Median :0.000           Median :0.00                    
##  Mean   :0.253           Mean   :0.05                    
##  3rd Qu.:1.000           3rd Qu.:0.00                    
##  Max.   :1.000           Max.   :1.00                    
##  Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
##  Min.   :0.00              Min.   :0.000        Min.   :0.000              
##  1st Qu.:0.00              1st Qu.:0.000        1st Qu.:1.000              
##  Median :0.00              Median :1.000        Median :1.000              
##  Mean   :0.31              Mean   :0.548        Mean   :0.907              
##  3rd Qu.:1.00              3rd Qu.:1.000        3rd Qu.:1.000              
##  Max.   :1.00              Max.   :1.000        Max.   :1.000              
##  OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
##  Min.   :0.000                      Min.   :0.000       Min.   :0.000     
##  1st Qu.:0.000                      1st Qu.:0.000       1st Qu.:0.000     
##  Median :0.000                      Median :0.000       Median :0.000     
##  Mean   :0.041                      Mean   :0.282       Mean   :0.232     
##  3rd Qu.:0.000                      3rd Qu.:1.000       3rd Qu.:0.000     
##  Max.   :1.000                      Max.   :1.000       Max.   :1.000     
##  Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
##  Min.   :0.000     Min.   :0.000              Min.   :0.000               
##  1st Qu.:0.000     1st Qu.:0.000              1st Qu.:0.000               
##  Median :0.000     Median :0.000              Median :0.000               
##  Mean   :0.332     Mean   :0.139              Mean   :0.047               
##  3rd Qu.:1.000     3rd Qu.:0.000              3rd Qu.:0.000               
##  Max.   :1.000     Max.   :1.000              Max.   :1.000               
##   Housing.Rent    Housing.Own    Job.UnemployedUnskilled Job.UnskilledResident
##  Min.   :0.000   Min.   :0.000   Min.   :0.000           Min.   :0.0          
##  1st Qu.:0.000   1st Qu.:0.000   1st Qu.:0.000           1st Qu.:0.0          
##  Median :0.000   Median :1.000   Median :0.000           Median :0.0          
##  Mean   :0.179   Mean   :0.713   Mean   :0.022           Mean   :0.2          
##  3rd Qu.:0.000   3rd Qu.:1.000   3rd Qu.:0.000           3rd Qu.:0.0          
##  Max.   :1.000   Max.   :1.000   Max.   :1.000           Max.   :1.0          
##  Job.SkilledEmployee
##  Min.   :0.00       
##  1st Qu.:0.00       
##  Median :1.00       
##  Mean   :0.63       
##  3rd Qu.:1.00       
##  Max.   :1.00
#correlation matrix
#correlation_matrix <- cor(GermanCredit)
#print(correlation_matrix) 

Your observation: The correlation matrix shows the highly correlated verus weak correlations in the dataset. The range of Age are 12 through 75. A lot of the varibales have means between 0 and 1 due to the responses only being “0” and “1”.

3. Split the dataset into training and test set. Please use the random seed as 2023 for reproducibility.

set.seed(2023)
index <- sample(1:nrow(GermanCredit),nrow(GermanCredit)*0.80)
German.train = GermanCredit[index,]
German.test = GermanCredit[-index,]

Your observation: Data spitted the data using the ‘2023’ for reproducibility.

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right.

library(rpart)
library(rpart.plot)
# fit the model
fit_tree <- rpart(as.factor(Class) ~ ., data=German.train)
rpart.plot(fit_tree,extra=4, yesno=2)

Your observation: This fitted tree has a bunch of branches and nodes. The “yes” leaf nodes have a hihger number of observations with a predicted of class 0. The default as factor is not used for regression tree. ### 2. Use the training set to get prediected classes.

# Make predictions on the train data
pred_credit_train <- predict(fit_tree, German.train, type="class")
summary(pred_credit_train)
##   0   1 
## 196 604

Your observation: Made predictions on the train data using the German. train dataset. The class one has 604 values compareed to 196. ### 3. Obtain confusion matrix and MR on training set.

# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = German.train$Class,
                      pred = pred_credit_train)
Cmatrix_train
##     pred
## true   0   1
##    0 146  96
##    1  50 508
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.1825

Your observation: The error rate is a measure of misclassification, or the incorrect predictions for approximately 18.25% of the observations in the German training data.

4. Use the testing set to get prediected classes.

# Make predictions on the test data
pred_credit_test <- predict(fit_tree, German.test, type="class")

Your observation: Made predictions on the test data. ### 5. Obtain confusion matrix and MR on testing set.

# Confusion matrix to evaluate the model on testing data
Cmatrix_test = table(true = German.test$Class,
                      pred = pred_credit_test)
Cmatrix_test
##     pred
## true   0   1
##    0  35  23
##    1  27 115
1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
## [1] 0.25

Your observation: The Cmatrix_test calculates the error rate for the testing data.

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right.

# We need to define a cost matrix first, don't change 0 there
cost_matrix <- matrix(c(0, 2,  # cost of 2 for FN
                        1, 0),  # cost of 1 for FP
                      byrow = TRUE, nrow = 2)

fit_tree_asym <- rpart(as.factor(Class) ~ ., data=GermanCredit,
                  parms = list(loss = cost_matrix))
rpart.plot(fit_tree_asym,extra=4, yesno=2)

Your observation:

2. Use the training set to get prediected probabilities and classes.

# probabilities
pred_prob_train = predict(fit_tree_asym, German.train, type = "prob")
pred_prob_train = pred_prob_train[,"1"] 

# classes
pred_germancredit_train2 <- predict(fit_tree_asym, German.train,
                             type = "class")
summary(pred_germancredit_train2)
##   0   1 
## 345 455

Your observation: The class 1 has more values with 455 values compared to 345.

3. Obtain confusion matrix and MR on training set (use predicted classes).

#C matrix for training
table( true = German.train$Class, pred = pred_credit_train)
##     pred
## true   0   1
##    0 146  96
##    1  50 508
# Confusion matrix to evaluate the model on testing data
Cmatrix_train = table(true = German.train$Class,
                      pred = pred_credit_train)
1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.1825

Your observation: The MR is 0.28375, which means that about 28% of the instances are moderately high misclassified by the model.

4. Obtain ROC and AUC on training set (use predicted probabilities).

library(ROCR)
## Warning: package 'ROCR' was built under R version 4.3.1
pred <- prediction(pred_prob_train ,  German.train$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7416541

Your observation: The ROC apporoaches a false positive rate and the true positive rate. The AUC us 0.74, which is relatively okay given the thresholds of the data.

5. Use the testing set to get prediected probabilities and classes.

# probabilities
pred_prob_test = predict(fit_tree, German.test, type = "prob")
pred_prob_test = pred_prob_test[,"1"]

# classes
pred_germancredit_test2 <- predict(fit_tree_asym, German.test,
                             type = "class")
summary(pred_germancredit_test2)
##   0   1 
##  95 105

Your observation: The testing data set is used similarly to training dataset. Also, the class 1 has more values.

6. Obtain confusion matrix and MR on testing set. (use predicted classes).

# Confusion matrix
Cmatrix_test2 = table(true = German.test$Class,
                      pred = pred_germancredit_test2)
Cmatrix_test2
##     pred
## true  0  1
##    0 44 14
##    1 51 91

Your observation: The elements in the diagonal (44 and 91) represent the correct predictions. The off-diagonal elements (14 and 51) represent the incorrect predictions. The error rate for the testing data based on Cmatrix_test2 32.5%, meaning the model has incorrect predictions for approximately 32.5% of the observations in the testing data.

7. Obtain ROC and AUC on testing set. (use predicted probabilities).

# obtain predicted probability
pred_prob_test = predict(fit_tree, German.test, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
pred_prob_test = pred_prob_test[,"1"] #replace "1" with the actual category if reponse variable is a factor
#ROC
pred <- prediction(pred_prob_test, German.test$Class)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7279626

Your observation: The true positive rate (“tpr”) and false positive rate (“fpr”) are obtained from the prediction and performance functions. The AUC of 0.7279 indicates reasonably good performance, with a higher AUC generally indicating better discrimination between classes.

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis.

Both the testing and training sets without weighted cost have an imbalance with class one.The training predicted class had a MR of 0.18 and the testing set had a value of 0.25 for the MR. When the tree model was weighted using the weight of 2 on FP and weight of 1 on FN the training and testing data changed. The training data is 0.283 with a MR of .28275 and AIC of 0.741 and the testing set had a AUC of 0.7279 which could result in a good performance.

2. How do you compare Tree model to logistic regression and SVM? Only for this question, you don’t need to show numbers, just answer based on your understanding.

-The decision tree model might capture non-linear relationships better but is susceptible to over fitting. -SVM powerful in high-dimensional spaces and can handle non-linear relationship.

Decision Tree:

  • Pros: Can capture complex relationships in the data, doesn’t assume linearity, handles categorical and numerical features, interrupt
  • Cons: Prone to over-fitting(noise, pruning), sensitive to small variations in the training data.

Support Vector Machine (SVM):

  • Pros: Effective in high-dimensional spaces, versatile with different kernel functions.
  • Cons: Choice of kernel and parameters can be critical, may not perform well on noisy datasets.