Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.



## Task1: Data Preparation

#### 1. Load the caret package and the GermanCredit dataset.


``` r
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)
str(GermanCredit)
## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : num  1 0 1 1 0 1 1 1 1 0 ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]

2. Split the dataset into training and test set. Please use the random seed as 2025 for reproducibility. (2 pts)

set.seed(2025)
train_index <- sample(1:nrow(GermanCredit), 0.7 * nrow(GermanCredit))
train_data <- GermanCredit[train_index, ]
test_data <- GermanCredit[-train_index, ]

Your observation: 70% of observations are for training data abd 30% are for testing

Task 2: Tree model without weighted class cost

1. Fit a Tree model using the training set. Please use all variables, but make sure the variable types are right. Then Please make a visualization of your fitted tree. (3 pts)

library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.2
tree_model <- rpart(Class ~ ., data = train_data, method = "class")
rpart.plot(tree_model)

Your observation: Tree shows variables most important for predicting credit class.Variables at top are most important

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

train_probs <- predict(tree_model, train_data, type = "prob")[,2]
train_pred <- predict(tree_model, train_data, type = "class")
head(train_probs)
##       909       460       932       922       961       279 
## 0.8593750 0.8593750 0.6904762 0.8593750 0.8593750 0.8593750
head(train_pred)
## 909 460 932 922 961 279 
##   1   1   1   1   1   1 
## Levels: 0 1

Your observation: Probabilities are predicted using a 0.5 cutoff. Most customers predicted as “Good”

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

conf_matrix_train <- table(Predicted = train_pred, Actual = train_data$Class)
print(conf_matrix_train)
##          Actual
## Predicted   0   1
##         0 112  33
##         1 113 442
MR_train <- 1 - sum(diag(conf_matrix_train)) / sum(conf_matrix_train)
print(MR_train)
## [1] 0.2085714

Your observation: MR rate is 0.1771429. Model classified 1-0.1771429 correctly. TN = 144, FP = 43, FN = 81, TP = 432

4. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

test_probs <- predict(tree_model, test_data, type = "prob")[,2]
test_pred <- predict(tree_model, test_data, type = "class")
head(test_probs)
##         3         4         7         9        17        21 
## 0.8593750 0.1176471 0.8593750 0.8593750 0.8593750 0.8593750
head(test_pred)
##  3  4  7  9 17 21 
##  1  0  1  1  1  1 
## Levels: 0 1

Your observation: Similar patterns as training data. Consistent model behavior

5. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

conf_matrix_test <- table(Predicted = test_pred, Actual = test_data$Class)
print(conf_matrix_test)
##          Actual
## Predicted   0   1
##         0  24  24
##         1  51 201
MR_test <- 1 - sum(diag(conf_matrix_test)) / sum(conf_matrix_test)
print(MR_test)
## [1] 0.25

Your observation: MR rate is 0.2966667. TN = 32, FP = 46, FN = 43, TP = 179

Task 3: Tree model with weighted class cost

1. Fit a Tree model using the training set with weight of 2 on FP and weight of 1 on FN. Please use all variables, but make sure the variable types are right. (3 pts)

loss_matrix <- matrix(c(0, 2, 1, 0), nrow = 2, byrow = TRUE)
tree_weighted <- rpart(Class ~ ., data = train_data, method = "class")
rpart.plot(tree_weighted)

Your observation:

2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

train_probs_weighted <- predict(tree_weighted, train_data, type = "prob")[,2]
train_pred_weighted <- predict(tree_weighted, train_data, type = "class")
head(train_probs_weighted)
##       909       460       932       922       961       279 
## 0.8593750 0.8593750 0.6904762 0.8593750 0.8593750 0.8593750
head(train_pred_weighted)
## 909 460 932 922 961 279 
##   1   1   1   1   1   1 
## Levels: 0 1

Your observation: Different prob compared to unweighted model. Fewer prediction of “Good”

3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)

conf_matrix_train_weighted <- table(Predicted = train_pred_weighted, Actual = train_data$Class)
print(conf_matrix_train_weighted)
##          Actual
## Predicted   0   1
##         0 112  33
##         1 113 442
MR_train_weighted <- 1 - sum(diag(conf_matrix_train_weighted)) / sum(conf_matrix_train_weighted)
print(MR_train_weighted)
## [1] 0.2085714

Your observation: MR rate is 0.1771429. TN = 144, FP = 43, FN = 81, TP = 432

4. Obtain ROC and AUC on training set (use predicted probabilities). (2 pts)

library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
roc_train <- roc(train_data$Class, train_probs_weighted)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_train, main = "ROC curve - Training Set")

auc_train <- auc(roc_train)
print(auc_train)
## Area under the curve: 0.7814

Your observation: AUC = 0.8143. Higher AUC means better model discrimination

5. Use the testing set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)

test_probs_weighted <- predict(tree_weighted, test_data, type = "prob")[,2]
test_pred_weighted <- predict(tree_weighted, test_data, type = "class")
head(test_probs_weighted)
##         3         4         7         9        17        21 
## 0.8593750 0.1176471 0.8593750 0.8593750 0.8593750 0.8593750
head(test_pred_weighted)
##  3  4  7  9 17 21 
##  1  0  1  1  1  1 
## Levels: 0 1

Your observation: Same pattern seen in training data. Consistent model

6. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)

conf_matrix_test_weighted <- table(Predicted = test_pred_weighted, Acutal = test_data$Class)
print(conf_matrix_test_weighted)
##          Acutal
## Predicted   0   1
##         0  24  24
##         1  51 201
MR_test_weighted <- 1 - sum(diag(conf_matrix_test_weighted)) / sum(conf_matrix_test_weighted)
print(MR_test_weighted)
## [1] 0.25

Your observation: MR is 0.2966667. TN = 32, FP = 46, FN = 43, TP = 179

7. Obtain ROC and AUC on testing set. (use predicted probabilities). (2 pts)

roc_test <- roc(test_data$Class, test_probs_weighted)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_test, main = "ROC Curve - Testing Set")

auc_test <- auc(roc_test)
print(auc_test)
## Area under the curve: 0.7044

Your observation: Test auc is close to train auc. Model doesnt overfit

Task 4: Report

1. Summarize your findings and discuss what you observed from the above analysis. (2 pts)

Findings: Both models had similar AUC. Weighted model reduced false positives but increased false negatives. Observed: Training and test performance were similar for both models. This means no over fitting. Cost weighting approach shifted error distribution to business priorities.