Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : num 1 0 1 1 0 1 1 1 1 0 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
2025 for reproducibility. (2
pts)dim(GermanCredit)
## [1] 1000 49
set.seed(2025)
index <- sample(1:1000, 700)
GermanCredit_train <- GermanCredit[index, ]
GermanCredit_test <- GermanCredit[-index, ]
dim(GermanCredit_train)
## [1] 700 49
dim(GermanCredit_test)
## [1] 300 49
Your observation: The training set has 700 observations while the testing set only has 300 observations. Both have 49 variables.
library(rpart)
library(rpart.plot)
Credit_Tree <- rpart(formula = Class ~ .,
data = GermanCredit_train,
method = "class")
prp(Credit_Tree)
rpart.plot(Credit_Tree,extra = 3)
Your observation: Applicants with a favorable checking account status tend to be deemed as good credit more of the time and if they have a poor checking acount status then they are deemed to have bad credit.
pred_train_prob <- predict(Credit_Tree, GermanCredit_train, type = "prob")
head(pred_train_prob)
## 0 1
## 909 0.1406250 0.8593750
## 460 0.1406250 0.8593750
## 932 0.3095238 0.6904762
## 922 0.1406250 0.8593750
## 961 0.1406250 0.8593750
## 279 0.1406250 0.8593750
head(GermanCredit_train)
## Duration Amount InstallmentRatePercentage ResidenceDuration Age
## 909 15 3594 1 2 46
## 460 18 4594 3 2 32
## 932 9 1670 4 2 22
## 922 48 12749 4 1 37
## 961 6 1740 2 2 30
## 279 6 4611 1 4 32
## NumberExistingCredits NumberPeopleMaintenance Telephone ForeignWorker Class
## 909 2 1 1 1 1
## 460 1 1 0 1 1
## 932 1 1 0 1 0
## 922 1 1 0 1 1
## 961 2 1 1 1 1
## 279 1 1 1 1 0
## CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## 909 0 0
## 460 0 0
## 932 0 1
## 922 0 0
## 961 0 0
## 279 0 0
## CheckingAccountStatus.gt.200 CreditHistory.NoCredit.AllPaid
## 909 0 0
## 460 0 0
## 932 0 0
## 922 0 0
## 961 0 0
## 279 0 0
## CreditHistory.ThisBank.AllPaid CreditHistory.PaidDuly CreditHistory.Delay
## 909 0 0 1
## 460 0 1 0
## 932 0 1 0
## 922 0 0 1
## 961 0 0 0
## 279 0 1 0
## Purpose.NewCar Purpose.UsedCar Purpose.Furniture.Equipment
## 909 0 1 0
## 460 0 0 0
## 932 0 0 0
## 922 0 0 0
## 961 0 0 0
## 279 0 0 1
## Purpose.Radio.Television Purpose.DomesticAppliance Purpose.Repairs
## 909 0 0 0
## 460 1 0 0
## 932 1 0 0
## 922 1 0 0
## 961 1 0 0
## 279 0 0 0
## Purpose.Education Purpose.Retraining Purpose.Business
## 909 0 0 0
## 460 0 0 0
## 932 0 0 0
## 922 0 0 0
## 961 0 0 0
## 279 0 0 0
## SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## 909 1 0
## 460 1 0
## 932 1 0
## 922 0 0
## 961 1 0
## 279 1 0
## SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## 909 0 0
## 460 0 0
## 932 0 0
## 922 1 0
## 961 0 0
## 279 0 0
## EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## 909 1 0 0
## 460 1 0 0
## 932 1 0 0
## 922 0 0 1
## 961 0 0 0
## 279 1 0 0
## EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
## 909 0 0
## 460 0 0
## 932 0 0
## 922 0 0
## 961 1 0
## 279 0 0
## Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
## 909 1 0 1
## 460 0 1 1
## 932 1 0 1
## 922 0 1 1
## 961 0 0 1
## 279 1 0 1
## OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
## 909 0 0 1
## 460 0 0 0
## 932 0 0 0
## 922 0 0 0
## 961 0 1 0
## 279 0 0 1
## Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## 909 0 0 0
## 460 1 0 0
## 932 1 0 0
## 922 1 0 0
## 961 0 0 0
## 279 0 0 0
## Housing.Rent Housing.Own Job.UnemployedUnskilled Job.UnskilledResident
## 909 0 1 0 1
## 460 0 1 0 0
## 932 0 1 0 0
## 922 0 1 0 0
## 961 1 0 0 0
## 279 0 1 0 0
## Job.SkilledEmployee
## 909 0
## 460 1
## 932 1
## 922 0
## 961 1
## 279 1
Your observation: It seems to be more likely to have Good credit than bad credit
act_value <- GermanCredit_train$Class
pred_value <- pred_train_prob[, 2]
pcut <- 0.5
pred_value <- 1 * (pred_value > pcut)
confusion_mat <- table(actual = act_value, predict = pred_value)
confusion_mat
## predict
## actual 0 1
## 0 112 113
## 1 33 442
MR_rate <- (113+33)/(112+113+33+442)
MR_rate
## [1] 0.2085714
Your observation: Around 20.9% of the time does this model predict the incorrect outcome
pred_test_prob <- predict(Credit_Tree, GermanCredit_test, type = "prob")
head(pred_test_prob)
## 0 1
## 3 0.1406250 0.8593750
## 4 0.8823529 0.1176471
## 7 0.1406250 0.8593750
## 9 0.1406250 0.8593750
## 17 0.1406250 0.8593750
## 21 0.1406250 0.8593750
Your observation: It seems to be more likely to have Good credit than bad credit with the testing set too…which makes sense.
act_value1 <- GermanCredit_test$Class
pred_value <- pred_test_prob[, 2]
pcut <- 0.5
pred_value1 <- 1 * (pred_value > pcut)
confusion_mat1 <- table(actual = act_value1, predict = pred_value1)
confusion_mat1
## predict
## actual 0 1
## 0 24 51
## 1 24 201
MR_rate <- (51+24)/(24+51+24+201)
MR_rate
## [1] 0.25
Your observation: 25% of the time does this model predict the incorrect outcome
cost_matrix <- matrix(c(0, 1,
2, 0),
nrow = 2, byrow = TRUE)
Credit_Tree1 <- rpart(
Class ~ .,
data = GermanCredit_train,
method = "class",
parms = list(loss = cost_matrix)
)
prp(Credit_Tree1)
rpart.plot(Credit_Tree1,extra = 3)
Your observation: This model favors the cheaper end vs the more expensive end when determining good vs bad credit status.
pred_train_prob1 <- predict(Credit_Tree1, GermanCredit_train, type = "prob")
head(pred_train_prob1)
## 0 1
## 909 0.1406250 0.8593750
## 460 0.1406250 0.8593750
## 932 0.3882353 0.6117647
## 922 0.1406250 0.8593750
## 961 0.1406250 0.8593750
## 279 0.1406250 0.8593750
Your observation: It is more likely people have good credit vs bad credit
act_value2 <- GermanCredit_train$Class
pred_value <- pred_train_prob1[, 2]
pcut <- 0.5
pred_value2 <- 1 * (pred_value > pcut)
confusion_mat2 <- table(actual = act_value2, predict = pred_value2)
confusion_mat2
## predict
## actual 0 1
## 0 54 171
## 1 9 466
MR_rate <- (171+9)/(54+171+9+466)
MR_rate
## [1] 0.2571429
Your observation: Around 25.7% of the time does this model predict the incorrect outcome
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
plot(roc(GermanCredit_train$Class, pred_train_prob1[,2]))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(GermanCredit_train$Class, pred_train_prob1[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7439
Your observation: The model can kind of separate Good vs Bad credit but around 25% of the time it will be incorrect. Not a bad model but definietly could be better.
pred_test_prob1 <- predict(Credit_Tree1, GermanCredit_test, type = "prob")
head(pred_test_prob1)
## 0 1
## 3 0.1406250 0.8593750
## 4 0.8823529 0.1176471
## 7 0.1406250 0.8593750
## 9 0.1406250 0.8593750
## 17 0.1406250 0.8593750
## 21 0.1406250 0.8593750
Your observation: It is more likely people have good credit vs bad credit
act_value3 <- GermanCredit_test$Class
pred_value <- pred_test_prob1[, 2]
pcut <- 0.5
pred_value3 <- 1 * (pred_value > pcut)
confusion_mat3 <- table(actual = act_value3, predict = pred_value3)
confusion_mat3
## predict
## actual 0 1
## 0 12 63
## 1 7 218
MR_rate <- (63+7)/(12+63+7+218)
MR_rate
## [1] 0.2333333
Your observation: 23% of the time the model will give the incorrect classification of Good vs Bad credit
plot(roc(GermanCredit_test$Class, pred_test_prob1[,2]))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(GermanCredit_test$Class, pred_test_prob1[,2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7049
Your observation: The model can kind of separate Good vs Bad credit but around 30% of the time it will be incorrect. Not a bad model but definietly could be better.
The model without the weights performed better than the model with the weights. This makes sense as the importance of errors was penalized a lot more so the FP and FN were more influential in the model itself. The weighted model could be better .75 AUC for the training and .70 AUC. Based off the tree models, if your checking account is in good standing then you are more likely to have a good credit score vs a bad one.