Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
install.packages('caret')
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <- as.numeric(GermanCredit$Class == "Good") # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : num 1 0 1 1 0 1 1 1 1 0 ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)]
2025 for reproducibility. (2
pts)library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.5.2
dim(GermanCredit)
## [1] 1000 49
set.seed(2025)
index <- sample(1:1000, 700)
GermanCredit_train <- GermanCredit[index, ]
GermanCredit_test <- GermanCredit[-index, ]
dim(GermanCredit_train)
## [1] 700 49
dim(GermanCredit_test)
## [1] 300 49
Your observation: Training data has 700 observations while Testing data has 300 observations. Both datasets have 49 variables.
Credit_Tree <- rpart(formula = Class ~ .,
data = GermanCredit_train,
method = "class")
prp(Credit_Tree)
rpart.plot(Credit_Tree, extra = 3)
Your observation: Applicants with a favorable checking account status tend to be deemed as good credit more of the time and if they have a poor checking account status then they are deemed to have bad credit #### 2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)
pred_train_prob <- predict(Credit_Tree, type = "prob")
head(pred_train_prob)
## 0 1
## 909 0.1406250 0.8593750
## 460 0.1406250 0.8593750
## 932 0.3095238 0.6904762
## 922 0.1406250 0.8593750
## 961 0.1406250 0.8593750
## 279 0.1406250 0.8593750
head(GermanCredit_train)
## Duration Amount InstallmentRatePercentage ResidenceDuration Age
## 909 15 3594 1 2 46
## 460 18 4594 3 2 32
## 932 9 1670 4 2 22
## 922 48 12749 4 1 37
## 961 6 1740 2 2 30
## 279 6 4611 1 4 32
## NumberExistingCredits NumberPeopleMaintenance Telephone ForeignWorker Class
## 909 2 1 1 1 1
## 460 1 1 0 1 1
## 932 1 1 0 1 0
## 922 1 1 0 1 1
## 961 2 1 1 1 1
## 279 1 1 1 1 0
## CheckingAccountStatus.lt.0 CheckingAccountStatus.0.to.200
## 909 0 0
## 460 0 0
## 932 0 1
## 922 0 0
## 961 0 0
## 279 0 0
## CheckingAccountStatus.gt.200 CreditHistory.NoCredit.AllPaid
## 909 0 0
## 460 0 0
## 932 0 0
## 922 0 0
## 961 0 0
## 279 0 0
## CreditHistory.ThisBank.AllPaid CreditHistory.PaidDuly CreditHistory.Delay
## 909 0 0 1
## 460 0 1 0
## 932 0 1 0
## 922 0 0 1
## 961 0 0 0
## 279 0 1 0
## Purpose.NewCar Purpose.UsedCar Purpose.Furniture.Equipment
## 909 0 1 0
## 460 0 0 0
## 932 0 0 0
## 922 0 0 0
## 961 0 0 0
## 279 0 0 1
## Purpose.Radio.Television Purpose.DomesticAppliance Purpose.Repairs
## 909 0 0 0
## 460 1 0 0
## 932 1 0 0
## 922 1 0 0
## 961 1 0 0
## 279 0 0 0
## Purpose.Education Purpose.Retraining Purpose.Business
## 909 0 0 0
## 460 0 0 0
## 932 0 0 0
## 922 0 0 0
## 961 0 0 0
## 279 0 0 0
## SavingsAccountBonds.lt.100 SavingsAccountBonds.100.to.500
## 909 1 0
## 460 1 0
## 932 1 0
## 922 0 0
## 961 1 0
## 279 1 0
## SavingsAccountBonds.500.to.1000 SavingsAccountBonds.gt.1000
## 909 0 0
## 460 0 0
## 932 0 0
## 922 1 0
## 961 0 0
## 279 0 0
## EmploymentDuration.lt.1 EmploymentDuration.1.to.4 EmploymentDuration.4.to.7
## 909 1 0 0
## 460 1 0 0
## 932 1 0 0
## 922 0 0 1
## 961 0 0 0
## 279 1 0 0
## EmploymentDuration.gt.7 Personal.Male.Divorced.Seperated
## 909 0 0
## 460 0 0
## 932 0 0
## 922 0 0
## 961 1 0
## 279 0 0
## Personal.Female.NotSingle Personal.Male.Single OtherDebtorsGuarantors.None
## 909 1 0 1
## 460 0 1 1
## 932 1 0 1
## 922 0 1 1
## 961 0 0 1
## 279 1 0 1
## OtherDebtorsGuarantors.CoApplicant Property.RealEstate Property.Insurance
## 909 0 0 1
## 460 0 0 0
## 932 0 0 0
## 922 0 0 0
## 961 0 1 0
## 279 0 0 1
## Property.CarOther OtherInstallmentPlans.Bank OtherInstallmentPlans.Stores
## 909 0 0 0
## 460 1 0 0
## 932 1 0 0
## 922 1 0 0
## 961 0 0 0
## 279 0 0 0
## Housing.Rent Housing.Own Job.UnemployedUnskilled Job.UnskilledResident
## 909 0 1 0 1
## 460 0 1 0 0
## 932 0 1 0 0
## 922 0 1 0 0
## 961 1 0 0 0
## 279 0 1 0 0
## Job.SkilledEmployee
## 909 0
## 460 1
## 932 1
## 922 0
## 961 1
## 279 1
Your observation: It seems to be more likely to have Good Credit than Bad Credit
act_value <- GermanCredit_train$Class
pred_value <- pred_train_prob[, 2]
pcut <- 0.5
pred_value <- 1 * (pred_value > pcut)
confusion_mat <- table(actual = act_value, predict = pred_value)
confusion_mat
## predict
## actual 0 1
## 0 112 113
## 1 33 442
MR_rate <- (113+33)/(112+113+33+442)
MR_rate
## [1] 0.2085714
Your observation: About 20.9% of the time does the model predict the incorrect outcome
pred_test_prob <- predict(Credit_Tree, GermanCredit_test, type = "prob")
head(pred_test_prob)
## 0 1
## 3 0.1406250 0.8593750
## 4 0.8823529 0.1176471
## 7 0.1406250 0.8593750
## 9 0.1406250 0.8593750
## 17 0.1406250 0.8593750
## 21 0.1406250 0.8593750
Your observation: It seems to be more likely to have good credit than bad credit with the testing set too #### 5. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)
act_value1 <- GermanCredit_test$Class
pred_value <- pred_test_prob[, 2]
pcut <- 0.5
pred_value1 <- 1 * (pred_value > pcut)
confusion_mat <- table(actual = act_value1, predict = pred_value1)
confusion_mat
## predict
## actual 0 1
## 0 24 51
## 1 24 201
MR_rate <- (51+24)/(24+51+24+201)
MR_rate
## [1] 0.25
Your observation: 25% of the time the model will predict the wrong outcome
cost_matrix <- matrix(c(0, 1,
2, 0),
nrow = 2, byrow = TRUE)
Credit_Tree1 <- rpart(formula = Class ~ .,
data = GermanCredit_train,
method = "class",
parms = list(loss = cost_matrix)
)
prp(Credit_Tree1)
rpart.plot(Credit_Tree1, extra = 3)
Your observation: The model favors cheaper end vs the more expensive end when determining good vs bad credit status #### 2. Use the training set to get prediected probabilities and classes (Please use the default cutoff probability). (2 pts)
pred_train_prob1 <- predict(Credit_Tree1, GermanCredit_train, type = "prob")
head(pred_train_prob1)
## 0 1
## 909 0.1406250 0.8593750
## 460 0.1406250 0.8593750
## 932 0.3882353 0.6117647
## 922 0.1406250 0.8593750
## 961 0.1406250 0.8593750
## 279 0.1406250 0.8593750
Your observation: It is more likely to have good credit rather than bad credit #### 3. Obtain confusion matrix and MR on training set (Please use the predicted class in previous question). (2 pts)
act_value2 <- GermanCredit_train$Class
pred_value <- pred_train_prob1[, 2]
pcut <- 0.5
pred_value2 <- 1 * (pred_value > pcut)
confusion_mat2 <- table(actual = act_value2, predict = pred_value2)
confusion_mat2
## predict
## actual 0 1
## 0 54 171
## 1 9 466
MR_rate <- (171+9)/(54+171+9+466)
MR_rate
## [1] 0.2571429
Your observation: About 25.7% of the time does this model predict the incorrect outcome #### 4. Obtain ROC and AUC on training set (use predicted probabilities). (2 pts)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
plot(roc(GermanCredit_train$Class, pred_train_prob1[, 2]))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(GermanCredit_train$Class, pred_train_prob1[, 2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7439
Your observation: The model can kind of separate good vs bad credit however, 25% of the time it’ll be incorrect
pred_test_prob1 <- predict(Credit_Tree1, GermanCredit_test, type = "prob")
head(pred_test_prob1)
## 0 1
## 3 0.1406250 0.8593750
## 4 0.8823529 0.1176471
## 7 0.1406250 0.8593750
## 9 0.1406250 0.8593750
## 17 0.1406250 0.8593750
## 21 0.1406250 0.8593750
Your observation: It is more likely people have good credit than bad credit #### 6. Obtain confusion matrix and MR on testing set. (Please use the predicted class in previous question). (2 pts)
act_value3 <- GermanCredit_test$Class
pred_value <- pred_test_prob1[, 2]
pcut <- 0.5
pred_value3 <- 1 * (pred_value > pcut)
confusion_mat3 <- table(actual = act_value3, predict = pred_value3)
confusion_mat3
## predict
## actual 0 1
## 0 12 63
## 1 7 218
MR_rate <- (63+7)/(12+63+7+218)
MR_rate
## [1] 0.2333333
Your observation: About 23.3% of the time the model will give the incorrect classification of Good against Bad credit #### 7. Obtain ROC and AUC on testing set. (use predicted probabilities). (2 pts)
plot(roc(GermanCredit_test$Class, pred_test_prob1[, 2]))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
auc(GermanCredit_test$Class, pred_test_prob1[, 2])
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7049
Your observation: Around 30% of the time the model will be incorrect.
The model without the weights performs better than the model with weights. This is because the importance of errors were penalized higher so the FP and FN were more influential in the model overall. The weighted model could be better, .75 AUC for the training and .7 AUC for the testing data. The tree models show if your checking account is in good standing than you a most likely going to have a good credit score rather than bad.