Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <- GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)
Your observation:
The GermanCredit dataset contains both numeric and factor variables that describe characteristics of loan applicants (e.g., credit history, duration, amount, age, etc.), and the target variable Class indicates whether credit is good or bad.
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.
# your code here:
dim(GermanCredit)
## [1] 1000 49
Your observation:
There are 1000 observations and approximately 49 variables after dropping uninformative ones.
class
(use table() function). How many observations are classed
as “good” and how many are “bad”? (2 pts)# your code here:
table(GermanCredit$Class)
##
## FALSE TRUE
## 300 700
Your observation:
There are 700 “Good” credit cases and 300 “Bad” credit cases, showing the dataset is imbalanced.
class. Please add titles and labels to axis. (2 pts)# your code here:
barplot(table(GermanCredit$Class),
col = c("steelblue", "tomato"),
main = "Distribution of Credit Class",
xlab = "Credit Class (TRUE = Good, FALSE = Bad)",
ylab = "Count")
2025 is set for reproducibility. Please
comment on what is the split proportion you choose for training and
testing data? (2 pts)set.seed(2025)
train_index <- createDataPartition(GermanCredit$Class, p = 0.7, list = FALSE)
German_train <- GermanCredit[train_index, ]
German_test <- GermanCredit[-train_index, ]
Your comment:
I used a 70/30 split to ensure enough data for training and a fair evaluation on unseen test data.
# your code here:
model_logit <- glm(Class ~ ., data = German_train, family = binomial)
summary(model_logit)
##
## Call:
## glm(formula = Class ~ ., family = binomial, data = German_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.941e+00 1.617e+00 4.910 9.12e-07 ***
## Duration -2.901e-02 1.151e-02 -2.521 0.011705 *
## Amount -1.075e-04 5.272e-05 -2.039 0.041401 *
## InstallmentRatePercentage -3.898e-01 1.092e-01 -3.569 0.000358 ***
## ResidenceDuration -6.141e-02 1.015e-01 -0.605 0.545004
## Age 2.488e-02 1.142e-02 2.178 0.029393 *
## NumberExistingCredits -2.077e-01 2.276e-01 -0.913 0.361504
## NumberPeopleMaintenance -2.455e-01 3.068e-01 -0.800 0.423583
## Telephone -4.703e-02 2.442e-01 -0.193 0.847267
## ForeignWorker -1.116e+00 6.633e-01 -1.682 0.092485 .
## CheckingAccountStatus.lt.0 -1.539e+00 2.694e-01 -5.714 1.10e-08 ***
## CheckingAccountStatus.0.to.200 -1.186e+00 2.786e-01 -4.258 2.06e-05 ***
## CheckingAccountStatus.gt.200 -2.359e-01 4.652e-01 -0.507 0.612139
## CreditHistory.NoCredit.AllPaid -1.508e+00 5.144e-01 -2.931 0.003383 **
## CreditHistory.ThisBank.AllPaid -1.970e+00 5.276e-01 -3.733 0.000189 ***
## CreditHistory.PaidDuly -9.283e-01 3.350e-01 -2.771 0.005588 **
## CreditHistory.Delay -7.730e-01 4.062e-01 -1.903 0.057032 .
## Purpose.NewCar -1.616e+00 8.906e-01 -1.815 0.069570 .
## Purpose.UsedCar 1.418e-02 9.328e-01 0.015 0.987872
## Purpose.Furniture.Equipment -8.664e-01 8.985e-01 -0.964 0.334875
## Purpose.Radio.Television -6.736e-01 8.958e-01 -0.752 0.452093
## Purpose.DomesticAppliance -1.316e+00 1.198e+00 -1.099 0.271980
## Purpose.Repairs -1.125e+00 1.077e+00 -1.045 0.295888
## Purpose.Education -1.842e+00 9.842e-01 -1.872 0.061248 .
## Purpose.Retraining 5.452e-01 1.534e+00 0.355 0.722233
## Purpose.Business -1.013e+00 9.128e-01 -1.110 0.266924
## SavingsAccountBonds.lt.100 -1.011e+00 3.130e-01 -3.230 0.001238 **
## SavingsAccountBonds.100.to.500 -6.076e-01 4.251e-01 -1.429 0.152873
## SavingsAccountBonds.500.to.1000 -3.826e-01 5.583e-01 -0.685 0.493160
## SavingsAccountBonds.gt.1000 2.995e-01 7.103e-01 0.422 0.673243
## EmploymentDuration.lt.1 3.306e-01 4.957e-01 0.667 0.504788
## EmploymentDuration.1.to.4 5.461e-01 4.722e-01 1.156 0.247501
## EmploymentDuration.4.to.7 6.790e-01 5.128e-01 1.324 0.185449
## EmploymentDuration.gt.7 4.548e-01 4.684e-01 0.971 0.331585
## Personal.Male.Divorced.Seperated -7.670e-01 5.438e-01 -1.410 0.158402
## Personal.Female.NotSingle -2.582e-01 3.681e-01 -0.701 0.483014
## Personal.Male.Single 4.698e-01 3.750e-01 1.253 0.210300
## OtherDebtorsGuarantors.None -1.044e+00 5.749e-01 -1.817 0.069292 .
## OtherDebtorsGuarantors.CoApplicant -1.167e+00 7.453e-01 -1.566 0.117349
## Property.RealEstate 7.808e-01 5.317e-01 1.468 0.141990
## Property.Insurance 3.269e-01 5.206e-01 0.628 0.530048
## Property.CarOther 7.458e-01 5.089e-01 1.466 0.142739
## OtherInstallmentPlans.Bank -5.795e-01 2.937e-01 -1.973 0.048493 *
## OtherInstallmentPlans.Stores -5.174e-01 4.619e-01 -1.120 0.262692
## Housing.Rent -8.942e-01 5.908e-01 -1.514 0.130116
## Housing.Own -4.131e-01 5.593e-01 -0.739 0.460202
## Job.UnemployedUnskilled 1.175e+00 9.763e-01 1.204 0.228626
## Job.UnskilledResident -2.144e-01 4.270e-01 -0.502 0.615535
## Job.SkilledEmployee -1.170e-01 3.437e-01 -0.340 0.733609
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 855.21 on 699 degrees of freedom
## Residual deviance: 621.15 on 651 degrees of freedom
## AIC: 719.15
##
## Number of Fisher Scoring iterations: 5
InstallmentRatePercentage? Is it significant, and why? (2
pts)# your code here:
coef(summary(model_logit))["InstallmentRatePercentage", ]
## Estimate Std. Error z value Pr(>|z|)
## -0.3897870155 0.1091997206 -3.5694872977 0.0003576806
Your comment:
If the coefficient is negative, higher installment rates are associated with a lower likelihood of good credit. The significance (p-value < 0.05) would mean this variable is an important predictor.
# you might need some code for calculation:
exp(coef(model_logit)["InstallmentRatePercentage"])
## InstallmentRatePercentage
## 0.6772011
Your comment:
The odds ratio (OR) indicates how the odds of having good credit change with a one-unit increase in InstallmentRatePercentage.
If OR < 1 → higher installment rates reduce creditworthiness.
If OR > 1 → higher installment rates increase creditworthiness.
# your code here:
train_pred_prob <- predict(model_logit, newdata = German_train, type = "response")
head(train_pred_prob)
## 1 3 6 11 12 13
## 0.97453792 0.95952716 0.71928110 0.32074645 0.07608803 0.83362826
# your code here:
train_pred_class <- ifelse(train_pred_prob >= 0.5, TRUE, FALSE)
conf_matrix_train <- table(Predicted = train_pred_class, Actual = German_train$Class)
conf_matrix_train
## Actual
## Predicted FALSE TRUE
## FALSE 122 52
## TRUE 88 438
MR_train <- mean(train_pred_class != German_train$Class)
MR_train
## [1] 0.2
Your comment:
The misclassification rate (MR) measures the fraction of incorrect predictions. A lower MR indicates better model accuracy.
# your code here:
cutoffs <- seq(0.1, 0.9, by = 0.05)
mr_values <- sapply(cutoffs, function(c) {
pred_class <- ifelse(train_pred_prob >= c, TRUE, FALSE)
mean(pred_class != German_train$Class)
})
plot(cutoffs, mr_values, type = "b", col = "blue", pch = 16,
xlab = "Cutoff Probability", ylab = "Misclassification Rate",
main = "MR vs. Cutoff Probability")
Your comment:
The optimal cutoff is where the MR is minimized, usually between 0.4 and 0.6, depending on class balance.
# your code here:
library(pROC)
## Warning: package 'pROC' was built under R version 4.5.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_train <- roc(German_train$Class, train_pred_prob)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
plot(roc_train, col = "blue", main = "ROC Curve - Training Set")
auc(roc_train)
## Area under the curve: 0.8346
Your comment:
An AUC > 0.7 indicates acceptable discriminatory power; AUC > 0.8 → excellent model performance.
# your code here:
test_pred_prob <- predict(model_logit, newdata = German_test, type = "response")
head(test_pred_prob)
## 2 4 5 7 8 9
## 0.4091914 0.7543049 0.3706810 0.9386210 0.7791341 0.9793484
# your code here:
test_pred_class <- ifelse(test_pred_prob >= 0.5, TRUE, FALSE)
conf_matrix_test <- table(Predicted = test_pred_class, Actual = German_test$Class)
conf_matrix_test
## Actual
## Predicted FALSE TRUE
## FALSE 40 29
## TRUE 50 181
MR_test <- mean(test_pred_class != German_test$Class)
MR_test
## [1] 0.2633333
Your comment:
The test MR is usually slightly higher than the training MR, reflecting how well the model generalizes. If the difference is small, the model is stable and not overfitted.
# your code here:
roc_test <- roc(German_test$Class, test_pred_prob)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
plot(roc_test, col = "red", main = "ROC Curve - Test Set")
auc(roc_test)
## Area under the curve: 0.7964
Your comment:
The AUC for the test set provides an unbiased evaluation of the model’s discrimination ability on unseen data. Values above 0.7 indicate good predictive power.