Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
if (!require(package_name, quietly = TRUE)) {
install.packages("caret")
}
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <- GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
# str(GermanCredit)
Your observation: After loading and cleaning, we have a binary outcome coded as TRUE for good credit and FALSE for bad credit, plus numeric/dummy predictors for logistic regressions.
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.
dim(GermanCredit)
## [1] 1000 49
Your observation: There are 1000 observations and 49 variables
class
(use table() function). How many observations are classed
as “good” and how many are “bad”? (2 pts)table(GermanCredit$Class)
##
## FALSE TRUE
## 300 700
Your observation: There are 700 good and 300 bad credit applicants.
##### (3) Please make a barplot of of response variable
class. Please add titles and labels to axis. (2 pts)
class_tab <- table(GermanCredit$Class)
barplot(class_tab,
names.arg = c("Bad", "Good"),
main = "Distribution of Credit Class",
xlab = "Class",
ylab = "Count")
2025 is set for reproducibility. Please
comment on what is the split proportion you choose for training and
testing data? (2 pts)set.seed(2025) # set random seed for reproducibility.
train_index <- createDataPartition(GermanCredit$Class,
p = 0.7, # 70% training
list = FALSE)
train_data <- GermanCredit[train_index, ]
test_data <- GermanCredit[-train_index, ]
dim(train_data)
## [1] 700 49
dim(test_data)
## [1] 300 49
Your comment: I used a 70%/30% split. 700 observations for training and 300 for testing.
# Logistic regression with all predictors
logit_model <- glm(Class ~ .,
data = train_data,
family = binomial)
summary(logit_model)
##
## Call:
## glm(formula = Class ~ ., family = binomial, data = train_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.941e+00 1.617e+00 4.910 9.12e-07 ***
## Duration -2.901e-02 1.151e-02 -2.521 0.011705 *
## Amount -1.075e-04 5.272e-05 -2.039 0.041401 *
## InstallmentRatePercentage -3.898e-01 1.092e-01 -3.569 0.000358 ***
## ResidenceDuration -6.141e-02 1.015e-01 -0.605 0.545004
## Age 2.488e-02 1.142e-02 2.178 0.029393 *
## NumberExistingCredits -2.077e-01 2.276e-01 -0.913 0.361504
## NumberPeopleMaintenance -2.455e-01 3.068e-01 -0.800 0.423583
## Telephone -4.703e-02 2.442e-01 -0.193 0.847267
## ForeignWorker -1.116e+00 6.633e-01 -1.682 0.092485 .
## CheckingAccountStatus.lt.0 -1.539e+00 2.694e-01 -5.714 1.10e-08 ***
## CheckingAccountStatus.0.to.200 -1.186e+00 2.786e-01 -4.258 2.06e-05 ***
## CheckingAccountStatus.gt.200 -2.359e-01 4.652e-01 -0.507 0.612139
## CreditHistory.NoCredit.AllPaid -1.508e+00 5.144e-01 -2.931 0.003383 **
## CreditHistory.ThisBank.AllPaid -1.970e+00 5.276e-01 -3.733 0.000189 ***
## CreditHistory.PaidDuly -9.283e-01 3.350e-01 -2.771 0.005588 **
## CreditHistory.Delay -7.730e-01 4.062e-01 -1.903 0.057032 .
## Purpose.NewCar -1.616e+00 8.906e-01 -1.815 0.069570 .
## Purpose.UsedCar 1.418e-02 9.328e-01 0.015 0.987872
## Purpose.Furniture.Equipment -8.664e-01 8.985e-01 -0.964 0.334875
## Purpose.Radio.Television -6.736e-01 8.958e-01 -0.752 0.452093
## Purpose.DomesticAppliance -1.316e+00 1.198e+00 -1.099 0.271980
## Purpose.Repairs -1.125e+00 1.077e+00 -1.045 0.295888
## Purpose.Education -1.842e+00 9.842e-01 -1.872 0.061248 .
## Purpose.Retraining 5.452e-01 1.534e+00 0.355 0.722233
## Purpose.Business -1.013e+00 9.128e-01 -1.110 0.266924
## SavingsAccountBonds.lt.100 -1.011e+00 3.130e-01 -3.230 0.001238 **
## SavingsAccountBonds.100.to.500 -6.076e-01 4.251e-01 -1.429 0.152873
## SavingsAccountBonds.500.to.1000 -3.826e-01 5.583e-01 -0.685 0.493160
## SavingsAccountBonds.gt.1000 2.995e-01 7.103e-01 0.422 0.673243
## EmploymentDuration.lt.1 3.306e-01 4.957e-01 0.667 0.504788
## EmploymentDuration.1.to.4 5.461e-01 4.722e-01 1.156 0.247501
## EmploymentDuration.4.to.7 6.790e-01 5.128e-01 1.324 0.185449
## EmploymentDuration.gt.7 4.548e-01 4.684e-01 0.971 0.331585
## Personal.Male.Divorced.Seperated -7.670e-01 5.438e-01 -1.410 0.158402
## Personal.Female.NotSingle -2.582e-01 3.681e-01 -0.701 0.483014
## Personal.Male.Single 4.698e-01 3.750e-01 1.253 0.210300
## OtherDebtorsGuarantors.None -1.044e+00 5.749e-01 -1.817 0.069292 .
## OtherDebtorsGuarantors.CoApplicant -1.167e+00 7.453e-01 -1.566 0.117349
## Property.RealEstate 7.808e-01 5.317e-01 1.468 0.141990
## Property.Insurance 3.269e-01 5.206e-01 0.628 0.530048
## Property.CarOther 7.458e-01 5.089e-01 1.466 0.142739
## OtherInstallmentPlans.Bank -5.795e-01 2.937e-01 -1.973 0.048493 *
## OtherInstallmentPlans.Stores -5.174e-01 4.619e-01 -1.120 0.262692
## Housing.Rent -8.942e-01 5.908e-01 -1.514 0.130116
## Housing.Own -4.131e-01 5.593e-01 -0.739 0.460202
## Job.UnemployedUnskilled 1.175e+00 9.763e-01 1.204 0.228626
## Job.UnskilledResident -2.144e-01 4.270e-01 -0.502 0.615535
## Job.SkilledEmployee -1.170e-01 3.437e-01 -0.340 0.733609
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 855.21 on 699 degrees of freedom
## Residual deviance: 621.15 on 651 degrees of freedom
## AIC: 719.15
##
## Number of Fisher Scoring iterations: 5
InstallmentRatePercentage? Is it significant, and why? (2
pts)summary(logit_model)$coefficients["InstallmentRatePercentage", ]
## Estimate Std. Error z value Pr(>|z|)
## -0.3897870155 0.1091997206 -3.5694872977 0.0003576806
Your comment: The estimated coefficient for InstallmentRatePercentage is approximately -0.39 with a small p-value (<0.001). This means it is statistically significant at the 5% level.
beta_install <- coef(logit_model)["InstallmentRatePercentage"]
odds_ratio_install <- exp(beta_install)
odds_ratio_install
## InstallmentRatePercentage
## 0.6772011
Your comment: The odds ratio is about 0.68. This means that for each one-unit increase in InstallmentRatePercentage, the odds of being classified as Good are multiplied by about 0.68. (About a 32% decrease in the odds of good credit) ### Task 3: Model Evaluation (Part I)
train_pred_prob <- predict(logit_model,
newdata = train_data,
type = "response")
summary(train_pred_prob)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04427 0.50110 0.78645 0.70000 0.92189 0.99851
cutoff <- 0.5
# Predicted class: 1 = Good, 0 = Bad
train_pred_class <- ifelse(train_pred_prob >= cutoff, 1, 0)
# Actual: convert logical TRUE/FALSE to 1/0
actual_train <- ifelse(train_data$Class, 1, 0)
confusion_train <- table(Predicted = train_pred_class,
Actual = actual_train)
confusion_train
## Actual
## Predicted 0 1
## 0 122 52
## 1 88 438
# Misclassification rate
FP_train <- sum(actual_train == 0 & train_pred_class == 1)
FN_train <- sum(actual_train == 1 & train_pred_class == 0)
MR_train <- (FP_train + FN_train) / length(actual_train)
MR_train
## [1] 0.2
Your comment: The training confusion matrix shows about 80% correctly classified and MR is about 0.20 (20% of training observations misclassified). Performance is decent given the problem and imbalance.
pcut_seq <- seq(0, 1, by = 0.01)
MR_seq <- numeric(length(pcut_seq))
for(i in seq_along(pcut_seq)){
pcut <- pcut_seq[i]
pred_class_i <- ifelse(train_pred_prob >= pcut, 1, 0)
FP_i <- sum(actual_train == 0 & pred_class_i == 1)
FN_i <- sum(actual_train == 1 & pred_class_i == 0)
MR_seq[i] <- (FP_i + FN_i) / length(actual_train)
}
plot(pcut_seq, MR_seq, type = "l",
xlab = "Cut-off probability",
ylab = "Misclassification Rate (MR)",
main = "MR vs Cut-off Probability (Training Set)")
best_cut <- pcut_seq[which.min(MR_seq)]
best_cut
## [1] 0.49
min(MR_seq)
## [1] 0.1957143
Your comment: The minimum MR occurs at a cut-off around 0.49 (very close to 0.5), with MR slightly below 0.20. So the usual 0.5 threshold is nearly optimal.
library(ROCR)
pred_train <- prediction(train_pred_prob, actual_train)
perf_train <- performance(pred_train, "tpr", "fpr")
plot(perf_train,
col = "blue",
lwd = 2,
main = "ROC Curve - Training Set")
auc_train <- performance(pred_train, "auc")@y.values[[1]]
auc_train
## [1] 0.8345578
Your comment: The training AUC is about 0.83, which indicates good discrimination: the model has a strong ability to rank good vs. bad credit applicants better than random guessing. ## Task 4: Model Evaluation (Part II)
test_pred_prob <- predict(logit_model,
newdata = test_data,
type = "response")
summary(test_pred_prob)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06043 0.52355 0.80110 0.70132 0.91777 0.99664
cutoff <- 0.5
test_pred_class <- ifelse(test_pred_prob >= cutoff, 1, 0)
actual_test <- ifelse(test_data$Class, 1, 0)
confusion_test <- table(Predicted = test_pred_class,
Actual = actual_test)
confusion_test
## Actual
## Predicted 0 1
## 0 40 29
## 1 50 181
FP_test <- sum(actual_test == 0 & test_pred_class == 1)
FN_test <- sum(actual_test == 1 & test_pred_class == 0)
MR_test <- (FP_test + FN_test) / length(actual_test)
MR_test
## [1] 0.2633333
Your comment: On the test set, MR is about 0.26 (about 26% misclassified), worse than the training MR (about 20%). This drop is expected and suggests some overfitting but still reasonable generalization.
pred_test <- prediction(test_pred_prob, actual_test)
perf_test <- performance(pred_test, "tpr", "fpr")
plot(perf_test,
col = "red",
lwd = 2,
main = "ROC Curve - Test Set")
auc_test <- performance(pred_test, "auc")@y.values[[1]]
auc_test
## [1] 0.7964021
Your comment: The test AUC is about 0.80, slightly lower than the training AUC but still in a good range. This indicates the model maintains solid predictive power on unseen data with only modest performance degradation.