Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
if (!require(package_name, quietly = TRUE)) {
install.packages("caret")
}
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <- GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
#
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : logi TRUE FALSE TRUE TRUE FALSE TRUE ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
Your observation:
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.
# your code here:
dim(GermanCredit)
## [1] 1000 49
Your observation: There are 1000 observations and 49 variables.
class
(use table() function). How many observations are classed
as “good” and how many are “bad”? (2 pts)# your code here:
table(GermanCredit$Class)
##
## FALSE TRUE
## 300 700
Your observation: 300 are bad and 700 are good
class. Please add titles and labels to axis. (2 pts)# your code here:
barplot(table(GermanCredit$Class), main = "Distribution of Germany's Credit Class", xlab = "Class", ylab = "Frequency")
2025 is set for reproducibility. Please
comment on what is the split proportion you choose for training and
testing data? (2 pts)set.seed(2025) # set random seed for reproducibility.
# your code here:
train_ind <- sample(1:nrow(GermanCredit), 0.75 * nrow(GermanCredit))
GermanCredit_train <- GermanCredit[train_ind, ]
GermanCredit_test <- GermanCredit[-train_ind, ]
Your comment:I chose a 75% / 25% split for the training and testing data. This means 75% of the observations are used to train the logistic regression model, allowing it to learn patterns from most of the data, while 25% of the observations are held out for testing to evaluate the model’s performance on unseen data.
# your code here:
GermanCredit_logit <- glm(Class ~., data = GermanCredit_train, family = binomial)
InstallmentRatePercentage? Is it significant, and why? (2
pts)# your code here:
summary(GermanCredit_logit)
##
## Call:
## glm(formula = Class ~ ., family = binomial, data = GermanCredit_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.138e+00 1.659e+00 4.905 9.33e-07 ***
## Duration -2.287e-02 1.051e-02 -2.176 0.029549 *
## Amount -1.407e-04 5.069e-05 -2.775 0.005517 **
## InstallmentRatePercentage -3.080e-01 1.007e-01 -3.059 0.002223 **
## ResidenceDuration -8.627e-02 1.017e-01 -0.848 0.396285
## Age 1.401e-02 1.045e-02 1.340 0.180091
## NumberExistingCredits -3.447e-01 2.206e-01 -1.562 0.118192
## NumberPeopleMaintenance -3.991e-01 2.749e-01 -1.452 0.146546
## Telephone -3.329e-01 2.339e-01 -1.423 0.154683
## ForeignWorker -1.418e+00 7.389e-01 -1.919 0.055021 .
## CheckingAccountStatus.lt.0 -1.775e+00 2.712e-01 -6.544 5.98e-11 ***
## CheckingAccountStatus.0.to.200 -1.314e+00 2.678e-01 -4.906 9.30e-07 ***
## CheckingAccountStatus.gt.200 -1.009e+00 4.333e-01 -2.330 0.019815 *
## CreditHistory.NoCredit.AllPaid -1.034e+00 5.010e-01 -2.063 0.039097 *
## CreditHistory.ThisBank.AllPaid -1.433e+00 4.930e-01 -2.906 0.003663 **
## CreditHistory.PaidDuly -8.772e-01 2.975e-01 -2.949 0.003193 **
## CreditHistory.Delay -4.631e-01 3.826e-01 -1.210 0.226131
## Purpose.NewCar -1.169e+00 8.451e-01 -1.384 0.166449
## Purpose.UsedCar 5.667e-01 8.977e-01 0.631 0.527881
## Purpose.Furniture.Equipment -2.726e-01 8.570e-01 -0.318 0.750388
## Purpose.Radio.Television -2.016e-01 8.540e-01 -0.236 0.813402
## Purpose.DomesticAppliance -4.521e-01 1.222e+00 -0.370 0.711306
## Purpose.Repairs -1.303e+00 1.030e+00 -1.265 0.205940
## Purpose.Education -1.009e+00 9.280e-01 -1.087 0.277076
## Purpose.Retraining 7.012e-01 1.518e+00 0.462 0.644074
## Purpose.Business -5.485e-01 8.790e-01 -0.624 0.532654
## SavingsAccountBonds.lt.100 -1.092e+00 3.184e-01 -3.428 0.000608 ***
## SavingsAccountBonds.100.to.500 -8.785e-01 4.048e-01 -2.170 0.029972 *
## SavingsAccountBonds.500.to.1000 -8.902e-01 5.226e-01 -1.703 0.088478 .
## SavingsAccountBonds.gt.1000 3.225e-01 6.318e-01 0.510 0.609733
## EmploymentDuration.lt.1 4.102e-01 4.885e-01 0.840 0.401148
## EmploymentDuration.1.to.4 4.354e-01 4.675e-01 0.931 0.351677
## EmploymentDuration.4.to.7 1.268e+00 5.081e-01 2.496 0.012563 *
## EmploymentDuration.gt.7 5.514e-01 4.803e-01 1.148 0.250921
## Personal.Male.Divorced.Seperated -1.034e-01 5.311e-01 -0.195 0.845651
## Personal.Female.NotSingle -6.919e-02 3.665e-01 -0.189 0.850244
## Personal.Male.Single 5.270e-01 3.657e-01 1.441 0.149528
## OtherDebtorsGuarantors.None -6.635e-01 4.785e-01 -1.387 0.165542
## OtherDebtorsGuarantors.CoApplicant -9.064e-01 6.538e-01 -1.386 0.165638
## Property.RealEstate 3.235e-01 5.026e-01 0.644 0.519794
## Property.Insurance 2.209e-01 4.940e-01 0.447 0.654739
## Property.CarOther 2.288e-01 4.772e-01 0.480 0.631544
## OtherInstallmentPlans.Bank -7.996e-01 2.707e-01 -2.953 0.003142 **
## OtherInstallmentPlans.Stores -2.339e-01 4.453e-01 -0.525 0.599450
## Housing.Rent -4.498e-01 5.573e-01 -0.807 0.419647
## Housing.Own -2.336e-01 5.368e-01 -0.435 0.663475
## Job.UnemployedUnskilled 7.239e-01 7.837e-01 0.924 0.355629
## Job.UnskilledResident -4.570e-02 3.959e-01 -0.115 0.908101
## Job.SkilledEmployee -1.314e-01 3.274e-01 -0.402 0.688025
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 931.04 on 749 degrees of freedom
## Residual deviance: 687.08 on 701 degrees of freedom
## AIC: 785.08
##
## Number of Fisher Scoring iterations: 5
Your comment: The coefficient is negative, which means that higher installment rates are associated with a lower probability of being a good credit risk. The InstallmentRatepercentage is -3.080e-01 The p-value is 0.002, which is less than 0.05. This indicates that the effect of InstallmentRatePercentage is statistically significant.
# you might need some code for calculation:
coef_GermanCredit <- coef(GermanCredit_logit)["InstallmentRatePercentage"]
odds_ratio <- exp(coef_GermanCredit)
odds_ratio
## InstallmentRatePercentage
## 0.7349491
Your comment: The estimated coefficient for InstallmentRatePercentage is -0.308. The corresponding odds ratio is 0.735, which means that for every 1-unit increase in InstallmentRatePercentage, the odds of having good credit decrease by about 26.5%. The p-value is 0.002, indicating this effect is statistically significant. Higher installment rates are therefore associated with a lower probability of being a good credit risk.
# your code here:
train_pred_prob <- predict(GermanCredit_logit, newdata= GermanCredit_train, type = "response")
head(train_pred_prob)
## 909 460 932 922 961 279
## 0.9624772 0.9035098 0.6215123 0.8649416 0.9471023 0.8748184
# your code here:
train_pred_class <- ifelse(train_pred_prob > 0.5, TRUE, FALSE)
conf_matrix_train <- table(Predicted = train_pred_class, Actual = GermanCredit_train$Class)
conf_matrix_train
## Actual
## Predicted FALSE TRUE
## FALSE 129 59
## TRUE 105 457
MR_train <- mean(train_pred_class != GermanCredit_train$Class)
MR_train
## [1] 0.2186667
Your comment: The confusion matrix shows that the model correctly predicted 129 bad credit cases and 457 good credit cases. It misclassified 59 good credit cases as bad and 105 bad credit cases as good. Overall, the model performs reasonably well, but it tends to misclassify some bad credit cases as good, which could be risky for lenders.The misclassification rate on the training data is 0.2187, meaning the model incorrectly classifies about 21.9% of the training observations.
# your code here:
cutoffs <- seq(0.01, 0.99, by = 0.01)
MR_values <- numeric(length(cutoffs))
for (i in 1:length(cutoffs)) {
pred_class <- ifelse(train_pred_prob > cutoffs[i], TRUE, FALSE)
MR_values[i] <- mean(pred_class != GermanCredit_train$Class)
}
optimal_cutoff <- cutoffs[which.min(MR_values)]
optimal_cutoff
## [1] 0.47
min(MR_values)
## [1] 0.2146667
plot(cutoffs, MR_values, type = "l", col = "blue", lwd = 2,
xlab = "Probability Cut-off", ylab = "Misclassification Rate (MR)",
main = "MR vs. Probability Cut-off")
abline(v = optimal_cutoff, col = "red", lty = 2)
text(optimal_cutoff, min(MR_values), labels = round(optimal_cutoff, 2), pos = 4, col = "red")
Your comment: The optimal is 0.47
# your code here:
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_train <- roc(GermanCredit_train$Class, train_pred_prob)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
plot(roc_train, col = "blue", main = "ROC Curve Training Set")
auc_train <- auc(roc_train)
auc_train
## Area under the curve: 0.8306
Your comment: The AUC for the training set is 0.8306, which indicates strong model performance. Since the AUC is much higher than 0.5, the logistic regression model has good discriminative ability, it can effectively distinguish between “Good” and “Bad” credit applicants.
# your code here:
test_pred_prob <- predict(GermanCredit_logit, newdata= GermanCredit_test, type = "response")
head(test_pred_prob)
## 3 4 7 9 17 21
## 0.9723991 0.6708878 0.9167871 0.9902534 0.9744949 0.8554000
# your code here:
test_pred_class <- ifelse(test_pred_prob > 0.5, TRUE, FALSE)
conf_matrix_test <- table(Predicted = test_pred_class, Actual = GermanCredit_test$Class)
conf_matrix_test
## Actual
## Predicted FALSE TRUE
## FALSE 26 19
## TRUE 40 165
MR_test <- mean(train_pred_class != GermanCredit_test$Class)
MR_test
## [1] 0.376
Your comment: The confusion matrix shows the model’s performance on the test set. The model correctly classified 26 bad credit cases and 165 good credit cases. It misclassified 19 good credit cases as bad and 40 bad credit cases as good. The misclassification rate is 0.376, meaning approximately 37.6% of predictions were incorrect. This MR is higher than the training MR, indicating the model’s accuracy decreases on unseen data, though it still captures many correct classifications.
# your code here:
roc_test <- roc(GermanCredit_test$Class, test_pred_prob)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
plot(roc_test, col = "blue", main = "ROC Curve Test Set")
auc_test <- auc(roc_test)
auc_test
## Area under the curve: 0.8167