Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data))
for variable description. The response variable is Class
and all others are predictors.
Only run the following code once to install the package
caret. The German credit scoring data in
provided in that package.
if (!require(package_name, quietly = TRUE)) {
install.packages("caret")
}
library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <- GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)
## 'data.frame': 1000 obs. of 62 variables:
## $ Duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ Amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ InstallmentRatePercentage : int 4 2 2 2 3 2 3 2 2 4 ...
## $ ResidenceDuration : int 4 2 3 4 4 4 4 2 4 2 ...
## $ Age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ NumberExistingCredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ NumberPeopleMaintenance : int 1 1 2 2 2 2 1 1 1 1 ...
## $ Telephone : num 0 1 1 1 1 0 1 0 1 1 ...
## $ ForeignWorker : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Class : logi TRUE FALSE TRUE TRUE FALSE TRUE ...
## $ CheckingAccountStatus.lt.0 : num 1 0 0 1 1 0 0 0 0 0 ...
## $ CheckingAccountStatus.0.to.200 : num 0 1 0 0 0 0 0 1 0 1 ...
## $ CheckingAccountStatus.gt.200 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CheckingAccountStatus.none : num 0 0 1 0 0 1 1 0 1 0 ...
## $ CreditHistory.NoCredit.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.ThisBank.AllPaid : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CreditHistory.PaidDuly : num 0 1 0 1 0 1 1 1 1 0 ...
## $ CreditHistory.Delay : num 0 0 0 0 1 0 0 0 0 0 ...
## $ CreditHistory.Critical : num 1 0 1 0 0 0 0 0 0 1 ...
## $ Purpose.NewCar : num 0 0 0 0 1 0 0 0 0 1 ...
## $ Purpose.UsedCar : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Purpose.Furniture.Equipment : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Purpose.Radio.Television : num 1 1 0 0 0 0 0 0 1 0 ...
## $ Purpose.DomesticAppliance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Repairs : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Education : num 0 0 1 0 0 1 0 0 0 0 ...
## $ Purpose.Vacation : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Retraining : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Business : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Purpose.Other : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.lt.100 : num 0 1 1 1 1 0 0 1 0 1 ...
## $ SavingsAccountBonds.100.to.500 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SavingsAccountBonds.500.to.1000 : num 0 0 0 0 0 0 1 0 0 0 ...
## $ SavingsAccountBonds.gt.1000 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ SavingsAccountBonds.Unknown : num 1 0 0 0 0 1 0 0 0 0 ...
## $ EmploymentDuration.lt.1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ EmploymentDuration.1.to.4 : num 0 1 0 0 1 1 0 1 0 0 ...
## $ EmploymentDuration.4.to.7 : num 0 0 1 1 0 0 0 0 1 0 ...
## $ EmploymentDuration.gt.7 : num 1 0 0 0 0 0 1 0 0 0 ...
## $ EmploymentDuration.Unemployed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Male.Divorced.Seperated : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Personal.Female.NotSingle : num 0 1 0 0 0 0 0 0 0 0 ...
## $ Personal.Male.Single : num 1 0 1 1 1 1 1 1 0 0 ...
## $ Personal.Male.Married.Widowed : num 0 0 0 0 0 0 0 0 0 1 ...
## $ Personal.Female.Single : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.None : num 1 1 1 0 1 1 1 1 1 1 ...
## $ OtherDebtorsGuarantors.CoApplicant : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherDebtorsGuarantors.Guarantor : num 0 0 0 1 0 0 0 0 0 0 ...
## $ Property.RealEstate : num 1 1 1 0 0 0 0 0 1 0 ...
## $ Property.Insurance : num 0 0 0 1 0 0 1 0 0 0 ...
## $ Property.CarOther : num 0 0 0 0 0 0 0 1 0 1 ...
## $ Property.Unknown : num 0 0 0 0 1 1 0 0 0 0 ...
## $ OtherInstallmentPlans.Bank : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.Stores : num 0 0 0 0 0 0 0 0 0 0 ...
## $ OtherInstallmentPlans.None : num 1 1 1 1 1 1 1 1 1 1 ...
## $ Housing.Rent : num 0 0 0 0 0 0 0 1 0 0 ...
## $ Housing.Own : num 1 1 1 0 0 0 1 0 1 1 ...
## $ Housing.ForFree : num 0 0 0 1 1 1 0 0 0 0 ...
## $ Job.UnemployedUnskilled : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Job.UnskilledResident : num 0 0 1 0 0 1 0 0 1 0 ...
## $ Job.SkilledEmployee : num 1 1 0 1 1 0 1 0 0 0 ...
## $ Job.Management.SelfEmp.HighlyQualified: num 0 0 0 0 0 0 0 1 0 1 ...
Your observation:
The GermanCredit dataset contains 1000 observations and 62 variables. The response variable Class has been changed to logical (TRUE for Good, FALSE for Bad). Most predictor variables are numeric, including integer variables like Duration, Amount, Age, and one-hot encoded dummy variables for categorical features such as CheckingAccountStatus, Purpose, and more.
#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.
# your code here:
dim(GermanCredit)
## [1] 1000 49
Your observation:
The dataset contains 1000 observations and 62 variables. Each row represents a customer, and each column is a predictor or the response variable Class.
class
(use table() function). How many observations are classed
as “good” and how many are “bad”? (2 pts)# your code here:
table(GermanCredit$Class)
##
## FALSE TRUE
## 300 700
Your observation:
The frequency table shows that there are 700 observations labeled as Good (TRUE) and 300 labeled as Bad (FALSE). This means that the dataset is imbalanced, with more customers having good credit than bad credit.
class. Please add titles and labels to axis. (2 pts)# your code here:
barplot(table(GermanCredit$Class),
main = "Distribution of Credit Class",
xlab = "Class (Good = TRUE, Bad = FALSE)",
ylab = "Number of Observations",
col = c("red", "blue"))
2025 is set for reproducibility. Please
comment on what is the split proportion you choose for training and
testing data? (2 pts)set.seed(2025) # set random seed for reproducibility.
# your code here:
trainingIndex <- createDataPartition(GermanCredit$Class, p = 0.7, list = FALSE)
trainData <- GermanCredit[trainingIndex, ]
testData <- GermanCredit[-trainingIndex, ]
Your comment:
I chose a 70/30 split, meaning 70% of the data (700 observations) is used for training the model and 30% (300 observations) is used for testing.
# your code here:
logmodel <- glm(Class ~ ., data = trainData, family = binomial)
summary(logmodel)
##
## Call:
## glm(formula = Class ~ ., family = binomial, data = trainData)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.941e+00 1.617e+00 4.910 9.12e-07 ***
## Duration -2.901e-02 1.151e-02 -2.521 0.011705 *
## Amount -1.075e-04 5.272e-05 -2.039 0.041401 *
## InstallmentRatePercentage -3.898e-01 1.092e-01 -3.569 0.000358 ***
## ResidenceDuration -6.141e-02 1.015e-01 -0.605 0.545004
## Age 2.488e-02 1.142e-02 2.178 0.029393 *
## NumberExistingCredits -2.077e-01 2.276e-01 -0.913 0.361504
## NumberPeopleMaintenance -2.455e-01 3.068e-01 -0.800 0.423583
## Telephone -4.703e-02 2.442e-01 -0.193 0.847267
## ForeignWorker -1.116e+00 6.633e-01 -1.682 0.092485 .
## CheckingAccountStatus.lt.0 -1.539e+00 2.694e-01 -5.714 1.10e-08 ***
## CheckingAccountStatus.0.to.200 -1.186e+00 2.786e-01 -4.258 2.06e-05 ***
## CheckingAccountStatus.gt.200 -2.359e-01 4.652e-01 -0.507 0.612139
## CreditHistory.NoCredit.AllPaid -1.508e+00 5.144e-01 -2.931 0.003383 **
## CreditHistory.ThisBank.AllPaid -1.970e+00 5.276e-01 -3.733 0.000189 ***
## CreditHistory.PaidDuly -9.283e-01 3.350e-01 -2.771 0.005588 **
## CreditHistory.Delay -7.730e-01 4.062e-01 -1.903 0.057032 .
## Purpose.NewCar -1.616e+00 8.906e-01 -1.815 0.069570 .
## Purpose.UsedCar 1.418e-02 9.328e-01 0.015 0.987872
## Purpose.Furniture.Equipment -8.664e-01 8.985e-01 -0.964 0.334875
## Purpose.Radio.Television -6.736e-01 8.958e-01 -0.752 0.452093
## Purpose.DomesticAppliance -1.316e+00 1.198e+00 -1.099 0.271980
## Purpose.Repairs -1.125e+00 1.077e+00 -1.045 0.295888
## Purpose.Education -1.842e+00 9.842e-01 -1.872 0.061248 .
## Purpose.Retraining 5.452e-01 1.534e+00 0.355 0.722233
## Purpose.Business -1.013e+00 9.128e-01 -1.110 0.266924
## SavingsAccountBonds.lt.100 -1.011e+00 3.130e-01 -3.230 0.001238 **
## SavingsAccountBonds.100.to.500 -6.076e-01 4.251e-01 -1.429 0.152873
## SavingsAccountBonds.500.to.1000 -3.826e-01 5.583e-01 -0.685 0.493160
## SavingsAccountBonds.gt.1000 2.995e-01 7.103e-01 0.422 0.673243
## EmploymentDuration.lt.1 3.306e-01 4.957e-01 0.667 0.504788
## EmploymentDuration.1.to.4 5.461e-01 4.722e-01 1.156 0.247501
## EmploymentDuration.4.to.7 6.790e-01 5.128e-01 1.324 0.185449
## EmploymentDuration.gt.7 4.548e-01 4.684e-01 0.971 0.331585
## Personal.Male.Divorced.Seperated -7.670e-01 5.438e-01 -1.410 0.158402
## Personal.Female.NotSingle -2.582e-01 3.681e-01 -0.701 0.483014
## Personal.Male.Single 4.698e-01 3.750e-01 1.253 0.210300
## OtherDebtorsGuarantors.None -1.044e+00 5.749e-01 -1.817 0.069292 .
## OtherDebtorsGuarantors.CoApplicant -1.167e+00 7.453e-01 -1.566 0.117349
## Property.RealEstate 7.808e-01 5.317e-01 1.468 0.141990
## Property.Insurance 3.269e-01 5.206e-01 0.628 0.530048
## Property.CarOther 7.458e-01 5.089e-01 1.466 0.142739
## OtherInstallmentPlans.Bank -5.795e-01 2.937e-01 -1.973 0.048493 *
## OtherInstallmentPlans.Stores -5.174e-01 4.619e-01 -1.120 0.262692
## Housing.Rent -8.942e-01 5.908e-01 -1.514 0.130116
## Housing.Own -4.131e-01 5.593e-01 -0.739 0.460202
## Job.UnemployedUnskilled 1.175e+00 9.763e-01 1.204 0.228626
## Job.UnskilledResident -2.144e-01 4.270e-01 -0.502 0.615535
## Job.SkilledEmployee -1.170e-01 3.437e-01 -0.340 0.733609
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 855.21 on 699 degrees of freedom
## Residual deviance: 621.15 on 651 degrees of freedom
## AIC: 719.15
##
## Number of Fisher Scoring iterations: 5
InstallmentRatePercentage? Is it significant, and why? (2
pts)# your code here:
coef(logmodel)["InstallmentRatePercentage"]
## InstallmentRatePercentage
## -0.389787
summary(logmodel)$coefficients["InstallmentRatePercentage", ]
## Estimate Std. Error z value Pr(>|z|)
## -0.3897870155 0.1091997206 -3.5694872977 0.0003576806
Your comment:
The estimated coefficient for InstallmentRatePercentage is -0.3144. The negative value mean that as the installment rate increases, the probability of having good credit decreases. The p-value is 0.0044, which is less than 0.05, so this coefficient is statistically significant. This means InstallmentRatePercentage has a meaningful and negative impact on the probability of good credit.
# you might need some code for calculation:
odds_ratio <- exp(coef(logmodel)["InstallmentRatePercentage"])
odds_ratio
## InstallmentRatePercentage
## 0.6772011
Your comment:
The odds ratio for InstallmentRatePercentage is approximately 0.73. This means that for each one-unit increase in the installment rate, the odds of having good credit decrease by about 27%. This suggests that higher installment rates are associated with a lower probability of being classified as a good credit customer, holding all other factors constant.
# your code here:
train_prob <- predict(logmodel, newdata = trainData, type = "response")
head(train_prob)
## 1 3 6 11 12 13
## 0.97453792 0.95952716 0.71928110 0.32074645 0.07608803 0.83362826
# your code here:
MR <- 1 - confusionMatrix(factor(train_prob > 0.5), factor(trainData$Class))$overall["Accuracy"]
MR
## Accuracy
## 0.2
Your comment:
Using a probability cutoff of 0.5, the overall accuracy is 0.7914, resulting in a misclassification rate (MR) of approximately 0.209. This means about 21% of the training set predictions are incorrect.
# your code here:
cutoffs <- seq(0.1, 0.9, by = 0.01)
calc_MR <- function(cut) {
pred <- train_prob > cut
mean(pred != trainData$Class)
}
# Apply the function for each cutoff
MR_values <- sapply(cutoffs, calc_MR)
# Plot MR vs cutoff
plot(cutoffs, MR_values, type = "l",
main = "Misclassification Rate vs Cutoff Probability",
xlab = "Probability Cutoff", ylab = "Misclassification Rate",
col = "blue", lwd = 2)
# Find the optimal cutoff and corresponding MR
min_MR <- min(MR_values)
optimal_cut <- cutoffs[MR_values == min_MR]
optimal_cut
## [1] 0.49
min_MR
## [1] 0.1957143
Your comment:
The plot shows how the misclassification rate (MR) changes as the probability cutoff varies from 0.1 to 0.9. The MR is minimized at a cutoff of approximately 0.31, which is slightly lower than the standard 0.5.
# your code here:
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_obj <- roc(trainData$Class, train_prob)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
plot(roc_obj, main = "ROC Curve - Training Set", col = "blue")
Your comment:
The ROC curve shows the trade-off between sensitivity (true positive rate) and 1-specificity (false positive rate) for different probability cutoffs. The AUC is approximately 0.85, indicating good discriminative ability. This means the model has an 85% chance of correctly distinguishing between good and bad credit customers in the training set.
# your code here:
test_probs <- predict(logmodel, newdata = testData, type = "response")
head(test_probs)
## 2 4 5 7 8 9
## 0.4091914 0.7543049 0.3706810 0.9386210 0.7791341 0.9793484
# your code here:
test_pred <- ifelse(test_probs > 0.5, TRUE, FALSE)
conf_matrix_test <- confusionMatrix(factor(test_pred), factor(testData$Class))
conf_matrix_test
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 40 29
## TRUE 50 181
##
## Accuracy : 0.7367
## 95% CI : (0.683, 0.7856)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.09179
##
## Kappa : 0.3282
##
## Mcnemar's Test P-Value : 0.02444
##
## Sensitivity : 0.4444
## Specificity : 0.8619
## Pos Pred Value : 0.5797
## Neg Pred Value : 0.7835
## Prevalence : 0.3000
## Detection Rate : 0.1333
## Detection Prevalence : 0.2300
## Balanced Accuracy : 0.6532
##
## 'Positive' Class : FALSE
##
MR_test <- 1 - conf_matrix_test$overall["Accuracy"]
MR_test
## Accuracy
## 0.2633333
Your comment:
The confusion matrix shows 181 correctly predicted Good (TRUE) and 42 correctly predicted Bad (FALSE). There are 29 false negatives and 48 false positives. The overall accuracy is 0.7433, resulting in a misclassification rate (MR) of approximately 0.257. This is slightly higher than the training MR (0.209), which suggests the model performs reasonably well on unseen data but has slightly reduced accuracy compared to the training set.
# your code here:
roc_obj_test <- roc(testData$Class, test_probs)
## Setting levels: control = FALSE, case = TRUE
## Setting direction: controls < cases
plot(roc_obj_test, main = "ROC Curve - Test Set", col = "blue")
auc(roc_obj_test)
## Area under the curve: 0.7964