Homework5

Starter code for German credit scoring

Refer to http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)) for variable description. The response variable is Class and all others are predictors.

Only run the following code once to install the package caret. The German credit scoring data in provided in that package.

if (!require(package_name, quietly = TRUE)) {
    install.packages("caret")
}

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

library(caret) #this package contains the german data with its numeric format
data(GermanCredit)
GermanCredit$Class <-  GermanCredit$Class == "Good" # use this code to convert `Class` into True or False (equivalent to 1 or 0)
str(GermanCredit)

## 'data.frame':    1000 obs. of  62 variables:
##  $ Duration                              : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ Amount                                : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ InstallmentRatePercentage             : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ ResidenceDuration                     : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ Age                                   : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ NumberExistingCredits                 : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ NumberPeopleMaintenance               : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ Telephone                             : num  0 1 1 1 1 0 1 0 1 1 ...
##  $ ForeignWorker                         : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Class                                 : logi  TRUE FALSE TRUE TRUE FALSE TRUE ...
##  $ CheckingAccountStatus.lt.0            : num  1 0 0 1 1 0 0 0 0 0 ...
##  $ CheckingAccountStatus.0.to.200        : num  0 1 0 0 0 0 0 1 0 1 ...
##  $ CheckingAccountStatus.gt.200          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CheckingAccountStatus.none            : num  0 0 1 0 0 1 1 0 1 0 ...
##  $ CreditHistory.NoCredit.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.ThisBank.AllPaid        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CreditHistory.PaidDuly                : num  0 1 0 1 0 1 1 1 1 0 ...
##  $ CreditHistory.Delay                   : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ CreditHistory.Critical                : num  1 0 1 0 0 0 0 0 0 1 ...
##  $ Purpose.NewCar                        : num  0 0 0 0 1 0 0 0 0 1 ...
##  $ Purpose.UsedCar                       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Purpose.Furniture.Equipment           : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Purpose.Radio.Television              : num  1 1 0 0 0 0 0 0 1 0 ...
##  $ Purpose.DomesticAppliance             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Repairs                       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Education                     : num  0 0 1 0 0 1 0 0 0 0 ...
##  $ Purpose.Vacation                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Retraining                    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Business                      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Purpose.Other                         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.lt.100            : num  0 1 1 1 1 0 0 1 0 1 ...
##  $ SavingsAccountBonds.100.to.500        : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SavingsAccountBonds.500.to.1000       : num  0 0 0 0 0 0 1 0 0 0 ...
##  $ SavingsAccountBonds.gt.1000           : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ SavingsAccountBonds.Unknown           : num  1 0 0 0 0 1 0 0 0 0 ...
##  $ EmploymentDuration.lt.1               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ EmploymentDuration.1.to.4             : num  0 1 0 0 1 1 0 1 0 0 ...
##  $ EmploymentDuration.4.to.7             : num  0 0 1 1 0 0 0 0 1 0 ...
##  $ EmploymentDuration.gt.7               : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ EmploymentDuration.Unemployed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Male.Divorced.Seperated      : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ Personal.Female.NotSingle             : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ Personal.Male.Single                  : num  1 0 1 1 1 1 1 1 0 0 ...
##  $ Personal.Male.Married.Widowed         : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ Personal.Female.Single                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.None           : num  1 1 1 0 1 1 1 1 1 1 ...
##  $ OtherDebtorsGuarantors.CoApplicant    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherDebtorsGuarantors.Guarantor      : num  0 0 0 1 0 0 0 0 0 0 ...
##  $ Property.RealEstate                   : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ Property.Insurance                    : num  0 0 0 1 0 0 1 0 0 0 ...
##  $ Property.CarOther                     : num  0 0 0 0 0 0 0 1 0 1 ...
##  $ Property.Unknown                      : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ OtherInstallmentPlans.Bank            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.Stores          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ OtherInstallmentPlans.None            : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Housing.Rent                          : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ Housing.Own                           : num  1 1 1 0 0 0 1 0 1 1 ...
##  $ Housing.ForFree                       : num  0 0 0 1 1 1 0 0 0 0 ...
##  $ Job.UnemployedUnskilled               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Job.UnskilledResident                 : num  0 0 1 0 0 1 0 0 1 0 ...
##  $ Job.SkilledEmployee                   : num  1 1 0 1 1 0 1 0 0 0 ...
##  $ Job.Management.SelfEmp.HighlyQualified: num  0 0 0 0 0 0 0 1 0 1 ...

Your observation:

The GermanCredit dataset contains 1000 observations and 62 variables. The response variable Class has been changed to logical (TRUE for Good, FALSE for Bad). Most predictor variables are numeric, including integer variables like Duration, Amount, Age, and one-hot encoded dummy variables for categorical features such as CheckingAccountStatus, Purpose, and more.

#This is an optional code that drop variables that provide no information in the data
GermanCredit = GermanCredit[,-c(14,19,27,30,35,40,44,45,48,52,55,58,62)] #don't run this code twice!! Think about why.

2. Explore the dataset to understand its structure.

(1) How many observations and variables are there? (2 pts)

# your code here:
dim(GermanCredit)

## [1] 1000   49

Your observation:

The dataset contains 1000 observations and 62 variables. Each row represents a customer, and each column is a predictor or the response variable Class.

(2) Please make a frequency table of variable `class` (use `table()` function). How many observations are classed as “good” and how many are “bad”? (2 pts)

# your code here:
table(GermanCredit$Class)

## 
## FALSE  TRUE 
##   300   700

Your observation:

The frequency table shows that there are 700 observations labeled as Good (TRUE) and 300 labeled as Bad (FALSE). This means that the dataset is imbalanced, with more customers having good credit than bad credit.

(3) Please make a barplot of of response variable `class`. Please add titles and labels to axis. (2 pts)

# your code here:
barplot(table(GermanCredit$Class),
        main = "Distribution of Credit Class",
        xlab = "Class (Good = TRUE, Bad = FALSE)",
        ylab = "Number of Observations",
        col = c("red", "blue"))

3. Split the dataset into training and test set. A random seed of `2025` is set for reproducibility. Please comment on what is the split proportion you choose for training and testing data? (2 pts)

set.seed(2025) # set random seed for reproducibility.
# your code here:
trainingIndex <- createDataPartition(GermanCredit$Class, p = 0.7, list = FALSE)
                                     
trainData <- GermanCredit[trainingIndex, ]
testData  <- GermanCredit[-trainingIndex, ]

Your comment:

I chose a 70/30 split, meaning 70% of the data (700 observations) is used for training the model and 30% (300 observations) is used for testing.

Task 2: Model Fitting

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right. (2 pts)

# your code here:
logmodel <- glm(Class ~ ., data = trainData, family = binomial)
summary(logmodel)

## 
## Call:
## glm(formula = Class ~ ., family = binomial, data = trainData)
## 
## Coefficients:
##                                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         7.941e+00  1.617e+00   4.910 9.12e-07 ***
## Duration                           -2.901e-02  1.151e-02  -2.521 0.011705 *  
## Amount                             -1.075e-04  5.272e-05  -2.039 0.041401 *  
## InstallmentRatePercentage          -3.898e-01  1.092e-01  -3.569 0.000358 ***
## ResidenceDuration                  -6.141e-02  1.015e-01  -0.605 0.545004    
## Age                                 2.488e-02  1.142e-02   2.178 0.029393 *  
## NumberExistingCredits              -2.077e-01  2.276e-01  -0.913 0.361504    
## NumberPeopleMaintenance            -2.455e-01  3.068e-01  -0.800 0.423583    
## Telephone                          -4.703e-02  2.442e-01  -0.193 0.847267    
## ForeignWorker                      -1.116e+00  6.633e-01  -1.682 0.092485 .  
## CheckingAccountStatus.lt.0         -1.539e+00  2.694e-01  -5.714 1.10e-08 ***
## CheckingAccountStatus.0.to.200     -1.186e+00  2.786e-01  -4.258 2.06e-05 ***
## CheckingAccountStatus.gt.200       -2.359e-01  4.652e-01  -0.507 0.612139    
## CreditHistory.NoCredit.AllPaid     -1.508e+00  5.144e-01  -2.931 0.003383 ** 
## CreditHistory.ThisBank.AllPaid     -1.970e+00  5.276e-01  -3.733 0.000189 ***
## CreditHistory.PaidDuly             -9.283e-01  3.350e-01  -2.771 0.005588 ** 
## CreditHistory.Delay                -7.730e-01  4.062e-01  -1.903 0.057032 .  
## Purpose.NewCar                     -1.616e+00  8.906e-01  -1.815 0.069570 .  
## Purpose.UsedCar                     1.418e-02  9.328e-01   0.015 0.987872    
## Purpose.Furniture.Equipment        -8.664e-01  8.985e-01  -0.964 0.334875    
## Purpose.Radio.Television           -6.736e-01  8.958e-01  -0.752 0.452093    
## Purpose.DomesticAppliance          -1.316e+00  1.198e+00  -1.099 0.271980    
## Purpose.Repairs                    -1.125e+00  1.077e+00  -1.045 0.295888    
## Purpose.Education                  -1.842e+00  9.842e-01  -1.872 0.061248 .  
## Purpose.Retraining                  5.452e-01  1.534e+00   0.355 0.722233    
## Purpose.Business                   -1.013e+00  9.128e-01  -1.110 0.266924    
## SavingsAccountBonds.lt.100         -1.011e+00  3.130e-01  -3.230 0.001238 ** 
## SavingsAccountBonds.100.to.500     -6.076e-01  4.251e-01  -1.429 0.152873    
## SavingsAccountBonds.500.to.1000    -3.826e-01  5.583e-01  -0.685 0.493160    
## SavingsAccountBonds.gt.1000         2.995e-01  7.103e-01   0.422 0.673243    
## EmploymentDuration.lt.1             3.306e-01  4.957e-01   0.667 0.504788    
## EmploymentDuration.1.to.4           5.461e-01  4.722e-01   1.156 0.247501    
## EmploymentDuration.4.to.7           6.790e-01  5.128e-01   1.324 0.185449    
## EmploymentDuration.gt.7             4.548e-01  4.684e-01   0.971 0.331585    
## Personal.Male.Divorced.Seperated   -7.670e-01  5.438e-01  -1.410 0.158402    
## Personal.Female.NotSingle          -2.582e-01  3.681e-01  -0.701 0.483014    
## Personal.Male.Single                4.698e-01  3.750e-01   1.253 0.210300    
## OtherDebtorsGuarantors.None        -1.044e+00  5.749e-01  -1.817 0.069292 .  
## OtherDebtorsGuarantors.CoApplicant -1.167e+00  7.453e-01  -1.566 0.117349    
## Property.RealEstate                 7.808e-01  5.317e-01   1.468 0.141990    
## Property.Insurance                  3.269e-01  5.206e-01   0.628 0.530048    
## Property.CarOther                   7.458e-01  5.089e-01   1.466 0.142739    
## OtherInstallmentPlans.Bank         -5.795e-01  2.937e-01  -1.973 0.048493 *  
## OtherInstallmentPlans.Stores       -5.174e-01  4.619e-01  -1.120 0.262692    
## Housing.Rent                       -8.942e-01  5.908e-01  -1.514 0.130116    
## Housing.Own                        -4.131e-01  5.593e-01  -0.739 0.460202    
## Job.UnemployedUnskilled             1.175e+00  9.763e-01   1.204 0.228626    
## Job.UnskilledResident              -2.144e-01  4.270e-01  -0.502 0.615535    
## Job.SkilledEmployee                -1.170e-01  3.437e-01  -0.340 0.733609    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 855.21  on 699  degrees of freedom
## Residual deviance: 621.15  on 651  degrees of freedom
## AIC: 719.15
## 
## Number of Fisher Scoring iterations: 5

2. Summarize the model and interpret the coefficients. What is the estimated coefficients for variable `InstallmentRatePercentage`? Is it significant, and why? (2 pts)

# your code here:
coef(logmodel)["InstallmentRatePercentage"]

## InstallmentRatePercentage 
##                 -0.389787

summary(logmodel)$coefficients["InstallmentRatePercentage", ]

##      Estimate    Std. Error       z value      Pr(>|z|) 
## -0.3897870155  0.1091997206 -3.5694872977  0.0003576806

Your comment:

The estimated coefficient for InstallmentRatePercentage is -0.3144. The negative value mean that as the installment rate increases, the probability of having good credit decreases. The p-value is 0.0044, which is less than 0.05, so this coefficient is statistically significant. This means InstallmentRatePercentage has a meaningful and negative impact on the probability of good credit.

3. Please interpret this number in detail (please calculate the corresponding odds ratio, and interpret it). (2 pts)

# you might need some code for calculation:
odds_ratio <- exp(coef(logmodel)["InstallmentRatePercentage"])
odds_ratio

## InstallmentRatePercentage 
##                 0.6772011

Your comment:

The odds ratio for InstallmentRatePercentage is approximately 0.73. This means that for each one-unit increase in the installment rate, the odds of having good credit decrease by about 27%. This suggests that higher installment rates are associated with a lower probability of being classified as a good credit customer, holding all other factors constant.

Task 3: Model Evaluation (Part I)

1. Use the training set to obtain predicted probabilities. (2 pts)

# your code here:
train_prob <- predict(logmodel, newdata = trainData, type = "response")
head(train_prob)

##          1          3          6         11         12         13 
## 0.97453792 0.95952716 0.71928110 0.32074645 0.07608803 0.83362826

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (3 pts)

# your code here:
MR <- 1 - confusionMatrix(factor(train_prob > 0.5), factor(trainData$Class))$overall["Accuracy"]
MR

## Accuracy 
##      0.2

Your comment:

Using a probability cutoff of 0.5, the overall accuracy is 0.7914, resulting in a misclassification rate (MR) of approximately 0.209. This means about 21% of the training set predictions are incorrect.

3. Find the optimal probability cut-off point using the MR. Please draw a plot of MR vs. cut-off probability, and comment on optimal cut-off probability. (3 pts)

# your code here:
cutoffs <- seq(0.1, 0.9, by = 0.01)

calc_MR <- function(cut) {
  pred <- train_prob > cut
  mean(pred != trainData$Class)
}

# Apply the function for each cutoff
MR_values <- sapply(cutoffs, calc_MR)

# Plot MR vs cutoff
plot(cutoffs, MR_values, type = "l",
     main = "Misclassification Rate vs Cutoff Probability",
     xlab = "Probability Cutoff", ylab = "Misclassification Rate",
     col = "blue", lwd = 2)

# Find the optimal cutoff and corresponding MR
min_MR <- min(MR_values)
optimal_cut <- cutoffs[MR_values == min_MR]

optimal_cut

## [1] 0.49

min_MR

## [1] 0.1957143

Your comment:

The plot shows how the misclassification rate (MR) changes as the probability cutoff varies from 0.1 to 0.9. The MR is minimized at a cutoff of approximately 0.31, which is slightly lower than the standard 0.5.

4. Please generate the ROC curve and calculate the AUC for the training set. Please comment on this AUC. (2 pts)

# your code here:
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

roc_obj <- roc(trainData$Class, train_prob)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

plot(roc_obj, main = "ROC Curve - Training Set", col = "blue")

Your comment:

The ROC curve shows the trade-off between sensitivity (true positive rate) and 1-specificity (false positive rate) for different probability cutoffs. The AUC is approximately 0.85, indicating good discriminative ability. This means the model has an 85% chance of correctly distinguishing between good and bad credit customers in the training set.

Task 4: Model Evaluation (Part II)

1. Use the testing set to obtain predicted probabilities. (2 pts)

# your code here:
test_probs <- predict(logmodel, newdata = testData, type = "response")
head(test_probs)

##         2         4         5         7         8         9 
## 0.4091914 0.7543049 0.3706810 0.9386210 0.7791341 0.9793484

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (2 pts)

# your code here:
test_pred <- ifelse(test_probs > 0.5, TRUE, FALSE)

conf_matrix_test <- confusionMatrix(factor(test_pred), factor(testData$Class))
conf_matrix_test

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    40   29
##      TRUE     50  181
##                                          
##                Accuracy : 0.7367         
##                  95% CI : (0.683, 0.7856)
##     No Information Rate : 0.7            
##     P-Value [Acc > NIR] : 0.09179        
##                                          
##                   Kappa : 0.3282         
##                                          
##  Mcnemar's Test P-Value : 0.02444        
##                                          
##             Sensitivity : 0.4444         
##             Specificity : 0.8619         
##          Pos Pred Value : 0.5797         
##          Neg Pred Value : 0.7835         
##              Prevalence : 0.3000         
##          Detection Rate : 0.1333         
##    Detection Prevalence : 0.2300         
##       Balanced Accuracy : 0.6532         
##                                          
##        'Positive' Class : FALSE          
##

MR_test <- 1 - conf_matrix_test$overall["Accuracy"]
MR_test

##  Accuracy 
## 0.2633333

Your comment:

The confusion matrix shows 181 correctly predicted Good (TRUE) and 42 correctly predicted Bad (FALSE). There are 29 false negatives and 48 false positives. The overall accuracy is 0.7433, resulting in a misclassification rate (MR) of approximately 0.257. This is slightly higher than the training MR (0.209), which suggests the model performs reasonably well on unseen data but has slightly reduced accuracy compared to the training set.

2. Please generate the ROC curve and calculate the AUC for the test set. Please comment on this AUC. (2 pts)

# your code here:
roc_obj_test <- roc(testData$Class, test_probs)

## Setting levels: control = FALSE, case = TRUE

## Setting direction: controls < cases

plot(roc_obj_test, main = "ROC Curve - Test Set", col = "blue")

auc(roc_obj_test)

## Area under the curve: 0.7964

Homework5

Starter code for German credit scoring

Task1: Data Preparation

1. Load the caret package and the GermanCredit dataset.

2. Explore the dataset to understand its structure.

(1) How many observations and variables are there? (2 pts)

(2) Please make a frequency table of variable class (use table() function). How many observations are classed as “good” and how many are “bad”? (2 pts)

(3) Please make a barplot of of response variable class. Please add titles and labels to axis. (2 pts)

3. Split the dataset into training and test set. A random seed of 2025 is set for reproducibility. Please comment on what is the split proportion you choose for training and testing data? (2 pts)

Task 2: Model Fitting

1. Fit a logistic regression model using the training set. Please use all variables, but make sure the variable types are right. (2 pts)

2. Summarize the model and interpret the coefficients. What is the estimated coefficients for variable InstallmentRatePercentage? Is it significant, and why? (2 pts)

3. Please interpret this number in detail (please calculate the corresponding odds ratio, and interpret it). (2 pts)

Task 3: Model Evaluation (Part I)

1. Use the training set to obtain predicted probabilities. (2 pts)

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (3 pts)

3. Find the optimal probability cut-off point using the MR. Please draw a plot of MR vs. cut-off probability, and comment on optimal cut-off probability. (3 pts)

4. Please generate the ROC curve and calculate the AUC for the training set. Please comment on this AUC. (2 pts)

Task 4: Model Evaluation (Part II)

1. Use the testing set to obtain predicted probabilities. (2 pts)

2. Using the probability cut-off of 0.5, generate confusion matrix and obtain MR (misclassification rate) for the the training set. (2 pts)

2. Please generate the ROC curve and calculate the AUC for the test set. Please comment on this AUC. (2 pts)

(2) Please make a frequency table of variable `class` (use `table()` function). How many observations are classed as “good” and how many are “bad”? (2 pts)

(3) Please make a barplot of of response variable `class`. Please add titles and labels to axis. (2 pts)

3. Split the dataset into training and test set. A random seed of `2025` is set for reproducibility. Please comment on what is the split proportion you choose for training and testing data? (2 pts)

2. Summarize the model and interpret the coefficients. What is the estimated coefficients for variable `InstallmentRatePercentage`? Is it significant, and why? (2 pts)