Introduction

In this tutorial, you will learn how to perform logistic regression in R. Logistic regression is used when the dependent variable is binary (i.e., has two possible outcomes). We will use the German Credit Data from the UCI Machine Learning Repository to predict credit risk (good vs bad credit).

Objectives:

  • Load and explore the dataset
  • Prepare the data for modeling
  • Build a logistic regression model
  • Evaluate model performance
  • Visualize logistic regression results

Logistic Regression Workflow

  1. Problem Understanding
    • Define the problem, ensuring that the target variable is binary (e.g., Yes/No, 0/1).
    • Identify the features (independent variables) that will be used to predict the target.
  2. Data Preparation
    • Load the data and inspect its structure.
    • Handle missing values and outliers if present.
  3. Exploratory Data Analysis (EDA)
    • Descriptive statistics: Generate summary statistics to understand data distributions.
    • Visualize the relationships between the target and independent variables.
    • Check for correlations, distributions, and any visible patterns in the data.
  4. Splitting the Data
  5. Data Preprocessing
    • Convert categorical variables into numerical format (e.g., using one-hot encoding or factor conversion in R).
    • Scale or normalize numerical features if necessary, especially for algorithms sensitive to feature magnitude (though not essential for logistic regression).
  6. Feature Selection
    • Remove irrelevant or redundant features.
    • Remove unethical or biased features.
  7. Handle Imbalanced Data (Oversampling/Undersampling)
    • Oversampling (e.g., SMOTE): Increase the number of minority class instances by generating synthetic data points.
    • Undersampling: Reduce the number of majority class instances by randomly removing data points.
    • Apply resampling techniques only on the training set to handle class imbalance. This step ensures that the model doesn’t become biased toward the majority class but generalizes well to both classes.
  8. Model Building
    • Fit a logistic regression model using the balanced training set (after feature selection and resampling).
    • Review the model summary for coefficients, significance (p-values), and fit statistics.
  9. Model Evaluation
    • Make predictions on the test set using the trained logistic regression model.
    • Convert probabilities to binary outcomes using a threshold (e.g., 0.5).
    • Evaluate the model using single threshold metrics:
      • Confusion Matrix: Summarizes true positives, true negatives, false positives, and false negatives.
      \[ \begin{array}{|c|c|c|} \hline \textbf{ } & Actual \textbf{ Positive} & Actual \textbf{ Negative} \\ \hline Predicted \textbf{ Positive} & TP & FP \\ \hline Predicted \textbf{ Negative} & FN & TN \\ \hline \end{array} \]
      • Sensitivity (Recall): Proportion of true positives correctly identified. \[ \text{Sensitivity} = \frac{TP}{TP + FN} \]
      • Specificity: Proportion of true negatives correctly identified. \[ \text{Specificity} = \frac{TN}{TN + FP} \]
      • Accuracy: The average of sensitivity and specificity. \[ \text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \]
    • Evaluate the model using multiple threshold metrics:
      • ROC Curve (Receiver Operating Characteristic): Plots the sensitivity (True Positive Rate) against 1 - Specificity (False Positive Rate) for different threshold values.
      • AUC (Area Under the Curve): AUC value close to 1 indicates a good model.
  10. Model Interpretation
    • Interpret coefficients: Convert logistic regression coefficients to odds ratios to understand how each feature affects the likelihood of the target variable.
    • Examine statistical significance: Check p-values to determine the significance of each predictor.
  11. Model Deployment

1. Loading the Data

First, we will load the dataset and rename the columns for better readability.

# Load the required libraries
if (!require("dplyr")) install.packages("dplyr", dependencies=TRUE)
library(dplyr)

# Load the dataset from UCI website
german <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data", quote="\"", comment.char="")

# View head of the dataset
head(german)
##    V1 V2  V3  V4   V5  V6  V7 V8  V9  V10 V11  V12 V13  V14  V15 V16  V17 V18
## 1 A11  6 A34 A43 1169 A65 A75  4 A93 A101   4 A121  67 A143 A152   2 A173   1
## 2 A12 48 A32 A43 5951 A61 A73  2 A92 A101   2 A121  22 A143 A152   1 A173   1
## 3 A14 12 A34 A46 2096 A61 A74  2 A93 A101   3 A121  49 A143 A152   1 A172   2
## 4 A11 42 A32 A42 7882 A61 A74  2 A93 A103   4 A122  45 A143 A153   1 A173   2
## 5 A11 24 A33 A40 4870 A61 A73  3 A93 A101   4 A124  53 A143 A153   2 A173   2
## 6 A14 36 A32 A46 9055 A65 A73  2 A93 A101   4 A124  35 A143 A153   1 A172   2
##    V19  V20 V21
## 1 A192 A201   1
## 2 A191 A201   2
## 3 A191 A201   1
## 4 A191 A201   1
## 5 A191 A201   2
## 6 A192 A201   1

Comment - Column names need to be renamed for better understanding. - The target variable is in the last column are coded as 1 (good credit) and 2 (bad credit). The target variable is to be recoded to be binary, with “good” credit as 1 and “bad” credit as 0.

# Rename the columns for better readability
german <- german %>%
  rename("checking" = "V1",
         "duration" = "V2",
         "credit_history" = "V3",
         "purpose" = "V4",
         "credit_amount" = "V5",
         "savings" = "V6",
         "employment_since" = "V7",
         "installment_rate" = "V8",
         "marital_status" = "V9",
         "debtors" = "V10",
         "residence_since" = "V11",
         "property" = "V12",
         "age" = "V13",
         "other_installment" = "V14",
         "housing" = "V15",
         "existing_credits" = "V16",
         "job" = "V17",
         "maintenance_people" = "V18",
         "telephone" = "V19",
         "foreign" = "V20",
         "target" = "V21")

# Recode the target variable
german$target[german$target == 2] <- 0

# Explore the dataset
skimr::skim(german)
Data summary
Name german
Number of rows 1000
Number of columns 21
_______________________
Column type frequency:
character 13
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
checking 0 1 3 3 0 4 0
credit_history 0 1 3 3 0 5 0
purpose 0 1 3 4 0 10 0
savings 0 1 3 3 0 5 0
employment_since 0 1 3 3 0 5 0
marital_status 0 1 3 3 0 4 0
debtors 0 1 4 4 0 3 0
property 0 1 4 4 0 4 0
other_installment 0 1 4 4 0 3 0
housing 0 1 4 4 0 3 0
job 0 1 4 4 0 4 0
telephone 0 1 4 4 0 2 0
foreign 0 1 4 4 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
duration 0 1 20.90 12.06 4 12.0 18.0 24.00 72 ▇▇▂▁▁
credit_amount 0 1 3271.26 2822.74 250 1365.5 2319.5 3972.25 18424 ▇▂▁▁▁
installment_rate 0 1 2.97 1.12 1 2.0 3.0 4.00 4 ▂▃▁▂▇
residence_since 0 1 2.85 1.10 1 2.0 3.0 4.00 4 ▂▆▁▃▇
age 0 1 35.55 11.38 19 27.0 33.0 42.00 75 ▇▆▃▁▁
existing_credits 0 1 1.41 0.58 1 1.0 1.0 2.00 4 ▇▅▁▁▁
maintenance_people 0 1 1.16 0.36 1 1.0 1.0 1.00 2 ▇▁▁▁▂
target 0 1 0.70 0.46 0 0.0 1.0 1.00 1 ▃▁▁▁▇
str(german)
## 'data.frame':    1000 obs. of  21 variables:
##  $ checking          : chr  "A11" "A12" "A14" "A11" ...
##  $ duration          : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history    : chr  "A34" "A32" "A34" "A32" ...
##  $ purpose           : chr  "A43" "A43" "A46" "A42" ...
##  $ credit_amount     : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings           : chr  "A65" "A61" "A61" "A61" ...
##  $ employment_since  : chr  "A75" "A73" "A74" "A74" ...
##  $ installment_rate  : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ marital_status    : chr  "A93" "A92" "A93" "A93" ...
##  $ debtors           : chr  "A101" "A101" "A101" "A103" ...
##  $ residence_since   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property          : chr  "A121" "A121" "A121" "A122" ...
##  $ age               : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_installment : chr  "A143" "A143" "A143" "A143" ...
##  $ housing           : chr  "A152" "A152" "A152" "A153" ...
##  $ existing_credits  : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job               : chr  "A173" "A173" "A172" "A173" ...
##  $ maintenance_people: int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone         : chr  "A192" "A191" "A191" "A191" ...
##  $ foreign           : chr  "A201" "A201" "A201" "A201" ...
##  $ target            : num  1 0 1 1 0 1 1 1 1 0 ...

2. Building a Logistic Regression Model

We will now build a logistic regression model to predict the credit risk based on several numerical predictors. In this section, we fit the logistic regression model using the glm() function with the binomial family. This fits a logistic model where the dependent variable is binary.

# Build a logistic regression model with numeric variables
model0 <- glm(target ~ duration + credit_amount + installment_rate + residence_since + age + existing_credits + maintenance_people, 
             family = "binomial", 
             data = german)

# Summary of the model
summary(model0)
## 
## Call:
## glm(formula = target ~ duration + credit_amount + installment_rate + 
##     residence_since + age + existing_credits + maintenance_people, 
##     family = "binomial", data = german)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.570e+00  4.300e-01   3.651 0.000261 ***
## duration           -2.621e-02  7.703e-03  -3.403 0.000667 ***
## credit_amount      -7.060e-05  3.404e-05  -2.074 0.038053 *  
## installment_rate   -2.036e-01  7.252e-02  -2.807 0.004999 ** 
## residence_since    -4.091e-02  6.691e-02  -0.611 0.540923    
## age                 2.143e-02  7.083e-03   3.026 0.002482 ** 
## existing_credits    1.569e-01  1.305e-01   1.202 0.229276    
## maintenance_people -1.280e-01  2.013e-01  -0.636 0.524880    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1221.7  on 999  degrees of freedom
## Residual deviance: 1158.4  on 992  degrees of freedom
## AIC: 1174.4
## 
## Number of Fisher Scoring iterations: 4

3. Splitting the Data into Training and Testing Sets

To assess the model’s performance, we split the data into training and testing sets. We split the data into training (80%) and testing (20%) sets using the caTools package. This ensures that we can evaluate the model on unseen data.

# Load the required library
if (!require("caTools")) install.packages("caTools", dependencies=TRUE)
## Loading required package: caTools
library(caTools)

# Split the data into training and testing sets
set.seed(123)  # for reproducibility
split <- sample.split(german$target, SplitRatio = 0.8)
train_german <- subset(german, split == TRUE)
test_german <- subset(german, split == FALSE)

# Check the distribution of the target variable in both sets
table(train_german$target)
## 
##   0   1 
## 240 560
table(test_german$target)
## 
##   0   1 
##  60 140

Comment A better way to visualize the distribution of the target variable in the training and testing sets is through bar plots.

# Bar plot of the target variable in the training set and testing set (0: lightblue, 1: lightcoral), the y-axis of these plots use the same maximum value to allow for comparison, show values on top of the bars, x-axis label is credit risk
par(mfrow=c(1,2))
barplot(table(train_german$target), col = c("lightblue", "lightcoral"), ylim = c(0, 600), main = "Training Set", ylab = "Frequency", xlab = "Credit Risk")
text(x = 1:2, y = table(train_german$target) + 20, labels = table(train_german$target), pos = 2)
barplot(table(test_german$target), col = c("lightblue", "lightcoral"), ylim = c(0, 600), main = "Testing Set", ylab = "Frequency", xlab = "Credit Risk")
text(x = 1:2, y = table(test_german$target) + 20, labels = table(test_german$target), pos = 2)

# stack bar plot
barplot(cbind(Train = table(train_german$target), Test = table(test_german$target)),
        beside = FALSE, 
        col = c("lightblue", "lightcoral"), 
        legend = c(0, 1), 
        ylab = "Frequency")

Comment The bar plots show the distribution of the target variable in the training and testing sets. - The distribution of the target variable is similar in both sets, which is important for model evaluation. - The target variable is imbalanced, with a significantly higher proportion of [majority class] compared to [minority class]. This imbalance may lead to biased model predictions favoring the majority class, potentially less sensitive to the minority class. To address this, resampling techniques such as oversampling the minority class or undersampling the majority class will be considered to ensure better model performance and fairness.

4. Evaluating the Model

Now, we will evaluate the logistic regression model’s performance using the testing set. We retrain the model using the training set and make predictions on the test set. The predicted values are probabilities ranging between 0 and 1.

# Train the logistic regression model
model_train<-glm(target~
                   ## categorical variables ##
                   # checking +
                   # credit_history +
                   # purpose +
                   # savings +
                   # employment_since +
                   # marital_status +
                   # debtors +
                   # property +
                   # other_installment +
                   # housing+
                   credit_history + 
                   ## numerical variables ##
                   duration + 
                   credit_amount +
                   installment_rate + 
                   residence_since +
                   age +
                   existing_credits +
                   maintenance_people,
                 family = "binomial", 
                 data=train_german)

# Summary of the trained model
summary(model_train)
## 
## Call:
## glm(formula = target ~ credit_history + duration + credit_amount + 
##     installment_rate + residence_since + age + existing_credits + 
##     maintenance_people, family = "binomial", data = train_german)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         8.917e-01  6.659e-01   1.339 0.180553    
## credit_historyA31   1.419e-01  5.397e-01   0.263 0.792563    
## credit_historyA32   1.114e+00  4.163e-01   2.675 0.007474 ** 
## credit_historyA33   1.206e+00  4.613e-01   2.614 0.008940 ** 
## credit_historyA34   1.992e+00  4.261e-01   4.676 2.93e-06 ***
## duration           -2.927e-02  8.893e-03  -3.291 0.000997 ***
## credit_amount      -5.254e-05  4.078e-05  -1.288 0.197643    
## installment_rate   -2.110e-01  8.303e-02  -2.541 0.011047 *  
## residence_since    -6.021e-02  7.707e-02  -0.781 0.434715    
## age                 1.331e-02  8.186e-03   1.625 0.104056    
## existing_credits   -1.252e-01  1.786e-01  -0.701 0.483464    
## maintenance_people  4.710e-02  2.444e-01   0.193 0.847198    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 977.38  on 799  degrees of freedom
## Residual deviance: 883.42  on 788  degrees of freedom
## AIC: 907.42
## 
## Number of Fisher Scoring iterations: 4
# Make predictions on the test set
predicted_test <- predict(model_train, newdata = test_german, type = "response")

# View the first 10 actual vs predicted values
data.frame(Actual = test_german$target, Predicted = predicted_test)[1:10, ]
##    Actual Predicted
## 2       0 0.4899381
## 6       1 0.5615767
## 7       1 0.7123913
## 11      0 0.7590096
## 13      1 0.8198102
## 18      1 0.2671756
## 24      1 0.8881059
## 28      1 0.5505340
## 29      1 0.7848754
## 33      1 0.6950008

5. Confusion Matrix and Model Performance

To evaluate the classification performance, we create a confusion matrix.

# Load the required libraries
if (!require("caret")) install.packages("caret", dependencies=TRUE)
library(caret)
# Create a confusion matrix
confusionMatrix(data = as.factor(predicted_test > 0.5), reference = as.factor(test_german$target > 0.5))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction FALSE TRUE
##      FALSE    14    9
##      TRUE     46  131
##                                           
##                Accuracy : 0.725           
##                  95% CI : (0.6576, 0.7856)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 0.2455          
##                                           
##                   Kappa : 0.2052          
##                                           
##  Mcnemar's Test P-Value : 1.208e-06       
##                                           
##             Sensitivity : 0.2333          
##             Specificity : 0.9357          
##          Pos Pred Value : 0.6087          
##          Neg Pred Value : 0.7401          
##              Prevalence : 0.3000          
##          Detection Rate : 0.0700          
##    Detection Prevalence : 0.1150          
##       Balanced Accuracy : 0.5845          
##                                           
##        'Positive' Class : FALSE           
## 

Explanation:

The confusion matrix compares the predicted classifications (based on a 0.5 threshold) to the actual classifications. It provides metrics such as accuracy, sensitivity, and specificity.

6. Model Diagnostics and Performance Metrics

We will now assess the goodness of fit using pseudo-R-squared and an ROC curve.

# Load the required libraries
if (!require("lmtest")) install.packages("lmtest", dependencies=TRUE)
if (!require("rcompanion")) install.packages("rcompanion", dependencies=TRUE)
library(lmtest)
library(rcompanion)
# Pseudo-R-squared
nagelkerke(model_train)
## $Models
##                                                                                                                                                                            
## Model: "glm, target ~ credit_history + duration + credit_amount + installment_rate + residence_since + age + existing_credits + maintenance_people, binomial, train_german"
## Null:  "glm, target ~ 1, binomial, train_german"                                                                                                                           
## 
## $Pseudo.R.squared.for.model.vs.null
##                              Pseudo.R.squared
## McFadden                            0.0961335
## Cox and Snell (ML)                  0.1108140
## Nagelkerke (Cragg and Uhler)        0.1571210
## 
## $Likelihood.ratio.test
##  Df.diff LogLik.diff  Chisq    p.value
##      -11      -46.98 93.959 2.7825e-15
## 
## $Number.of.observations
##           
## Model: 800
## Null:  800
## 
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
## 
## $Warnings
## [1] "None"
# ROC Curve
if (!require("ROCR")) install.packages("ROCR", dependencies=TRUE)
## Loading required package: ROCR
library(ROCR)

# ROC curve
LR_pred <- predict(model_train, type = 'response', newdata = test_german)
german_pred <- prediction(LR_pred, test_german$target)
german_perf <- performance(german_pred, "tpr", "fpr")

# Calculate AUC
perf2 <- performance(german_pred, "auc")
auroc <- perf2@y.values
print(auroc)
## [[1]]
## [1] 0.6759524
# Plot the ROC curve in red
plot(german_perf, col = "red", print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.25, 0.5), print.cutoffs.col = "blue")
abline(a = 0, b = 1, lty = 2)

# Add AUC to the plot, and legend for cutoffs
legend("bottomright", legend = c(paste("AUC =", round(as.numeric(auroc), 2)), "Cutoffs"), col = c("red","black"), lty = c(1, 1), pch = c(NA, 1))

Explanation:

We calculate the Nagelkerke pseudo-R-squared, which measures how well the model fits the data. The ROC curve visualizes the trade-off between true positives and false positives, and the AUC (area under the curve) indicates the overall performance of the model.