In this tutorial, you will learn how to perform logistic regression in R. Logistic regression is used when the dependent variable is binary (i.e., has two possible outcomes). We will use the German Credit Data from the UCI Machine Learning Repository to predict credit risk (good vs bad credit).
First, we will load the dataset and rename the columns for better readability.
# Load the required libraries
if (!require("dplyr")) install.packages("dplyr", dependencies=TRUE)
library(dplyr)
# Load the dataset from UCI website
german <- read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data", quote="\"", comment.char="")
# View head of the dataset
head(german)
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18
## 1 A11 6 A34 A43 1169 A65 A75 4 A93 A101 4 A121 67 A143 A152 2 A173 1
## 2 A12 48 A32 A43 5951 A61 A73 2 A92 A101 2 A121 22 A143 A152 1 A173 1
## 3 A14 12 A34 A46 2096 A61 A74 2 A93 A101 3 A121 49 A143 A152 1 A172 2
## 4 A11 42 A32 A42 7882 A61 A74 2 A93 A103 4 A122 45 A143 A153 1 A173 2
## 5 A11 24 A33 A40 4870 A61 A73 3 A93 A101 4 A124 53 A143 A153 2 A173 2
## 6 A14 36 A32 A46 9055 A65 A73 2 A93 A101 4 A124 35 A143 A153 1 A172 2
## V19 V20 V21
## 1 A192 A201 1
## 2 A191 A201 2
## 3 A191 A201 1
## 4 A191 A201 1
## 5 A191 A201 2
## 6 A192 A201 1
Comment - Column names need to be renamed for better
understanding. - The target variable is in the last column are coded as
1 (good credit) and 2 (bad credit). The target
variable is
to be recoded to be binary, with “good” credit as 1 and “bad” credit as
0.
# Rename the columns for better readability
german <- german %>%
rename("checking" = "V1",
"duration" = "V2",
"credit_history" = "V3",
"purpose" = "V4",
"credit_amount" = "V5",
"savings" = "V6",
"employment_since" = "V7",
"installment_rate" = "V8",
"marital_status" = "V9",
"debtors" = "V10",
"residence_since" = "V11",
"property" = "V12",
"age" = "V13",
"other_installment" = "V14",
"housing" = "V15",
"existing_credits" = "V16",
"job" = "V17",
"maintenance_people" = "V18",
"telephone" = "V19",
"foreign" = "V20",
"target" = "V21")
# Recode the target variable
german$target[german$target == 2] <- 0
# Explore the dataset
skimr::skim(german)
Name | german |
Number of rows | 1000 |
Number of columns | 21 |
_______________________ | |
Column type frequency: | |
character | 13 |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
checking | 0 | 1 | 3 | 3 | 0 | 4 | 0 |
credit_history | 0 | 1 | 3 | 3 | 0 | 5 | 0 |
purpose | 0 | 1 | 3 | 4 | 0 | 10 | 0 |
savings | 0 | 1 | 3 | 3 | 0 | 5 | 0 |
employment_since | 0 | 1 | 3 | 3 | 0 | 5 | 0 |
marital_status | 0 | 1 | 3 | 3 | 0 | 4 | 0 |
debtors | 0 | 1 | 4 | 4 | 0 | 3 | 0 |
property | 0 | 1 | 4 | 4 | 0 | 4 | 0 |
other_installment | 0 | 1 | 4 | 4 | 0 | 3 | 0 |
housing | 0 | 1 | 4 | 4 | 0 | 3 | 0 |
job | 0 | 1 | 4 | 4 | 0 | 4 | 0 |
telephone | 0 | 1 | 4 | 4 | 0 | 2 | 0 |
foreign | 0 | 1 | 4 | 4 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
duration | 0 | 1 | 20.90 | 12.06 | 4 | 12.0 | 18.0 | 24.00 | 72 | ▇▇▂▁▁ |
credit_amount | 0 | 1 | 3271.26 | 2822.74 | 250 | 1365.5 | 2319.5 | 3972.25 | 18424 | ▇▂▁▁▁ |
installment_rate | 0 | 1 | 2.97 | 1.12 | 1 | 2.0 | 3.0 | 4.00 | 4 | ▂▃▁▂▇ |
residence_since | 0 | 1 | 2.85 | 1.10 | 1 | 2.0 | 3.0 | 4.00 | 4 | ▂▆▁▃▇ |
age | 0 | 1 | 35.55 | 11.38 | 19 | 27.0 | 33.0 | 42.00 | 75 | ▇▆▃▁▁ |
existing_credits | 0 | 1 | 1.41 | 0.58 | 1 | 1.0 | 1.0 | 2.00 | 4 | ▇▅▁▁▁ |
maintenance_people | 0 | 1 | 1.16 | 0.36 | 1 | 1.0 | 1.0 | 1.00 | 2 | ▇▁▁▁▂ |
target | 0 | 1 | 0.70 | 0.46 | 0 | 0.0 | 1.0 | 1.00 | 1 | ▃▁▁▁▇ |
str(german)
## 'data.frame': 1000 obs. of 21 variables:
## $ checking : chr "A11" "A12" "A14" "A11" ...
## $ duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_history : chr "A34" "A32" "A34" "A32" ...
## $ purpose : chr "A43" "A43" "A46" "A42" ...
## $ credit_amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings : chr "A65" "A61" "A61" "A61" ...
## $ employment_since : chr "A75" "A73" "A74" "A74" ...
## $ installment_rate : int 4 2 2 2 3 2 3 2 2 4 ...
## $ marital_status : chr "A93" "A92" "A93" "A93" ...
## $ debtors : chr "A101" "A101" "A101" "A103" ...
## $ residence_since : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property : chr "A121" "A121" "A121" "A122" ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ other_installment : chr "A143" "A143" "A143" "A143" ...
## $ housing : chr "A152" "A152" "A152" "A153" ...
## $ existing_credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : chr "A173" "A173" "A172" "A173" ...
## $ maintenance_people: int 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone : chr "A192" "A191" "A191" "A191" ...
## $ foreign : chr "A201" "A201" "A201" "A201" ...
## $ target : num 1 0 1 1 0 1 1 1 1 0 ...
We will now build a logistic regression model to predict the credit
risk based on several numerical predictors. In this section, we fit the
logistic regression model using the glm()
function with the
binomial
family. This fits a logistic model where the
dependent variable is binary.
# Build a logistic regression model with numeric variables
model0 <- glm(target ~ duration + credit_amount + installment_rate + residence_since + age + existing_credits + maintenance_people,
family = "binomial",
data = german)
# Summary of the model
summary(model0)
##
## Call:
## glm(formula = target ~ duration + credit_amount + installment_rate +
## residence_since + age + existing_credits + maintenance_people,
## family = "binomial", data = german)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.570e+00 4.300e-01 3.651 0.000261 ***
## duration -2.621e-02 7.703e-03 -3.403 0.000667 ***
## credit_amount -7.060e-05 3.404e-05 -2.074 0.038053 *
## installment_rate -2.036e-01 7.252e-02 -2.807 0.004999 **
## residence_since -4.091e-02 6.691e-02 -0.611 0.540923
## age 2.143e-02 7.083e-03 3.026 0.002482 **
## existing_credits 1.569e-01 1.305e-01 1.202 0.229276
## maintenance_people -1.280e-01 2.013e-01 -0.636 0.524880
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1221.7 on 999 degrees of freedom
## Residual deviance: 1158.4 on 992 degrees of freedom
## AIC: 1174.4
##
## Number of Fisher Scoring iterations: 4
To assess the model’s performance, we split the data into training
and testing sets. We split the data into training (80%) and testing
(20%) sets using the caTools
package. This ensures that we
can evaluate the model on unseen data.
# Load the required library
if (!require("caTools")) install.packages("caTools", dependencies=TRUE)
## Loading required package: caTools
library(caTools)
# Split the data into training and testing sets
set.seed(123) # for reproducibility
split <- sample.split(german$target, SplitRatio = 0.8)
train_german <- subset(german, split == TRUE)
test_german <- subset(german, split == FALSE)
# Check the distribution of the target variable in both sets
table(train_german$target)
##
## 0 1
## 240 560
table(test_german$target)
##
## 0 1
## 60 140
Comment A better way to visualize the distribution of the target variable in the training and testing sets is through bar plots.
# Bar plot of the target variable in the training set and testing set (0: lightblue, 1: lightcoral), the y-axis of these plots use the same maximum value to allow for comparison, show values on top of the bars, x-axis label is credit risk
par(mfrow=c(1,2))
barplot(table(train_german$target), col = c("lightblue", "lightcoral"), ylim = c(0, 600), main = "Training Set", ylab = "Frequency", xlab = "Credit Risk")
text(x = 1:2, y = table(train_german$target) + 20, labels = table(train_german$target), pos = 2)
barplot(table(test_german$target), col = c("lightblue", "lightcoral"), ylim = c(0, 600), main = "Testing Set", ylab = "Frequency", xlab = "Credit Risk")
text(x = 1:2, y = table(test_german$target) + 20, labels = table(test_german$target), pos = 2)
# stack bar plot
barplot(cbind(Train = table(train_german$target), Test = table(test_german$target)),
beside = FALSE,
col = c("lightblue", "lightcoral"),
legend = c(0, 1),
ylab = "Frequency")
Comment The bar plots show the distribution of the target variable in the training and testing sets. - The distribution of the target variable is similar in both sets, which is important for model evaluation. - The target variable is imbalanced, with a significantly higher proportion of [majority class] compared to [minority class]. This imbalance may lead to biased model predictions favoring the majority class, potentially less sensitive to the minority class. To address this, resampling techniques such as oversampling the minority class or undersampling the majority class will be considered to ensure better model performance and fairness.
Now, we will evaluate the logistic regression model’s performance using the testing set. We retrain the model using the training set and make predictions on the test set. The predicted values are probabilities ranging between 0 and 1.
# Train the logistic regression model
model_train<-glm(target~
## categorical variables ##
# checking +
# credit_history +
# purpose +
# savings +
# employment_since +
# marital_status +
# debtors +
# property +
# other_installment +
# housing+
credit_history +
## numerical variables ##
duration +
credit_amount +
installment_rate +
residence_since +
age +
existing_credits +
maintenance_people,
family = "binomial",
data=train_german)
# Summary of the trained model
summary(model_train)
##
## Call:
## glm(formula = target ~ credit_history + duration + credit_amount +
## installment_rate + residence_since + age + existing_credits +
## maintenance_people, family = "binomial", data = train_german)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.917e-01 6.659e-01 1.339 0.180553
## credit_historyA31 1.419e-01 5.397e-01 0.263 0.792563
## credit_historyA32 1.114e+00 4.163e-01 2.675 0.007474 **
## credit_historyA33 1.206e+00 4.613e-01 2.614 0.008940 **
## credit_historyA34 1.992e+00 4.261e-01 4.676 2.93e-06 ***
## duration -2.927e-02 8.893e-03 -3.291 0.000997 ***
## credit_amount -5.254e-05 4.078e-05 -1.288 0.197643
## installment_rate -2.110e-01 8.303e-02 -2.541 0.011047 *
## residence_since -6.021e-02 7.707e-02 -0.781 0.434715
## age 1.331e-02 8.186e-03 1.625 0.104056
## existing_credits -1.252e-01 1.786e-01 -0.701 0.483464
## maintenance_people 4.710e-02 2.444e-01 0.193 0.847198
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 977.38 on 799 degrees of freedom
## Residual deviance: 883.42 on 788 degrees of freedom
## AIC: 907.42
##
## Number of Fisher Scoring iterations: 4
# Make predictions on the test set
predicted_test <- predict(model_train, newdata = test_german, type = "response")
# View the first 10 actual vs predicted values
data.frame(Actual = test_german$target, Predicted = predicted_test)[1:10, ]
## Actual Predicted
## 2 0 0.4899381
## 6 1 0.5615767
## 7 1 0.7123913
## 11 0 0.7590096
## 13 1 0.8198102
## 18 1 0.2671756
## 24 1 0.8881059
## 28 1 0.5505340
## 29 1 0.7848754
## 33 1 0.6950008
To evaluate the classification performance, we create a confusion matrix.
# Load the required libraries
if (!require("caret")) install.packages("caret", dependencies=TRUE)
library(caret)
# Create a confusion matrix
confusionMatrix(data = as.factor(predicted_test > 0.5), reference = as.factor(test_german$target > 0.5))
## Confusion Matrix and Statistics
##
## Reference
## Prediction FALSE TRUE
## FALSE 14 9
## TRUE 46 131
##
## Accuracy : 0.725
## 95% CI : (0.6576, 0.7856)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 0.2455
##
## Kappa : 0.2052
##
## Mcnemar's Test P-Value : 1.208e-06
##
## Sensitivity : 0.2333
## Specificity : 0.9357
## Pos Pred Value : 0.6087
## Neg Pred Value : 0.7401
## Prevalence : 0.3000
## Detection Rate : 0.0700
## Detection Prevalence : 0.1150
## Balanced Accuracy : 0.5845
##
## 'Positive' Class : FALSE
##
The confusion matrix compares the predicted classifications (based on a 0.5 threshold) to the actual classifications. It provides metrics such as accuracy, sensitivity, and specificity.
We will now assess the goodness of fit using pseudo-R-squared and an ROC curve.
# Load the required libraries
if (!require("lmtest")) install.packages("lmtest", dependencies=TRUE)
if (!require("rcompanion")) install.packages("rcompanion", dependencies=TRUE)
library(lmtest)
library(rcompanion)
# Pseudo-R-squared
nagelkerke(model_train)
## $Models
##
## Model: "glm, target ~ credit_history + duration + credit_amount + installment_rate + residence_since + age + existing_credits + maintenance_people, binomial, train_german"
## Null: "glm, target ~ 1, binomial, train_german"
##
## $Pseudo.R.squared.for.model.vs.null
## Pseudo.R.squared
## McFadden 0.0961335
## Cox and Snell (ML) 0.1108140
## Nagelkerke (Cragg and Uhler) 0.1571210
##
## $Likelihood.ratio.test
## Df.diff LogLik.diff Chisq p.value
## -11 -46.98 93.959 2.7825e-15
##
## $Number.of.observations
##
## Model: 800
## Null: 800
##
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
##
## $Warnings
## [1] "None"
# ROC Curve
if (!require("ROCR")) install.packages("ROCR", dependencies=TRUE)
## Loading required package: ROCR
library(ROCR)
# ROC curve
LR_pred <- predict(model_train, type = 'response', newdata = test_german)
german_pred <- prediction(LR_pred, test_german$target)
german_perf <- performance(german_pred, "tpr", "fpr")
# Calculate AUC
perf2 <- performance(german_pred, "auc")
auroc <- perf2@y.values
print(auroc)
## [[1]]
## [1] 0.6759524
# Plot the ROC curve in red
plot(german_perf, col = "red", print.cutoffs.at = seq(0, 1, 0.1), text.adj = c(-0.25, 0.5), print.cutoffs.col = "blue")
abline(a = 0, b = 1, lty = 2)
# Add AUC to the plot, and legend for cutoffs
legend("bottomright", legend = c(paste("AUC =", round(as.numeric(auroc), 2)), "Cutoffs"), col = c("red","black"), lty = c(1, 1), pch = c(NA, 1))
Explanation:
We calculate the Nagelkerke pseudo-R-squared, which measures how well the model fits the data. The ROC curve visualizes the trade-off between true positives and false positives, and the AUC (area under the curve) indicates the overall performance of the model.