Using F1 Score to Evaluate Logistic Regression Models

Introduction

The F1 score is a measure of a model’s accuracy that considers both precision and recall to compute the score. Precision is the number of true positives divided by the number of all positive predictions, and recall is the number of true positives divided by the number of positive instances in the dataset. The F1 score is useful when the classes are very imbalanced. It is calculated by:

\(F1 = 2 \times \left( \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} \right)\)

The F1 score can be used to evaluate logistic regression models. For this blog post, I will use the “Adult” dataset from the UCI Machine Learning Repository, which is often used to predict whether income exceeds $50K/yr based on census data.

I will import the dataset preprocess it, fit a logistic regression model, and then evaluate it using the F1 score.

The dataset features include age, work class, education, marital status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country.

Regression Model

Importing and Processing Data

library(data.table)
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: ggplot2
## Loading required package: lattice
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

adult = fread(url, header = FALSE, na.strings = "?")

colnames(adult) = c("age", "workclass", "fnlwgt", "education", "education_num",
                     "marital_status", "occupation", "relationship", "race", "sex",
                     "capital_gain", "capital_loss", "hours_per_week", "native_country", "income")


adult = na.omit(adult)


adult$income_binary = as.integer(adult$income == ">50K")

Here, I split the adult dataset into training and testing subsets. The createDataPartition function from the caret package is used to randomly divide the dataset, allocating 80% of the data to trainData and the remaining 20% to testData, based on the binary outcome variable adult$income_binary.

set.seed(621)

trainIndex = createDataPartition(adult$income_binary, p = .8, 
                                  list = FALSE, 
                                  times = 1)
trainData = adult[trainIndex,]
testData = adult[-trainIndex,]

Fitting Logistic Regression Model

Here, I fit a logistic regression model and make predictions based on the model. I use the glm function to fit a logistic regression model using income_binary as the response variable and age, education_num, and hours_per_week as predictors, assuming a binomial distribution.

The model is trained using the trainData dataset. After the model is fitted, predictions are made on the testData set using the predict function with the type set to “response”, which returns predicted probabilities. These probabilities are then converted into binary class predictions (1 for probabilities greater than 0.5 indicating income over $50K, and 0 otherwise) using the ifelse function.

model = glm(income_binary ~ age + education_num + hours_per_week,
             family = binomial(), data = trainData)


predictions = predict(model, testData, type = "response")
predicted_classes = ifelse(predictions > 0.5, 1, 0)

Evaluating Model using F1 Score

The confusionMatrix variable contains the counts of true positives, false positives, true negatives, and false negatives. It is used in the F1 score calculation.

  • Precision is calculated for each class as the # of true positives for each class divided by the row sums of the matrix (total predicted positives for each class).

  • Recall is calculated as the diagonal elements divided by the column sums of the matrix (actual positives for each class).

confusionMatrix = table(testData$income_binary, predicted_classes)
confusionMatrix
##    predicted_classes
##        0    1
##   0 4288  296
##   1  943  505
precision = diag(confusionMatrix) / rowSums(confusionMatrix)
recall = diag(confusionMatrix) / colSums(confusionMatrix)
F1 = 2 * (precision * recall) / (precision + recall)


print(F1)
##         0         1 
## 0.8737646 0.4490885

Interpretation

true_negatives = 4288
false_positives = 296
false_negatives = 943
true_positives = 505


recall_0 = true_negatives / (true_negatives + false_positives)
recall_1 = true_positives / (true_positives + false_negatives)


precision_0 = 0.8737646
precision_1 = 0.4490885


f1_0 = 2 * (precision_0 * recall_0) / (precision_0 + recall_0)
f1_1 = 2 * (precision_1 * recall_1) / (precision_1 + recall_1)

print(recall_0)
## [1] 0.9354276
print( recall_1)
## [1] 0.3487569
print( f1_0)
## [1] 0.9035453
print( f1_1)
## [1] 0.3926142

Recall for Class 0 (income <= 50K): 0.9354

Recall for Class 1 (income > 50K): 0.3488

F1 Score for Class 0: 0.9035

F1 Score for Class 1: 0.3926

The model performs well in identifying individuals with an income of <=50K (higher F1 score for class 0) but struggles with accurately identifying individuals with an income >50K, as evidenced by the significantly lower F1 score for class 1. This imbalance might suggest the need for either improving the model (perhaps by including more relevant features or using a different model) or addressing class imbalance directly through techniques like oversampling, undersampling, or advanced ensemble methods.

Training New Regression Model

We will add more features to the model and include interaction terms to see if these improve the predictive accuracy for the higher income bracket (>50K), which had a lower F1 score in the previous model.

model_enhanced = glm(income_binary ~ age + education_num + hours_per_week + workclass + marital_status + sex + age:education_num,
             family = binomial(), data = trainData)

predictions_enhanced = predict(model_enhanced, testData, type = "response")
predicted_classes_enhanced = ifelse(predictions_enhanced > 0.5, 1, 0)

Evaluating New Model

confusionMatrix_enhanced = table(testData$income_binary, predicted_classes_enhanced)
confusionMatrix_enhanced
##    predicted_classes_enhanced
##        0    1
##   0 4231  353
##   1  701  747
precision_enhanced = diag(confusionMatrix_enhanced) / rowSums(confusionMatrix_enhanced)
recall_enhanced = diag(confusionMatrix_enhanced) / colSums(confusionMatrix_enhanced)
F1_enhanced = 2 * (precision_enhanced * recall_enhanced) / (precision_enhanced + recall_enhanced)

print(F1_enhanced)
##         0         1 
## 0.8892392 0.5863422

The enhanced model, which included additional predictors like workclass, marital_status, sex, and an interaction term between age and education_num, showed different dynamics.

The recall for Class 0 decreased slightly to 0.8892 but showed a remarkable improvement for Class 1, increasing to 0.5863. The precision values of 0.8892392 for Class 0 and 0.5863422 for Class 1, it’s clear that there would be an improvement for Class 1.

Comparing Original vs. Enhanced

The enhanced model balances its predictive accuracy between the two classes. Although the recall for Class 0 slightly decreased, the gains in recall and precision for Class 1 suggest a more balanced model. This change indicates that the model is now more capable of distinguishing between individuals above and below the $50K income threshold without being heavily biased toward the lower income group.

Conclusion

The original logistic regression model was effective in identifying individuals with an income of ≤ 50K but performed poorly when predicting higher incomes. By enhancing the model with additional predictors and interaction terms, I achieved more balanced performance across both income classes. The enhanced model not only maintained high predictive accuracy for lower incomes but also significantly improved the identification of individuals earning more than 50K.