The F1 score is a measure of a model’s accuracy that considers both precision and recall to compute the score. Precision is the number of true positives divided by the number of all positive predictions, and recall is the number of true positives divided by the number of positive instances in the dataset. The F1 score is useful when the classes are very imbalanced. It is calculated by:
\(F1 = 2 \times \left( \frac{\text{precision} \times \text{recall}}{\text{precision} + \text{recall}} \right)\)
The F1 score can be used to evaluate logistic regression models. For this blog post, I will use the “Adult” dataset from the UCI Machine Learning Repository, which is often used to predict whether income exceeds $50K/yr based on census data.
I will import the dataset preprocess it, fit a logistic regression model, and then evaluate it using the F1 score.
The dataset features include age, work class, education, marital status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, and native-country.
library(data.table)
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: ggplot2
## Loading required package: lattice
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
adult = fread(url, header = FALSE, na.strings = "?")
colnames(adult) = c("age", "workclass", "fnlwgt", "education", "education_num",
"marital_status", "occupation", "relationship", "race", "sex",
"capital_gain", "capital_loss", "hours_per_week", "native_country", "income")
adult = na.omit(adult)
adult$income_binary = as.integer(adult$income == ">50K")
Here, I split the adult dataset into
training and testing subsets. The
createDataPartition function from the
caret package is used to randomly divide
the dataset, allocating 80% of the data to
trainData and the remaining 20% to
testData, based on the binary outcome
variable adult$income_binary.
set.seed(621)
trainIndex = createDataPartition(adult$income_binary, p = .8,
list = FALSE,
times = 1)
trainData = adult[trainIndex,]
testData = adult[-trainIndex,]
Here, I fit a logistic regression model and make predictions based on
the model. I use the glm function to fit a logistic
regression model using income_binary as the response
variable and age, education_num, and
hours_per_week as predictors, assuming a binomial
distribution.
The model is trained using the trainData dataset. After
the model is fitted, predictions are made on the testData
set using the predict function with the type set to
“response”, which returns predicted probabilities. These probabilities
are then converted into binary class predictions (1 for
probabilities greater than 0.5 indicating income over $50K, and
0 otherwise) using the ifelse function.
model = glm(income_binary ~ age + education_num + hours_per_week,
family = binomial(), data = trainData)
predictions = predict(model, testData, type = "response")
predicted_classes = ifelse(predictions > 0.5, 1, 0)
The confusionMatrix variable contains
the counts of true positives, false positives, true negatives, and false
negatives. It is used in the F1 score calculation.
Precision is calculated for each class as the # of true positives for each class divided by the row sums of the matrix (total predicted positives for each class).
Recall is calculated as the diagonal elements divided by the column sums of the matrix (actual positives for each class).
confusionMatrix = table(testData$income_binary, predicted_classes)
confusionMatrix
## predicted_classes
## 0 1
## 0 4288 296
## 1 943 505
precision = diag(confusionMatrix) / rowSums(confusionMatrix)
recall = diag(confusionMatrix) / colSums(confusionMatrix)
F1 = 2 * (precision * recall) / (precision + recall)
print(F1)
## 0 1
## 0.8737646 0.4490885
true_negatives = 4288
false_positives = 296
false_negatives = 943
true_positives = 505
recall_0 = true_negatives / (true_negatives + false_positives)
recall_1 = true_positives / (true_positives + false_negatives)
precision_0 = 0.8737646
precision_1 = 0.4490885
f1_0 = 2 * (precision_0 * recall_0) / (precision_0 + recall_0)
f1_1 = 2 * (precision_1 * recall_1) / (precision_1 + recall_1)
print(recall_0)
## [1] 0.9354276
print( recall_1)
## [1] 0.3487569
print( f1_0)
## [1] 0.9035453
print( f1_1)
## [1] 0.3926142
Recall for Class 0 (income <= 50K): 0.9354
Recall for Class 1 (income > 50K): 0.3488
F1 Score for Class 0: 0.9035
F1 Score for Class 1: 0.3926
The model performs well in identifying individuals with an income of <=50K (higher F1 score for class 0) but struggles with accurately identifying individuals with an income >50K, as evidenced by the significantly lower F1 score for class 1. This imbalance might suggest the need for either improving the model (perhaps by including more relevant features or using a different model) or addressing class imbalance directly through techniques like oversampling, undersampling, or advanced ensemble methods.
We will add more features to the model and include interaction terms to see if these improve the predictive accuracy for the higher income bracket (>50K), which had a lower F1 score in the previous model.
model_enhanced = glm(income_binary ~ age + education_num + hours_per_week + workclass + marital_status + sex + age:education_num,
family = binomial(), data = trainData)
predictions_enhanced = predict(model_enhanced, testData, type = "response")
predicted_classes_enhanced = ifelse(predictions_enhanced > 0.5, 1, 0)
confusionMatrix_enhanced = table(testData$income_binary, predicted_classes_enhanced)
confusionMatrix_enhanced
## predicted_classes_enhanced
## 0 1
## 0 4231 353
## 1 701 747
precision_enhanced = diag(confusionMatrix_enhanced) / rowSums(confusionMatrix_enhanced)
recall_enhanced = diag(confusionMatrix_enhanced) / colSums(confusionMatrix_enhanced)
F1_enhanced = 2 * (precision_enhanced * recall_enhanced) / (precision_enhanced + recall_enhanced)
print(F1_enhanced)
## 0 1
## 0.8892392 0.5863422
The enhanced model, which included additional predictors like workclass, marital_status, sex, and an interaction term between age and education_num, showed different dynamics.
The recall for Class 0 decreased slightly to 0.8892 but showed a remarkable improvement for Class 1, increasing to 0.5863. The precision values of 0.8892392 for Class 0 and 0.5863422 for Class 1, it’s clear that there would be an improvement for Class 1.
The enhanced model balances its predictive accuracy between the two classes. Although the recall for Class 0 slightly decreased, the gains in recall and precision for Class 1 suggest a more balanced model. This change indicates that the model is now more capable of distinguishing between individuals above and below the $50K income threshold without being heavily biased toward the lower income group.
The original logistic regression model was effective in identifying individuals with an income of ≤ 50K but performed poorly when predicting higher incomes. By enhancing the model with additional predictors and interaction terms, I achieved more balanced performance across both income classes. The enhanced model not only maintained high predictive accuracy for lower incomes but also significantly improved the identification of individuals earning more than 50K.