# Logistic regression is a powerful tool I chose to analyze the probability of shipment delays in my warehouse shipping data.
# I noticed that a linear regression model would struggle with predicting probabilities since it can produce values outside the [0, 1] range.
# To address this, I applied logistic regression to ensure meaningful and interpretable predictions.
# Step 1: Generate the dataset
# I created a simulated dataset that represents shipment volumes and whether they were delayed (Delayed = Yes).
# This allows me to explore how shipment volume impacts delay probability.
set.seed(123)
warehouse_data <- data.frame(
ShipmentVolume = runif(500, 100, 1000), # Shipment volumes between 100 and 1000 units
Delayed = sample(c(0, 1), 500, replace = TRUE, prob = c(0.6, 0.4)) # 40% delayed shipments
)
# Step 2: Fit a Linear Regression Model
# I began by fitting a linear regression model to predict delays based on shipment volume.
# However, I quickly realized the limitations of this approach.
linear_model <- lm(Delayed ~ ShipmentVolume, data = warehouse_data)
warehouse_data$PredictedProb_LM <- predict(linear_model)
# I observed that linear regression models the probability of delay using a straight line:
# p(X) = β0 + β1 * ShipmentVolume.
# While this works mathematically, it often results in probabilities outside the range [0, 1],
# which are unrealistic and problematic when interpreting the results.
# Step 3: Fit a Logistic Regression Model
# To address the limitations of linear regression, I used logistic regression, which models the probability using the logistic function:
# p(X) = e^(β0 + β1 * ShipmentVolume) / (1 + e^(β0 + β1 * ShipmentVolume)).
logistic_model <- glm(Delayed ~ ShipmentVolume, data = warehouse_data, family = binomial)
warehouse_data$PredictedProb_Logistic <- predict(logistic_model, type = "response")
# Step 4: Visualize Predictions
# I created visualizations to compare the predictions from linear regression and logistic regression.
# This helped me understand the strengths and weaknesses of each approach.
library(ggplot2)
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.4.2
# Linear Regression Plot
# I plotted the predicted probabilities from the linear regression model and observed that the line extended beyond 0 and 1,
# making it unsuitable for predicting probabilities.
plot_linear <- ggplot(warehouse_data, aes(x = ShipmentVolume, y = PredictedProb_LM)) +
geom_point(aes(color = as.factor(Delayed)), alpha = 0.6) +
geom_hline(yintercept = c(0, 1), linetype = "dotted", color = "black") +
geom_line(color = "purple", size = 1) +
scale_color_manual(values = c("green", "red"), labels = c("On Time", "Delayed")) +
labs(
title = "Linear Regression Predictions",
x = "Shipment Volume",
y = "Predicted Probability of Delay",
color = "Delay Status"
) +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Logistic Regression Plot
# With the logistic regression model, I saw a more realistic S-shaped curve, where all probabilities were constrained between 0 and 1.
plot_logistic <- ggplot(warehouse_data, aes(x = ShipmentVolume, y = PredictedProb_Logistic)) +
geom_point(aes(color = as.factor(Delayed)), alpha = 0.6) +
geom_hline(yintercept = c(0, 1), linetype = "dotted", color = "black") +
geom_line(color = "blue", size = 1) +
scale_color_manual(values = c("green", "red"), labels = c("On Time", "Delayed")) +
labs(
title = "Logistic Regression Predictions",
x = "Shipment Volume",
y = "Predicted Probability of Delay",
color = "Delay Status"
) +
theme_minimal()
# Combine Plots
# I compared the two models side by side to illustrate the differences in their predictions.
plot_linear + plot_logistic
