Rows: 2800 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): shipping_method, weather_condition, order_priority
dbl (7): order_id, supplier_reliability_score, warehouse_inventory_level, o...
dttm (1): order_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Logistic Regression - Supply Chain Order Delay Risk Analysis
1 Objective
Develop a Logistic Regression classifier to quantify the impact of supply chain variables on order latency and predict the binary outcome of “Delayed” vs. “On-Time.”
2 Exploratory Data Analysis
str(supply_data)spc_tbl_ [2,800 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ order_id : num [1:2800] 1 2 3 4 5 6 7 8 9 10 ...
$ order_date : POSIXct[1:2800], format: "2023-01-01 00:00:00" "2023-01-01 01:00:00" ...
$ supplier_reliability_score: num [1:2800] 0.63 0.91 0.78 0.89 0.99 0.82 0.8 0.63 0.71 0.8 ...
$ warehouse_inventory_level : num [1:2800] 2869 2406 4665 1545 511 ...
$ order_quantity : num [1:2800] 253 370 198 363 454 44 155 66 471 276 ...
$ shipping_distance_km : num [1:2800] 1823 876 134 818 984 ...
$ shipping_method : chr [1:2800] "Sea" "Road" "Rail" "Sea" ...
$ weather_condition : chr [1:2800] "Fog" "Fog" "Fog" "Storm" ...
$ processing_time_hours : num [1:2800] 15.6 53.2 62.2 8.2 46.5 19.1 45.5 47.9 3.3 5.6 ...
$ order_priority : chr [1:2800] "Low" "Low" "Low" "High" ...
$ delayed : num [1:2800] 1 0 0 0 1 1 1 0 0 0 ...
- attr(*, "spec")=
.. cols(
.. order_id = col_double(),
.. order_date = col_datetime(format = ""),
.. supplier_reliability_score = col_double(),
.. warehouse_inventory_level = col_double(),
.. order_quantity = col_double(),
.. shipping_distance_km = col_double(),
.. shipping_method = col_character(),
.. weather_condition = col_character(),
.. processing_time_hours = col_double(),
.. order_priority = col_character(),
.. delayed = col_double()
.. )
- attr(*, "problems")=<externalptr>
After taking a quick peek at our dataset now we can transform character variables into factor variables for us to quantify interactions or possible affects on our desired delayed outcome.
supply_data$shipping_method = as.factor(supply_data$shipping_method)
supply_data$weather_condition = as.factor(supply_data$weather_condition)
supply_data$order_priority = as.factor(supply_data$order_priority)Now lets continue our exploratory process with our dataset using few useful plots for us to discover potential patterns of the data.
library(skimr)
skim(supply_data)| Name | supply_data |
| Number of rows | 2800 |
| Number of columns | 11 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 7 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| shipping_method | 0 | 1 | FALSE | 4 | Roa: 723, Rai: 704, Sea: 687, Air: 686 |
| weather_condition | 0 | 1 | FALSE | 4 | Fog: 746, Cle: 696, Rai: 693, Sto: 665 |
| order_priority | 0 | 1 | FALSE | 3 | Low: 942, Hig: 941, Med: 917 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| order_id | 0 | 1 | 1400.50 | 808.43 | 1.0 | 700.75 | 1400.50 | 2100.25 | 2800.0 | ▇▇▇▇▇ |
| supplier_reliability_score | 0 | 1 | 0.80 | 0.12 | 0.6 | 0.70 | 0.79 | 0.90 | 1.0 | ▇▇▇▆▇ |
| warehouse_inventory_level | 0 | 1 | 2493.01 | 1408.39 | 52.0 | 1309.00 | 2459.00 | 3688.00 | 4996.0 | ▇▇▇▇▇ |
| order_quantity | 0 | 1 | 251.87 | 145.23 | 1.0 | 128.00 | 253.00 | 377.00 | 499.0 | ▇▇▇▇▇ |
| shipping_distance_km | 0 | 1 | 1014.31 | 570.65 | 5.2 | 532.95 | 1016.10 | 1508.82 | 1998.6 | ▇▇▇▇▇ |
| processing_time_hours | 0 | 1 | 35.84 | 20.62 | 1.1 | 17.98 | 35.40 | 53.40 | 72.0 | ▇▇▇▇▇ |
| delayed | 0 | 1 | 0.29 | 0.45 | 0.0 | 0.00 | 0.00 | 1.00 | 1.0 | ▇▁▁▁▃ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| order_date | 0 | 1 | 2023-01-01 | 2023-04-27 15:00:00 | 2023-02-28 07:30:00 | 2800 |
table(supply_data$delayed)
0 1
1984 816
The Ratio: We have about 29% delayed cases and 71% not delayed.
library(ggplot2)
library(tidyverse)
supply_data |>
select(where(is.numeric)) |>
pivot_longer(everything()) |>
ggplot(aes(x = value)) +
geom_density(fill = "steelblue", alpha = 0.5) +
facet_wrap(~name, scales = "free") +
theme_minimal()library(ggcorrplot)
corr_matrix <- cor(supply_data |> select(where(is.numeric)))
ggcorrplot(corr_matrix,
hc.order = TRUE,
type = "lower",
lab = TRUE) There is no distinctive pattern showing our correlation plot. All the correlations between variables seem to be very weak.
3 Model Training
We are now fitting a logistic regression model to the dataset, testing all variables against the delayed outcome. Since the correlation plot provided no clear insights, we will begin by including all variables in the model excluding
order_id. In our datasetorder_idis identifier rather then predictor.
full_model <- glm(delayed ~ . - order_id,
data = supply_data,
family = binomial(link = "logit"))
summary(full_model)
Call:
glm(formula = delayed ~ . - order_id, family = binomial(link = "logit"),
data = supply_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.963e+01 2.413e+01 -1.228 0.21960
order_date 1.734e-08 1.439e-08 1.206 0.22798
supplier_reliability_score -4.657e-01 3.624e-01 -1.285 0.19871
warehouse_inventory_level 7.047e-06 2.971e-05 0.237 0.81248
order_quantity -3.909e-04 2.881e-04 -1.357 0.17487
shipping_distance_km -5.935e-05 7.337e-05 -0.809 0.41857
shipping_methodRail -8.850e-02 1.170e-01 -0.757 0.44927
shipping_methodRoad -2.710e-01 1.186e-01 -2.285 0.02233 *
shipping_methodSea -1.036e-01 1.181e-01 -0.877 0.38025
weather_conditionFog 2.813e-01 1.187e-01 2.369 0.01783 *
weather_conditionRain 2.190e-01 1.213e-01 1.805 0.07104 .
weather_conditionStorm 3.498e-01 1.214e-01 2.881 0.00397 **
processing_time_hours 2.811e-03 2.033e-03 1.383 0.16676
order_priorityLow -6.530e-02 1.019e-01 -0.641 0.52151
order_priorityMedium -7.276e-02 1.026e-01 -0.709 0.47835
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3379.2 on 2799 degrees of freedom
Residual deviance: 3356.5 on 2785 degrees of freedom
AIC: 3386.5
Number of Fisher Scoring iterations: 4
4 Interpretation of our Full Model
Our model shows only a marginal improvement in predictive power, as evidenced by the small reduction in deviance from the Null Deviance (3379.2) to the Residual Deviance (3356.5). While adding predictors does technically improve the fit, the 22.7-point drop suggests that the variables are currently capturing very little of the variation in delays. Essentially, the full model is only slightly better at forecasting outcomes than a simple baseline guess based on average delay rates.
4.1 Statistically Significant Predictors (p < 0.05)
shipping_method: Road (p = 0.022) with Negative coefficient (-0.271). This means that orders shipped by road are less likely to be delayed compared to the reference shipping method.
weather_condition: Fog (p = 0.018) with Positive coefficient (+0.281). This tells us that fog increases the likelihood of delay.
weather_condition: Storm (p = 0.004) with Positive coefficient (+0.350) shows us that storms significantly increase the probability of delay.
4.2 Marginally Significant (p < 0.10)
weather_condition: Rain (p = 0.071) Rain may increase delay risk, but evidence is weaker. We are still going to consider this variable for our next reduced model.
reduced_model <- glm(delayed ~ weather_condition + shipping_method,
family = binomial(link = "logit"),
data = supply_data)
summary(reduced_model)
Call:
glm(formula = delayed ~ weather_condition + shipping_method,
family = binomial(link = "logit"), data = supply_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.98581 0.11198 -8.803 < 2e-16 ***
weather_conditionFog 0.28434 0.11839 2.402 0.01632 *
weather_conditionRain 0.21752 0.12107 1.797 0.07239 .
weather_conditionStorm 0.34257 0.12106 2.830 0.00466 **
shipping_methodRail -0.09043 0.11671 -0.775 0.43844
shipping_methodRoad -0.27382 0.11831 -2.314 0.02064 *
shipping_methodSea -0.10427 0.11759 -0.887 0.37522
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3379.2 on 2799 degrees of freedom
Residual deviance: 3364.8 on 2793 degrees of freedom
AIC: 3378.8
Number of Fisher Scoring iterations: 4
anova(full_model, reduced_model, test="Chisq")Analysis of Deviance Table
Model 1: delayed ~ (order_id + order_date + supplier_reliability_score +
warehouse_inventory_level + order_quantity + shipping_distance_km +
shipping_method + weather_condition + processing_time_hours +
order_priority) - order_id
Model 2: delayed ~ weather_condition + shipping_method
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 2785 3356.5
2 2793 3364.8 -8 -8.2303 0.4113
Our reduced model performs just as well as the full model). Even though I stripped away 8 predictors, the accuracy of my predictions didn’t significantly drop. In data science, we call this parsimony it is usually better to use the simpler model because it’s easier to explain and less likely to over-fit the data.
library(ResourceSelection)Warning: package 'ResourceSelection' was built under R version 4.5.2
ResourceSelection 0.3-6 2023-06-27
hoslem.test(full_model$y, fitted(reduced_model))
Hosmer and Lemeshow goodness of fit (GOF) test
data: full_model$y, fitted(reduced_model)
X-squared = 7.5552, df = 8, p-value = 0.4781
Hoslem test shows us that our model is a good fit for the data. The predicted probabilities from our model match the actual observed outcomes closely. how ever we have to keep continue on our diagnosis.
library(pROC)Warning: package 'pROC' was built under R version 4.5.2
Type 'citation("pROC")' for a citation.
Attaching package: 'pROC'
The following objects are masked from 'package:stats':
cov, smooth, var
prob <- predict(reduced_model, type = "response")
roc_obj <- roc(supply_data$delayed, prob)Setting levels: control = 0, case = 1
Setting direction: controls < cases
plot(roc_obj)auc(roc_obj)Area under the curve: 0.5401
An Area Under Curve of 0.5401 means our model is performing only slightly better than a random coin toss.
5 Model Prediction with new data
Now we are going to test our model. First we will get
Predicted Probabilities.
prob <- predict(reduced_model, type = "response")
head(prob) 1 2 3 4 5 6
0.3087976 0.2738264 0.3117599 0.3213640 0.3213640 0.3314857
Next we will convert probabilities to class predictions
pred_class <- ifelse(prob > 0.5, 1, 0)Confusion Matrix (Model Performance)
table(Predicted = pred_class, Actual = supply_data$delayed) Actual
Predicted 0 1
0 1984 816
The code chunk below will give us overall prediction accuracy.
mean(pred_class == supply_data$delayed)[1] 0.7085714
now we can use new data to predict outcome.
new_data <- data.frame(
weather_condition = "Storm",
shipping_method = "Sea"
)
predict(reduced_model, new_data, type="response") 1
0.321364
There is a 32.1% chance that this order will be delayed.
coords(roc_obj, "best", ret = c("threshold","sensitivity","specificity")) threshold sensitivity specificity
1 0.2662421 0.7990196 0.2646169
6 Final Conclusion
Our model is statistically valid, but practically weak for prediction.
It identifies trends,but it is not strong enough for operational decision-making yet.