Logistic Regression - Supply Chain Order Delay Risk Analysis

Author

Emrah Akbas

1 Objective

Develop a Logistic Regression classifier to quantify the impact of supply chain variables on order latency and predict the binary outcome of “Delayed” vs. “On-Time.”

2 Exploratory Data Analysis

Rows: 2800 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): shipping_method, weather_condition, order_priority
dbl  (7): order_id, supplier_reliability_score, warehouse_inventory_level, o...
dttm (1): order_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(supply_data)
spc_tbl_ [2,800 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ order_id                  : num [1:2800] 1 2 3 4 5 6 7 8 9 10 ...
 $ order_date                : POSIXct[1:2800], format: "2023-01-01 00:00:00" "2023-01-01 01:00:00" ...
 $ supplier_reliability_score: num [1:2800] 0.63 0.91 0.78 0.89 0.99 0.82 0.8 0.63 0.71 0.8 ...
 $ warehouse_inventory_level : num [1:2800] 2869 2406 4665 1545 511 ...
 $ order_quantity            : num [1:2800] 253 370 198 363 454 44 155 66 471 276 ...
 $ shipping_distance_km      : num [1:2800] 1823 876 134 818 984 ...
 $ shipping_method           : chr [1:2800] "Sea" "Road" "Rail" "Sea" ...
 $ weather_condition         : chr [1:2800] "Fog" "Fog" "Fog" "Storm" ...
 $ processing_time_hours     : num [1:2800] 15.6 53.2 62.2 8.2 46.5 19.1 45.5 47.9 3.3 5.6 ...
 $ order_priority            : chr [1:2800] "Low" "Low" "Low" "High" ...
 $ delayed                   : num [1:2800] 1 0 0 0 1 1 1 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   order_id = col_double(),
  ..   order_date = col_datetime(format = ""),
  ..   supplier_reliability_score = col_double(),
  ..   warehouse_inventory_level = col_double(),
  ..   order_quantity = col_double(),
  ..   shipping_distance_km = col_double(),
  ..   shipping_method = col_character(),
  ..   weather_condition = col_character(),
  ..   processing_time_hours = col_double(),
  ..   order_priority = col_character(),
  ..   delayed = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

After taking a quick peek at our dataset now we can transform character variables into factor variables for us to quantify interactions or possible affects on our desired delayed outcome.

supply_data$shipping_method = as.factor(supply_data$shipping_method)
supply_data$weather_condition = as.factor(supply_data$weather_condition)
supply_data$order_priority = as.factor(supply_data$order_priority)

Now lets continue our exploratory process with our dataset using few useful plots for us to discover potential patterns of the data.

library(skimr)
skim(supply_data)
Data summary
Name supply_data
Number of rows 2800
Number of columns 11
_______________________
Column type frequency:
factor 3
numeric 7
POSIXct 1
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
shipping_method 0 1 FALSE 4 Roa: 723, Rai: 704, Sea: 687, Air: 686
weather_condition 0 1 FALSE 4 Fog: 746, Cle: 696, Rai: 693, Sto: 665
order_priority 0 1 FALSE 3 Low: 942, Hig: 941, Med: 917

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
order_id 0 1 1400.50 808.43 1.0 700.75 1400.50 2100.25 2800.0 ▇▇▇▇▇
supplier_reliability_score 0 1 0.80 0.12 0.6 0.70 0.79 0.90 1.0 ▇▇▇▆▇
warehouse_inventory_level 0 1 2493.01 1408.39 52.0 1309.00 2459.00 3688.00 4996.0 ▇▇▇▇▇
order_quantity 0 1 251.87 145.23 1.0 128.00 253.00 377.00 499.0 ▇▇▇▇▇
shipping_distance_km 0 1 1014.31 570.65 5.2 532.95 1016.10 1508.82 1998.6 ▇▇▇▇▇
processing_time_hours 0 1 35.84 20.62 1.1 17.98 35.40 53.40 72.0 ▇▇▇▇▇
delayed 0 1 0.29 0.45 0.0 0.00 0.00 1.00 1.0 ▇▁▁▁▃

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
order_date 0 1 2023-01-01 2023-04-27 15:00:00 2023-02-28 07:30:00 2800
table(supply_data$delayed)

   0    1 
1984  816 

The Ratio: We have about 29% delayed cases and 71% not delayed.

library(ggplot2)
library(tidyverse)

supply_data |> 
  select(where(is.numeric)) |> 
  pivot_longer(everything()) |> 
  ggplot(aes(x = value)) +
  geom_density(fill = "steelblue", alpha = 0.5) +
  facet_wrap(~name, scales = "free") +
  theme_minimal()

library(ggcorrplot)
corr_matrix <- cor(supply_data |>  select(where(is.numeric)))
ggcorrplot(corr_matrix, 
           hc.order = TRUE, 
           type = "lower",  
           lab = TRUE)      

There is no distinctive pattern showing our correlation plot. All the correlations between variables seem to be very weak.

3 Model Training

We are now fitting a logistic regression model to the dataset, testing all variables against the delayed outcome. Since the correlation plot provided no clear insights, we will begin by including all variables in the model excluding order_id. In our dataset order_id is identifier rather then predictor.

full_model <- glm(delayed ~ . - order_id, 
                  data = supply_data, 
                  family = binomial(link = "logit"))

summary(full_model)

Call:
glm(formula = delayed ~ . - order_id, family = binomial(link = "logit"), 
    data = supply_data)

Coefficients:
                             Estimate Std. Error z value Pr(>|z|)   
(Intercept)                -2.963e+01  2.413e+01  -1.228  0.21960   
order_date                  1.734e-08  1.439e-08   1.206  0.22798   
supplier_reliability_score -4.657e-01  3.624e-01  -1.285  0.19871   
warehouse_inventory_level   7.047e-06  2.971e-05   0.237  0.81248   
order_quantity             -3.909e-04  2.881e-04  -1.357  0.17487   
shipping_distance_km       -5.935e-05  7.337e-05  -0.809  0.41857   
shipping_methodRail        -8.850e-02  1.170e-01  -0.757  0.44927   
shipping_methodRoad        -2.710e-01  1.186e-01  -2.285  0.02233 * 
shipping_methodSea         -1.036e-01  1.181e-01  -0.877  0.38025   
weather_conditionFog        2.813e-01  1.187e-01   2.369  0.01783 * 
weather_conditionRain       2.190e-01  1.213e-01   1.805  0.07104 . 
weather_conditionStorm      3.498e-01  1.214e-01   2.881  0.00397 **
processing_time_hours       2.811e-03  2.033e-03   1.383  0.16676   
order_priorityLow          -6.530e-02  1.019e-01  -0.641  0.52151   
order_priorityMedium       -7.276e-02  1.026e-01  -0.709  0.47835   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3379.2  on 2799  degrees of freedom
Residual deviance: 3356.5  on 2785  degrees of freedom
AIC: 3386.5

Number of Fisher Scoring iterations: 4

4 Interpretation of our Full Model

Our model shows only a marginal improvement in predictive power, as evidenced by the small reduction in deviance from the Null Deviance (3379.2) to the Residual Deviance (3356.5). While adding predictors does technically improve the fit, the 22.7-point drop suggests that the variables are currently capturing very little of the variation in delays. Essentially, the full model is only slightly better at forecasting outcomes than a simple baseline guess based on average delay rates.

4.1 Statistically Significant Predictors (p < 0.05)

shipping_method: Road (p = 0.022) with Negative coefficient (-0.271). This means that orders shipped by road are less likely to be delayed compared to the reference shipping method.

weather_condition: Fog (p = 0.018) with Positive coefficient (+0.281). This tells us that fog increases the likelihood of delay.

weather_condition: Storm (p = 0.004) with Positive coefficient (+0.350) shows us that storms significantly increase the probability of delay.

4.2 Marginally Significant (p < 0.10)

weather_condition: Rain (p = 0.071) Rain may increase delay risk, but evidence is weaker. We are still going to consider this variable for our next reduced model.

reduced_model <- glm(delayed ~ weather_condition + shipping_method, 
                     family = binomial(link = "logit"), 
                     data = supply_data)
summary(reduced_model)

Call:
glm(formula = delayed ~ weather_condition + shipping_method, 
    family = binomial(link = "logit"), data = supply_data)

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)            -0.98581    0.11198  -8.803  < 2e-16 ***
weather_conditionFog    0.28434    0.11839   2.402  0.01632 *  
weather_conditionRain   0.21752    0.12107   1.797  0.07239 .  
weather_conditionStorm  0.34257    0.12106   2.830  0.00466 ** 
shipping_methodRail    -0.09043    0.11671  -0.775  0.43844    
shipping_methodRoad    -0.27382    0.11831  -2.314  0.02064 *  
shipping_methodSea     -0.10427    0.11759  -0.887  0.37522    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3379.2  on 2799  degrees of freedom
Residual deviance: 3364.8  on 2793  degrees of freedom
AIC: 3378.8

Number of Fisher Scoring iterations: 4
anova(full_model, reduced_model, test="Chisq")
Analysis of Deviance Table

Model 1: delayed ~ (order_id + order_date + supplier_reliability_score + 
    warehouse_inventory_level + order_quantity + shipping_distance_km + 
    shipping_method + weather_condition + processing_time_hours + 
    order_priority) - order_id
Model 2: delayed ~ weather_condition + shipping_method
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1      2785     3356.5                     
2      2793     3364.8 -8  -8.2303   0.4113

Our reduced model performs just as well as the full model). Even though I stripped away 8 predictors, the accuracy of my predictions didn’t significantly drop. In data science, we call this parsimony it is usually better to use the simpler model because it’s easier to explain and less likely to over-fit the data.

library(ResourceSelection)
Warning: package 'ResourceSelection' was built under R version 4.5.2
ResourceSelection 0.3-6      2023-06-27
hoslem.test(full_model$y, fitted(reduced_model))

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  full_model$y, fitted(reduced_model)
X-squared = 7.5552, df = 8, p-value = 0.4781

Hoslem test shows us that our model is a good fit for the data. The predicted probabilities from our model match the actual observed outcomes closely. how ever we have to keep continue on our diagnosis.

library(pROC)
Warning: package 'pROC' was built under R version 4.5.2
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'
The following objects are masked from 'package:stats':

    cov, smooth, var
prob <- predict(reduced_model, type = "response")
roc_obj <- roc(supply_data$delayed, prob)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
plot(roc_obj)

auc(roc_obj)
Area under the curve: 0.5401

An Area Under Curve of 0.5401 means our model is performing only slightly better than a random coin toss.

5 Model Prediction with new data

Now we are going to test our model. First we will get Predicted Probabilities.

prob <- predict(reduced_model, type = "response")
head(prob)
        1         2         3         4         5         6 
0.3087976 0.2738264 0.3117599 0.3213640 0.3213640 0.3314857 

Next we will convert probabilities to class predictions

pred_class <- ifelse(prob > 0.5, 1, 0)

Confusion Matrix (Model Performance)

table(Predicted = pred_class, Actual = supply_data$delayed)
         Actual
Predicted    0    1
        0 1984  816

The code chunk below will give us overall prediction accuracy.

mean(pred_class == supply_data$delayed)
[1] 0.7085714

now we can use new data to predict outcome.

new_data <- data.frame(
  weather_condition = "Storm",
  shipping_method = "Sea"
)

predict(reduced_model, new_data, type="response")
       1 
0.321364 

There is a 32.1% chance that this order will be delayed.

coords(roc_obj, "best", ret = c("threshold","sensitivity","specificity"))
  threshold sensitivity specificity
1 0.2662421   0.7990196   0.2646169

6 Final Conclusion

Our model is statistically valid, but practically weak for prediction.

It identifies trends,but it is not strong enough for operational decision-making yet.