Freight Mode Analysis

DataCo Global

David Springhetti

Table of Contents


Datasource


scmD <- read.csv("DataCoGlobal.csv")

Original Data source: https://data.mendeley.com/datasets/8gx2fvg2k6/5

Data source citation: Constante, Fabian; Silva, Fernando; Pereira, António (2019), “DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS”, Mendeley Data, V5, doi: 10.17632/8gx2fvg2k6.5

Adjusted data: https://drive.google.com/file/d/1yS4hgc7uKzOTEYAXaQtAzIeDuco_EUeH/view?usp=sharing

Description: The data is from Mendeley, a website that provides research data. For this project we wanted to find and apply data to a business case (explained later). The data (180519 observations and 53 columns) comes from an e-commerce company, DataCo Global, selling clothing, sport goods, and electronics. The variables in this dataset can be related to the supply chain around an e-commerce company. In specific, the variables contain Customer, Production, Sales, Stores, and Logistics information. In our analysis we will focus on Logistics, in specific on Shipping Modes and Late Deliveries. In terms of statistical analysis, it is interesting to apply mixed effect models to find relationships between variables. For the analysis we are not using the following variables:

A. Shipping

  1. DaysForShippingReal: Actual shipping days of the purchased product

  2. DaysForShipmentScheduled: Days of scheduled delivery of the purchased product

  3. LateDeliveryRisk:Categorical variable that indicates if sending is late 1, if it is on time 0

  4. ShippingMode: Consists of the following shipping modes: Standard Class, First Class, Second Class, Same Day

  5. Shipping date (DateOrders): Exact date and time of shipment

  6. Type: Type of transaction made, Transfer, Cash, Payment, Debit

B. Orders

  1. OrderDate: Date on which the order is made

  2. Department Id: Department code of store

  3. Department Name: Department name of store

  4. Latitude: Latitude corresponding to location of store

  5. Longitude: Longitude corresponding to location of store

  6. Market: Market to where the order is delivered: Africa, Europe, LATAM, Pacific Asia, USCA

  7. Order Region: Region of the world where the order is delivered: Southeast Asia, South Asia, Oceania, Eastern Asia, West Asia, West of USA, US Center, West Africa, Central Africa, North Africa, Western Europe, Northern, Caribbean, South America, East Africa, Southern Europe, East of USA, Canada, Southern Africa, Central Asia, Europe, Central America, Eastern Europe, South of USA

  8. Order Country: Destination country of the order

  9. Order City: Destination city of the order

  10. Order Customer Id: Customer order code

  11. Order Item Total: Total amount per order

  12. Order Profit Per Order: Order Profit Per Order

  13. Order State: State of the region where the order is delivered

Data Exploration

Overview of all columns


rmarkdown::paged_table(scmD)

Overview of relevant statistical measures


library(psych)
rmarkdown::paged_table(describe(scmD))

Missing Values in percent


sum((is.na(scmD))/prod(dim(scmD)) * 100)

[1] 3.513987

In what columns are the missing values?


na_count <-sapply(scmD, function(y) sum(length(which(is.na(y)))))
na_count

OrderZipcode and ProductDescription have missing values. We won´t need those for our analysis. All other columns have 0 missing values.

Region Overview


unique(scmD$OrderRegion)

 [1] "Eastern Asia"    "South Asia"      "Southeast Asia" 
 [4] "Oceania"         "Southern Africa" "North Africa"   
 [7] "Northern Europe" "Eastern Europe"  "Southern Europe"
[10] "Western Europe"  "Central America" "South America"  
[13] "West Asia"       "West Africa"     "Canada"         
[16] "US Center "      "East of USA"     "West of USA "   
[19] "South of  USA "  "Caribbean"       "Central Africa" 
[22] "East Africa"     "Central Asia"   

Problem

In today´s world, e-commerce companies are established entities. It has become normal for people to order the bulk of their things online. The customer shifts more and more to the center of the company´s priorities, bringing higher expectations on the quality of products, shipping, delivery times, and so on. Online Shopping evolved into an experience for the customer. The competition between e-commerce companies is high, the market is fast paced. Companies that don´t adjust in time will likely not exist for very long.

DataCo Global´s customer satisfaction (this is an assumption) and sales dropped, also because of late deliveries. The company needs to make a change to decrease late deliveries and win the trust of their customers back. DataCo Global hired us to consult them on late deliveries.

Goal

Classify deliveries by mode and late delivery risk. Use mixed effect models to support the initial classification. Recommend next steps that will decrease late deliveries.

Analysis

The below models will look at the relationships between LateDeliveryRisk, ShippingMode, DaysForShippingReal, DaysForShippingScheduled, Transfer Type. We assume that the late delivery risk is real, meaning 0 is a shipment on time, and 1 is a late shipment. LateDeliveryRisk is our binary response variable (0, 1) and used to test our assumptions with various statistical models.

Overview

Shipping Modes * Same Day * First Class * Second Class * Standard Class

Cost of Shipping Modes ($)


SameDayCost <- 10
FirstClassCost <- 7
SecondClassCost <- 4
StandardClassCost <- 3

Overall late vs on time deliveries


library(plotly)
All_d <- table(scmD$LateDeliveryRisk)
All_sum <- sum(All_d)

Ontime_filter <- (sum(scmD$LateDeliveryRisk == 0))
Late_filter <- (sum(scmD$LateDeliveryRisk == 1))

Ontime_p <- (Ontime_filter/All_sum)*100 
Ontime_p %>% round(1)

[1] 59.1

Late_p <- (Late_filter/All_sum)*100 
Late_p %>% round(1)

[1] 40.9

labels = c("Deliveries")

fig <- plot_ly(x = ~labels, y = ~Ontime_p, type = 'bar', 
               name = 'On Time',text = Ontime_p %>% round(1), 
               textposition = 'auto', 
               marker = list(color = 'blue'))
fig <- fig %>% add_trace(y = ~Late_p, name = 'Late', 
                         text = Late_p %>% round(1),
                         textposition = 'auto', 
                         marker = list(color = 'red'))
fig <- fig %>% layout(title = "Total Deliveries",
                      yaxis = list(title = '%'), 
                      barmode = 'group', 
                      xaxis = list(title = ''))
fig

It looks like that about 59% of deliveries are on time, while 41% have a delay.

Delivery Count by Shipping Mode


library(kableExtra)
Shipping_All <- table(scmD$ShippingMode)
Shipping_AP <- Shipping_All 
DeliveryC = as.data.frame(Shipping_AP)

kbl(as.data.frame(DeliveryC), booktabs = T) %>% 
kable_styling(full_width = F) %>%
column_spec(1, width = "4cm") 
Var1 Freq
First Class 27814
Same Day 9737
Second Class 35216
Standard Class 107752

fig <- plot_ly(DeliveryC, x = ~Var1, y = ~Freq, type = 'bar',
               text = ~Freq %>% round(1), 
               textposition = 'auto', 
               marker = list(color = "blue"))
fig <- fig %>% layout(title = "Deliveries by Class",
                      yaxis = list(title = ''), 
                      xaxis = list(title = ''))
fig

Delivery (%) by Shipping Mode


library(dplyr)
library(plotly)
Shipping_All <- table(scmD$ShippingMode)
Shipping_AP <- prop.table(Shipping_All) %>% {. * 100} %>% round(1)
Ovplot = as.data.frame(Shipping_AP)

kbl(as.data.frame(Ovplot), booktabs = T) %>% 
kable_styling(full_width = F) %>%
column_spec(1, width = "4cm") 
Var1 Freq
First Class 15.4
Same Day 5.4
Second Class 19.5
Standard Class 59.7

fig <- plot_ly(Ovplot, x = ~Var1, y = ~Freq, type = 'bar',
               text = ~Freq %>% round(1), 
               textposition = 'auto', 
               marker = list(color = "blue"))
fig <- fig %>% layout(title = "Deliveries by Class",
                      yaxis = list(title = '%'), 
                      xaxis = list(title = ''))
fig

Shipping Mode by Late Delivery Risk (Count)


Shipping_D <- table(scmD$LateDeliveryRisk, scmD$ShippingMode)
ShippingACount <- as.data.frame(Shipping_D)
kbl(as.data.frame(ShippingACount), booktabs = T) %>% 
kable_styling(full_width = F) %>%
column_spec(1, width = "4cm") 
Var1 Var2 Freq
0 First Class 26513
1 First Class 1301
0 Same Day 5283
1 Same Day 4454
0 Second Class 8229
1 Second Class 26987
0 Standard Class 66729
1 Standard Class 41023

Shipping Mode by Late Delivery Risk (%)


library(plotly)
library(ggplot2)
library(kableExtra)
Shipping_D <- table(scmD$LateDeliveryRisk, scmD$ShippingMode)
Shipping_D

   
    First Class Same Day Second Class Standard Class
  0       26513     5283         8229          66729
  1        1301     4454        26987          41023

Shipping_P <- prop.table(Shipping_D, margin = 2) %>% {. * 100} %>% round(1)
Splot = as.data.frame(Shipping_P)

kbl(as.data.frame(Splot), booktabs = T) %>% 
kable_styling(full_width = F) %>%
column_spec(1, width = "4cm")
Var1 Var2 Freq
0 First Class 95.3
1 First Class 4.7
0 Same Day 54.3
1 Same Day 45.7
0 Second Class 23.4
1 Second Class 76.6
0 Standard Class 61.9
1 Standard Class 38.1

Ontime_filter <- filter(Splot, Var1 == "0")
Late_filter <- filter(Splot, Var1 == "1")

ShippingMode <- c("Same Day", "First Class", "Second Class", "Standard Class")
OnTime <- c(54.3, 95.3, 23.4, 61.9)
Late <- c(45.7, 4.7, 76.6, 38.1)
data <- data.frame(ShippingMode, OnTime, Late)

fig <- plot_ly(data, x = ~ShippingMode, y = ~OnTime, type = 'bar', 
               name = 'On Time', text = ~OnTime, 
               textposition = 'auto', marker = list(color = 'blue'))
fig <- fig %>% add_trace(y = ~Late, name = 'Late', 
                         text = ~Late, textposition = 'auto', 
                         marker = list(color = 'red'))
fig <- fig %>% layout(title = "Late Deliveries by Class",
                      yaxis = list(title = ''), barmode = 'group', 
                      xaxis = list(title = ''))

fig

In the visualization we can assume a difference in reliability between shipping modes. The plot gives an overview but we need to confirm the initial findings with statistically significant numbers.

Distribution Days for Shipping Scheduled vs Real


fig <- plot_ly(alpha = 0.7)
fig <- fig %>% add_histogram(x = ~scmD$DaysForShipmentScheduled, 
                             name = 'Scheduled', 
                             marker = list(color = 'red',
                             line = list(color ='white', width = 1),
                             histnorm = "probability"))
fig <- fig %>% add_histogram(x = ~scmD$DaysForShippingReal + 1, 
                             name = 'Real', 
                             marker = list(color = 'blue',
                             line = list(color ='white', width = 1),
                             histnorm = "probability"))
fig <- fig %>% layout(barmode = "overlay",
                      title = "Scheduled vs Real Deliveries",
                      yaxis = list(title = ''),
                      xaxis = list(title = 'Days'))

fig

The histograms show that there are some disparities between planned and actual shipping time.

Standard Linear Model

GLM Benefits/Order by Shipping Mode


lmtest <- glm(BenefitPerOrder ~ ShippingMode,
             data = scmD)

summary(lmtest)

Call:
glm(formula = BenefitPerOrder ~ ShippingMode, data = scmD)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4296.3    -15.0      9.5     42.8    888.7  

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 23.1222     0.6262  36.925   <2e-16 ***
ShippingModeSame Day        -2.2720     1.2297  -1.848   0.0647 .  
ShippingModeSecond Class    -1.8163     0.8377  -2.168   0.0301 *  
ShippingModeStandard Class  -1.1231     0.7024  -1.599   0.1098    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 10906.18)

    Null deviance: 1968794529  on 180518  degrees of freedom
Residual deviance: 1968729773  on 180515  degrees of freedom
AIC: 2190597

Number of Fisher Scoring iterations: 2

BenefitPerOrder = (-2.2720 (Same Day)) + (-1.8163 (Second Class)) + (-1.1231 (Standard Class))


Shipping_D <- table(scmD$ShippingMode, scmD$LateDeliveryRisk)
Shipping_D

                
                     0     1
  First Class    26513  1301
  Same Day        5283  4454
  Second Class    8229 26987
  Standard Class 66729 41023

GLM Late Delivery


lmTest <- glm(LateDeliveryRisk ~ DaysForShipmentScheduled + 
             DaysForShippingReal + OrderRegion,
             data = scmD, family = binomial)

summary(lmTest)

Call:
glm(formula = LateDeliveryRisk ~ DaysForShipmentScheduled + DaysForShippingReal + 
    OrderRegion, family = binomial, data = scmD)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.8536  -0.1066  -0.0189   0.1315   3.3237  

Coefficients:
                            Estimate Std. Error  z value Pr(>|z|)    
(Intercept)                -1.789296   0.160162  -11.172   <2e-16 ***
DaysForShipmentScheduled   -3.410749   0.019343 -176.330   <2e-16 ***
DaysForShippingReal         3.351885   0.017376  192.898   <2e-16 ***
OrderRegionCaribbean        0.010826   0.165471    0.065    0.948    
OrderRegionCentral Africa   0.277987   0.191455    1.452    0.147    
OrderRegionCentral America  0.031801   0.160671    0.198    0.843    
OrderRegionCentral Asia    -0.108660   0.235419   -0.462    0.644    
OrderRegionEast Africa      0.034404   0.187241    0.184    0.854    
OrderRegionEast of USA      0.047875   0.166974    0.287    0.774    
OrderRegionEastern Asia     0.066079   0.166905    0.396    0.692    
OrderRegionEastern Europe   0.103393   0.173695    0.595    0.552    
OrderRegionNorth Africa     0.006107   0.176344    0.035    0.972    
OrderRegionNorthern Europe  0.008382   0.164663    0.051    0.959    
OrderRegionOceania          0.208447   0.164103    1.270    0.204    
OrderRegionSouth America    0.121481   0.162525    0.747    0.455    
OrderRegionSouth Asia       0.038480   0.166473    0.231    0.817    
OrderRegionSouth of  USA    0.109999   0.173152    0.635    0.525    
OrderRegionSoutheast Asia   0.081984   0.164903    0.497    0.619    
OrderRegionSouthern Africa -0.260416   0.206244   -1.263    0.207    
OrderRegionSouthern Europe  0.021181   0.164639    0.129    0.898    
OrderRegionUS Center        0.070206   0.168067    0.418    0.676    
OrderRegionWest Africa     -0.033489   0.172620   -0.194    0.846    
OrderRegionWest Asia        0.100183   0.168273    0.595    0.552    
OrderRegionWest of USA      0.103690   0.165928    0.625    0.532    
OrderRegionWestern Europe   0.067176   0.160760    0.418    0.676    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 244190  on 180518  degrees of freedom
Residual deviance:  64178  on 180494  degrees of freedom
AIC: 64228

Number of Fisher Scoring iterations: 7

ggplot(scmD,
       aes(DaysForShippingReal, DaysForShipmentScheduled,
           color = as.factor(LateDeliveryRisk))) +
  geom_point(size = 3) +
  theme_minimal()

The visualization gives a rough overview between the differences in Scheduled vs Real Shipping Days.

Random Intercept Model

Compared to our standard linear model, we are now including ShippingMode, not as a predictor, but as another level into our model, to get more accurate results. We use glmer, a command for binary response variables, which is part of lme4.


library(lme4)

intMod <- lme4::glmer(LateDeliveryRisk ~  1 + (1|ShippingMode), 
               data = scmD, family = binomial)
summary(intMod)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: LateDeliveryRisk ~ 1 + (1 | ShippingMode)
   Data: scmD

      AIC       BIC    logLik  deviance  df.resid 
 205457.4  205477.6 -102726.7  205453.4    180517 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.8108 -0.7841 -0.2216  1.0891  4.5124 

Random effects:
 Groups       Name        Variance Std.Dev.
 ShippingMode (Intercept) 2.305    1.518   
Number of obs: 180519, groups:  ShippingMode, 4

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.6208     0.1201  -5.169 2.35e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A statistically significant p-value indicates that there is some relationship between LateDeliveryRisk and Shipping Modes.

We can add a predictor variable to the model:


riMod <- lme4::glmer(LateDeliveryRisk ~ DaysForShippingReal +
              DaysForShipmentScheduled + (1|ShippingMode), 
              data = scmD, 
              family = binomial)

summary(riMod)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: 
LateDeliveryRisk ~ DaysForShippingReal + DaysForShipmentScheduled +  
    (1 | ShippingMode)
   Data: scmD

     AIC      BIC   logLik deviance df.resid 
 59228.3  59268.7 -29610.2  59220.3   180515 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1030.04    -0.22     0.00     0.05     4.52 

Random effects:
 Groups       Name        Variance Std.Dev.
 ShippingMode (Intercept) 2.633    1.623   
Number of obs: 180519, groups:  ShippingMode, 4

Fixed effects:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -1.54855    0.39476  -3.923 8.75e-05 ***
DaysForShippingReal       4.17641    0.02394 174.456  < 2e-16 ***
DaysForShipmentScheduled -4.21985    0.33893 -12.451  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) DysFSR
DysFrShppnR  0.027       
DysFrShpmnS -0.462 -0.071

Random effects: The standard deviation in random effects tells how much the scores move around based on the variable ShippingMode. Following the output, this score can move by 1.6 points based upon ShippingMode. This confirms that there are differences, in this case in terms of delivery time, between shipping methods.

Standard Errors: By integrating information about the groups, we are getting a better sense of how much uncertainty our model contains at the global average level. The standard errors changed from the previous model by including information that impact the groups. Thanks to that we get a better understanding of the uncertainty in the model.

Correlation of Fixed Effects: When groups are largely balanced, we would find that our coefficients would be the same (or very close to it). The groups in the correlation of fixed effects look like they are not balanced. There is no similarity in value.

Below, we can get a good sense of the random effect estimates:


MuMIn::r.squaredGLMM(riMod)

                  R2m       R2c
theoretical 0.8797382 0.9331974
delta       0.8640148 0.9165185

We get two values the marginal R2 (R2m) and the conditional R2 (R2c). Marginal values are the standard type of R², the variability explained by fixed effects in the model. The Conditional R² uses both fixed and random effects.

In this case, we would see that we are accounting for about 5.3% of the variation on the theoretical level, while delta gives us a variation of about 5.25%. Because we are a binomial distribution we get both these variance levels.

We can plot simulated random effect ranges for each of the random effect groups, Shipping Modes:


library(merTools)
plotREsim(REsim(riMod), labs = TRUE)

The values are not actual values, but rather the difference between the general intercept or slope value found in the model summary and the estimate for this specific level of random effect.

PlotREsim highlights group levels that have a simulated distribution that does not overlap 0 – these appear darker. The lighter bars represent grouping levels that are not distinguishable from 0 in the data. The values in the above plot are effects for each individual random effect, in this case for each Shipping Mode. We can observe a broad range of Shipping Modes based on Late Delivery Risk. The point for First Class stands out. The plot above confirms our initial visualization where we showed a distribution by Shipping Mode in terms of being on time or late.

We can get exact values shown in the plot:


ranks <- expectedRank(riMod)
head(ranks)

     groupFctr     groupLevel      term   estimate    std.error
2 ShippingMode    First Class Intercept  2.7964902 0.0008067361
3 ShippingMode       Same Day Intercept -0.8006355 0.0010413073
4 ShippingMode   Second Class Intercept -1.1955168 0.0006402459
5 ShippingMode Standard Class Intercept -0.8004332 0.0002263543
        ER pctER
2 4.000000    88
3 2.497734    50
4 1.000000    12
5 2.502266    50

Below we are getting predicted values from our mixed model:


mixedPred <- predict(riMod, type = 'response')

slimPred <- predict(lmTest, type = 'response')

allPred <- cbind(actual = scmD$LateDeliveryRisk, 
      mixed = mixedPred, 
      slim = slimPred)

head(allPred, 10)

   actual      mixed        slim
1       0 0.04673682 0.005525557
2       0 0.04673682 0.005525557
3       0 0.04673682 0.005525557
4       0 0.04673682 0.005375950
5       0 0.04673682 0.005375950
6       0 0.04673682 0.005375950
7       0 0.04673682 0.005375950
8       0 0.04673682 0.005375950
9       0 0.04673682 0.005525557
10      0 0.04673682 0.005525557

While there were cases where the standard linear model did a slightly better job, our mixed model generally did better.

We can plot these to make it clearer:


plot(allPred[, "actual"], allPred[, "slim"])


plot(allPred[, "actual"], allPred[, "mixed"])

We can add additional information to the model with the goal to get more accurate results. In the below model we also have a building variable. We can add this additional grouping into our model:


clusterMod <- lme4::glmer(LateDeliveryRisk ~ DaysForShippingReal + 
                     DaysForShipmentScheduled + 
                     (1|ShippingMode) + (1|Type), 
                   data = scmD, family = binomial)

summary(clusterMod)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: 
LateDeliveryRisk ~ DaysForShippingReal + DaysForShipmentScheduled +  
    (1 | ShippingMode) + (1 | Type)
   Data: scmD

     AIC      BIC   logLik deviance df.resid 
 58180.5  58231.0 -29085.2  58170.5   180514 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-895.06   -0.24    0.00    0.05    6.23 

Random effects:
 Groups       Name        Variance Std.Dev.
 ShippingMode (Intercept) 2.8019   1.6739  
 Type         (Intercept) 0.1634   0.4043  
Number of obs: 180519, groups:  ShippingMode, 4; Type, 4

Fixed effects:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -1.55443    0.25697  -6.049 1.46e-09 ***
DaysForShippingReal       4.26883    0.02474 172.568  < 2e-16 ***
DaysForShipmentScheduled -4.31347    0.23642 -18.245  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) DysFSR
DysFrShppnR -0.016       
DysFrShpmnS -0.137 -0.054

Plot the effect ranges again


plotREsim(REsim(clusterMod), labs = TRUE)

The plot confirms, and further adds to the previous model.

We can find out if the predictions improved:


clusterPredict <- predict(clusterMod, type = 'response')

allPred <- cbind(actual = scmD$LateDeliveryRisk, 
           clustered = clusterPredict,
           mixed = mixedPred, 
           slim = slimPred)

head(allPred, 10)

   actual  clustered      mixed        slim
1       0 0.05546389 0.04673682 0.005525557
2       0 0.02512777 0.04673682 0.005525557
3       0 0.05450154 0.04673682 0.005525557
4       0 0.05546389 0.04673682 0.005375950
5       0 0.02512777 0.04673682 0.005375950
6       0 0.05546389 0.04673682 0.005375950
7       0 0.05450154 0.04673682 0.005375950
8       0 0.05546389 0.04673682 0.005375950
9       0 0.05546389 0.04673682 0.005525557
10      0 0.05501780 0.04673682 0.005525557

Thanks to the plots above its easy to recognize that we have slight improvements in our clustered model:


plot(allPred[, "actual"], allPred[, "slim"])


plot(allPred[, "actual"], allPred[, "mixed"])


plot(allPred[, "actual"], allPred[, "clustered"])

Hierarchical Model

Hierarchical models are a little bit different than the previous ones. In the following models we will have groups nested within other groups. We know that we have a Shipping Mode variable and a Type (payment type) variable.

Within the parentheses, we have our intercept as before, but we are also saying that we are looking at the Shipping Modes nested within Type.


hierMod <- lme4::glmer(LateDeliveryRisk ~ DaysForShippingReal + 
                DaysForShipmentScheduled + 
                (1|ShippingMode/Type),
                data = scmD, family = binomial)

summary(hierMod)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: 
LateDeliveryRisk ~ DaysForShippingReal + DaysForShipmentScheduled +  
    (1 | ShippingMode/Type)
   Data: scmD

     AIC      BIC   logLik deviance df.resid 
 51785.9  51836.5 -25888.0  51775.9   180514 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-792.26   -0.03    0.00    0.03    2.24 

Random effects:
 Groups            Name        Variance  Std.Dev. 
 Type:ShippingMode (Intercept) 3.856e+00 1.9636970
 ShippingMode      (Intercept) 9.150e-07 0.0009566
Number of obs: 180519, groups:  
Type:ShippingMode, 16; ShippingMode, 4

Fixed effects:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -2.74718    0.45200  -6.078 1.22e-09 ***
DaysForShippingReal       4.53604    0.02709 167.425  < 2e-16 ***
DaysForShipmentScheduled -4.57365    0.22253 -20.553  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) DysFSR
DysFrShppnR -0.030       
DysFrShpmnS -0.552 -0.092

We have our fixed effect for observation and we have individual intercepts for both Type and Shipping Mode nested within Type. Looking at the standard deviation of our random effects, we can see that the scores differ by about 2 from Type to Shipping Mode.

A look at the effect ranges:


plotREsim(REsim(hierMod), labs = TRUE)

For the effect ranges we get again a similar pattern as before, with some slight adjustments.

We can compare the predictions below:


hierModPredict <- predict(hierMod, type = 'response')

allPred <- cbind(actual = scmD$LateDeliveryRisk, 
                 hierMod = hierModPredict,
                 clustered = clusterPredict,
                 mixed = mixedPred, 
                 slim = slimPred)

head(allPred, 10)

   actual      hierMod  clustered      mixed        slim
1       0 5.782315e-05 0.05546389 0.04673682 0.005525557
2       0 1.663266e-01 0.02512777 0.04673682 0.005525557
3       0 1.340084e-04 0.05450154 0.04673682 0.005525557
4       0 5.782315e-05 0.05546389 0.04673682 0.005375950
5       0 1.663266e-01 0.02512777 0.04673682 0.005375950
6       0 5.782315e-05 0.05546389 0.04673682 0.005375950
7       0 1.340084e-04 0.05450154 0.04673682 0.005375950
8       0 5.782315e-05 0.05546389 0.04673682 0.005375950
9       0 5.782315e-05 0.05546389 0.04673682 0.005525557
10      0 8.430066e-05 0.05501780 0.04673682 0.005525557

And plot to look at the differences quickly:


plot(allPred[, "actual"], allPred[, "slim"])


plot(allPred[, "actual"], allPred[, "mixed"])


plot(allPred[, "actual"], allPred[, "clustered"])


plot(allPred[, "actual"], allPred[, "hierMod"])

We can see that the Hierarchical Model is the most accurate in terms of performance.

Random Slopes

At this point we can add random slopes. In the below model, we will define a random intercept (DaysForShippingReal|ShippingMode) and a random slope. We want to see differences in the Shipping Modes in terms of the intercept and slope.


randomSlopes <- lme4::glmer(LateDeliveryRisk ~ DaysForShipmentScheduled +
                       DaysForShippingReal + 
                       (DaysForShippingReal|ShippingMode), 
                     data = scmD, family = binomial)

summary(randomSlopes)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: 
LateDeliveryRisk ~ DaysForShipmentScheduled + DaysForShippingReal +  
    (DaysForShippingReal | ShippingMode)
   Data: scmD

     AIC      BIC   logLik deviance df.resid 
 54571.1  54631.7 -27279.5  54559.1   180513 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-52.695  -0.211  -0.001   0.062   4.515 

Random effects:
 Groups       Name                Variance Std.Dev. Corr 
 ShippingMode (Intercept)         52.98    7.279         
              DaysForShippingReal 21.87    4.676    -1.00
Number of obs: 180519, groups:  ShippingMode, 4

Fixed effects:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)               2.05646    0.38099   5.398 6.75e-08 ***
DaysForShipmentScheduled -6.42557    0.09835 -65.334  < 2e-16 ***
DaysForShippingReal       5.30310    0.31566  16.800  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) DysFSS
DysFrShpmnS -0.146       
DysFrShppnR -0.662 -0.604
optimizer (Nelder_Mead) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingular

We can observe some slight changes in terms of our fixed effects. We also get additional information for the random effects. The random intercept standard deviation for ShippingMode is telling us the amount that the scores bounce around from place to place and the DaysForShippingReal variance is telling us how much variability there is within the slope between locations.

A plot of the above model:


library(kableExtra)
scmD$ShippingMode <- as.factor(scmD$ShippingMode)

scmd_ship <- scmD %>% 
  group_by(ShippingMode, LateDeliveryRisk) %>%
  summarise(mean_days = mean(DaysForShippingReal, na.rm = TRUE))

kbl(as.data.frame(scmd_ship), booktabs = T) %>% 
kable_styling(full_width = F) %>%
column_spec(1, width = "4cm") 
ShippingMode LateDeliveryRisk mean_days
First Class 0 1.0000000
First Class 1 1.0000000
Same Day 0 0.0384251
Same Day 1 1.0000000
Second Class 0 2.3315105
Second Class 1 4.4967948
Standard Class 0 3.0670323
Standard Class 1 5.5068376

scmD$ShippingMode <- factor(scmD$ShippingMode, levels = c("Standard Class", "Second Class", "First Class", "Same Day"))

ggplot(scmd_ship, aes(LateDeliveryRisk, mean_days,
                 group = ShippingMode, color = ShippingMode)) +
  geom_line(linetype = 'dashed', size = 1) +
  geom_point (size = 4) +
  labs(title = " ",
        x = "Delivery",
        y = "Shipping Days",
        color = " ") + 
  scale_y_continuous(breaks=c(1,2,3,4,5)) +
  scale_x_continuous(breaks=c(0, 0.25, 0.5, 0.75, 1),
        labels=c("On Time", " ", " ", " ", "Late")) +
  #theme_bw() +
  theme_classic() +
  scale_fill_manual(name=" ")

We can see a clear difference in the slopes between the different Shipping Modes. The steeper the slope, the greater the risk in terms of days having a late delivery. It becomes clear that First Class Shipping is the most reliable in terms of risk and late deliveries, followed by Same Day Shipping. A steeper slope indicates a broader range of Late Deliveries.

Let´s compare both Models and see if we got more accuracy


ri1 <- lme4::glmer(LateDeliveryRisk ~ DaysForShipmentScheduled +
            DaysForShippingReal + (1 | ShippingMode), 
            data = scmD, family = binomial)

ri2 <- lme4::glmer(LateDeliveryRisk ~  DaysForShipmentScheduled +
            DaysForShippingReal + (DaysForShippingReal|ShippingMode), 
            data = scmD, family = binomial)

anova(ri1, ri2)

Data: scmD
Models:
ri1: LateDeliveryRisk ~ DaysForShipmentScheduled + DaysForShippingReal + 
ri1:     (1 | ShippingMode)
ri2: LateDeliveryRisk ~ DaysForShipmentScheduled + DaysForShippingReal + 
ri2:     (DaysForShippingReal | ShippingMode)
    npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)    
ri1    4 59228 59269 -29610    59220                         
ri2    6 54571 54632 -27280    54559 4661.2  2  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Finally, we can compare both, Random Intercept and Random intercept with slopes models. From the results we can observe the best performance in the second, hierarchical model.

Key Takeaways

  1. Mixed Effect Models confirm the differences within shipping modes. In detail, this is explained by the variation seen in the models. There is a statistically significant relationship between LateDeliveryRisk and DaysForShippingReal. Adding random effects to the model confirms the differences in delivery times between shipping modes. On average, deliveries move around by 1.6 days. By looking at the correlation of fixed effects we confirm that the different shipping modes are not balanced

  2. First Class is the most reliable shipping mode. Almost half of Same Day deliveries are late by one day. Standard Class and Second Class Deliveries account for most deliveries and face the biggest risk to be delivered late by about 2.5 days

  1. Further analysis on operational efficiency and in-house supply chain: Are late deliveries caused by internal or external factors? Are certain geographies more affected by late deliveries? Are specific products more likely to be delivered late? Are there problems with transportation routes, product sizes, batching etc?

  2. After further analysis of the supply chain, and assuming there are no internally driven problems causing late deliveries: DataCo Global should find a more reliable Shipping Partner and shift most deliveries to First Class and Same Day.

  1. Close a deal with a Shipping Company (FedEx, UPS) for First Class Shipping on all products
  2. Offer a membership package to customers with unlimited First Class Shipping
  1. Once actions are implemented, customer satisfaction and sales are projected to increase. DataCo Global should focus on current and niche markets to sell their products to establish a favorable position within the market

Results

The company DataCo Global is shipping the majority of their goods with Standard and Second Class. Both these Shipping Methods were proved to cause the most late deliveries when compared to First Class Shipping. Further analysis on operational efficiency and in-house supply chain is required to figure out whether late deliveries are caused by internal or external factors. In detail, further research on costs concerning the different shipping modes, batching, product weight, and transportation routes are required. DataCo Global needs to make sure that its internal supply chain is not hindering shipping times and late deliveries.

Overall, First Class Shipping Mode is the most reliable and amounts to least Late Deliveries. To increase customer satisfaction, DataCo Global needs to make a change in Shipping Modes and shift the majority of their deliveries to First Class shipping. This will likely increase costs. DataCo Global should partner with a Shipping Company to get exclusive rates for First Class Shipping.

DataCo Global can then offer a Premium Membership in exchange for a yearly fee from their customer. The Premium Membership should include unlimited First Class and Same Day Deliveries. If DataCo Global incurs highers costs due to offering more expensive shipping rates at a premium for customers, we suggest the following:

  1. Close a deal with a Shipping Company such as UPS or FedEx to get exclusive shipping rates. Company should fit DataCo Global´s needs for deliveries and be a reliable partner.

  2. Similar the way Amazon offers AmazonBasics to its customers, DataCo should introduce its own branded products to customers (potentially named DataCo Essentials). These products would be offered at competitive prices, giving customers the ability to purchase private-label products directly from DataCo, allowing the firm to help offset higher costs it may incur from offering premium shipping at a discounted or more affordable price to its customers.

Some of DataCo Globals most realistic competitors are Amazon or Alibaba. Amazon is present in almost every part of the world, whereas Alibaba dominates a large share of the ecommerce marketplace in China and the Western half of the hemisphere. DataCo Global should identify niches where Amazon and Alibaba are not present and establish an early relationship with customers in such areas. Currently, Amazon only offers Amazon Prime (Same day shipping option) to customers in the U.S., United Kingdom, Japan, Germany and Canada. However, customers in India, France, Italy and Spain do not have the option for Amazon Same Day services. It would be a savvy move for DataCo Global to enter the market in these countries, and establish relationships with these customers by offering its products along with Same Day and First Class shipping.

Premium Memberships are going to be an additional expense for the consumer. However, they will add value to the company and give the consumer a more exclusive feeling when shopping on the website. The Premium Membership will boost customer satisfaction, also thanks to on time deliveries and a reliable service and experience.