DataCo Global
scmD <- read.csv("DataCoGlobal.csv")
Original Data source: https://data.mendeley.com/datasets/8gx2fvg2k6/5
Data source citation: Constante, Fabian; Silva, Fernando; Pereira, António (2019), “DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS”, Mendeley Data, V5, doi: 10.17632/8gx2fvg2k6.5
Adjusted data: https://drive.google.com/file/d/1yS4hgc7uKzOTEYAXaQtAzIeDuco_EUeH/view?usp=sharing
Description: The data is from Mendeley, a website that provides research data. For this project we wanted to find and apply data to a business case (explained later). The data (180519 observations and 53 columns) comes from an e-commerce company, DataCo Global, selling clothing, sport goods, and electronics. The variables in this dataset can be related to the supply chain around an e-commerce company. In specific, the variables contain Customer, Production, Sales, Stores, and Logistics information. In our analysis we will focus on Logistics, in specific on Shipping Modes and Late Deliveries. In terms of statistical analysis, it is interesting to apply mixed effect models to find relationships between variables. For the analysis we are not using the following variables:
A. Shipping
DaysForShippingReal: Actual shipping days of the purchased product
DaysForShipmentScheduled: Days of scheduled delivery of the purchased product
LateDeliveryRisk:Categorical variable that indicates if sending is late 1, if it is on time 0
ShippingMode: Consists of the following shipping modes: Standard Class, First Class, Second Class, Same Day
Shipping date (DateOrders): Exact date and time of shipment
Type: Type of transaction made, Transfer, Cash, Payment, Debit
B. Orders
OrderDate: Date on which the order is made
Department Id: Department code of store
Department Name: Department name of store
Latitude: Latitude corresponding to location of store
Longitude: Longitude corresponding to location of store
Market: Market to where the order is delivered: Africa, Europe, LATAM, Pacific Asia, USCA
Order Region: Region of the world where the order is delivered: Southeast Asia, South Asia, Oceania, Eastern Asia, West Asia, West of USA, US Center, West Africa, Central Africa, North Africa, Western Europe, Northern, Caribbean, South America, East Africa, Southern Europe, East of USA, Canada, Southern Africa, Central Asia, Europe, Central America, Eastern Europe, South of USA
Order Country: Destination country of the order
Order City: Destination city of the order
Order Customer Id: Customer order code
Order Item Total: Total amount per order
Order Profit Per Order: Order Profit Per Order
Order State: State of the region where the order is delivered
Overview of all columns
rmarkdown::paged_table(scmD)
Overview of relevant statistical measures
library(psych)
rmarkdown::paged_table(describe(scmD))
Missing Values in percent
sum((is.na(scmD))/prod(dim(scmD)) * 100)
[1] 3.513987
In what columns are the missing values?
na_count <-sapply(scmD, function(y) sum(length(which(is.na(y)))))
na_count
OrderZipcode and ProductDescription have missing values. We won´t need those for our analysis. All other columns have 0 missing values.
Region Overview
unique(scmD$OrderRegion)
[1] "Eastern Asia" "South Asia" "Southeast Asia"
[4] "Oceania" "Southern Africa" "North Africa"
[7] "Northern Europe" "Eastern Europe" "Southern Europe"
[10] "Western Europe" "Central America" "South America"
[13] "West Asia" "West Africa" "Canada"
[16] "US Center " "East of USA" "West of USA "
[19] "South of USA " "Caribbean" "Central Africa"
[22] "East Africa" "Central Asia"
In today´s world, e-commerce companies are established entities. It has become normal for people to order the bulk of their things online. The customer shifts more and more to the center of the company´s priorities, bringing higher expectations on the quality of products, shipping, delivery times, and so on. Online Shopping evolved into an experience for the customer. The competition between e-commerce companies is high, the market is fast paced. Companies that don´t adjust in time will likely not exist for very long.
DataCo Global´s customer satisfaction (this is an assumption) and sales dropped, also because of late deliveries. The company needs to make a change to decrease late deliveries and win the trust of their customers back. DataCo Global hired us to consult them on late deliveries.
Classify deliveries by mode and late delivery risk. Use mixed effect models to support the initial classification. Recommend next steps that will decrease late deliveries.
The below models will look at the relationships between LateDeliveryRisk, ShippingMode, DaysForShippingReal, DaysForShippingScheduled, Transfer Type. We assume that the late delivery risk is real, meaning 0 is a shipment on time, and 1 is a late shipment. LateDeliveryRisk is our binary response variable (0, 1) and used to test our assumptions with various statistical models.
Shipping Modes * Same Day * First Class * Second Class * Standard Class
Cost of Shipping Modes ($)
SameDayCost <- 10
FirstClassCost <- 7
SecondClassCost <- 4
StandardClassCost <- 3
Overall late vs on time deliveries
library(plotly)
All_d <- table(scmD$LateDeliveryRisk)
All_sum <- sum(All_d)
Ontime_filter <- (sum(scmD$LateDeliveryRisk == 0))
Late_filter <- (sum(scmD$LateDeliveryRisk == 1))
Ontime_p <- (Ontime_filter/All_sum)*100
Ontime_p %>% round(1)
[1] 59.1
Late_p <- (Late_filter/All_sum)*100
Late_p %>% round(1)
[1] 40.9
labels = c("Deliveries")
fig <- plot_ly(x = ~labels, y = ~Ontime_p, type = 'bar',
name = 'On Time',text = Ontime_p %>% round(1),
textposition = 'auto',
marker = list(color = 'blue'))
fig <- fig %>% add_trace(y = ~Late_p, name = 'Late',
text = Late_p %>% round(1),
textposition = 'auto',
marker = list(color = 'red'))
fig <- fig %>% layout(title = "Total Deliveries",
yaxis = list(title = '%'),
barmode = 'group',
xaxis = list(title = ''))
fig
It looks like that about 59% of deliveries are on time, while 41% have a delay.
Delivery Count by Shipping Mode
library(kableExtra)
Shipping_All <- table(scmD$ShippingMode)
Shipping_AP <- Shipping_All
DeliveryC = as.data.frame(Shipping_AP)
kbl(as.data.frame(DeliveryC), booktabs = T) %>%
kable_styling(full_width = F) %>%
column_spec(1, width = "4cm")
| Var1 | Freq |
|---|---|
| First Class | 27814 |
| Same Day | 9737 |
| Second Class | 35216 |
| Standard Class | 107752 |
fig <- plot_ly(DeliveryC, x = ~Var1, y = ~Freq, type = 'bar',
text = ~Freq %>% round(1),
textposition = 'auto',
marker = list(color = "blue"))
fig <- fig %>% layout(title = "Deliveries by Class",
yaxis = list(title = ''),
xaxis = list(title = ''))
fig
Delivery (%) by Shipping Mode
library(dplyr)
library(plotly)
Shipping_All <- table(scmD$ShippingMode)
Shipping_AP <- prop.table(Shipping_All) %>% {. * 100} %>% round(1)
Ovplot = as.data.frame(Shipping_AP)
kbl(as.data.frame(Ovplot), booktabs = T) %>%
kable_styling(full_width = F) %>%
column_spec(1, width = "4cm")
| Var1 | Freq |
|---|---|
| First Class | 15.4 |
| Same Day | 5.4 |
| Second Class | 19.5 |
| Standard Class | 59.7 |
fig <- plot_ly(Ovplot, x = ~Var1, y = ~Freq, type = 'bar',
text = ~Freq %>% round(1),
textposition = 'auto',
marker = list(color = "blue"))
fig <- fig %>% layout(title = "Deliveries by Class",
yaxis = list(title = '%'),
xaxis = list(title = ''))
fig
Shipping Mode by Late Delivery Risk (Count)
Shipping_D <- table(scmD$LateDeliveryRisk, scmD$ShippingMode)
ShippingACount <- as.data.frame(Shipping_D)
kbl(as.data.frame(ShippingACount), booktabs = T) %>%
kable_styling(full_width = F) %>%
column_spec(1, width = "4cm")
| Var1 | Var2 | Freq |
|---|---|---|
| 0 | First Class | 26513 |
| 1 | First Class | 1301 |
| 0 | Same Day | 5283 |
| 1 | Same Day | 4454 |
| 0 | Second Class | 8229 |
| 1 | Second Class | 26987 |
| 0 | Standard Class | 66729 |
| 1 | Standard Class | 41023 |
Shipping Mode by Late Delivery Risk (%)
library(plotly)
library(ggplot2)
library(kableExtra)
Shipping_D <- table(scmD$LateDeliveryRisk, scmD$ShippingMode)
Shipping_D
First Class Same Day Second Class Standard Class
0 26513 5283 8229 66729
1 1301 4454 26987 41023
Shipping_P <- prop.table(Shipping_D, margin = 2) %>% {. * 100} %>% round(1)
Splot = as.data.frame(Shipping_P)
kbl(as.data.frame(Splot), booktabs = T) %>%
kable_styling(full_width = F) %>%
column_spec(1, width = "4cm")
| Var1 | Var2 | Freq |
|---|---|---|
| 0 | First Class | 95.3 |
| 1 | First Class | 4.7 |
| 0 | Same Day | 54.3 |
| 1 | Same Day | 45.7 |
| 0 | Second Class | 23.4 |
| 1 | Second Class | 76.6 |
| 0 | Standard Class | 61.9 |
| 1 | Standard Class | 38.1 |
Ontime_filter <- filter(Splot, Var1 == "0")
Late_filter <- filter(Splot, Var1 == "1")
ShippingMode <- c("Same Day", "First Class", "Second Class", "Standard Class")
OnTime <- c(54.3, 95.3, 23.4, 61.9)
Late <- c(45.7, 4.7, 76.6, 38.1)
data <- data.frame(ShippingMode, OnTime, Late)
fig <- plot_ly(data, x = ~ShippingMode, y = ~OnTime, type = 'bar',
name = 'On Time', text = ~OnTime,
textposition = 'auto', marker = list(color = 'blue'))
fig <- fig %>% add_trace(y = ~Late, name = 'Late',
text = ~Late, textposition = 'auto',
marker = list(color = 'red'))
fig <- fig %>% layout(title = "Late Deliveries by Class",
yaxis = list(title = ''), barmode = 'group',
xaxis = list(title = ''))
fig
In the visualization we can assume a difference in reliability between shipping modes. The plot gives an overview but we need to confirm the initial findings with statistically significant numbers.
fig <- plot_ly(alpha = 0.7)
fig <- fig %>% add_histogram(x = ~scmD$DaysForShipmentScheduled,
name = 'Scheduled',
marker = list(color = 'red',
line = list(color ='white', width = 1),
histnorm = "probability"))
fig <- fig %>% add_histogram(x = ~scmD$DaysForShippingReal + 1,
name = 'Real',
marker = list(color = 'blue',
line = list(color ='white', width = 1),
histnorm = "probability"))
fig <- fig %>% layout(barmode = "overlay",
title = "Scheduled vs Real Deliveries",
yaxis = list(title = ''),
xaxis = list(title = 'Days'))
fig
The histograms show that there are some disparities between planned and actual shipping time.
GLM Benefits/Order by Shipping Mode
lmtest <- glm(BenefitPerOrder ~ ShippingMode,
data = scmD)
summary(lmtest)
Call:
glm(formula = BenefitPerOrder ~ ShippingMode, data = scmD)
Deviance Residuals:
Min 1Q Median 3Q Max
-4296.3 -15.0 9.5 42.8 888.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.1222 0.6262 36.925 <2e-16 ***
ShippingModeSame Day -2.2720 1.2297 -1.848 0.0647 .
ShippingModeSecond Class -1.8163 0.8377 -2.168 0.0301 *
ShippingModeStandard Class -1.1231 0.7024 -1.599 0.1098
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 10906.18)
Null deviance: 1968794529 on 180518 degrees of freedom
Residual deviance: 1968729773 on 180515 degrees of freedom
AIC: 2190597
Number of Fisher Scoring iterations: 2
BenefitPerOrder = (-2.2720 (Same Day)) + (-1.8163 (Second Class)) + (-1.1231 (Standard Class))
Shipping_D <- table(scmD$ShippingMode, scmD$LateDeliveryRisk)
Shipping_D
0 1
First Class 26513 1301
Same Day 5283 4454
Second Class 8229 26987
Standard Class 66729 41023
GLM Late Delivery
lmTest <- glm(LateDeliveryRisk ~ DaysForShipmentScheduled +
DaysForShippingReal + OrderRegion,
data = scmD, family = binomial)
summary(lmTest)
Call:
glm(formula = LateDeliveryRisk ~ DaysForShipmentScheduled + DaysForShippingReal +
OrderRegion, family = binomial, data = scmD)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.8536 -0.1066 -0.0189 0.1315 3.3237
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.789296 0.160162 -11.172 <2e-16 ***
DaysForShipmentScheduled -3.410749 0.019343 -176.330 <2e-16 ***
DaysForShippingReal 3.351885 0.017376 192.898 <2e-16 ***
OrderRegionCaribbean 0.010826 0.165471 0.065 0.948
OrderRegionCentral Africa 0.277987 0.191455 1.452 0.147
OrderRegionCentral America 0.031801 0.160671 0.198 0.843
OrderRegionCentral Asia -0.108660 0.235419 -0.462 0.644
OrderRegionEast Africa 0.034404 0.187241 0.184 0.854
OrderRegionEast of USA 0.047875 0.166974 0.287 0.774
OrderRegionEastern Asia 0.066079 0.166905 0.396 0.692
OrderRegionEastern Europe 0.103393 0.173695 0.595 0.552
OrderRegionNorth Africa 0.006107 0.176344 0.035 0.972
OrderRegionNorthern Europe 0.008382 0.164663 0.051 0.959
OrderRegionOceania 0.208447 0.164103 1.270 0.204
OrderRegionSouth America 0.121481 0.162525 0.747 0.455
OrderRegionSouth Asia 0.038480 0.166473 0.231 0.817
OrderRegionSouth of USA 0.109999 0.173152 0.635 0.525
OrderRegionSoutheast Asia 0.081984 0.164903 0.497 0.619
OrderRegionSouthern Africa -0.260416 0.206244 -1.263 0.207
OrderRegionSouthern Europe 0.021181 0.164639 0.129 0.898
OrderRegionUS Center 0.070206 0.168067 0.418 0.676
OrderRegionWest Africa -0.033489 0.172620 -0.194 0.846
OrderRegionWest Asia 0.100183 0.168273 0.595 0.552
OrderRegionWest of USA 0.103690 0.165928 0.625 0.532
OrderRegionWestern Europe 0.067176 0.160760 0.418 0.676
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 244190 on 180518 degrees of freedom
Residual deviance: 64178 on 180494 degrees of freedom
AIC: 64228
Number of Fisher Scoring iterations: 7
ggplot(scmD,
aes(DaysForShippingReal, DaysForShipmentScheduled,
color = as.factor(LateDeliveryRisk))) +
geom_point(size = 3) +
theme_minimal()
The visualization gives a rough overview between the differences in Scheduled vs Real Shipping Days.
Compared to our standard linear model, we are now including ShippingMode, not as a predictor, but as another level into our model, to get more accurate results. We use glmer, a command for binary response variables, which is part of lme4.
library(lme4)
intMod <- lme4::glmer(LateDeliveryRisk ~ 1 + (1|ShippingMode),
data = scmD, family = binomial)
summary(intMod)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula: LateDeliveryRisk ~ 1 + (1 | ShippingMode)
Data: scmD
AIC BIC logLik deviance df.resid
205457.4 205477.6 -102726.7 205453.4 180517
Scaled residuals:
Min 1Q Median 3Q Max
-1.8108 -0.7841 -0.2216 1.0891 4.5124
Random effects:
Groups Name Variance Std.Dev.
ShippingMode (Intercept) 2.305 1.518
Number of obs: 180519, groups: ShippingMode, 4
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.6208 0.1201 -5.169 2.35e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A statistically significant p-value indicates that there is some relationship between LateDeliveryRisk and Shipping Modes.
We can add a predictor variable to the model:
riMod <- lme4::glmer(LateDeliveryRisk ~ DaysForShippingReal +
DaysForShipmentScheduled + (1|ShippingMode),
data = scmD,
family = binomial)
summary(riMod)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula:
LateDeliveryRisk ~ DaysForShippingReal + DaysForShipmentScheduled +
(1 | ShippingMode)
Data: scmD
AIC BIC logLik deviance df.resid
59228.3 59268.7 -29610.2 59220.3 180515
Scaled residuals:
Min 1Q Median 3Q Max
-1030.04 -0.22 0.00 0.05 4.52
Random effects:
Groups Name Variance Std.Dev.
ShippingMode (Intercept) 2.633 1.623
Number of obs: 180519, groups: ShippingMode, 4
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.54855 0.39476 -3.923 8.75e-05 ***
DaysForShippingReal 4.17641 0.02394 174.456 < 2e-16 ***
DaysForShipmentScheduled -4.21985 0.33893 -12.451 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) DysFSR
DysFrShppnR 0.027
DysFrShpmnS -0.462 -0.071
Random effects: The standard deviation in random effects tells how much the scores move around based on the variable ShippingMode. Following the output, this score can move by 1.6 points based upon ShippingMode. This confirms that there are differences, in this case in terms of delivery time, between shipping methods.
Standard Errors: By integrating information about the groups, we are getting a better sense of how much uncertainty our model contains at the global average level. The standard errors changed from the previous model by including information that impact the groups. Thanks to that we get a better understanding of the uncertainty in the model.
Correlation of Fixed Effects: When groups are largely balanced, we would find that our coefficients would be the same (or very close to it). The groups in the correlation of fixed effects look like they are not balanced. There is no similarity in value.
Below, we can get a good sense of the random effect estimates:
MuMIn::r.squaredGLMM(riMod)
R2m R2c
theoretical 0.8797382 0.9331974
delta 0.8640148 0.9165185
We get two values the marginal R2 (R2m) and the conditional R2 (R2c). Marginal values are the standard type of R², the variability explained by fixed effects in the model. The Conditional R² uses both fixed and random effects.
In this case, we would see that we are accounting for about 5.3% of the variation on the theoretical level, while delta gives us a variation of about 5.25%. Because we are a binomial distribution we get both these variance levels.
We can plot simulated random effect ranges for each of the random effect groups, Shipping Modes:
library(merTools)
plotREsim(REsim(riMod), labs = TRUE)
The values are not actual values, but rather the difference between the general intercept or slope value found in the model summary and the estimate for this specific level of random effect.
PlotREsim highlights group levels that have a simulated distribution that does not overlap 0 – these appear darker. The lighter bars represent grouping levels that are not distinguishable from 0 in the data. The values in the above plot are effects for each individual random effect, in this case for each Shipping Mode. We can observe a broad range of Shipping Modes based on Late Delivery Risk. The point for First Class stands out. The plot above confirms our initial visualization where we showed a distribution by Shipping Mode in terms of being on time or late.
We can get exact values shown in the plot:
ranks <- expectedRank(riMod)
head(ranks)
groupFctr groupLevel term estimate std.error
2 ShippingMode First Class Intercept 2.7964902 0.0008067361
3 ShippingMode Same Day Intercept -0.8006355 0.0010413073
4 ShippingMode Second Class Intercept -1.1955168 0.0006402459
5 ShippingMode Standard Class Intercept -0.8004332 0.0002263543
ER pctER
2 4.000000 88
3 2.497734 50
4 1.000000 12
5 2.502266 50
Below we are getting predicted values from our mixed model:
mixedPred <- predict(riMod, type = 'response')
slimPred <- predict(lmTest, type = 'response')
allPred <- cbind(actual = scmD$LateDeliveryRisk,
mixed = mixedPred,
slim = slimPred)
head(allPred, 10)
actual mixed slim
1 0 0.04673682 0.005525557
2 0 0.04673682 0.005525557
3 0 0.04673682 0.005525557
4 0 0.04673682 0.005375950
5 0 0.04673682 0.005375950
6 0 0.04673682 0.005375950
7 0 0.04673682 0.005375950
8 0 0.04673682 0.005375950
9 0 0.04673682 0.005525557
10 0 0.04673682 0.005525557
While there were cases where the standard linear model did a slightly better job, our mixed model generally did better.
We can plot these to make it clearer:
plot(allPred[, "actual"], allPred[, "slim"])
plot(allPred[, "actual"], allPred[, "mixed"])
We can add additional information to the model with the goal to get more accurate results. In the below model we also have a building variable. We can add this additional grouping into our model:
clusterMod <- lme4::glmer(LateDeliveryRisk ~ DaysForShippingReal +
DaysForShipmentScheduled +
(1|ShippingMode) + (1|Type),
data = scmD, family = binomial)
summary(clusterMod)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula:
LateDeliveryRisk ~ DaysForShippingReal + DaysForShipmentScheduled +
(1 | ShippingMode) + (1 | Type)
Data: scmD
AIC BIC logLik deviance df.resid
58180.5 58231.0 -29085.2 58170.5 180514
Scaled residuals:
Min 1Q Median 3Q Max
-895.06 -0.24 0.00 0.05 6.23
Random effects:
Groups Name Variance Std.Dev.
ShippingMode (Intercept) 2.8019 1.6739
Type (Intercept) 0.1634 0.4043
Number of obs: 180519, groups: ShippingMode, 4; Type, 4
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.55443 0.25697 -6.049 1.46e-09 ***
DaysForShippingReal 4.26883 0.02474 172.568 < 2e-16 ***
DaysForShipmentScheduled -4.31347 0.23642 -18.245 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) DysFSR
DysFrShppnR -0.016
DysFrShpmnS -0.137 -0.054
Plot the effect ranges again
plotREsim(REsim(clusterMod), labs = TRUE)
The plot confirms, and further adds to the previous model.
We can find out if the predictions improved:
clusterPredict <- predict(clusterMod, type = 'response')
allPred <- cbind(actual = scmD$LateDeliveryRisk,
clustered = clusterPredict,
mixed = mixedPred,
slim = slimPred)
head(allPred, 10)
actual clustered mixed slim
1 0 0.05546389 0.04673682 0.005525557
2 0 0.02512777 0.04673682 0.005525557
3 0 0.05450154 0.04673682 0.005525557
4 0 0.05546389 0.04673682 0.005375950
5 0 0.02512777 0.04673682 0.005375950
6 0 0.05546389 0.04673682 0.005375950
7 0 0.05450154 0.04673682 0.005375950
8 0 0.05546389 0.04673682 0.005375950
9 0 0.05546389 0.04673682 0.005525557
10 0 0.05501780 0.04673682 0.005525557
Thanks to the plots above its easy to recognize that we have slight improvements in our clustered model:
plot(allPred[, "actual"], allPred[, "slim"])
plot(allPred[, "actual"], allPred[, "mixed"])
plot(allPred[, "actual"], allPred[, "clustered"])
Hierarchical models are a little bit different than the previous ones. In the following models we will have groups nested within other groups. We know that we have a Shipping Mode variable and a Type (payment type) variable.
Within the parentheses, we have our intercept as before, but we are also saying that we are looking at the Shipping Modes nested within Type.
hierMod <- lme4::glmer(LateDeliveryRisk ~ DaysForShippingReal +
DaysForShipmentScheduled +
(1|ShippingMode/Type),
data = scmD, family = binomial)
summary(hierMod)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula:
LateDeliveryRisk ~ DaysForShippingReal + DaysForShipmentScheduled +
(1 | ShippingMode/Type)
Data: scmD
AIC BIC logLik deviance df.resid
51785.9 51836.5 -25888.0 51775.9 180514
Scaled residuals:
Min 1Q Median 3Q Max
-792.26 -0.03 0.00 0.03 2.24
Random effects:
Groups Name Variance Std.Dev.
Type:ShippingMode (Intercept) 3.856e+00 1.9636970
ShippingMode (Intercept) 9.150e-07 0.0009566
Number of obs: 180519, groups:
Type:ShippingMode, 16; ShippingMode, 4
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.74718 0.45200 -6.078 1.22e-09 ***
DaysForShippingReal 4.53604 0.02709 167.425 < 2e-16 ***
DaysForShipmentScheduled -4.57365 0.22253 -20.553 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) DysFSR
DysFrShppnR -0.030
DysFrShpmnS -0.552 -0.092
We have our fixed effect for observation and we have individual intercepts for both Type and Shipping Mode nested within Type. Looking at the standard deviation of our random effects, we can see that the scores differ by about 2 from Type to Shipping Mode.
A look at the effect ranges:
plotREsim(REsim(hierMod), labs = TRUE)
For the effect ranges we get again a similar pattern as before, with some slight adjustments.
We can compare the predictions below:
hierModPredict <- predict(hierMod, type = 'response')
allPred <- cbind(actual = scmD$LateDeliveryRisk,
hierMod = hierModPredict,
clustered = clusterPredict,
mixed = mixedPred,
slim = slimPred)
head(allPred, 10)
actual hierMod clustered mixed slim
1 0 5.782315e-05 0.05546389 0.04673682 0.005525557
2 0 1.663266e-01 0.02512777 0.04673682 0.005525557
3 0 1.340084e-04 0.05450154 0.04673682 0.005525557
4 0 5.782315e-05 0.05546389 0.04673682 0.005375950
5 0 1.663266e-01 0.02512777 0.04673682 0.005375950
6 0 5.782315e-05 0.05546389 0.04673682 0.005375950
7 0 1.340084e-04 0.05450154 0.04673682 0.005375950
8 0 5.782315e-05 0.05546389 0.04673682 0.005375950
9 0 5.782315e-05 0.05546389 0.04673682 0.005525557
10 0 8.430066e-05 0.05501780 0.04673682 0.005525557
And plot to look at the differences quickly:
plot(allPred[, "actual"], allPred[, "slim"])
plot(allPred[, "actual"], allPred[, "mixed"])
plot(allPred[, "actual"], allPred[, "clustered"])
plot(allPred[, "actual"], allPred[, "hierMod"])
We can see that the Hierarchical Model is the most accurate in terms of performance.
At this point we can add random slopes. In the below model, we will define a random intercept (DaysForShippingReal|ShippingMode) and a random slope. We want to see differences in the Shipping Modes in terms of the intercept and slope.
randomSlopes <- lme4::glmer(LateDeliveryRisk ~ DaysForShipmentScheduled +
DaysForShippingReal +
(DaysForShippingReal|ShippingMode),
data = scmD, family = binomial)
summary(randomSlopes)
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: binomial ( logit )
Formula:
LateDeliveryRisk ~ DaysForShipmentScheduled + DaysForShippingReal +
(DaysForShippingReal | ShippingMode)
Data: scmD
AIC BIC logLik deviance df.resid
54571.1 54631.7 -27279.5 54559.1 180513
Scaled residuals:
Min 1Q Median 3Q Max
-52.695 -0.211 -0.001 0.062 4.515
Random effects:
Groups Name Variance Std.Dev. Corr
ShippingMode (Intercept) 52.98 7.279
DaysForShippingReal 21.87 4.676 -1.00
Number of obs: 180519, groups: ShippingMode, 4
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.05646 0.38099 5.398 6.75e-08 ***
DaysForShipmentScheduled -6.42557 0.09835 -65.334 < 2e-16 ***
DaysForShippingReal 5.30310 0.31566 16.800 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) DysFSS
DysFrShpmnS -0.146
DysFrShppnR -0.662 -0.604
optimizer (Nelder_Mead) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingular
We can observe some slight changes in terms of our fixed effects. We also get additional information for the random effects. The random intercept standard deviation for ShippingMode is telling us the amount that the scores bounce around from place to place and the DaysForShippingReal variance is telling us how much variability there is within the slope between locations.
A plot of the above model:
library(kableExtra)
scmD$ShippingMode <- as.factor(scmD$ShippingMode)
scmd_ship <- scmD %>%
group_by(ShippingMode, LateDeliveryRisk) %>%
summarise(mean_days = mean(DaysForShippingReal, na.rm = TRUE))
kbl(as.data.frame(scmd_ship), booktabs = T) %>%
kable_styling(full_width = F) %>%
column_spec(1, width = "4cm")
| ShippingMode | LateDeliveryRisk | mean_days |
|---|---|---|
| First Class | 0 | 1.0000000 |
| First Class | 1 | 1.0000000 |
| Same Day | 0 | 0.0384251 |
| Same Day | 1 | 1.0000000 |
| Second Class | 0 | 2.3315105 |
| Second Class | 1 | 4.4967948 |
| Standard Class | 0 | 3.0670323 |
| Standard Class | 1 | 5.5068376 |
scmD$ShippingMode <- factor(scmD$ShippingMode, levels = c("Standard Class", "Second Class", "First Class", "Same Day"))
ggplot(scmd_ship, aes(LateDeliveryRisk, mean_days,
group = ShippingMode, color = ShippingMode)) +
geom_line(linetype = 'dashed', size = 1) +
geom_point (size = 4) +
labs(title = " ",
x = "Delivery",
y = "Shipping Days",
color = " ") +
scale_y_continuous(breaks=c(1,2,3,4,5)) +
scale_x_continuous(breaks=c(0, 0.25, 0.5, 0.75, 1),
labels=c("On Time", " ", " ", " ", "Late")) +
#theme_bw() +
theme_classic() +
scale_fill_manual(name=" ")
We can see a clear difference in the slopes between the different Shipping Modes. The steeper the slope, the greater the risk in terms of days having a late delivery. It becomes clear that First Class Shipping is the most reliable in terms of risk and late deliveries, followed by Same Day Shipping. A steeper slope indicates a broader range of Late Deliveries.
Let´s compare both Models and see if we got more accuracy
ri1 <- lme4::glmer(LateDeliveryRisk ~ DaysForShipmentScheduled +
DaysForShippingReal + (1 | ShippingMode),
data = scmD, family = binomial)
ri2 <- lme4::glmer(LateDeliveryRisk ~ DaysForShipmentScheduled +
DaysForShippingReal + (DaysForShippingReal|ShippingMode),
data = scmD, family = binomial)
anova(ri1, ri2)
Data: scmD
Models:
ri1: LateDeliveryRisk ~ DaysForShipmentScheduled + DaysForShippingReal +
ri1: (1 | ShippingMode)
ri2: LateDeliveryRisk ~ DaysForShipmentScheduled + DaysForShippingReal +
ri2: (DaysForShippingReal | ShippingMode)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
ri1 4 59228 59269 -29610 59220
ri2 6 54571 54632 -27280 54559 4661.2 2 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Finally, we can compare both, Random Intercept and Random intercept with slopes models. From the results we can observe the best performance in the second, hierarchical model.
Mixed Effect Models confirm the differences within shipping modes. In detail, this is explained by the variation seen in the models. There is a statistically significant relationship between LateDeliveryRisk and DaysForShippingReal. Adding random effects to the model confirms the differences in delivery times between shipping modes. On average, deliveries move around by 1.6 days. By looking at the correlation of fixed effects we confirm that the different shipping modes are not balanced
First Class is the most reliable shipping mode. Almost half of Same Day deliveries are late by one day. Standard Class and Second Class Deliveries account for most deliveries and face the biggest risk to be delivered late by about 2.5 days
Same Day: 5.4% of total deliveries are shipped with Same Day. Almost half (46%) of deliveries are late. The average delay is one day
First Class: 15.4% of total deliveries are shipped with First Class. It´s the most reliable Shipping Method (95% on Time)
Second Class: 19.5% of total deliveries are shipped with Second Class. About 76% of all deliveries are late. The average delay is about two days
Standard Class: 59.7% of total deliveries are shipped with Standard Class. Of these deliveries, 39% are late. The average delay is about 2.5 days
Further analysis on operational efficiency and in-house supply chain: Are late deliveries caused by internal or external factors? Are certain geographies more affected by late deliveries? Are specific products more likely to be delivered late? Are there problems with transportation routes, product sizes, batching etc?
After further analysis of the supply chain, and assuming there are no internally driven problems causing late deliveries: DataCo Global should find a more reliable Shipping Partner and shift most deliveries to First Class and Same Day.
The company DataCo Global is shipping the majority of their goods with Standard and Second Class. Both these Shipping Methods were proved to cause the most late deliveries when compared to First Class Shipping. Further analysis on operational efficiency and in-house supply chain is required to figure out whether late deliveries are caused by internal or external factors. In detail, further research on costs concerning the different shipping modes, batching, product weight, and transportation routes are required. DataCo Global needs to make sure that its internal supply chain is not hindering shipping times and late deliveries.
Overall, First Class Shipping Mode is the most reliable and amounts to least Late Deliveries. To increase customer satisfaction, DataCo Global needs to make a change in Shipping Modes and shift the majority of their deliveries to First Class shipping. This will likely increase costs. DataCo Global should partner with a Shipping Company to get exclusive rates for First Class Shipping.
DataCo Global can then offer a Premium Membership in exchange for a yearly fee from their customer. The Premium Membership should include unlimited First Class and Same Day Deliveries. If DataCo Global incurs highers costs due to offering more expensive shipping rates at a premium for customers, we suggest the following:
Close a deal with a Shipping Company such as UPS or FedEx to get exclusive shipping rates. Company should fit DataCo Global´s needs for deliveries and be a reliable partner.
Similar the way Amazon offers AmazonBasics to its customers, DataCo should introduce its own branded products to customers (potentially named DataCo Essentials). These products would be offered at competitive prices, giving customers the ability to purchase private-label products directly from DataCo, allowing the firm to help offset higher costs it may incur from offering premium shipping at a discounted or more affordable price to its customers.
Some of DataCo Globals most realistic competitors are Amazon or Alibaba. Amazon is present in almost every part of the world, whereas Alibaba dominates a large share of the ecommerce marketplace in China and the Western half of the hemisphere. DataCo Global should identify niches where Amazon and Alibaba are not present and establish an early relationship with customers in such areas. Currently, Amazon only offers Amazon Prime (Same day shipping option) to customers in the U.S., United Kingdom, Japan, Germany and Canada. However, customers in India, France, Italy and Spain do not have the option for Amazon Same Day services. It would be a savvy move for DataCo Global to enter the market in these countries, and establish relationships with these customers by offering its products along with Same Day and First Class shipping.
Premium Memberships are going to be an additional expense for the consumer. However, they will add value to the company and give the consumer a more exclusive feeling when shopping on the website. The Premium Membership will boost customer satisfaction, also thanks to on time deliveries and a reliable service and experience.