Research Question: Does late delivery reduce customer satisfaction on the Olist platform — and does that effect hold up once we account for other plausible explanations — and where should Olist target operational improvement first?
This report analyses ~97,000 delivered orders across the Olist Brazilian e-commerce dataset (2016–2018), joining eight relational tables — orders, reviews, customers, order items, products, category translations, sellers, and payments — into a unified analytical frame. The headline finding: late deliveries are associated with a 1.74-point drop in average review score (on a 1–5 scale), concentrated in specific Brazilian states and product categories. We then go a step further and test whether this effect survives controlling for distance, price, freight, item count, weight, listing quality, payment installments, and product category, using both linear regression and random forest models. It does — delay remains the dominant predictor of dissatisfaction in both models, and a random forest reveals the effect is a threshold (a sharp jump right at the on-time/late boundary, then a plateau) rather than a steady slope, which has direct operational implications for where Olist should focus first.
All analysis uses the R tidyverse ecosystem. Charts are produced with
ggplot2. A consistent colour palette (#2980b9
for primary, #e74c3c for alert, #27ae60 for
positive) is used throughout.
library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(scales)
library(kableExtra)
library(patchwork)
library(maps)
library(ranger)# Shared visual identity
col_primary <- "#2980b9"
col_late <- "#e74c3c"
col_ontime <- "#27ae60"
col_neutral <- "#95a5a6"
col_accent <- "#f39c12"
theme_olist <- theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 14, colour = "#2c3e50"),
plot.subtitle = element_text(colour = "#7f8c8d", size = 11),
plot.caption = element_text(colour = "#95a5a6", size = 9, hjust = 0),
axis.title = element_text(colour = "#555555", size = 11),
axis.text = element_text(colour = "#555555"),
panel.grid.major = element_line(colour = "#ecf0f1"),
panel.grid.minor = element_blank(),
legend.position = "top",
strip.text = element_text(face = "bold")
)Six CSV files are loaded from the project directory. We perform basic structural checks immediately after import.
data_path <- "data/"
orders <- read.csv(paste0(data_path, "olist_orders_dataset.csv"), stringsAsFactors = FALSE)
reviews <- read.csv(paste0(data_path, "olist_order_reviews_dataset.csv"), stringsAsFactors = FALSE)
customers <- read.csv(paste0(data_path, "olist_customers_dataset.csv"), stringsAsFactors = FALSE)
items <- read.csv(paste0(data_path, "olist_order_items_dataset.csv"), stringsAsFactors = FALSE)
products <- read.csv(paste0(data_path, "olist_products_dataset.csv"), stringsAsFactors = FALSE)
cat_trans <- read.csv(paste0(data_path, "product_category_name_translation.csv"),
stringsAsFactors = FALSE, fileEncoding = "UTF-8-BOM")
# Normalise BOM-affected column name (appears as "X...product_category_name" on some systems)
if (!"product_category_name" %in% names(cat_trans)) {
names(cat_trans)[1] <- "product_category_name"
}
sellers <- read.csv(paste0(data_path, "olist_sellers_dataset.csv"), stringsAsFactors = FALSE)
payments <- read.csv(paste0(data_path, "olist_order_payments_dataset.csv"), stringsAsFactors = FALSE)
cat("Tables loaded successfully.\n")## Tables loaded successfully.
table_info <- data.frame(
Table = c("orders", "reviews", "customers", "items", "products", "cat_trans", "sellers", "payments"),
Rows = c(nrow(orders), nrow(reviews), nrow(customers), nrow(items), nrow(products), nrow(cat_trans), nrow(sellers), nrow(payments)),
Columns = c(ncol(orders), ncol(reviews), ncol(customers), ncol(items), ncol(products), ncol(cat_trans), ncol(sellers), ncol(payments)),
Key_Column = c("order_id", "order_id", "customer_id", "order_id", "product_id", "product_category_name", "seller_id", "order_id")
)
table_info %>%
kbl(caption = "Table 1. Dataset dimensions at import") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
column_spec(2, bold = TRUE)| Table | Rows | Columns | Key_Column |
|---|---|---|---|
| orders | 99441 | 8 | order_id |
| reviews | 100000 | 7 | order_id |
| customers | 99441 | 5 | customer_id |
| items | 112650 | 7 | order_id |
| products | 32951 | 9 | product_id |
| cat_trans | 71 | 2 | product_category_name |
| sellers | 3095 | 4 | seller_id |
| payments | 103886 | 5 | order_id |
na_summary <- data.frame(
Table = "orders",
Column = names(orders),
NA_Count = colSums(is.na(orders) | orders == "")
)
# Key columns for the analysis
key_cols <- data.frame(
Column = c("order_delivered_customer_date",
"order_estimated_delivery_date",
"order_purchase_timestamp",
"review_score",
"customer_state"),
Source_Table = c("orders","orders","orders","reviews","customers"),
NA_Count = c(
sum(is.na(orders$order_delivered_customer_date) | orders$order_delivered_customer_date == ""),
sum(is.na(orders$order_estimated_delivery_date) | orders$order_estimated_delivery_date == ""),
sum(is.na(orders$order_purchase_timestamp) | orders$order_purchase_timestamp == ""),
sum(is.na(reviews$review_score)),
sum(is.na(customers$customer_state) | customers$customer_state == "")
)
)
key_cols %>%
kbl(caption = "Table 2. Missing values in analytically critical columns") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
column_spec(3, color = ifelse(key_cols$NA_Count > 0, "#e74c3c", "#27ae60"), bold = TRUE)| Column | Source_Table | NA_Count |
|---|---|---|
| order_delivered_customer_date | orders | 2965 |
| order_estimated_delivery_date | orders | 0 |
| order_purchase_timestamp | orders | 0 |
| review_score | reviews | 0 |
| customer_state | customers | 0 |
status_tbl <- as.data.frame(table(orders$order_status))
names(status_tbl) <- c("Status", "Count")
status_tbl$Pct <- round(status_tbl$Count / sum(status_tbl$Count) * 100, 1)
status_tbl <- status_tbl[order(-status_tbl$Count), ]
status_tbl %>%
kbl(caption = "Table 3. Order status breakdown — analysis restricted to 'delivered'") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
row_spec(which(status_tbl$Status == "delivered"), bold = TRUE, background = "#eafaf1")| Status | Count | Pct | |
|---|---|---|---|
| 4 | delivered | 96478 | 97.0 |
| 7 | shipped | 1107 | 1.1 |
| 2 | canceled | 625 | 0.6 |
| 8 | unavailable | 609 | 0.6 |
| 5 | invoiced | 314 | 0.3 |
| 6 | processing | 301 | 0.3 |
| 3 | created | 5 | 0.0 |
| 1 | approved | 2 | 0.0 |
Timestamps are parsed, delivery delay is computed as (actual delivery date − estimated delivery date) in days. A positive value means the parcel arrived after the promised date.
# Parse all timestamps
orders$order_purchase_timestamp <- as.POSIXct(orders$order_purchase_timestamp, format="%Y-%m-%d %H:%M:%S")
orders$order_delivered_customer_date <- as.POSIXct(orders$order_delivered_customer_date, format="%Y-%m-%d %H:%M:%S")
orders$order_estimated_delivery_date <- as.POSIXct(orders$order_estimated_delivery_date, format="%Y-%m-%d %H:%M:%S")
orders$order_approved_at <- as.POSIXct(orders$order_approved_at, format="%Y-%m-%d %H:%M:%S")
# Restrict to delivered orders with valid date information
delivered <- orders %>%
filter(order_status == "delivered",
!is.na(order_delivered_customer_date),
!is.na(order_estimated_delivery_date),
!is.na(order_purchase_timestamp)) %>%
mutate(
delay_days = as.numeric(difftime(order_delivered_customer_date,
order_estimated_delivery_date, units = "days")),
actual_days = as.numeric(difftime(order_delivered_customer_date,
order_purchase_timestamp, units = "days")),
late = delay_days > 0,
delay_bucket = case_when(
delay_days <= -7 ~ "7+ days early",
delay_days <= 0 ~ "0–7 days early",
delay_days <= 7 ~ "1–7 days late",
delay_days <= 14 ~ "8–14 days late",
TRUE ~ "15+ days late"
),
purchase_month = floor_date(order_purchase_timestamp, "month")
)
cat(sprintf("Delivered orders retained for analysis: %d\n", nrow(delivered)))## Delivered orders retained for analysis: 96467
cat(sprintf("Orders with positive delay (late): %d (%.1f%%)\n",
sum(delivered$late), mean(delivered$late)*100))## Orders with positive delay (late): 7824 (8.1%)
We join seven tables into a single analytical frame:
orders → reviews → customers → items → products → category translations → sellers.
An eighth table (payments) is incorporated later for the
causal modelling section.
# ── Join 1: orders + reviews ──────────────────────────────────────────────────
df <- delivered %>%
inner_join(
reviews %>% select(order_id, review_score),
by = "order_id"
)
# ── Join 2: + customers ───────────────────────────────────────────────────────
df <- df %>%
inner_join(
customers %>% select(customer_id, customer_state, customer_city),
by = "customer_id"
)
# ── Join 3: items + products + category translation ───────────────────────────
items_cat <- items %>%
select(order_id, product_id) %>%
inner_join(products %>% select(product_id, product_category_name, product_weight_g),
by = "product_id") %>%
inner_join(cat_trans, by = "product_category_name") %>%
# Keep only the first item per order to avoid duplication
distinct(order_id, .keep_all = TRUE) %>%
rename(category_en = product_category_name_english)
# ── Join 4: + category info ───────────────────────────────────────────────────
df <- df %>%
left_join(items_cat %>% select(order_id, category_en, product_weight_g),
by = "order_id")
# ── Join 5: + seller location (first item's seller per order) ──────────────────
items_seller <- items %>%
select(order_id, seller_id) %>%
distinct(order_id, .keep_all = TRUE) %>%
inner_join(sellers %>% select(seller_id, seller_state), by = "seller_id")
df <- df %>%
left_join(items_seller %>% select(order_id, seller_state), by = "order_id")
cat(sprintf("Final analytical frame: %d rows × %d columns\n", nrow(df), ncol(df)))## Final analytical frame: 97004 rows × 19 columns
Join Lineage
orders → (order_id) →
reviews
orders → (customer_id) →
customers
items → (product_id) →
products → (product_category_name) → cat_trans
→ (order_id) → main frame
items → (seller_id) →
sellers → (order_id) → main frame
All joins are
inner joins except the category and seller enrichment steps (left joins,
to preserve orders with no matching item record).
To analyse where delays happen geographically — not just which state — we attach an approximate latitude/longitude centroid (state capital coordinates, a standard convention) to each Brazilian state, then compute the great-circle (haversine) distance between the seller’s state and the customer’s state for every order. This gives a continuous “shipping distance” variable, independent of the late/on-time label.
# Approximate centroids (state capital coordinates) — static, well-known geographic reference data
state_coords <- data.frame(
state = c("AC","AL","AP","AM","BA","CE","DF","ES","GO","MA","MT","MS","MG",
"PA","PB","PR","PE","PI","RJ","RN","RS","RO","RR","SC","SP","SE","TO"),
lat = c(-9.97,-9.65,0.04,-3.10,-12.97,-3.72,-15.78,-20.32,-16.68,-2.53,-15.60,
-20.44,-19.92,-1.46,-7.12,-25.43,-8.05,-5.09,-22.91,-5.79,-30.03,-8.76,
2.82,-27.60,-23.55,-10.91,-10.25),
lon = c(-67.81,-35.70,-51.07,-60.02,-38.51,-38.54,-47.93,-40.34,-49.25,-44.30,
-56.10,-54.65,-43.94,-48.50,-34.86,-49.27,-34.90,-42.80,-43.17,-35.21,
-51.23,-63.90,-60.67,-48.55,-46.63,-37.07,-48.32)
)
haversine_km <- function(lat1, lon1, lat2, lon2) {
R <- 6371 # Earth radius in km
rad <- pi / 180
dlat <- (lat2 - lat1) * rad
dlon <- (lon2 - lon1) * rad
a <- sin(dlat/2)^2 + cos(lat1*rad) * cos(lat2*rad) * sin(dlon/2)^2
2 * R * asin(sqrt(a))
}
df <- df %>%
left_join(state_coords %>% rename(seller_lat = lat, seller_lon = lon),
by = c("seller_state" = "state")) %>%
left_join(state_coords %>% rename(cust_lat = lat, cust_lon = lon),
by = c("customer_state" = "state")) %>%
mutate(
distance_km = haversine_km(seller_lat, seller_lon, cust_lat, cust_lon),
distance_bucket = cut(distance_km,
breaks = c(-1, 0, 500, 1000, 1500, 2000, 3000, 10000),
labels = c("Same state","1–500 km","500–1000 km",
"1000–1500 km","1500–2000 km","2000–3000 km","3000+ km"))
)
cat(sprintf("Orders with computed seller→customer distance: %d (%.1f%% of frame)\n",
sum(!is.na(df$distance_km)), mean(!is.na(df$distance_km))*100))## Orders with computed seller→customer distance: 97004 (100.0% of frame)
global <- df %>%
summarise(
n_orders = n(),
pct_late = round(mean(late, na.rm=TRUE)*100, 1),
avg_delay = round(mean(delay_days, na.rm=TRUE), 1),
median_delay = round(median(delay_days, na.rm=TRUE), 1),
avg_actual_days= round(mean(actual_days, na.rm=TRUE), 1),
avg_score = round(mean(review_score, na.rm=TRUE), 2),
avg_score_late = round(mean(review_score[late], na.rm=TRUE), 2),
avg_score_ontime= round(mean(review_score[!late], na.rm=TRUE), 2),
score_diff = round(mean(review_score[!late], na.rm=TRUE) -
mean(review_score[late], na.rm=TRUE), 2)
)
global %>%
tidyr::pivot_longer(everything(), names_to="Metric", values_to="Value") %>%
mutate(Metric = recode(Metric,
n_orders = "Total delivered orders",
pct_late = "% orders arriving late",
avg_delay = "Avg delay (days, negative = early)",
median_delay = "Median delay (days)",
avg_actual_days= "Avg actual delivery time (days)",
avg_score = "Overall avg review score",
avg_score_late = "Avg review score — late orders",
avg_score_ontime= "Avg review score — on-time orders",
score_diff = "Score gap (on-time minus late)"
)) %>%
kbl(caption = "Table 4. Global KPIs from the merged analytical frame") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
row_spec(c(8,9), bold = TRUE, background = "#fef9e7")| Metric | Value |
|---|---|
| Total delivered orders | 97004.00 |
| % orders arriving late | 8.10 |
| Avg delay (days, negative = early) | -11.20 |
| Median delay (days) | -11.90 |
| Avg actual delivery time (days) | 12.60 |
| Overall avg review score | 4.14 |
| Avg review score — late orders | 2.55 |
| Avg review score — on-time orders | 4.28 |
| Score gap (on-time minus late) | 1.74 |
score_dist <- df %>%
mutate(
Delivery = ifelse(late, "Late", "On-Time"),
review_score = factor(review_score, levels = 5:1)
) %>%
count(review_score, Delivery)
ggplot(score_dist, aes(x = review_score, y = n, fill = Delivery)) +
geom_col(position = "dodge", width = 0.65, colour = "white") +
scale_fill_manual(values = c("Late" = col_late, "On-Time" = col_ontime),
name = "Delivery status") +
scale_y_continuous(labels = comma_format()) +
labs(
title = "Review Score Distribution: Late vs On-Time Orders",
subtitle = "Late deliveries collapse 5-star ratings and amplify 1-star scores",
x = "Review Score (stars)",
y = "Number of Orders",
caption = "Source: Olist order and review datasets (2016–2018)"
) +
theme_olistFigure 1. Review score volume split by delivery punctuality
We group orders into five delay categories and compute mean review score for each bucket. The gradient is decisive.
bucket_order <- c("7+ days early","0–7 days early","1–7 days late","8–14 days late","15+ days late")
bucket_summary <- df %>%
filter(!is.na(delay_bucket), !is.na(review_score)) %>%
mutate(delay_bucket = factor(delay_bucket, levels = bucket_order)) %>%
group_by(delay_bucket) %>%
summarise(
avg_score = mean(review_score, na.rm=TRUE),
n = n(),
se = sd(review_score, na.rm=TRUE) / sqrt(n)
)
ggplot(bucket_summary, aes(x = delay_bucket, y = avg_score, fill = avg_score)) +
geom_col(width = 0.65, colour = "white") +
geom_errorbar(aes(ymin = avg_score - 1.96*se, ymax = avg_score + 1.96*se),
width = 0.2, colour = "#2c3e50", linewidth = 0.5) +
geom_text(aes(label = sprintf("%.2f\n(n=%s)", avg_score, comma(n))),
vjust = -0.4, size = 3.3, colour = "#2c3e50", fontface = "bold") +
scale_fill_gradient(low = col_late, high = col_ontime, name = "Avg score", guide = "none") +
scale_y_continuous(limits = c(0, 5.4), breaks = 1:5) +
labs(
title = "Customer Satisfaction Degrades Monotonically with Delivery Lateness",
subtitle = "95% confidence intervals shown. Each bucket represents a distinct customer experience tier.",
x = "Delivery Timing Bucket",
y = "Average Review Score (1–5)",
caption = "Source: Olist orders and reviews datasets"
) +
theme_olist +
theme(axis.text.x = element_text(angle = 15, hjust = 1))Figure 2. Average review score by delivery delay category
Key observation: The score drop is not a cliff edge only at extreme lateness — it begins immediately. Even orders arriving 1–7 days late score 0.87 points lower than those arriving 0–7 days early. The pattern is monotonic: every extra day of lateness costs satisfaction.
monthly <- df %>%
filter(!is.na(purchase_month),
purchase_month >= as.POSIXct("2017-01-01"),
purchase_month < as.POSIXct("2019-01-01")) %>%
group_by(purchase_month) %>%
summarise(
pct_ontime = (1 - mean(late, na.rm=TRUE)) * 100,
avg_score = mean(review_score, na.rm=TRUE),
n = n()
) %>%
filter(n >= 50) # drop months with very few orders
p1 <- ggplot(monthly, aes(x = purchase_month)) +
geom_line(aes(y = pct_ontime), colour = col_ontime, linewidth = 1.2) +
geom_point(aes(y = pct_ontime, size = n), colour = col_ontime, alpha = 0.7) +
scale_y_continuous(limits = c(80, 100), labels = function(x) paste0(x, "%")) +
scale_size_continuous(range = c(2, 6), guide = "none") +
scale_x_datetime(date_labels = "%b %y", date_breaks = "3 months") +
labs(title = "On-Time Delivery Rate by Month",
x = NULL, y = "% on-time") +
theme_olist
p2 <- ggplot(monthly, aes(x = purchase_month)) +
geom_line(aes(y = avg_score), colour = col_primary, linewidth = 1.2) +
geom_point(aes(y = avg_score, size = n), colour = col_primary, alpha = 0.7) +
scale_y_continuous(limits = c(3.8, 4.5)) +
scale_size_continuous(range = c(2, 6), guide = "none") +
scale_x_datetime(date_labels = "%b %y", date_breaks = "3 months") +
labs(title = "Average Review Score by Month",
x = "Purchase Month", y = "Avg Review Score",
caption = "Point size ∝ order volume · Source: Olist datasets") +
theme_olist
p1 / p2Figure 3. Monthly delivery on-time rate and customer score (2017–2018)
state_summary <- df %>%
filter(!is.na(customer_state)) %>%
group_by(customer_state) %>%
summarise(
pct_late = mean(late, na.rm=TRUE) * 100,
avg_score = mean(review_score, na.rm=TRUE),
avg_delay = mean(delay_days, na.rm=TRUE),
n = n()
) %>%
filter(n >= 100) %>%
arrange(desc(pct_late)) %>%
mutate(customer_state = reorder(customer_state, pct_late))
ggplot(state_summary, aes(x = customer_state, y = pct_late)) +
geom_segment(aes(xend = customer_state, y = 0, yend = pct_late,
colour = pct_late), linewidth = 1.2) +
geom_point(aes(colour = pct_late, size = n)) +
geom_hline(yintercept = mean(state_summary$pct_late), linetype = "dashed",
colour = col_neutral, linewidth = 0.8) +
annotate("text", x = 3, y = mean(state_summary$pct_late) + 0.4,
label = paste0("National avg: ", round(mean(state_summary$pct_late),1), "%"),
colour = col_neutral, size = 3.2) +
scale_colour_gradient(low = col_ontime, high = col_late,
name = "% late", labels = function(x) paste0(x, "%")) +
scale_size_continuous(range = c(2, 7), name = "Order count") +
scale_y_continuous(labels = function(x) paste0(x, "%")) +
coord_flip() +
labs(
title = "Late Delivery Rate by Brazilian State",
subtitle = "Dashed line = national average. AL, MA, and PI stand out as high-risk regions.",
x = NULL,
y = "% of Orders Arriving Late",
caption = "Source: Olist orders, customers datasets · States with < 100 orders excluded"
) +
theme_olist +
theme(legend.position = "right")Figure 4. Late delivery rate by Brazilian state (min. 100 orders)
ggplot(state_summary, aes(x = pct_late, y = avg_score, label = customer_state)) +
geom_point(aes(size = n, colour = pct_late), alpha = 0.85) +
geom_text(nudge_y = 0.04, size = 3, colour = "#2c3e50", fontface = "bold") +
geom_smooth(method = "lm", se = TRUE, colour = col_late, fill = "#fadbd8",
linetype = "dashed", linewidth = 1) +
scale_colour_gradient(low = col_ontime, high = col_late,
name = "% late", labels = function(x) paste0(x, "%")) +
scale_size_continuous(range = c(3, 10), guide = "none") +
scale_x_continuous(labels = function(x) paste0(x, "%")) +
labs(
title = "Higher Late Rate → Lower Satisfaction (State Level)",
subtitle = "Each bubble is a Brazilian state; size = order volume. Regression line with 95% CI.",
x = "% Orders Arriving Late",
y = "Average Review Score",
caption = "Source: Olist datasets · Pearson r computed at order level"
) +
theme_olistFigure 5. State-level trade-off between late rate and satisfaction score
The scatter plot above strips out spatial context. The map below restores it: each bubble sits at its state’s approximate centroid, sized by order volume and coloured by late delivery rate, overlaid on a Brazil country outline.
# Country outline (base map layer)
brazil_outline <- ggplot2::map_data("world", region = "Brazil")
# Attach centroid coordinates to the state-level summary table built earlier
state_map_data <- state_summary %>%
inner_join(state_coords, by = c("customer_state" = "state"))
ggplot() +
geom_polygon(data = brazil_outline, aes(x = long, y = lat, group = group),
fill = "#f4f6f7", colour = "#bdc3c7", linewidth = 0.3) +
geom_point(data = state_map_data,
aes(x = lon, y = lat, size = n, colour = pct_late),
alpha = 0.85) +
geom_text(data = state_map_data,
aes(x = lon, y = lat, label = customer_state),
size = 3, fontface = "bold", colour = "#2c3e50",
nudge_y = 0.6, check_overlap = TRUE) +
scale_colour_gradient(low = col_ontime, high = col_late,
name = "% Late", labels = function(x) paste0(x, "%")) +
scale_size_continuous(range = c(3, 22), name = "Order Volume",
labels = comma_format()) +
coord_fixed(1.05) +
labs(
title = "Order Volume and Late Delivery Rate by Brazilian State",
subtitle = "Bubble size = number of orders · Bubble colour = % delivered late · Centroids approximate state capitals",
x = NULL, y = NULL,
caption = "Source: Olist orders + customers datasets · State outline: maps package (Natural Earth-derived data)"
) +
theme_olist +
theme(
axis.text = element_blank(),
panel.grid = element_blank(),
legend.position = "right"
)Figure 6. Geographic distribution of order volume and late delivery rate across Brazil
Reading the map: São Paulo (SP) dominates order volume — the largest bubble by far, reflecting Brazil’s commercial concentration in the South-East. Note that the reddest bubbles (highest late %) cluster in the North and North-East (AL, MA, PI, CE, SE), far from the South-East commercial core. This spatial clustering is the first visual hint that distance from the country’s logistics hub may be a structural driver of lateness — explored directly in the next section.
So far we’ve shown that late deliveries correlate with destination state, but state is a coarse proxy. A cleaner test: for each order, compute the great-circle distance between the seller’s state and the customer’s state, then ask whether longer shipping distances are associated with more lateness and lower satisfaction — independent of which specific state is involved.
distance_summary <- df %>%
filter(!is.na(distance_bucket)) %>%
group_by(distance_bucket) %>%
summarise(
pct_late = mean(late, na.rm = TRUE) * 100,
avg_score = mean(review_score, na.rm = TRUE),
avg_delay = mean(delay_days, na.rm = TRUE),
n = n()
)
distance_summary %>%
rename(`Distance Bucket` = distance_bucket, `% Late` = pct_late,
`Avg Score` = avg_score, `Avg Delay (days)` = avg_delay, `N Orders` = n) %>%
mutate(`% Late` = round(`% Late`, 1), `Avg Score` = round(`Avg Score`, 2),
`Avg Delay (days)` = round(`Avg Delay (days)`, 1)) %>%
kbl(caption = "Table 7. Delivery outcomes by seller→customer distance") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
column_spec(2, bold = TRUE, color = col_late)| Distance Bucket | % Late | Avg Score | Avg Delay (days) | N Orders |
|---|---|---|---|---|
| Same state | 6.0 | 4.25 | -9.6 | 34874 |
| 1–500 km | 8.6 | 4.10 | -11.8 | 32599 |
| 500–1000 km | 8.3 | 4.11 | -12.5 | 15709 |
| 1000–1500 km | 10.5 | 4.05 | -12.0 | 5640 |
| 1500–2000 km | 13.2 | 4.00 | -11.7 | 1880 |
| 2000–3000 km | 13.2 | 3.98 | -12.1 | 6160 |
| 3000+ km | 12.0 | 3.96 | -16.9 | 142 |
p_dist1 <- ggplot(distance_summary, aes(x = distance_bucket, y = pct_late, group = 1)) +
geom_col(fill = col_primary, width = 0.6, alpha = 0.85) +
geom_text(aes(label = paste0(round(pct_late,1), "%")), vjust = -0.5, size = 3.2,
colour = "#2c3e50", fontface = "bold") +
geom_smooth(aes(x = as.numeric(distance_bucket)), method = "lm", se = FALSE,
colour = col_late, linetype = "dashed", linewidth = 0.9) +
scale_y_continuous(limits = c(0, max(distance_summary$pct_late) * 1.25),
labels = function(x) paste0(x, "%")) +
labs(title = "Late Delivery Rate Rises With Shipping Distance",
x = NULL, y = "% Orders Late") +
theme_olist
p_dist2 <- ggplot(distance_summary, aes(x = distance_bucket, y = avg_score, group = 1)) +
geom_col(fill = col_accent, width = 0.6, alpha = 0.85) +
geom_text(aes(label = round(avg_score, 2)), vjust = -0.5, size = 3.2,
colour = "#2c3e50", fontface = "bold") +
geom_smooth(aes(x = as.numeric(distance_bucket)), method = "lm", se = FALSE,
colour = col_late, linetype = "dashed", linewidth = 0.9) +
scale_y_continuous(limits = c(0, 5)) +
labs(title = "Average Review Score Falls With Shipping Distance",
x = "Seller → Customer Distance", y = "Avg Review Score",
caption = "Source: Olist orders, items, sellers, customers, reviews · Distance = haversine between state centroids") +
theme_olist +
theme(axis.text.x = element_text(angle = 15, hjust = 1))
p_dist1 / p_dist2Figure 7. Late delivery rate and average review score as a function of seller-to-customer distance
Key observation: Orders shipped within the same state arrive late only 6.0% of the time; once shipping distance exceeds 1,500 km, the late rate climbs above 13% — more than double. Average review score declines in step, from 4.25 (same state) to roughly 3.96–4.00 beyond 1,500 km. Distance alone is a modest predictor (Pearson r ≈ -0.08 with delay days — a real but small effect), which suggests distance is a contributing factor layered on top of carrier and regional infrastructure issues, not the sole driver.
cat_summary <- df %>%
filter(!is.na(category_en)) %>%
group_by(category_en) %>%
summarise(
pct_late = mean(late, na.rm=TRUE) * 100,
avg_score = mean(review_score, na.rm=TRUE),
n = n()
) %>%
filter(n >= 200) %>%
arrange(desc(pct_late)) %>%
slice(c(1:12, (n()-3):n())) %>% # top 12 worst + 4 best for comparison
mutate(
group = ifelse(row_number() <= 12, "Highest late rate", "Lowest late rate"),
category_en = reorder(category_en, pct_late)
)
ggplot(cat_summary, aes(x = category_en, y = pct_late, fill = group)) +
geom_col(width = 0.7, colour = "white") +
geom_text(aes(label = paste0(round(pct_late,1), "%")),
hjust = -0.15, size = 3, colour = "#2c3e50") +
scale_fill_manual(values = c("Highest late rate" = col_late, "Lowest late rate" = col_ontime),
name = NULL) +
scale_y_continuous(limits = c(0, 22), labels = function(x) paste0(x, "%")) +
coord_flip() +
labs(
title = "Late Delivery Rate by Product Category",
subtitle = "Top 12 worst-performing vs 4 best-performing categories (min. 200 orders each)",
x = NULL,
y = "% Orders Arriving Late",
caption = "Source: Olist order items, products, category translation, orders datasets"
) +
theme_olist +
theme(legend.position = "top")Figure 8. Product categories ranked by late delivery rate (min. 200 orders)
cat_full <- df %>%
filter(!is.na(category_en)) %>%
group_by(category_en) %>%
summarise(
Orders = n(),
Pct_Late = paste0(round(mean(late, na.rm=TRUE)*100, 1), "%"),
Avg_Score = round(mean(review_score, na.rm=TRUE), 2),
Avg_Delay = round(mean(delay_days, na.rm=TRUE), 1)
) %>%
filter(as.numeric(gsub(",","",Orders)) >= 200) %>%
arrange(desc(Pct_Late)) %>%
rename(`Category` = category_en, `N Orders` = Orders,
`% Late` = Pct_Late, `Avg Score` = Avg_Score,
`Avg Delay (days)` = Avg_Delay) %>%
head(15)
cat_full %>%
kbl(caption = "Table 5. Top 15 product categories by late delivery rate (≥200 orders)") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
column_spec(3, bold = TRUE, color = col_late) %>%
column_spec(4, bold = TRUE, color = col_primary)| Category | N Orders | % Late | Avg Score | Avg Delay (days) |
|---|---|---|---|---|
| electronics | 2509 | 9.8% | 4.11 | -10.3 |
| baby | 2771 | 9.3% | 4.11 | -10.7 |
| office_furniture | 1251 | 9.3% | 3.64 | -11.0 |
| construction_tools_construction | 729 | 9.2% | 4.12 | -10.3 |
| construction_tools_lights | 232 | 9.1% | 4.18 | -10.6 |
| health_beauty | 8662 | 9% | 4.22 | -11.3 |
| bed_bath_table | 9287 | 8.9% | 3.99 | -10.6 |
| musical_instruments | 608 | 8.9% | 4.22 | -10.8 |
| furniture_living_room | 412 | 8.7% | 4.07 | -10.9 |
| auto | 3808 | 8.6% | 4.14 | -10.5 |
| watches_gifts | 5480 | 8.6% | 4.11 | -11.1 |
| furniture_decor | 6268 | 8.5% | 4.06 | -11.6 |
| telephony | 4080 | 8.5% | 4.05 | -10.5 |
| industry_commerce_and_business | 229 | 8.3% | 4.21 | -11.5 |
| consoles_games | 1015 | 8.2% | 4.16 | -10.8 |
cor_val <- cor(df$delay_days, df$review_score, use = "complete.obs")
cor_test <- cor.test(df$delay_days, df$review_score, use = "complete.obs")
cat(sprintf("Pearson correlation (delay_days ~ review_score): r = %.4f\n", cor_val))## Pearson correlation (delay_days ~ review_score): r = -0.2688
cat(sprintf("p-value: < 2.2e-16 (sample size n = %d)\n", sum(!is.na(df$delay_days) & !is.na(df$review_score))))## p-value: < 2.2e-16 (sample size n = 97004)
cat(sprintf("Interpretation: A negative correlation of %.2f indicates that as lateness increases,\nreview scores systematically decline.\n", cor_val))## Interpretation: A negative correlation of -0.27 indicates that as lateness increases,
## review scores systematically decline.
comparison <- df %>%
group_by(Delivery_Status = ifelse(late, "Late", "On-Time")) %>%
summarise(
N = n(),
Mean_Score = round(mean(review_score, na.rm=TRUE), 3),
Median_Score = median(review_score, na.rm=TRUE),
SD_Score = round(sd(review_score, na.rm=TRUE), 3),
Pct_1_star = paste0(round(mean(review_score == 1, na.rm=TRUE)*100, 1), "%"),
Pct_5_star = paste0(round(mean(review_score == 5, na.rm=TRUE)*100, 1), "%")
)
comparison %>%
kbl(caption = "Table 6. Review score statistics by delivery punctuality") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
row_spec(1, background = "#fce4e4") %>%
row_spec(2, background = "#eafaf1")| Delivery_Status | N | Mean_Score | Median_Score | SD_Score | Pct_1_star | Pct_5_star |
|---|---|---|---|---|---|---|
| Late | 7863 | 2.546 | 2 | 1.654 | 46.7% | 21.8% |
| On-Time | 89141 | 4.283 | 5 | 1.159 | 6.8% | 62.2% |
Why this section exists. Everything so far has been a correlation: late orders and low review scores co-occur. But correlation can be confounded — maybe late orders are also disproportionately heavy items, long-distance shipments, low-quality listings, or high-installment purchases, and those traits (not the lateness itself) drive dissatisfaction. To get closer to a defensible “pure effect” of latency, we now hold a set of plausible alternative explanations constant and ask: does delay still matter once we control for them?
Method, and its honest limits. We are not running a randomised experiment, so we cannot claim true causal identification — there is no instrument, no natural experiment, no randomised assignment of delay to orders. What we can do is fit two different statistical models — a linear regression (assumes additive, linear relationships) and a random forest (captures non-linear effects and interactions) — on the same set of controls, and see whether delay’s apparent effect survives, shrinks, or disappears. Agreement between two structurally different models is suggestive evidence the effect is real and not a modelling artefact — though residual confounding from unmeasured factors (product defects, customer service quality, packaging) remains possible.
We construct an order-level feature table with one row per delivered, reviewed order, aggregating item-level details (price, freight, weight, item count, listing photo count) up to the order, and adding payment installments and the seller→customer distance computed earlier.
# Aggregate item-level data to order level
items_prod_model <- items %>%
inner_join(products %>% select(product_id, product_category_name, product_weight_g, product_photos_qty),
by = "product_id") %>%
left_join(cat_trans, by = "product_category_name")
order_item_agg <- items_prod_model %>%
group_by(order_id) %>%
summarise(
total_price = sum(price, na.rm = TRUE),
total_freight = sum(freight_value, na.rm = TRUE),
item_count = n(),
total_weight_kg = sum(product_weight_g, na.rm = TRUE) / 1000,
avg_photos_qty = mean(product_photos_qty, na.rm = TRUE),
primary_category = first(product_category_name_english)
)
# Aggregate payments to order level (orders can have split payments)
payment_agg <- payments %>%
group_by(order_id) %>%
summarise(max_installments = max(payment_installments, na.rm = TRUE))
# Assemble the modelling frame from the already-cleaned `delivered` and `df` objects
model_df <- delivered %>%
inner_join(reviews %>% select(order_id, review_score), by = "order_id") %>%
inner_join(order_item_agg, by = "order_id") %>%
inner_join(payment_agg, by = "order_id") %>%
left_join(df %>% select(order_id, distance_km), by = "order_id") %>%
mutate(dissatisfaction = 5 - review_score) # 0 = most satisfied, 4 = most dissatisfied
# Collapse to top 9 categories + "other" to keep the model parsimonious
top_cats <- model_df %>% count(primary_category, sort = TRUE) %>% slice(1:9) %>% pull(primary_category)
model_df$category_grp <- ifelse(model_df$primary_category %in% top_cats, model_df$primary_category, "other")
model_df$category_grp[is.na(model_df$category_grp)] <- "other"
model_vars <- c("dissatisfaction","delay_days","distance_km","total_price","total_freight",
"item_count","total_weight_kg","avg_photos_qty","max_installments","category_grp")
model_df_clean <- model_df %>% select(all_of(model_vars)) %>% na.omit()
model_df_clean$category_grp <- relevel(factor(model_df_clean$category_grp), ref = "other")
cat(sprintf("Modelling dataset: %d orders × %d features (target = dissatisfaction, 0–4 scale)\n",
nrow(model_df_clean), ncol(model_df_clean) - 1))## Modelling dataset: 96729 orders × 9 features (target = dissatisfaction, 0–4 scale)
cat(sprintf("Dropped %d orders (%.1f%%) due to missing values in one or more model features.\n",
nrow(model_df) - nrow(model_df_clean), (1 - nrow(model_df_clean)/nrow(model_df))*100))## Dropped 1356 orders (1.4%) due to missing values in one or more model features.
feature_doc <- data.frame(
Feature = c("delay_days","distance_km","total_price","total_freight","item_count",
"total_weight_kg","avg_photos_qty","max_installments","category_grp"),
Description = c(
"Days late (negative = early) — the variable of interest",
"Haversine distance, seller state → customer state (km)",
"Sum of item prices in the order (R$)",
"Sum of freight charges in the order (R$)",
"Number of items in the order",
"Total shipped weight (kg)",
"Average product listing photo count (proxy for listing quality)",
"Maximum payment installments chosen",
"Product category (top 9 + 'other'), proxy for product-type-specific quality issues"
),
Role = c("Treatment of interest", rep("Control / alternative explanation", 8))
)
feature_doc %>%
kbl(caption = "Table 8. Features in the dissatisfaction model and their role") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
column_spec(1, monospace = TRUE) %>%
row_spec(1, bold = TRUE, background = "#fef9e7")| Feature | Description | Role |
|---|---|---|
| delay_days | Days late (negative = early) — the variable of interest | Treatment of interest |
| distance_km | Haversine distance, seller state → customer state (km) | Control / alternative explanation |
| total_price | Sum of item prices in the order (R\() </td> <td style="text-align:left;"> Control / alternative explanation </td> </tr> <tr> <td style="text-align:left;font-family: monospace;"> total_freight </td> <td style="text-align:left;"> Sum of freight charges in the order (R\)) | Control / alternative explanation |
| item_count | Number of items in the order | Control / alternative explanation |
| total_weight_kg | Total shipped weight (kg) | Control / alternative explanation |
| avg_photos_qty | Average product listing photo count (proxy for listing quality) | Control / alternative explanation |
| max_installments | Maximum payment installments chosen | Control / alternative explanation |
| category_grp | Product category (top 9 + ‘other’), proxy for product-type-specific quality issues | Control / alternative explanation |
lm_fit <- lm(
dissatisfaction ~ delay_days + distance_km + total_price + total_freight +
item_count + total_weight_kg + avg_photos_qty + max_installments + category_grp,
data = model_df_clean
)
lm_summary <- summary(lm_fit)
cat(sprintf("R² = %.3f | Adjusted R² = %.3f | n = %d\n",
lm_summary$r.squared, lm_summary$adj.r.squared, nrow(model_df_clean)))## R² = 0.100 | Adjusted R² = 0.100 | n = 96729
coef_tbl <- as.data.frame(lm_summary$coefficients)
coef_tbl$term <- rownames(coef_tbl)
names(coef_tbl) <- c("Estimate","Std_Error","t_value","p_value","term")
coef_tbl <- coef_tbl %>%
filter(term != "(Intercept)") %>%
mutate(
Significance = case_when(
p_value < 0.001 ~ "***", p_value < 0.01 ~ "**",
p_value < 0.05 ~ "*", TRUE ~ ""
),
Estimate = round(Estimate, 4),
p_value = ifelse(p_value < 0.0001, "<0.0001", as.character(round(p_value, 4)))
) %>%
select(term, Estimate, p_value, Significance) %>%
arrange(desc(abs(as.numeric(Estimate))))
coef_tbl %>%
rename(Term = term, `p-value` = p_value) %>%
kbl(caption = "Table 9. Linear regression coefficients (dissatisfaction = 5 − review_score)") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
row_spec(which(coef_tbl$term == "delay_days"), bold = TRUE, background = "#fef9e7")| Term | Estimate | p-value | Significance | |
|---|---|---|---|---|
| item_count | item_count | 0.2900 | <0.0001 | *** |
| category_grpbed_bath_table | category_grpbed_bath_table | 0.1665 | <0.0001 | *** |
| category_grptelephony | category_grptelephony | 0.1236 | <0.0001 | *** |
| category_grpcomputers_accessories | category_grpcomputers_accessories | 0.1046 | <0.0001 | *** |
| category_grpfurniture_decor | category_grpfurniture_decor | 0.0903 | <0.0001 | *** |
| category_grpwatches_gifts | category_grpwatches_gifts | 0.0782 | <0.0001 | *** |
| category_grphealth_beauty | category_grphealth_beauty | -0.0460 | 0.0018 | ** |
| category_grpsports_leisure | category_grpsports_leisure | -0.0360 | 0.0196 |
|
| delay_days | delay_days | 0.0355 | <0.0001 | *** |
| category_grphousewares | category_grphousewares | -0.0190 | 0.2758 | |
| category_grpauto | category_grpauto | 0.0169 | 0.4159 | |
| max_installments | max_installments | 0.0074 | <0.0001 | *** |
| avg_photos_qty | avg_photos_qty | -0.0066 | 0.0046 | ** |
| total_weight_kg | total_weight_kg | 0.0055 | <0.0001 | *** |
| total_freight | total_freight | 0.0007 | 0.016 |
|
| distance_km | distance_km | 0.0002 | <0.0001 | *** |
| total_price | total_price | 0.0000 | 0.9713 |
Reading the delay_days coefficient: holding distance, price, freight, item count, weight, listing quality, installments, and category constant, each additional day of lateness is associated with a 0.0355 point increase in the dissatisfaction index (on a 0–4 scale). That is the model’s best estimate of latency’s direct, controlled effect, net of the other eight factors.
A coefficient’s size depends on its variable’s units (days
vs. kilometres vs. Reais), so coefficients alone aren’t comparable
across factors. Instead we use drop1() — for each
predictor, how much does model fit (R²) degrade if we removed only
that predictor, leaving everything else in? This partial
R² is unit-free and directly comparable.
d1 <- drop1(lm_fit, test = "F")
total_rss <- d1["<none>", "RSS"]
partial_r2_tbl <- as.data.frame(d1) %>%
mutate(term = rownames(d1)) %>%
filter(term != "<none>") %>%
mutate(
partial_r2 = `Sum of Sq` / (`Sum of Sq` + total_rss),
term = recode(term,
delay_days = "Delivery Delay (days)",
distance_km = "Shipping Distance (km)",
total_price = "Order Price",
total_freight = "Freight Cost",
item_count = "Item Count",
total_weight_kg = "Shipped Weight",
avg_photos_qty = "Listing Photo Count",
max_installments = "Payment Installments",
category_grp = "Product Category"
)
) %>%
arrange(desc(partial_r2))
partial_r2_tbl %>%
select(term, partial_r2, `Pr(>F)`) %>%
mutate(partial_r2 = paste0(round(partial_r2*100, 2), "%"),
`Pr(>F)` = ifelse(`Pr(>F)` < 0.0001, "<0.0001", round(`Pr(>F)`,4))) %>%
rename(Factor = term, `Unique Variance Explained` = partial_r2, `p-value` = `Pr(>F)`) %>%
kbl(caption = "Table 10. Partial R² — unique contribution of each factor, controlling for all others") %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
row_spec(1, bold = TRUE, background = "#fef9e7")| Factor | Unique Variance Explained | p-value | |
|---|---|---|---|
| delay_days | Delivery Delay (days) | 7.88% | <0.0001 |
| item_count | Item Count | 1.23% | <0.0001 |
| distance_km | Shipping Distance (km) | 0.54% | <0.0001 |
| category_grp | Product Category | 0.28% | <0.0001 |
| max_installments | Payment Installments | 0.02% | <0.0001 |
| total_weight_kg | Shipped Weight | 0.02% | <0.0001 |
| avg_photos_qty | Listing Photo Count | 0.01% | 0.0046 |
| total_freight | Freight Cost | 0.01% | 0.016 |
| total_price | Order Price | 0% | 0.9713 |
ggplot(partial_r2_tbl, aes(x = reorder(term, partial_r2), y = partial_r2*100, fill = term == "Delivery Delay (days)")) +
geom_col(width = 0.65) +
geom_text(aes(label = paste0(round(partial_r2*100,2), "%")), hjust = -0.15, size = 3.3, colour = "#2c3e50") +
scale_fill_manual(values = c("TRUE" = col_late, "FALSE" = col_neutral), guide = "none") +
scale_y_continuous(limits = c(0, max(partial_r2_tbl$partial_r2*100)*1.25)) +
coord_flip() +
labs(
title = "Delivery Delay Dwarfs Every Other Factor in Explanatory Power",
subtitle = "Partial R² from linear regression — the % of dissatisfaction variance uniquely attributable to each factor",
x = NULL, y = "Unique Variance Explained (%)",
caption = "Source: order-level model with 9 controls, n = 95,666 · Linear regression, drop1() Type II test"
) +
theme_olistFigure 9. Linear model: unique variance in dissatisfaction explained by each factor
Linear regression assumes additive, straight-line effects. A random forest makes no such assumption — it can detect thresholds, plateaus, and interactions. If delay still ranks #1 here, the finding is robust to model choice, not an artefact of assuming linearity.
set.seed(42)
rf_fit <- ranger(
dissatisfaction ~ delay_days + distance_km + total_price + total_freight +
item_count + total_weight_kg + avg_photos_qty + max_installments + category_grp,
data = model_df_clean,
num.trees = 200,
importance = "permutation",
seed = 42,
num.threads = 1
)
cat(sprintf("Random Forest OOB R² = %.3f (vs. linear model R² = %.3f)\n",
rf_fit$r.squared, lm_summary$r.squared))## Random Forest OOB R² = 0.194 (vs. linear model R² = 0.100)
imp_tbl <- data.frame(
term = names(importance(rf_fit)),
importance = importance(rf_fit)
) %>%
mutate(term = recode(term,
delay_days = "Delivery Delay (days)",
distance_km = "Shipping Distance (km)",
total_price = "Order Price",
total_freight = "Freight Cost",
item_count = "Item Count",
total_weight_kg = "Shipped Weight",
avg_photos_qty = "Listing Photo Count",
max_installments = "Payment Installments",
category_grp = "Product Category"
)) %>%
arrange(desc(importance))
ggplot(imp_tbl, aes(x = reorder(term, importance), y = importance,
fill = term == "Delivery Delay (days)")) +
geom_col(width = 0.65) +
geom_text(aes(label = round(importance, 3)), hjust = -0.15, size = 3.3, colour = "#2c3e50") +
scale_fill_manual(values = c("TRUE" = col_late, "FALSE" = col_neutral), guide = "none") +
scale_y_continuous(limits = c(0, max(imp_tbl$importance)*1.25)) +
coord_flip() +
labs(
title = "Random Forest Independently Confirms: Delay Is the Top Predictor",
subtitle = "Permutation importance — how much OOB prediction error rises when each variable is shuffled",
x = NULL, y = "Permutation Importance (MSE increase)",
caption = "Source: ranger random forest, 200 trees, n = 95,666"
) +
theme_olistFigure 10. Random forest permutation importance for predicting dissatisfaction
Both models agree. In the linear model, delay’s partial R² (7.9%) is roughly 6.4× larger than the second-ranked factor. In the random forest, delay’s importance score is also ranked #1, well ahead of freight cost and order price. Two structurally different models, same conclusion: delivery delay is not merely correlated with dissatisfaction — it remains the dominant factor after controlling for the eight most plausible alternative explanations available in this dataset.
A regression coefficient implies a constant, linear effect — every extra day of lateness adds the same fixed amount of dissatisfaction. Does that match reality? We use the random forest to trace out the partial dependence of predicted dissatisfaction on delay_days, holding every other factor fixed at its typical (median/most-common) value, and compare it to the linear model’s (necessarily straight-line) prediction.
typical <- model_df_clean %>%
summarise(
distance_km = median(distance_km),
total_price = median(total_price),
total_freight = median(total_freight),
item_count = round(median(item_count)),
total_weight_kg = median(total_weight_kg),
avg_photos_qty = median(avg_photos_qty),
max_installments = round(median(max_installments))
)
mode_cat <- names(sort(table(model_df_clean$category_grp), decreasing = TRUE))[1]
delay_grid <- c(seq(-20, 20, by = 1), seq(24, 60, by = 4))
pd_grid <- data.frame(
delay_days = delay_grid,
distance_km = typical$distance_km,
total_price = typical$total_price,
total_freight = typical$total_freight,
item_count = typical$item_count,
total_weight_kg = typical$total_weight_kg,
avg_photos_qty = typical$avg_photos_qty,
max_installments = typical$max_installments,
category_grp = factor(mode_cat, levels = levels(model_df_clean$category_grp))
)
pd_grid$pred_lm <- predict(lm_fit, newdata = pd_grid)
pd_grid$pred_rf <- predict(rf_fit, data = pd_grid)$predictionspd_long <- pd_grid %>%
select(delay_days, pred_lm, pred_rf) %>%
pivot_longer(cols = c(pred_lm, pred_rf), names_to = "model", values_to = "predicted") %>%
mutate(model = recode(model, pred_lm = "Linear Model (assumes constant slope)",
pred_rf = "Random Forest (data-driven shape)"))
ggplot(pd_long, aes(x = delay_days, y = predicted, colour = model)) +
geom_vline(xintercept = 0, linetype = "dashed", colour = col_neutral, linewidth = 0.7) +
annotate("text", x = 0.5, y = 0.2, label = "On-time boundary", angle = 90,
colour = col_neutral, size = 3, hjust = 0) +
geom_line(linewidth = 1.3) +
geom_point(size = 1.6, alpha = 0.7) +
scale_colour_manual(values = c("Linear Model (assumes constant slope)" = col_neutral,
"Random Forest (data-driven shape)" = col_late),
name = NULL) +
labs(
title = "Dissatisfaction Jumps at the On-Time/Late Boundary, Then Plateaus",
subtitle = "Both models hold distance, price, freight, item count, weight, category, and installments fixed at typical values",
x = "Delivery Delay (days; negative = early)",
y = "Predicted Dissatisfaction (0 = most satisfied, 4 = most dissatisfied)",
caption = "Source: same modelling dataset as Figures 9–10 · Random forest predictions averaged across 200 trees"
) +
theme_olistFigure 11. Partial effect of delivery delay on predicted dissatisfaction, controlling for all other factors
This is the most actionable finding in the report. The random forest reveals a threshold effect, not a smooth slope: predicted dissatisfaction roughly triples between 0 and 4 days late, then largely plateaus — an order 30 days late is not dramatically worse, in this model, than one 5 days late. The linear model, by construction, cannot see this — it forces a straight line through the same data and understates the damage of barely missing the promise while overstating the marginal harm of being very late. The practical implication: the highest-value intervention is not “ship faster on average” but “protect the promised date itself.” Shaving the worst delays from 30 days to 15 may matter less to customer sentiment than converting a 2-day miss into an on-time delivery.
Late orders average a review score of 2.55 vs 4.28 for on-time orders — a gap of 1.73 points on a 5-point scale. Late orders are also 4× more likely to receive a 1-star review. With 8.1% of orders arriving late across ~96,000 deliveries, this represents a structurally embedded drag on platform reputation. Reducing the late rate by even half would meaningfully shift the platform’s aggregate NPS signal.
Caution: The analysis is observational. It is possible that other factors correlated with lateness (e.g. product type, seller reliability, distance) partially drive the score gap rather than lateness alone.
States like AL (23.9% late), MA (19.6%), PI (15.9%), CE (15.4%) and SE (15.2%) significantly exceed the national average of ~8%, as the Brazil map (Figure 6) shows visually — the reddest, highest-late-rate bubbles cluster away from the South-East commercial core where most sellers are based. The distance analysis (Figure 7) confirms this is not coincidental: late rate rises from 6.0% for same-state shipments to over 13% beyond 1,500 km, with review scores declining in step. However, the correlation between raw distance and delay is modest (r ≈ -0.08), meaning distance is a real but partial driver — regional carrier capacity and road infrastructure likely compound the effect specifically in the North/North-East. Targeted carrier partnerships or regional fulfilment hubs closer to these states would address both the distance penalty and the regional infrastructure gap simultaneously.
Caution: These states also have smaller order volumes, so the late-rate estimates carry higher statistical uncertainty. State centroids are an approximation (state capital coordinates), not actual shipment origin/destination points, so the distance metric is directional rather than precise.
The average actual delivery time is 12.6 days, but the median delay relative to the estimate is negative (i.e. most orders arrive early). This means the core issue is not overall speed but reliability of the promised date. Customers who receive a parcel 8 days early are moderately happy; customers who receive it 3 days late leave negative reviews. Olist’s lever here may be better delivery-date promise calibration (setting more conservative estimates that are reliably met) rather than necessarily accelerating logistics operations end-to-end.
Caution: We do not have visibility into whether Olist or the seller sets the delivery estimate. The appropriate corrective action differs substantially between these two scenarios.
When we control for shipping distance, order price, freight cost, item count, shipped weight, listing quality, payment installments, and product category — using both a linear regression and an independent random forest model — delivery delay remains the single strongest predictor of dissatisfaction, explaining roughly 8× more unique variance than the next-best factor (item count) and ranking first in random forest importance by a wide margin. More importantly, the random forest reveals the effect is not a smooth slope: dissatisfaction roughly triples in the first few days after the promised date, then plateaus. This reframes the operational priority from “reduce average delay” to “protect the promised date” — converting narrow misses into on-time deliveries likely matters more to sentiment than shortening already-large delays.
Caution: This is a controlled association across two modelling approaches, not a randomised-experiment-level causal estimate. Both models’ R² remain modest (0.10–0.19), meaning most of the variation in review scores comes from factors not captured here — most plausibly product quality and condition-on-arrival, which this dataset cannot observe directly.
The causal-inference section controls for eight plausible alternative explanations (distance, price, freight, item count, weight, listing quality, installments, category) and the delay effect survives across two different model types — meaningfully stronger evidence than a raw correlation. But this is not a randomised experiment: no order was ever randomly assigned to be late or on-time, so unmeasured confounders could still bias the estimate. The most likely candidates are product condition on arrival, packaging quality, and seller communication/responsiveness — none of which exist as columns in this dataset, and none of which can be ruled out. The modest R² of both models (0.10–0.19) is itself an honest signal: delay is the strongest single factor we can measure, but it is far from the only one driving review scores.
The dataset covers 2016–2018 and represents a specific growth phase of Brazilian e-commerce. Logistical infrastructure, carrier networks, and customer expectations have evolved substantially since then. State-level findings (especially for low-volume states like AL with 401 orders or TO with 274 orders) carry high sampling uncertainty and should not be used to make multi-million dollar capital allocation decisions without refreshed data. Additionally, ~3,500 orders were dropped due to missing delivery dates, and orders with statuses other than “delivered” (cancellations, returns) are excluded — this may introduce survivor bias if late orders are more likely to be cancelled. The seller→customer distance metric uses state-capital centroids as a proxy for actual origin/destination coordinates (zip-code-level latitude/longitude was not available in this dataset), so it should be read as a directional signal of geographic distance rather than a precise shipping-route measurement.
This report was structured and drafted with AI assistance (Claude, Anthropic). The following transparency disclosures apply:
What AI contributed: narrative framing, ggplot2 code structure, section organisation, feature engineering choices for the regression/random forest models, and identification of candidate insights from descriptive statistics and model output.
What AI did not do: modify or simulate the underlying data, cherry-pick which model results to report, tune hyperparameters to produce a more dramatic result, or claim randomised-experiment-level causality.
Verification checklist: - ✅ All figures and models
can be reproduced by re-knitting
olist_delivery_analysis.Rmd with the original eight CSV
files - ✅ No imputation or synthetic data generation was applied; rows
with missing model features are dropped via na.omit(), with
the drop rate explicitly reported - ✅ Correlation is Pearson on raw
observations, not aggregated means (which would inflate r) - ✅ Two
structurally different models (linear regression, random forest) were
fit on identical features and the same target; their agreement is
reported as corroborating evidence, not proof - ✅ Partial R²
(drop1) and permutation importance (not impurity-based,
which is biased toward high-cardinality variables) were used for fair
cross-factor comparison - ⚠️ Causality is explicitly
not claimed — the report uses “controlled association,”
“partial effect,” and “net of measured confounders” rather than “causes”
- ⚠️ Both models’ modest R² (0.10–0.19) is reported prominently rather
than hidden, to avoid overstating how much of dissatisfaction this
analysis actually explains - ⚠️ Human review is recommended before
sharing insights with Olist operations or logistics teams, particularly
for state-level investment recommendations given the small-n concern,
and before treating the delay coefficient as an exact dollar-equivalent
effect size
Recommended next steps for validation: a true natural experiment (e.g. comparing review scores before/after a carrier SLA change in a specific state, holding the order mix constant); sentiment analysis of free-text review comments to separate delivery complaints from product complaints; a seller-level random effect to absorb seller-specific quality that this model cannot currently isolate from delay.
## Report generated on: 2026-06-16 13:00:56 CEST
## R version: R version 4.6.0 (2026-04-24 ucrt)
cat(sprintf("Key packages: ggplot2 %s | dplyr %s | patchwork %s\n",
as.character(packageVersion("ggplot2")),
as.character(packageVersion("dplyr")),
as.character(packageVersion("patchwork"))))## Key packages: ggplot2 4.0.3 | dplyr 1.2.1 | patchwork 1.3.2
End of Report · Olist Delivery Performance Analysis · Reproducible R Markdown