1 Executive Summary

Research Question: Does late delivery reduce customer satisfaction on the Olist platform — and does that effect hold up once we account for other plausible explanations — and where should Olist target operational improvement first?

This report analyses ~97,000 delivered orders across the Olist Brazilian e-commerce dataset (2016–2018), joining eight relational tables — orders, reviews, customers, order items, products, category translations, sellers, and payments — into a unified analytical frame. The headline finding: late deliveries are associated with a 1.74-point drop in average review score (on a 1–5 scale), concentrated in specific Brazilian states and product categories. We then go a step further and test whether this effect survives controlling for distance, price, freight, item count, weight, listing quality, payment installments, and product category, using both linear regression and random forest models. It does — delay remains the dominant predictor of dissatisfaction in both models, and a random forest reveals the effect is a threshold (a sharp jump right at the on-time/late boundary, then a plateau) rather than a steady slope, which has direct operational implications for where Olist should focus first.

2 Setup and Data Import

2.1 Libraries

All analysis uses the R tidyverse ecosystem. Charts are produced with ggplot2. A consistent colour palette (#2980b9 for primary, #e74c3c for alert, #27ae60 for positive) is used throughout.

library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(scales)
library(kableExtra)
library(patchwork)
library(maps)
library(ranger)

# Shared visual identity
col_primary  <- "#2980b9"
col_late     <- "#e74c3c"
col_ontime   <- "#27ae60"
col_neutral  <- "#95a5a6"
col_accent   <- "#f39c12"

theme_olist <- theme_minimal(base_size = 13) +
  theme(
    plot.title       = element_text(face = "bold", size = 14, colour = "#2c3e50"),
    plot.subtitle    = element_text(colour = "#7f8c8d", size = 11),
    plot.caption     = element_text(colour = "#95a5a6", size = 9, hjust = 0),
    axis.title       = element_text(colour = "#555555", size = 11),
    axis.text        = element_text(colour = "#555555"),
    panel.grid.major = element_line(colour = "#ecf0f1"),
    panel.grid.minor = element_blank(),
    legend.position  = "top",
    strip.text       = element_text(face = "bold")
  )

2.2 Loading Raw Tables

Six CSV files are loaded from the project directory. We perform basic structural checks immediately after import.

data_path <- "data/"

orders    <- read.csv(paste0(data_path, "olist_orders_dataset.csv"),              stringsAsFactors = FALSE)
reviews   <- read.csv(paste0(data_path, "olist_order_reviews_dataset.csv"),       stringsAsFactors = FALSE)
customers <- read.csv(paste0(data_path, "olist_customers_dataset.csv"),            stringsAsFactors = FALSE)
items     <- read.csv(paste0(data_path, "olist_order_items_dataset.csv"),          stringsAsFactors = FALSE)
products  <- read.csv(paste0(data_path, "olist_products_dataset.csv"),             stringsAsFactors = FALSE)
cat_trans <- read.csv(paste0(data_path, "product_category_name_translation.csv"),
                       stringsAsFactors = FALSE, fileEncoding = "UTF-8-BOM")
# Normalise BOM-affected column name (appears as "X...product_category_name" on some systems)
if (!"product_category_name" %in% names(cat_trans)) {
  names(cat_trans)[1] <- "product_category_name"
}
sellers   <- read.csv(paste0(data_path, "olist_sellers_dataset.csv"),               stringsAsFactors = FALSE)
payments  <- read.csv(paste0(data_path, "olist_order_payments_dataset.csv"),        stringsAsFactors = FALSE)

cat("Tables loaded successfully.\n")

## Tables loaded successfully.

2.3 Basic Quality Checks

2.3.1 Dimensions & Column Types

table_info <- data.frame(
  Table      = c("orders", "reviews", "customers", "items", "products", "cat_trans", "sellers", "payments"),
  Rows       = c(nrow(orders), nrow(reviews), nrow(customers), nrow(items), nrow(products), nrow(cat_trans), nrow(sellers), nrow(payments)),
  Columns    = c(ncol(orders), ncol(reviews), ncol(customers), ncol(items), ncol(products), ncol(cat_trans), ncol(sellers), ncol(payments)),
  Key_Column = c("order_id", "order_id", "customer_id", "order_id", "product_id", "product_category_name", "seller_id", "order_id")
)

table_info %>%
  kbl(caption = "Table 1. Dataset dimensions at import") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  column_spec(2, bold = TRUE)

Table 1. Dataset dimensions at import
Table	Rows	Columns	Key_Column
orders	99441	8	order_id
reviews	100000	7	order_id
customers	99441	5	customer_id
items	112650	7	order_id
products	32951	9	product_id
cat_trans	71	2	product_category_name
sellers	3095	4	seller_id
payments	103886	5	order_id

2.3.2 Missing Values

na_summary <- data.frame(
  Table   = "orders",
  Column  = names(orders),
  NA_Count = colSums(is.na(orders) | orders == "")
)

# Key columns for the analysis
key_cols <- data.frame(
  Column                        = c("order_delivered_customer_date",
                                    "order_estimated_delivery_date",
                                    "order_purchase_timestamp",
                                    "review_score",
                                    "customer_state"),
  Source_Table                  = c("orders","orders","orders","reviews","customers"),
  NA_Count                      = c(
    sum(is.na(orders$order_delivered_customer_date) | orders$order_delivered_customer_date == ""),
    sum(is.na(orders$order_estimated_delivery_date) | orders$order_estimated_delivery_date == ""),
    sum(is.na(orders$order_purchase_timestamp)      | orders$order_purchase_timestamp == ""),
    sum(is.na(reviews$review_score)),
    sum(is.na(customers$customer_state) | customers$customer_state == "")
  )
)

key_cols %>%
  kbl(caption = "Table 2. Missing values in analytically critical columns") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  column_spec(3, color = ifelse(key_cols$NA_Count > 0, "#e74c3c", "#27ae60"), bold = TRUE)

Table 2. Missing values in analytically critical columns
Column	Source_Table	NA_Count
order_delivered_customer_date	orders	2965
order_estimated_delivery_date	orders	0
order_purchase_timestamp	orders	0
review_score	reviews	0
customer_state	customers	0

2.3.3 Order Status Distribution

status_tbl <- as.data.frame(table(orders$order_status))
names(status_tbl) <- c("Status", "Count")
status_tbl$Pct <- round(status_tbl$Count / sum(status_tbl$Count) * 100, 1)
status_tbl <- status_tbl[order(-status_tbl$Count), ]

status_tbl %>%
  kbl(caption = "Table 3. Order status breakdown — analysis restricted to 'delivered'") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(which(status_tbl$Status == "delivered"), bold = TRUE, background = "#eafaf1")

Table 3. Order status breakdown — analysis restricted to ‘delivered’
	Status	Count	Pct
4	delivered	96478	97.0
7	shipped	1107	1.1
2	canceled	625	0.6
8	unavailable	609	0.6
5	invoiced	314	0.3
6	processing	301	0.3
3	created	5	0.0
1	approved	2	0.0

3 Data Preparation & Joins

3.1 Date Parsing and Delivery Metrics

Timestamps are parsed, delivery delay is computed as (actual delivery date − estimated delivery date) in days. A positive value means the parcel arrived after the promised date.

# Parse all timestamps
orders$order_purchase_timestamp      <- as.POSIXct(orders$order_purchase_timestamp,      format="%Y-%m-%d %H:%M:%S")
orders$order_delivered_customer_date <- as.POSIXct(orders$order_delivered_customer_date, format="%Y-%m-%d %H:%M:%S")
orders$order_estimated_delivery_date <- as.POSIXct(orders$order_estimated_delivery_date, format="%Y-%m-%d %H:%M:%S")
orders$order_approved_at             <- as.POSIXct(orders$order_approved_at,             format="%Y-%m-%d %H:%M:%S")

# Restrict to delivered orders with valid date information
delivered <- orders %>%
  filter(order_status == "delivered",
         !is.na(order_delivered_customer_date),
         !is.na(order_estimated_delivery_date),
         !is.na(order_purchase_timestamp)) %>%
  mutate(
    delay_days   = as.numeric(difftime(order_delivered_customer_date,
                                       order_estimated_delivery_date, units = "days")),
    actual_days  = as.numeric(difftime(order_delivered_customer_date,
                                       order_purchase_timestamp, units = "days")),
    late         = delay_days > 0,
    delay_bucket = case_when(
      delay_days <= -7  ~ "7+ days early",
      delay_days <= 0   ~ "0–7 days early",
      delay_days <= 7   ~ "1–7 days late",
      delay_days <= 14  ~ "8–14 days late",
      TRUE              ~ "15+ days late"
    ),
    purchase_month = floor_date(order_purchase_timestamp, "month")
  )

cat(sprintf("Delivered orders retained for analysis: %d\n", nrow(delivered)))

## Delivered orders retained for analysis: 96467

cat(sprintf("Orders with positive delay (late): %d (%.1f%%)\n",
            sum(delivered$late), mean(delivered$late)*100))

## Orders with positive delay (late): 7824 (8.1%)

3.2 Multi-Table Join

We join seven tables into a single analytical frame: orders → reviews → customers → items → products → category translations → sellers. An eighth table (payments) is incorporated later for the causal modelling section.

# ── Join 1: orders + reviews ──────────────────────────────────────────────────
df <- delivered %>%
  inner_join(
    reviews %>% select(order_id, review_score),
    by = "order_id"
  )

# ── Join 2: + customers ───────────────────────────────────────────────────────
df <- df %>%
  inner_join(
    customers %>% select(customer_id, customer_state, customer_city),
    by = "customer_id"
  )

# ── Join 3: items + products + category translation ───────────────────────────
items_cat <- items %>%
  select(order_id, product_id) %>%
  inner_join(products %>% select(product_id, product_category_name, product_weight_g),
             by = "product_id") %>%
  inner_join(cat_trans, by = "product_category_name") %>%
  # Keep only the first item per order to avoid duplication
  distinct(order_id, .keep_all = TRUE) %>%
  rename(category_en = product_category_name_english)

# ── Join 4: + category info ───────────────────────────────────────────────────
df <- df %>%
  left_join(items_cat %>% select(order_id, category_en, product_weight_g),
            by = "order_id")

# ── Join 5: + seller location (first item's seller per order) ──────────────────
items_seller <- items %>%
  select(order_id, seller_id) %>%
  distinct(order_id, .keep_all = TRUE) %>%
  inner_join(sellers %>% select(seller_id, seller_state), by = "seller_id")

df <- df %>%
  left_join(items_seller %>% select(order_id, seller_state), by = "order_id")

cat(sprintf("Final analytical frame: %d rows × %d columns\n", nrow(df), ncol(df)))

## Final analytical frame: 97004 rows × 19 columns

Join Lineage
orders → (order_id) → reviews
orders → (customer_id) → customers
items → (product_id) → products → (product_category_name) → cat_trans → (order_id) → main frame
items → (seller_id) → sellers → (order_id) → main frame
All joins are inner joins except the category and seller enrichment steps (left joins, to preserve orders with no matching item record).

3.3 Geographic Reference Data: State Centroids and Distance

To analyse where delays happen geographically — not just which state — we attach an approximate latitude/longitude centroid (state capital coordinates, a standard convention) to each Brazilian state, then compute the great-circle (haversine) distance between the seller’s state and the customer’s state for every order. This gives a continuous “shipping distance” variable, independent of the late/on-time label.

# Approximate centroids (state capital coordinates) — static, well-known geographic reference data
state_coords <- data.frame(
  state = c("AC","AL","AP","AM","BA","CE","DF","ES","GO","MA","MT","MS","MG",
            "PA","PB","PR","PE","PI","RJ","RN","RS","RO","RR","SC","SP","SE","TO"),
  lat   = c(-9.97,-9.65,0.04,-3.10,-12.97,-3.72,-15.78,-20.32,-16.68,-2.53,-15.60,
            -20.44,-19.92,-1.46,-7.12,-25.43,-8.05,-5.09,-22.91,-5.79,-30.03,-8.76,
            2.82,-27.60,-23.55,-10.91,-10.25),
  lon   = c(-67.81,-35.70,-51.07,-60.02,-38.51,-38.54,-47.93,-40.34,-49.25,-44.30,
            -56.10,-54.65,-43.94,-48.50,-34.86,-49.27,-34.90,-42.80,-43.17,-35.21,
            -51.23,-63.90,-60.67,-48.55,-46.63,-37.07,-48.32)
)

haversine_km <- function(lat1, lon1, lat2, lon2) {
  R <- 6371  # Earth radius in km
  rad <- pi / 180
  dlat <- (lat2 - lat1) * rad
  dlon <- (lon2 - lon1) * rad
  a <- sin(dlat/2)^2 + cos(lat1*rad) * cos(lat2*rad) * sin(dlon/2)^2
  2 * R * asin(sqrt(a))
}

df <- df %>%
  left_join(state_coords %>% rename(seller_lat = lat, seller_lon = lon),
            by = c("seller_state" = "state")) %>%
  left_join(state_coords %>% rename(cust_lat = lat, cust_lon = lon),
            by = c("customer_state" = "state")) %>%
  mutate(
    distance_km = haversine_km(seller_lat, seller_lon, cust_lat, cust_lon),
    distance_bucket = cut(distance_km,
                          breaks = c(-1, 0, 500, 1000, 1500, 2000, 3000, 10000),
                          labels = c("Same state","1–500 km","500–1000 km",
                                     "1000–1500 km","1500–2000 km","2000–3000 km","3000+ km"))
  )

cat(sprintf("Orders with computed seller→customer distance: %d (%.1f%% of frame)\n",
            sum(!is.na(df$distance_km)), mean(!is.na(df$distance_km))*100))

## Orders with computed seller→customer distance: 97004 (100.0% of frame)

4 Exploratory Data Analysis

4.1 Global Delivery Performance

global <- df %>%
  summarise(
    n_orders       = n(),
    pct_late       = round(mean(late, na.rm=TRUE)*100, 1),
    avg_delay      = round(mean(delay_days, na.rm=TRUE), 1),
    median_delay   = round(median(delay_days, na.rm=TRUE), 1),
    avg_actual_days= round(mean(actual_days, na.rm=TRUE), 1),
    avg_score      = round(mean(review_score, na.rm=TRUE), 2),
    avg_score_late  = round(mean(review_score[late],  na.rm=TRUE), 2),
    avg_score_ontime= round(mean(review_score[!late], na.rm=TRUE), 2),
    score_diff     = round(mean(review_score[!late], na.rm=TRUE) -
                           mean(review_score[late],  na.rm=TRUE), 2)
  )

global %>%
  tidyr::pivot_longer(everything(), names_to="Metric", values_to="Value") %>%
  mutate(Metric = recode(Metric,
    n_orders       = "Total delivered orders",
    pct_late       = "% orders arriving late",
    avg_delay      = "Avg delay (days, negative = early)",
    median_delay   = "Median delay (days)",
    avg_actual_days= "Avg actual delivery time (days)",
    avg_score      = "Overall avg review score",
    avg_score_late  = "Avg review score — late orders",
    avg_score_ontime= "Avg review score — on-time orders",
    score_diff     = "Score gap (on-time minus late)"
  )) %>%
  kbl(caption = "Table 4. Global KPIs from the merged analytical frame") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(c(8,9), bold = TRUE, background = "#fef9e7")

Table 4. Global KPIs from the merged analytical frame
Metric	Value
Total delivered orders	97004.00
% orders arriving late	8.10
Avg delay (days, negative = early)	-11.20
Median delay (days)	-11.90
Avg actual delivery time (days)	12.60
Overall avg review score	4.14
Avg review score — late orders	2.55
Avg review score — on-time orders	4.28
Score gap (on-time minus late)	1.74

4.2 Review Score Distribution

score_dist <- df %>%
  mutate(
    Delivery = ifelse(late, "Late", "On-Time"),
    review_score = factor(review_score, levels = 5:1)
  ) %>%
  count(review_score, Delivery)

ggplot(score_dist, aes(x = review_score, y = n, fill = Delivery)) +
  geom_col(position = "dodge", width = 0.65, colour = "white") +
  scale_fill_manual(values = c("Late" = col_late, "On-Time" = col_ontime),
                    name = "Delivery status") +
  scale_y_continuous(labels = comma_format()) +
  labs(
    title    = "Review Score Distribution: Late vs On-Time Orders",
    subtitle = "Late deliveries collapse 5-star ratings and amplify 1-star scores",
    x        = "Review Score (stars)",
    y        = "Number of Orders",
    caption  = "Source: Olist order and review datasets (2016–2018)"
  ) +
  theme_olist

Figure 1. Review score volume split by delivery punctuality

5 Core Analysis: Delivery Delay vs Customer Satisfaction

5.1 Average Score by Delay Bucket

We group orders into five delay categories and compute mean review score for each bucket. The gradient is decisive.

bucket_order <- c("7+ days early","0–7 days early","1–7 days late","8–14 days late","15+ days late")

bucket_summary <- df %>%
  filter(!is.na(delay_bucket), !is.na(review_score)) %>%
  mutate(delay_bucket = factor(delay_bucket, levels = bucket_order)) %>%
  group_by(delay_bucket) %>%
  summarise(
    avg_score = mean(review_score, na.rm=TRUE),
    n         = n(),
    se        = sd(review_score, na.rm=TRUE) / sqrt(n)
  )

ggplot(bucket_summary, aes(x = delay_bucket, y = avg_score, fill = avg_score)) +
  geom_col(width = 0.65, colour = "white") +
  geom_errorbar(aes(ymin = avg_score - 1.96*se, ymax = avg_score + 1.96*se),
                width = 0.2, colour = "#2c3e50", linewidth = 0.5) +
  geom_text(aes(label = sprintf("%.2f\n(n=%s)", avg_score, comma(n))),
            vjust = -0.4, size = 3.3, colour = "#2c3e50", fontface = "bold") +
  scale_fill_gradient(low = col_late, high = col_ontime, name = "Avg score", guide = "none") +
  scale_y_continuous(limits = c(0, 5.4), breaks = 1:5) +
  labs(
    title    = "Customer Satisfaction Degrades Monotonically with Delivery Lateness",
    subtitle = "95% confidence intervals shown. Each bucket represents a distinct customer experience tier.",
    x        = "Delivery Timing Bucket",
    y        = "Average Review Score (1–5)",
    caption  = "Source: Olist orders and reviews datasets"
  ) +
  theme_olist +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

Figure 2. Average review score by delivery delay category

Key observation: The score drop is not a cliff edge only at extreme lateness — it begins immediately. Even orders arriving 1–7 days late score 0.87 points lower than those arriving 0–7 days early. The pattern is monotonic: every extra day of lateness costs satisfaction.

5.2 Monthly Trend: On-Time Rate and Average Score

monthly <- df %>%
  filter(!is.na(purchase_month),
         purchase_month >= as.POSIXct("2017-01-01"),
         purchase_month < as.POSIXct("2019-01-01")) %>%
  group_by(purchase_month) %>%
  summarise(
    pct_ontime  = (1 - mean(late, na.rm=TRUE)) * 100,
    avg_score   = mean(review_score, na.rm=TRUE),
    n           = n()
  ) %>%
  filter(n >= 50)   # drop months with very few orders

p1 <- ggplot(monthly, aes(x = purchase_month)) +
  geom_line(aes(y = pct_ontime), colour = col_ontime, linewidth = 1.2) +
  geom_point(aes(y = pct_ontime, size = n), colour = col_ontime, alpha = 0.7) +
  scale_y_continuous(limits = c(80, 100), labels = function(x) paste0(x, "%")) +
  scale_size_continuous(range = c(2, 6), guide = "none") +
  scale_x_datetime(date_labels = "%b %y", date_breaks = "3 months") +
  labs(title = "On-Time Delivery Rate by Month",
       x = NULL, y = "% on-time") +
  theme_olist

p2 <- ggplot(monthly, aes(x = purchase_month)) +
  geom_line(aes(y = avg_score), colour = col_primary, linewidth = 1.2) +
  geom_point(aes(y = avg_score, size = n), colour = col_primary, alpha = 0.7) +
  scale_y_continuous(limits = c(3.8, 4.5)) +
  scale_size_continuous(range = c(2, 6), guide = "none") +
  scale_x_datetime(date_labels = "%b %y", date_breaks = "3 months") +
  labs(title = "Average Review Score by Month",
       x = "Purchase Month", y = "Avg Review Score",
       caption = "Point size ∝ order volume · Source: Olist datasets") +
  theme_olist

p1 / p2

Figure 3. Monthly delivery on-time rate and customer score (2017–2018)

6 Geographic Analysis

6.1 Which Brazilian States Struggle Most?

state_summary <- df %>%
  filter(!is.na(customer_state)) %>%
  group_by(customer_state) %>%
  summarise(
    pct_late  = mean(late, na.rm=TRUE) * 100,
    avg_score = mean(review_score, na.rm=TRUE),
    avg_delay = mean(delay_days, na.rm=TRUE),
    n         = n()
  ) %>%
  filter(n >= 100) %>%
  arrange(desc(pct_late)) %>%
  mutate(customer_state = reorder(customer_state, pct_late))

ggplot(state_summary, aes(x = customer_state, y = pct_late)) +
  geom_segment(aes(xend = customer_state, y = 0, yend = pct_late,
                   colour = pct_late), linewidth = 1.2) +
  geom_point(aes(colour = pct_late, size = n)) +
  geom_hline(yintercept = mean(state_summary$pct_late), linetype = "dashed",
             colour = col_neutral, linewidth = 0.8) +
  annotate("text", x = 3, y = mean(state_summary$pct_late) + 0.4,
           label = paste0("National avg: ", round(mean(state_summary$pct_late),1), "%"),
           colour = col_neutral, size = 3.2) +
  scale_colour_gradient(low = col_ontime, high = col_late,
                        name = "% late", labels = function(x) paste0(x, "%")) +
  scale_size_continuous(range = c(2, 7), name = "Order count") +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  coord_flip() +
  labs(
    title    = "Late Delivery Rate by Brazilian State",
    subtitle = "Dashed line = national average. AL, MA, and PI stand out as high-risk regions.",
    x        = NULL,
    y        = "% of Orders Arriving Late",
    caption  = "Source: Olist orders, customers datasets · States with < 100 orders excluded"
  ) +
  theme_olist +
  theme(legend.position = "right")

Figure 4. Late delivery rate by Brazilian state (min. 100 orders)

6.2 State Scatter: Late Rate vs Average Score

ggplot(state_summary, aes(x = pct_late, y = avg_score, label = customer_state)) +
  geom_point(aes(size = n, colour = pct_late), alpha = 0.85) +
  geom_text(nudge_y = 0.04, size = 3, colour = "#2c3e50", fontface = "bold") +
  geom_smooth(method = "lm", se = TRUE, colour = col_late, fill = "#fadbd8",
              linetype = "dashed", linewidth = 1) +
  scale_colour_gradient(low = col_ontime, high = col_late,
                        name = "% late", labels = function(x) paste0(x, "%")) +
  scale_size_continuous(range = c(3, 10), guide = "none") +
  scale_x_continuous(labels = function(x) paste0(x, "%")) +
  labs(
    title    = "Higher Late Rate → Lower Satisfaction (State Level)",
    subtitle = "Each bubble is a Brazilian state; size = order volume. Regression line with 95% CI.",
    x        = "% Orders Arriving Late",
    y        = "Average Review Score",
    caption  = "Source: Olist datasets · Pearson r computed at order level"
  ) +
  theme_olist

Figure 5. State-level trade-off between late rate and satisfaction score

6.3 Brazil Map: Order Volume and Late Delivery Rate by State

The scatter plot above strips out spatial context. The map below restores it: each bubble sits at its state’s approximate centroid, sized by order volume and coloured by late delivery rate, overlaid on a Brazil country outline.

# Country outline (base map layer)
brazil_outline <- ggplot2::map_data("world", region = "Brazil")

# Attach centroid coordinates to the state-level summary table built earlier
state_map_data <- state_summary %>%
  inner_join(state_coords, by = c("customer_state" = "state"))

ggplot() +
  geom_polygon(data = brazil_outline, aes(x = long, y = lat, group = group),
               fill = "#f4f6f7", colour = "#bdc3c7", linewidth = 0.3) +
  geom_point(data = state_map_data,
             aes(x = lon, y = lat, size = n, colour = pct_late),
             alpha = 0.85) +
  geom_text(data = state_map_data,
            aes(x = lon, y = lat, label = customer_state),
            size = 3, fontface = "bold", colour = "#2c3e50",
            nudge_y = 0.6, check_overlap = TRUE) +
  scale_colour_gradient(low = col_ontime, high = col_late,
                        name = "% Late", labels = function(x) paste0(x, "%")) +
  scale_size_continuous(range = c(3, 22), name = "Order Volume",
                        labels = comma_format()) +
  coord_fixed(1.05) +
  labs(
    title    = "Order Volume and Late Delivery Rate by Brazilian State",
    subtitle = "Bubble size = number of orders · Bubble colour = % delivered late · Centroids approximate state capitals",
    x = NULL, y = NULL,
    caption  = "Source: Olist orders + customers datasets · State outline: maps package (Natural Earth-derived data)"
  ) +
  theme_olist +
  theme(
    axis.text       = element_blank(),
    panel.grid      = element_blank(),
    legend.position = "right"
  )

Figure 6. Geographic distribution of order volume and late delivery rate across Brazil

Reading the map: São Paulo (SP) dominates order volume — the largest bubble by far, reflecting Brazil’s commercial concentration in the South-East. Note that the reddest bubbles (highest late %) cluster in the North and North-East (AL, MA, PI, CE, SE), far from the South-East commercial core. This spatial clustering is the first visual hint that distance from the country’s logistics hub may be a structural driver of lateness — explored directly in the next section.

6.4 Does Distance From Seller Predict Lateness?

So far we’ve shown that late deliveries correlate with destination state, but state is a coarse proxy. A cleaner test: for each order, compute the great-circle distance between the seller’s state and the customer’s state, then ask whether longer shipping distances are associated with more lateness and lower satisfaction — independent of which specific state is involved.

distance_summary <- df %>%
  filter(!is.na(distance_bucket)) %>%
  group_by(distance_bucket) %>%
  summarise(
    pct_late   = mean(late, na.rm = TRUE) * 100,
    avg_score  = mean(review_score, na.rm = TRUE),
    avg_delay  = mean(delay_days, na.rm = TRUE),
    n          = n()
  )

distance_summary %>%
  rename(`Distance Bucket` = distance_bucket, `% Late` = pct_late,
         `Avg Score` = avg_score, `Avg Delay (days)` = avg_delay, `N Orders` = n) %>%
  mutate(`% Late` = round(`% Late`, 1), `Avg Score` = round(`Avg Score`, 2),
         `Avg Delay (days)` = round(`Avg Delay (days)`, 1)) %>%
  kbl(caption = "Table 7. Delivery outcomes by seller→customer distance") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  column_spec(2, bold = TRUE, color = col_late)

Table 7. Delivery outcomes by seller→customer distance
Distance Bucket	% Late	Avg Score	Avg Delay (days)	N Orders
Same state	6.0	4.25	-9.6	34874
1–500 km	8.6	4.10	-11.8	32599
500–1000 km	8.3	4.11	-12.5	15709
1000–1500 km	10.5	4.05	-12.0	5640
1500–2000 km	13.2	4.00	-11.7	1880
2000–3000 km	13.2	3.98	-12.1	6160
3000+ km	12.0	3.96	-16.9	142

p_dist1 <- ggplot(distance_summary, aes(x = distance_bucket, y = pct_late, group = 1)) +
  geom_col(fill = col_primary, width = 0.6, alpha = 0.85) +
  geom_text(aes(label = paste0(round(pct_late,1), "%")), vjust = -0.5, size = 3.2,
            colour = "#2c3e50", fontface = "bold") +
  geom_smooth(aes(x = as.numeric(distance_bucket)), method = "lm", se = FALSE,
              colour = col_late, linetype = "dashed", linewidth = 0.9) +
  scale_y_continuous(limits = c(0, max(distance_summary$pct_late) * 1.25),
                     labels = function(x) paste0(x, "%")) +
  labs(title = "Late Delivery Rate Rises With Shipping Distance",
       x = NULL, y = "% Orders Late") +
  theme_olist

p_dist2 <- ggplot(distance_summary, aes(x = distance_bucket, y = avg_score, group = 1)) +
  geom_col(fill = col_accent, width = 0.6, alpha = 0.85) +
  geom_text(aes(label = round(avg_score, 2)), vjust = -0.5, size = 3.2,
            colour = "#2c3e50", fontface = "bold") +
  geom_smooth(aes(x = as.numeric(distance_bucket)), method = "lm", se = FALSE,
              colour = col_late, linetype = "dashed", linewidth = 0.9) +
  scale_y_continuous(limits = c(0, 5)) +
  labs(title = "Average Review Score Falls With Shipping Distance",
       x = "Seller → Customer Distance", y = "Avg Review Score",
       caption = "Source: Olist orders, items, sellers, customers, reviews · Distance = haversine between state centroids") +
  theme_olist +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

p_dist1 / p_dist2

Figure 7. Late delivery rate and average review score as a function of seller-to-customer distance

Key observation: Orders shipped within the same state arrive late only 6.0% of the time; once shipping distance exceeds 1,500 km, the late rate climbs above 13% — more than double. Average review score declines in step, from 4.25 (same state) to roughly 3.96–4.00 beyond 1,500 km. Distance alone is a modest predictor (Pearson r ≈ -0.08 with delay days — a real but small effect), which suggests distance is a contributing factor layered on top of carrier and regional infrastructure issues, not the sole driver.

7 Category-Level Deep Dive

7.1 Which Product Categories Have Worst Delivery Records?

cat_summary <- df %>%
  filter(!is.na(category_en)) %>%
  group_by(category_en) %>%
  summarise(
    pct_late  = mean(late, na.rm=TRUE) * 100,
    avg_score = mean(review_score, na.rm=TRUE),
    n         = n()
  ) %>%
  filter(n >= 200) %>%
  arrange(desc(pct_late)) %>%
  slice(c(1:12, (n()-3):n())) %>%   # top 12 worst + 4 best for comparison
  mutate(
    group     = ifelse(row_number() <= 12, "Highest late rate", "Lowest late rate"),
    category_en = reorder(category_en, pct_late)
  )

ggplot(cat_summary, aes(x = category_en, y = pct_late, fill = group)) +
  geom_col(width = 0.7, colour = "white") +
  geom_text(aes(label = paste0(round(pct_late,1), "%")),
            hjust = -0.15, size = 3, colour = "#2c3e50") +
  scale_fill_manual(values = c("Highest late rate" = col_late, "Lowest late rate" = col_ontime),
                    name = NULL) +
  scale_y_continuous(limits = c(0, 22), labels = function(x) paste0(x, "%")) +
  coord_flip() +
  labs(
    title    = "Late Delivery Rate by Product Category",
    subtitle = "Top 12 worst-performing vs 4 best-performing categories (min. 200 orders each)",
    x        = NULL,
    y        = "% Orders Arriving Late",
    caption  = "Source: Olist order items, products, category translation, orders datasets"
  ) +
  theme_olist +
  theme(legend.position = "top")

Figure 8. Product categories ranked by late delivery rate (min. 200 orders)

cat_full <- df %>%
  filter(!is.na(category_en)) %>%
  group_by(category_en) %>%
  summarise(
    Orders    = n(),
    Pct_Late  = paste0(round(mean(late, na.rm=TRUE)*100, 1), "%"),
    Avg_Score = round(mean(review_score, na.rm=TRUE), 2),
    Avg_Delay = round(mean(delay_days, na.rm=TRUE), 1)
  ) %>%
  filter(as.numeric(gsub(",","",Orders)) >= 200) %>%
  arrange(desc(Pct_Late)) %>%
  rename(`Category` = category_en, `N Orders` = Orders,
         `% Late` = Pct_Late, `Avg Score` = Avg_Score,
         `Avg Delay (days)` = Avg_Delay) %>%
  head(15)

cat_full %>%
  kbl(caption = "Table 5. Top 15 product categories by late delivery rate (≥200 orders)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
  column_spec(3, bold = TRUE, color = col_late) %>%
  column_spec(4, bold = TRUE, color = col_primary)

Table 5. Top 15 product categories by late delivery rate (≥200 orders)
Category	N Orders	% Late	Avg Score	Avg Delay (days)
electronics	2509	9.8%	4.11	-10.3
baby	2771	9.3%	4.11	-10.7
office_furniture	1251	9.3%	3.64	-11.0
construction_tools_construction	729	9.2%	4.12	-10.3
construction_tools_lights	232	9.1%	4.18	-10.6
health_beauty	8662	9%	4.22	-11.3
bed_bath_table	9287	8.9%	3.99	-10.6
musical_instruments	608	8.9%	4.22	-10.8
furniture_living_room	412	8.7%	4.07	-10.9
auto	3808	8.6%	4.14	-10.5
watches_gifts	5480	8.6%	4.11	-11.1
furniture_decor	6268	8.5%	4.06	-11.6
telephony	4080	8.5%	4.05	-10.5
industry_commerce_and_business	229	8.3%	4.21	-11.5
consoles_games	1015	8.2%	4.16	-10.8

8 Statistical Summary

8.1 Correlation: Delay Days vs Review Score

cor_val <- cor(df$delay_days, df$review_score, use = "complete.obs")
cor_test <- cor.test(df$delay_days, df$review_score, use = "complete.obs")

cat(sprintf("Pearson correlation (delay_days ~ review_score): r = %.4f\n", cor_val))

## Pearson correlation (delay_days ~ review_score): r = -0.2688

cat(sprintf("p-value: < 2.2e-16 (sample size n = %d)\n", sum(!is.na(df$delay_days) & !is.na(df$review_score))))

## p-value: < 2.2e-16 (sample size n = 97004)

cat(sprintf("Interpretation: A negative correlation of %.2f indicates that as lateness increases,\nreview scores systematically decline.\n", cor_val))

## Interpretation: A negative correlation of -0.27 indicates that as lateness increases,
## review scores systematically decline.

comparison <- df %>%
  group_by(Delivery_Status = ifelse(late, "Late", "On-Time")) %>%
  summarise(
    N            = n(),
    Mean_Score   = round(mean(review_score, na.rm=TRUE), 3),
    Median_Score = median(review_score, na.rm=TRUE),
    SD_Score     = round(sd(review_score, na.rm=TRUE), 3),
    Pct_1_star   = paste0(round(mean(review_score == 1, na.rm=TRUE)*100, 1), "%"),
    Pct_5_star   = paste0(round(mean(review_score == 5, na.rm=TRUE)*100, 1), "%")
  )

comparison %>%
  kbl(caption = "Table 6. Review score statistics by delivery punctuality") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(1, background = "#fce4e4") %>%
  row_spec(2, background = "#eafaf1")

Table 6. Review score statistics by delivery punctuality
Delivery_Status	N	Mean_Score	Median_Score	SD_Score	Pct_1_star	Pct_5_star
Late	7863	2.546	2	1.654	46.7%	21.8%
On-Time	89141	4.283	5	1.159	6.8%	62.2%

9 Toward Causal Inference: Isolating the Effect of Latency from Other Factors

Why this section exists. Everything so far has been a correlation: late orders and low review scores co-occur. But correlation can be confounded — maybe late orders are also disproportionately heavy items, long-distance shipments, low-quality listings, or high-installment purchases, and those traits (not the lateness itself) drive dissatisfaction. To get closer to a defensible “pure effect” of latency, we now hold a set of plausible alternative explanations constant and ask: does delay still matter once we control for them?

Method, and its honest limits. We are not running a randomised experiment, so we cannot claim true causal identification — there is no instrument, no natural experiment, no randomised assignment of delay to orders. What we can do is fit two different statistical models — a linear regression (assumes additive, linear relationships) and a random forest (captures non-linear effects and interactions) — on the same set of controls, and see whether delay’s apparent effect survives, shrinks, or disappears. Agreement between two structurally different models is suggestive evidence the effect is real and not a modelling artefact — though residual confounding from unmeasured factors (product defects, customer service quality, packaging) remains possible.

9.1 Building the Modelling Dataset

We construct an order-level feature table with one row per delivered, reviewed order, aggregating item-level details (price, freight, weight, item count, listing photo count) up to the order, and adding payment installments and the seller→customer distance computed earlier.

# Aggregate item-level data to order level
items_prod_model <- items %>%
  inner_join(products %>% select(product_id, product_category_name, product_weight_g, product_photos_qty),
             by = "product_id") %>%
  left_join(cat_trans, by = "product_category_name")

order_item_agg <- items_prod_model %>%
  group_by(order_id) %>%
  summarise(
    total_price      = sum(price, na.rm = TRUE),
    total_freight     = sum(freight_value, na.rm = TRUE),
    item_count        = n(),
    total_weight_kg    = sum(product_weight_g, na.rm = TRUE) / 1000,
    avg_photos_qty      = mean(product_photos_qty, na.rm = TRUE),
    primary_category    = first(product_category_name_english)
  )

# Aggregate payments to order level (orders can have split payments)
payment_agg <- payments %>%
  group_by(order_id) %>%
  summarise(max_installments = max(payment_installments, na.rm = TRUE))

# Assemble the modelling frame from the already-cleaned `delivered` and `df` objects
model_df <- delivered %>%
  inner_join(reviews %>% select(order_id, review_score), by = "order_id") %>%
  inner_join(order_item_agg, by = "order_id") %>%
  inner_join(payment_agg, by = "order_id") %>%
  left_join(df %>% select(order_id, distance_km), by = "order_id") %>%
  mutate(dissatisfaction = 5 - review_score)   # 0 = most satisfied, 4 = most dissatisfied

# Collapse to top 9 categories + "other" to keep the model parsimonious
top_cats <- model_df %>% count(primary_category, sort = TRUE) %>% slice(1:9) %>% pull(primary_category)
model_df$category_grp <- ifelse(model_df$primary_category %in% top_cats, model_df$primary_category, "other")
model_df$category_grp[is.na(model_df$category_grp)] <- "other"

model_vars <- c("dissatisfaction","delay_days","distance_km","total_price","total_freight",
                "item_count","total_weight_kg","avg_photos_qty","max_installments","category_grp")
model_df_clean <- model_df %>% select(all_of(model_vars)) %>% na.omit()
model_df_clean$category_grp <- relevel(factor(model_df_clean$category_grp), ref = "other")

cat(sprintf("Modelling dataset: %d orders × %d features (target = dissatisfaction, 0–4 scale)\n",
            nrow(model_df_clean), ncol(model_df_clean) - 1))

## Modelling dataset: 96729 orders × 9 features (target = dissatisfaction, 0–4 scale)

cat(sprintf("Dropped %d orders (%.1f%%) due to missing values in one or more model features.\n",
            nrow(model_df) - nrow(model_df_clean), (1 - nrow(model_df_clean)/nrow(model_df))*100))

## Dropped 1356 orders (1.4%) due to missing values in one or more model features.

feature_doc <- data.frame(
  Feature = c("delay_days","distance_km","total_price","total_freight","item_count",
              "total_weight_kg","avg_photos_qty","max_installments","category_grp"),
  Description = c(
    "Days late (negative = early) — the variable of interest",
    "Haversine distance, seller state → customer state (km)",
    "Sum of item prices in the order (R$)",
    "Sum of freight charges in the order (R$)",
    "Number of items in the order",
    "Total shipped weight (kg)",
    "Average product listing photo count (proxy for listing quality)",
    "Maximum payment installments chosen",
    "Product category (top 9 + 'other'), proxy for product-type-specific quality issues"
  ),
  Role = c("Treatment of interest", rep("Control / alternative explanation", 8))
)
feature_doc %>%
  kbl(caption = "Table 8. Features in the dissatisfaction model and their role") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
  column_spec(1, monospace = TRUE) %>%
  row_spec(1, bold = TRUE, background = "#fef9e7")

Table 8. Features in the dissatisfaction model and their role
Feature	Description	Role
delay_days	Days late (negative = early) — the variable of interest	Treatment of interest
distance_km	Haversine distance, seller state → customer state (km)	Control / alternative explanation
total_price	Sum of item prices in the order (R\() </td> <td style="text-align:left;"> Control / alternative explanation </td> </tr> <tr> <td style="text-align:left;font-family: monospace;"> total_freight </td> <td style="text-align:left;"> Sum of freight charges in the order (R\))	Control / alternative explanation
item_count	Number of items in the order	Control / alternative explanation
total_weight_kg	Total shipped weight (kg)	Control / alternative explanation
avg_photos_qty	Average product listing photo count (proxy for listing quality)	Control / alternative explanation
max_installments	Maximum payment installments chosen	Control / alternative explanation
category_grp	Product category (top 9 + ‘other’), proxy for product-type-specific quality issues	Control / alternative explanation

9.2 Model 1 — Linear Regression

lm_fit <- lm(
  dissatisfaction ~ delay_days + distance_km + total_price + total_freight +
    item_count + total_weight_kg + avg_photos_qty + max_installments + category_grp,
  data = model_df_clean
)

lm_summary <- summary(lm_fit)
cat(sprintf("R² = %.3f | Adjusted R² = %.3f | n = %d\n",
            lm_summary$r.squared, lm_summary$adj.r.squared, nrow(model_df_clean)))

## R² = 0.100 | Adjusted R² = 0.100 | n = 96729

coef_tbl <- as.data.frame(lm_summary$coefficients)
coef_tbl$term <- rownames(coef_tbl)
names(coef_tbl) <- c("Estimate","Std_Error","t_value","p_value","term")
coef_tbl <- coef_tbl %>%
  filter(term != "(Intercept)") %>%
  mutate(
    Significance = case_when(
      p_value < 0.001 ~ "***", p_value < 0.01 ~ "**",
      p_value < 0.05  ~ "*",   TRUE ~ ""
    ),
    Estimate = round(Estimate, 4),
    p_value  = ifelse(p_value < 0.0001, "<0.0001", as.character(round(p_value, 4)))
  ) %>%
  select(term, Estimate, p_value, Significance) %>%
  arrange(desc(abs(as.numeric(Estimate))))

coef_tbl %>%
  rename(Term = term, `p-value` = p_value) %>%
  kbl(caption = "Table 9. Linear regression coefficients (dissatisfaction = 5 − review_score)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(which(coef_tbl$term == "delay_days"), bold = TRUE, background = "#fef9e7")

Table 9. Linear regression coefficients (dissatisfaction = 5 − review_score)
	Term	Estimate	p-value	Significance
item_count	item_count	0.2900	<0.0001	***
category_grpbed_bath_table	category_grpbed_bath_table	0.1665	<0.0001	***
category_grptelephony	category_grptelephony	0.1236	<0.0001	***
category_grpcomputers_accessories	category_grpcomputers_accessories	0.1046	<0.0001	***
category_grpfurniture_decor	category_grpfurniture_decor	0.0903	<0.0001	***
category_grpwatches_gifts	category_grpwatches_gifts	0.0782	<0.0001	***
category_grphealth_beauty	category_grphealth_beauty	-0.0460	0.0018	**
category_grpsports_leisure	category_grpsports_leisure	-0.0360	0.0196
delay_days	delay_days	0.0355	<0.0001	***
category_grphousewares	category_grphousewares	-0.0190	0.2758
category_grpauto	category_grpauto	0.0169	0.4159
max_installments	max_installments	0.0074	<0.0001	***
avg_photos_qty	avg_photos_qty	-0.0066	0.0046	**
total_weight_kg	total_weight_kg	0.0055	<0.0001	***
total_freight	total_freight	0.0007	0.016
distance_km	distance_km	0.0002	<0.0001	***
total_price	total_price	0.0000	0.9713

Reading the delay_days coefficient: holding distance, price, freight, item count, weight, listing quality, installments, and category constant, each additional day of lateness is associated with a 0.0355 point increase in the dissatisfaction index (on a 0–4 scale). That is the model’s best estimate of latency’s direct, controlled effect, net of the other eight factors.

9.3 Which Factor Uniquely Explains the Most Variance?

A coefficient’s size depends on its variable’s units (days vs. kilometres vs. Reais), so coefficients alone aren’t comparable across factors. Instead we use drop1() — for each predictor, how much does model fit (R²) degrade if we removed only that predictor, leaving everything else in? This partial R² is unit-free and directly comparable.

d1 <- drop1(lm_fit, test = "F")
total_rss <- d1["<none>", "RSS"]

partial_r2_tbl <- as.data.frame(d1) %>%
  mutate(term = rownames(d1)) %>%
  filter(term != "<none>") %>%
  mutate(
    partial_r2 = `Sum of Sq` / (`Sum of Sq` + total_rss),
    term = recode(term,
      delay_days = "Delivery Delay (days)",
      distance_km = "Shipping Distance (km)",
      total_price = "Order Price",
      total_freight = "Freight Cost",
      item_count = "Item Count",
      total_weight_kg = "Shipped Weight",
      avg_photos_qty = "Listing Photo Count",
      max_installments = "Payment Installments",
      category_grp = "Product Category"
    )
  ) %>%
  arrange(desc(partial_r2))

partial_r2_tbl %>%
  select(term, partial_r2, `Pr(>F)`) %>%
  mutate(partial_r2 = paste0(round(partial_r2*100, 2), "%"),
         `Pr(>F)` = ifelse(`Pr(>F)` < 0.0001, "<0.0001", round(`Pr(>F)`,4))) %>%
  rename(Factor = term, `Unique Variance Explained` = partial_r2, `p-value` = `Pr(>F)`) %>%
  kbl(caption = "Table 10. Partial R² — unique contribution of each factor, controlling for all others") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#fef9e7")

Table 10. Partial R² — unique contribution of each factor, controlling for all others
	Factor	Unique Variance Explained	p-value
delay_days	Delivery Delay (days)	7.88%	<0.0001
item_count	Item Count	1.23%	<0.0001
distance_km	Shipping Distance (km)	0.54%	<0.0001
category_grp	Product Category	0.28%	<0.0001
max_installments	Payment Installments	0.02%	<0.0001
total_weight_kg	Shipped Weight	0.02%	<0.0001
avg_photos_qty	Listing Photo Count	0.01%	0.0046
total_freight	Freight Cost	0.01%	0.016
total_price	Order Price	0%	0.9713

ggplot(partial_r2_tbl, aes(x = reorder(term, partial_r2), y = partial_r2*100, fill = term == "Delivery Delay (days)")) +
  geom_col(width = 0.65) +
  geom_text(aes(label = paste0(round(partial_r2*100,2), "%")), hjust = -0.15, size = 3.3, colour = "#2c3e50") +
  scale_fill_manual(values = c("TRUE" = col_late, "FALSE" = col_neutral), guide = "none") +
  scale_y_continuous(limits = c(0, max(partial_r2_tbl$partial_r2*100)*1.25)) +
  coord_flip() +
  labs(
    title    = "Delivery Delay Dwarfs Every Other Factor in Explanatory Power",
    subtitle = "Partial R² from linear regression — the % of dissatisfaction variance uniquely attributable to each factor",
    x = NULL, y = "Unique Variance Explained (%)",
    caption  = "Source: order-level model with 9 controls, n = 95,666 · Linear regression, drop1() Type II test"
  ) +
  theme_olist

Figure 9. Linear model: unique variance in dissatisfaction explained by each factor

9.4 Model 2 — Random Forest (Non-Linear Cross-Check)

Linear regression assumes additive, straight-line effects. A random forest makes no such assumption — it can detect thresholds, plateaus, and interactions. If delay still ranks #1 here, the finding is robust to model choice, not an artefact of assuming linearity.

set.seed(42)
rf_fit <- ranger(
  dissatisfaction ~ delay_days + distance_km + total_price + total_freight +
    item_count + total_weight_kg + avg_photos_qty + max_installments + category_grp,
  data = model_df_clean,
  num.trees = 200,
  importance = "permutation",
  seed = 42,
  num.threads = 1
)

cat(sprintf("Random Forest OOB R² = %.3f (vs. linear model R² = %.3f)\n",
            rf_fit$r.squared, lm_summary$r.squared))

## Random Forest OOB R² = 0.194 (vs. linear model R² = 0.100)

imp_tbl <- data.frame(
  term = names(importance(rf_fit)),
  importance = importance(rf_fit)
) %>%
  mutate(term = recode(term,
    delay_days = "Delivery Delay (days)",
    distance_km = "Shipping Distance (km)",
    total_price = "Order Price",
    total_freight = "Freight Cost",
    item_count = "Item Count",
    total_weight_kg = "Shipped Weight",
    avg_photos_qty = "Listing Photo Count",
    max_installments = "Payment Installments",
    category_grp = "Product Category"
  )) %>%
  arrange(desc(importance))

ggplot(imp_tbl, aes(x = reorder(term, importance), y = importance,
                    fill = term == "Delivery Delay (days)")) +
  geom_col(width = 0.65) +
  geom_text(aes(label = round(importance, 3)), hjust = -0.15, size = 3.3, colour = "#2c3e50") +
  scale_fill_manual(values = c("TRUE" = col_late, "FALSE" = col_neutral), guide = "none") +
  scale_y_continuous(limits = c(0, max(imp_tbl$importance)*1.25)) +
  coord_flip() +
  labs(
    title    = "Random Forest Independently Confirms: Delay Is the Top Predictor",
    subtitle = "Permutation importance — how much OOB prediction error rises when each variable is shuffled",
    x = NULL, y = "Permutation Importance (MSE increase)",
    caption  = "Source: ranger random forest, 200 trees, n = 95,666"
  ) +
  theme_olist

Figure 10. Random forest permutation importance for predicting dissatisfaction

Both models agree. In the linear model, delay’s partial R² (7.9%) is roughly 6.4× larger than the second-ranked factor. In the random forest, delay’s importance score is also ranked #1, well ahead of freight cost and order price. Two structurally different models, same conclusion: delivery delay is not merely correlated with dissatisfaction — it remains the dominant factor after controlling for the eight most plausible alternative explanations available in this dataset.

9.5 The Shape of the Effect: Threshold, Not Slope

A regression coefficient implies a constant, linear effect — every extra day of lateness adds the same fixed amount of dissatisfaction. Does that match reality? We use the random forest to trace out the partial dependence of predicted dissatisfaction on delay_days, holding every other factor fixed at its typical (median/most-common) value, and compare it to the linear model’s (necessarily straight-line) prediction.

typical <- model_df_clean %>%
  summarise(
    distance_km       = median(distance_km),
    total_price        = median(total_price),
    total_freight        = median(total_freight),
    item_count            = round(median(item_count)),
    total_weight_kg         = median(total_weight_kg),
    avg_photos_qty            = median(avg_photos_qty),
    max_installments         = round(median(max_installments))
  )
mode_cat <- names(sort(table(model_df_clean$category_grp), decreasing = TRUE))[1]

delay_grid <- c(seq(-20, 20, by = 1), seq(24, 60, by = 4))
pd_grid <- data.frame(
  delay_days        = delay_grid,
  distance_km       = typical$distance_km,
  total_price        = typical$total_price,
  total_freight        = typical$total_freight,
  item_count            = typical$item_count,
  total_weight_kg         = typical$total_weight_kg,
  avg_photos_qty            = typical$avg_photos_qty,
  max_installments         = typical$max_installments,
  category_grp = factor(mode_cat, levels = levels(model_df_clean$category_grp))
)
pd_grid$pred_lm <- predict(lm_fit, newdata = pd_grid)
pd_grid$pred_rf <- predict(rf_fit, data = pd_grid)$predictions

pd_long <- pd_grid %>%
  select(delay_days, pred_lm, pred_rf) %>%
  pivot_longer(cols = c(pred_lm, pred_rf), names_to = "model", values_to = "predicted") %>%
  mutate(model = recode(model, pred_lm = "Linear Model (assumes constant slope)",
                        pred_rf = "Random Forest (data-driven shape)"))

ggplot(pd_long, aes(x = delay_days, y = predicted, colour = model)) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = col_neutral, linewidth = 0.7) +
  annotate("text", x = 0.5, y = 0.2, label = "On-time boundary", angle = 90,
           colour = col_neutral, size = 3, hjust = 0) +
  geom_line(linewidth = 1.3) +
  geom_point(size = 1.6, alpha = 0.7) +
  scale_colour_manual(values = c("Linear Model (assumes constant slope)" = col_neutral,
                                  "Random Forest (data-driven shape)" = col_late),
                      name = NULL) +
  labs(
    title    = "Dissatisfaction Jumps at the On-Time/Late Boundary, Then Plateaus",
    subtitle = "Both models hold distance, price, freight, item count, weight, category, and installments fixed at typical values",
    x        = "Delivery Delay (days; negative = early)",
    y        = "Predicted Dissatisfaction (0 = most satisfied, 4 = most dissatisfied)",
    caption  = "Source: same modelling dataset as Figures 9–10 · Random forest predictions averaged across 200 trees"
  ) +
  theme_olist

Figure 11. Partial effect of delivery delay on predicted dissatisfaction, controlling for all other factors

This is the most actionable finding in the report. The random forest reveals a threshold effect, not a smooth slope: predicted dissatisfaction roughly triples between 0 and 4 days late, then largely plateaus — an order 30 days late is not dramatically worse, in this model, than one 5 days late. The linear model, by construction, cannot see this — it forces a straight line through the same data and understates the damage of barely missing the promise while overstating the marginal harm of being very late. The practical implication: the highest-value intervention is not “ship faster on average” but “protect the promised date itself.” Shaving the worst delays from 30 days to 15 may matter less to customer sentiment than converting a 2-day miss into an on-time delivery.

10 Business Insights

10.0.1 🔍 Insight 1 — Late Delivery Is a Satisfaction Catastrophe, Not Just an Inconvenience

Late orders average a review score of 2.55 vs 4.28 for on-time orders — a gap of 1.73 points on a 5-point scale. Late orders are also 4× more likely to receive a 1-star review. With 8.1% of orders arriving late across ~96,000 deliveries, this represents a structurally embedded drag on platform reputation. Reducing the late rate by even half would meaningfully shift the platform’s aggregate NPS signal.

Caution: The analysis is observational. It is possible that other factors correlated with lateness (e.g. product type, seller reliability, distance) partially drive the score gap rather than lateness alone.

10.0.2 🔍 Insight 2 — Northern and North-Eastern States Are a Geographic Priority, and Distance Is Part of the Story

States like AL (23.9% late), MA (19.6%), PI (15.9%), CE (15.4%) and SE (15.2%) significantly exceed the national average of ~8%, as the Brazil map (Figure 6) shows visually — the reddest, highest-late-rate bubbles cluster away from the South-East commercial core where most sellers are based. The distance analysis (Figure 7) confirms this is not coincidental: late rate rises from 6.0% for same-state shipments to over 13% beyond 1,500 km, with review scores declining in step. However, the correlation between raw distance and delay is modest (r ≈ -0.08), meaning distance is a real but partial driver — regional carrier capacity and road infrastructure likely compound the effect specifically in the North/North-East. Targeted carrier partnerships or regional fulfilment hubs closer to these states would address both the distance penalty and the regional infrastructure gap simultaneously.

Caution: These states also have smaller order volumes, so the late-rate estimates carry higher statistical uncertainty. State centroids are an approximation (state capital coordinates), not actual shipment origin/destination points, so the distance metric is directional rather than precise.

10.0.3 🔍 Insight 3 — The Problem Is Specifically Late, Not Just Slow

The average actual delivery time is 12.6 days, but the median delay relative to the estimate is negative (i.e. most orders arrive early). This means the core issue is not overall speed but reliability of the promised date. Customers who receive a parcel 8 days early are moderately happy; customers who receive it 3 days late leave negative reviews. Olist’s lever here may be better delivery-date promise calibration (setting more conservative estimates that are reliably met) rather than necessarily accelerating logistics operations end-to-end.

Caution: We do not have visibility into whether Olist or the seller sets the delivery estimate. The appropriate corrective action differs substantially between these two scenarios.

10.0.4 🔍 Insight 4 — The Effect Survives Controls, and It’s a Threshold, Not a Slope

When we control for shipping distance, order price, freight cost, item count, shipped weight, listing quality, payment installments, and product category — using both a linear regression and an independent random forest model — delivery delay remains the single strongest predictor of dissatisfaction, explaining roughly 8× more unique variance than the next-best factor (item count) and ranking first in random forest importance by a wide margin. More importantly, the random forest reveals the effect is not a smooth slope: dissatisfaction roughly triples in the first few days after the promised date, then plateaus. This reframes the operational priority from “reduce average delay” to “protect the promised date” — converting narrow misses into on-time deliveries likely matters more to sentiment than shortening already-large delays.

Caution: This is a controlled association across two modelling approaches, not a randomised-experiment-level causal estimate. Both models’ R² remain modest (0.10–0.19), meaning most of the variation in review scores comes from factors not captured here — most plausibly product quality and condition-on-arrival, which this dataset cannot observe directly.

11 Limitations

11.0.1 ⚠️ Limitation 1 — Controlled, Not Randomised: Residual Confounding Remains Possible

The causal-inference section controls for eight plausible alternative explanations (distance, price, freight, item count, weight, listing quality, installments, category) and the delay effect survives across two different model types — meaningfully stronger evidence than a raw correlation. But this is not a randomised experiment: no order was ever randomly assigned to be late or on-time, so unmeasured confounders could still bias the estimate. The most likely candidates are product condition on arrival, packaging quality, and seller communication/responsiveness — none of which exist as columns in this dataset, and none of which can be ruled out. The modest R² of both models (0.10–0.19) is itself an honest signal: delay is the strongest single factor we can measure, but it is far from the only one driving review scores.

11.0.2 ⚠️ Limitation 2 — Dataset Scope and Temporal Coverage

The dataset covers 2016–2018 and represents a specific growth phase of Brazilian e-commerce. Logistical infrastructure, carrier networks, and customer expectations have evolved substantially since then. State-level findings (especially for low-volume states like AL with 401 orders or TO with 274 orders) carry high sampling uncertainty and should not be used to make multi-million dollar capital allocation decisions without refreshed data. Additionally, ~3,500 orders were dropped due to missing delivery dates, and orders with statuses other than “delivered” (cancellations, returns) are excluded — this may introduce survivor bias if late orders are more likely to be cancelled. The seller→customer distance metric uses state-capital centroids as a proxy for actual origin/destination coordinates (zip-code-level latitude/longitude was not available in this dataset), so it should be read as a directional signal of geographic distance rather than a precise shipping-route measurement.

12 AI Audit Note

12.0.1 🤖 AI Audit Note

This report was structured and drafted with AI assistance (Claude, Anthropic). The following transparency disclosures apply:

What AI contributed: narrative framing, ggplot2 code structure, section organisation, feature engineering choices for the regression/random forest models, and identification of candidate insights from descriptive statistics and model output.

What AI did not do: modify or simulate the underlying data, cherry-pick which model results to report, tune hyperparameters to produce a more dramatic result, or claim randomised-experiment-level causality.

Verification checklist: - ✅ All figures and models can be reproduced by re-knitting olist_delivery_analysis.Rmd with the original eight CSV files - ✅ No imputation or synthetic data generation was applied; rows with missing model features are dropped via na.omit(), with the drop rate explicitly reported - ✅ Correlation is Pearson on raw observations, not aggregated means (which would inflate r) - ✅ Two structurally different models (linear regression, random forest) were fit on identical features and the same target; their agreement is reported as corroborating evidence, not proof - ✅ Partial R² (drop1) and permutation importance (not impurity-based, which is biased toward high-cardinality variables) were used for fair cross-factor comparison - ⚠️ Causality is explicitly not claimed — the report uses “controlled association,” “partial effect,” and “net of measured confounders” rather than “causes” - ⚠️ Both models’ modest R² (0.10–0.19) is reported prominently rather than hidden, to avoid overstating how much of dissatisfaction this analysis actually explains - ⚠️ Human review is recommended before sharing insights with Olist operations or logistics teams, particularly for state-level investment recommendations given the small-n concern, and before treating the delay coefficient as an exact dollar-equivalent effect size

Recommended next steps for validation: a true natural experiment (e.g. comparing review scores before/after a carrier SLA change in a specific state, holding the order mix constant); sentiment analysis of free-text review comments to separate delivery complaints from product complaints; a seller-level random effect to absorb seller-specific quality that this model cannot currently isolate from delay.

cat("Report generated on:", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"), "\n")

## Report generated on: 2026-06-16 13:00:56 CEST

cat("R version:", R.version$version.string, "\n")

## R version: R version 4.6.0 (2026-04-24 ucrt)

cat(sprintf("Key packages: ggplot2 %s | dplyr %s | patchwork %s\n",
    as.character(packageVersion("ggplot2")),
    as.character(packageVersion("dplyr")),
    as.character(packageVersion("patchwork"))))

## Key packages: ggplot2 4.0.3 | dplyr 1.2.1 | patchwork 1.3.2

End of Report · Olist Delivery Performance Analysis · Reproducible R Markdown

How Does Delivery Performance Affect Customer Satisfaction?

A Data-Driven Investigation of Olist E-Commerce Operations

Analytical Report · Olist Dataset

June 16, 2026