1 Executive Summary

Research Question: Does late delivery reduce customer satisfaction on the Olist platform — and does that effect hold up once we account for other plausible explanations — and where should Olist target operational improvement first?

This report analyses ~97,000 delivered orders across the Olist Brazilian e-commerce dataset (2016–2018), joining eight relational tables — orders, reviews, customers, order items, products, category translations, sellers, and payments — into a unified analytical frame. The headline finding: late deliveries are associated with a 1.74-point drop in average review score (on a 1–5 scale), concentrated in specific Brazilian states and product categories. We then go a step further and test whether this effect survives controlling for distance, price, freight, item count, weight, listing quality, payment installments, and product category, using both linear regression and random forest models. It does — delay remains the dominant predictor of dissatisfaction in both models, and a random forest reveals the effect is a threshold (a sharp jump right at the on-time/late boundary, then a plateau) rather than a steady slope, which has direct operational implications for where Olist should focus first.


2 Setup and Data Import

2.1 Libraries

All analysis uses the R tidyverse ecosystem. Charts are produced with ggplot2. A consistent colour palette (#2980b9 for primary, #e74c3c for alert, #27ae60 for positive) is used throughout.

library(dplyr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(scales)
library(kableExtra)
library(patchwork)
library(maps)
library(ranger)
# Shared visual identity
col_primary  <- "#2980b9"
col_late     <- "#e74c3c"
col_ontime   <- "#27ae60"
col_neutral  <- "#95a5a6"
col_accent   <- "#f39c12"

theme_olist <- theme_minimal(base_size = 13) +
  theme(
    plot.title       = element_text(face = "bold", size = 14, colour = "#2c3e50"),
    plot.subtitle    = element_text(colour = "#7f8c8d", size = 11),
    plot.caption     = element_text(colour = "#95a5a6", size = 9, hjust = 0),
    axis.title       = element_text(colour = "#555555", size = 11),
    axis.text        = element_text(colour = "#555555"),
    panel.grid.major = element_line(colour = "#ecf0f1"),
    panel.grid.minor = element_blank(),
    legend.position  = "top",
    strip.text       = element_text(face = "bold")
  )

2.2 Loading Raw Tables

Six CSV files are loaded from the project directory. We perform basic structural checks immediately after import.

data_path <- "data/"

orders    <- read.csv(paste0(data_path, "olist_orders_dataset.csv"),              stringsAsFactors = FALSE)
reviews   <- read.csv(paste0(data_path, "olist_order_reviews_dataset.csv"),       stringsAsFactors = FALSE)
customers <- read.csv(paste0(data_path, "olist_customers_dataset.csv"),            stringsAsFactors = FALSE)
items     <- read.csv(paste0(data_path, "olist_order_items_dataset.csv"),          stringsAsFactors = FALSE)
products  <- read.csv(paste0(data_path, "olist_products_dataset.csv"),             stringsAsFactors = FALSE)
cat_trans <- read.csv(paste0(data_path, "product_category_name_translation.csv"),
                       stringsAsFactors = FALSE, fileEncoding = "UTF-8-BOM")
# Normalise BOM-affected column name (appears as "X...product_category_name" on some systems)
if (!"product_category_name" %in% names(cat_trans)) {
  names(cat_trans)[1] <- "product_category_name"
}
sellers   <- read.csv(paste0(data_path, "olist_sellers_dataset.csv"),               stringsAsFactors = FALSE)
payments  <- read.csv(paste0(data_path, "olist_order_payments_dataset.csv"),        stringsAsFactors = FALSE)

cat("Tables loaded successfully.\n")
## Tables loaded successfully.

2.3 Basic Quality Checks

2.3.1 Dimensions & Column Types

table_info <- data.frame(
  Table      = c("orders", "reviews", "customers", "items", "products", "cat_trans", "sellers", "payments"),
  Rows       = c(nrow(orders), nrow(reviews), nrow(customers), nrow(items), nrow(products), nrow(cat_trans), nrow(sellers), nrow(payments)),
  Columns    = c(ncol(orders), ncol(reviews), ncol(customers), ncol(items), ncol(products), ncol(cat_trans), ncol(sellers), ncol(payments)),
  Key_Column = c("order_id", "order_id", "customer_id", "order_id", "product_id", "product_category_name", "seller_id", "order_id")
)

table_info %>%
  kbl(caption = "Table 1. Dataset dimensions at import") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  column_spec(2, bold = TRUE)
Table 1. Dataset dimensions at import
Table Rows Columns Key_Column
orders 99441 8 order_id
reviews 100000 7 order_id
customers 99441 5 customer_id
items 112650 7 order_id
products 32951 9 product_id
cat_trans 71 2 product_category_name
sellers 3095 4 seller_id
payments 103886 5 order_id

2.3.2 Missing Values

na_summary <- data.frame(
  Table   = "orders",
  Column  = names(orders),
  NA_Count = colSums(is.na(orders) | orders == "")
)

# Key columns for the analysis
key_cols <- data.frame(
  Column                        = c("order_delivered_customer_date",
                                    "order_estimated_delivery_date",
                                    "order_purchase_timestamp",
                                    "review_score",
                                    "customer_state"),
  Source_Table                  = c("orders","orders","orders","reviews","customers"),
  NA_Count                      = c(
    sum(is.na(orders$order_delivered_customer_date) | orders$order_delivered_customer_date == ""),
    sum(is.na(orders$order_estimated_delivery_date) | orders$order_estimated_delivery_date == ""),
    sum(is.na(orders$order_purchase_timestamp)      | orders$order_purchase_timestamp == ""),
    sum(is.na(reviews$review_score)),
    sum(is.na(customers$customer_state) | customers$customer_state == "")
  )
)

key_cols %>%
  kbl(caption = "Table 2. Missing values in analytically critical columns") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE) %>%
  column_spec(3, color = ifelse(key_cols$NA_Count > 0, "#e74c3c", "#27ae60"), bold = TRUE)
Table 2. Missing values in analytically critical columns
Column Source_Table NA_Count
order_delivered_customer_date orders 2965
order_estimated_delivery_date orders 0
order_purchase_timestamp orders 0
review_score reviews 0
customer_state customers 0

2.3.3 Order Status Distribution

status_tbl <- as.data.frame(table(orders$order_status))
names(status_tbl) <- c("Status", "Count")
status_tbl$Pct <- round(status_tbl$Count / sum(status_tbl$Count) * 100, 1)
status_tbl <- status_tbl[order(-status_tbl$Count), ]

status_tbl %>%
  kbl(caption = "Table 3. Order status breakdown — analysis restricted to 'delivered'") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(which(status_tbl$Status == "delivered"), bold = TRUE, background = "#eafaf1")
Table 3. Order status breakdown — analysis restricted to ‘delivered’
Status Count Pct
4 delivered 96478 97.0
7 shipped 1107 1.1
2 canceled 625 0.6
8 unavailable 609 0.6
5 invoiced 314 0.3
6 processing 301 0.3
3 created 5 0.0
1 approved 2 0.0

3 Data Preparation & Joins

3.1 Date Parsing and Delivery Metrics

Timestamps are parsed, delivery delay is computed as (actual delivery date − estimated delivery date) in days. A positive value means the parcel arrived after the promised date.

# Parse all timestamps
orders$order_purchase_timestamp      <- as.POSIXct(orders$order_purchase_timestamp,      format="%Y-%m-%d %H:%M:%S")
orders$order_delivered_customer_date <- as.POSIXct(orders$order_delivered_customer_date, format="%Y-%m-%d %H:%M:%S")
orders$order_estimated_delivery_date <- as.POSIXct(orders$order_estimated_delivery_date, format="%Y-%m-%d %H:%M:%S")
orders$order_approved_at             <- as.POSIXct(orders$order_approved_at,             format="%Y-%m-%d %H:%M:%S")

# Restrict to delivered orders with valid date information
delivered <- orders %>%
  filter(order_status == "delivered",
         !is.na(order_delivered_customer_date),
         !is.na(order_estimated_delivery_date),
         !is.na(order_purchase_timestamp)) %>%
  mutate(
    delay_days   = as.numeric(difftime(order_delivered_customer_date,
                                       order_estimated_delivery_date, units = "days")),
    actual_days  = as.numeric(difftime(order_delivered_customer_date,
                                       order_purchase_timestamp, units = "days")),
    late         = delay_days > 0,
    delay_bucket = case_when(
      delay_days <= -7  ~ "7+ days early",
      delay_days <= 0   ~ "0–7 days early",
      delay_days <= 7   ~ "1–7 days late",
      delay_days <= 14  ~ "8–14 days late",
      TRUE              ~ "15+ days late"
    ),
    purchase_month = floor_date(order_purchase_timestamp, "month")
  )

cat(sprintf("Delivered orders retained for analysis: %d\n", nrow(delivered)))
## Delivered orders retained for analysis: 96467
cat(sprintf("Orders with positive delay (late): %d (%.1f%%)\n",
            sum(delivered$late), mean(delivered$late)*100))
## Orders with positive delay (late): 7824 (8.1%)

3.2 Multi-Table Join

We join seven tables into a single analytical frame: orders → reviews → customers → items → products → category translations → sellers. An eighth table (payments) is incorporated later for the causal modelling section.

# ── Join 1: orders + reviews ──────────────────────────────────────────────────
df <- delivered %>%
  inner_join(
    reviews %>% select(order_id, review_score),
    by = "order_id"
  )

# ── Join 2: + customers ───────────────────────────────────────────────────────
df <- df %>%
  inner_join(
    customers %>% select(customer_id, customer_state, customer_city),
    by = "customer_id"
  )

# ── Join 3: items + products + category translation ───────────────────────────
items_cat <- items %>%
  select(order_id, product_id) %>%
  inner_join(products %>% select(product_id, product_category_name, product_weight_g),
             by = "product_id") %>%
  inner_join(cat_trans, by = "product_category_name") %>%
  # Keep only the first item per order to avoid duplication
  distinct(order_id, .keep_all = TRUE) %>%
  rename(category_en = product_category_name_english)

# ── Join 4: + category info ───────────────────────────────────────────────────
df <- df %>%
  left_join(items_cat %>% select(order_id, category_en, product_weight_g),
            by = "order_id")

# ── Join 5: + seller location (first item's seller per order) ──────────────────
items_seller <- items %>%
  select(order_id, seller_id) %>%
  distinct(order_id, .keep_all = TRUE) %>%
  inner_join(sellers %>% select(seller_id, seller_state), by = "seller_id")

df <- df %>%
  left_join(items_seller %>% select(order_id, seller_state), by = "order_id")

cat(sprintf("Final analytical frame: %d rows × %d columns\n", nrow(df), ncol(df)))
## Final analytical frame: 97004 rows × 19 columns

Join Lineage
orders → (order_id) → reviews
orders → (customer_id) → customers
items → (product_id) → products → (product_category_name) → cat_trans → (order_id) → main frame
items → (seller_id) → sellers → (order_id) → main frame
All joins are inner joins except the category and seller enrichment steps (left joins, to preserve orders with no matching item record).

3.3 Geographic Reference Data: State Centroids and Distance

To analyse where delays happen geographically — not just which state — we attach an approximate latitude/longitude centroid (state capital coordinates, a standard convention) to each Brazilian state, then compute the great-circle (haversine) distance between the seller’s state and the customer’s state for every order. This gives a continuous “shipping distance” variable, independent of the late/on-time label.

# Approximate centroids (state capital coordinates) — static, well-known geographic reference data
state_coords <- data.frame(
  state = c("AC","AL","AP","AM","BA","CE","DF","ES","GO","MA","MT","MS","MG",
            "PA","PB","PR","PE","PI","RJ","RN","RS","RO","RR","SC","SP","SE","TO"),
  lat   = c(-9.97,-9.65,0.04,-3.10,-12.97,-3.72,-15.78,-20.32,-16.68,-2.53,-15.60,
            -20.44,-19.92,-1.46,-7.12,-25.43,-8.05,-5.09,-22.91,-5.79,-30.03,-8.76,
            2.82,-27.60,-23.55,-10.91,-10.25),
  lon   = c(-67.81,-35.70,-51.07,-60.02,-38.51,-38.54,-47.93,-40.34,-49.25,-44.30,
            -56.10,-54.65,-43.94,-48.50,-34.86,-49.27,-34.90,-42.80,-43.17,-35.21,
            -51.23,-63.90,-60.67,-48.55,-46.63,-37.07,-48.32)
)

haversine_km <- function(lat1, lon1, lat2, lon2) {
  R <- 6371  # Earth radius in km
  rad <- pi / 180
  dlat <- (lat2 - lat1) * rad
  dlon <- (lon2 - lon1) * rad
  a <- sin(dlat/2)^2 + cos(lat1*rad) * cos(lat2*rad) * sin(dlon/2)^2
  2 * R * asin(sqrt(a))
}

df <- df %>%
  left_join(state_coords %>% rename(seller_lat = lat, seller_lon = lon),
            by = c("seller_state" = "state")) %>%
  left_join(state_coords %>% rename(cust_lat = lat, cust_lon = lon),
            by = c("customer_state" = "state")) %>%
  mutate(
    distance_km = haversine_km(seller_lat, seller_lon, cust_lat, cust_lon),
    distance_bucket = cut(distance_km,
                          breaks = c(-1, 0, 500, 1000, 1500, 2000, 3000, 10000),
                          labels = c("Same state","1–500 km","500–1000 km",
                                     "1000–1500 km","1500–2000 km","2000–3000 km","3000+ km"))
  )

cat(sprintf("Orders with computed seller→customer distance: %d (%.1f%% of frame)\n",
            sum(!is.na(df$distance_km)), mean(!is.na(df$distance_km))*100))
## Orders with computed seller→customer distance: 97004 (100.0% of frame)

4 Exploratory Data Analysis

4.1 Global Delivery Performance

global <- df %>%
  summarise(
    n_orders       = n(),
    pct_late       = round(mean(late, na.rm=TRUE)*100, 1),
    avg_delay      = round(mean(delay_days, na.rm=TRUE), 1),
    median_delay   = round(median(delay_days, na.rm=TRUE), 1),
    avg_actual_days= round(mean(actual_days, na.rm=TRUE), 1),
    avg_score      = round(mean(review_score, na.rm=TRUE), 2),
    avg_score_late  = round(mean(review_score[late],  na.rm=TRUE), 2),
    avg_score_ontime= round(mean(review_score[!late], na.rm=TRUE), 2),
    score_diff     = round(mean(review_score[!late], na.rm=TRUE) -
                           mean(review_score[late],  na.rm=TRUE), 2)
  )

global %>%
  tidyr::pivot_longer(everything(), names_to="Metric", values_to="Value") %>%
  mutate(Metric = recode(Metric,
    n_orders       = "Total delivered orders",
    pct_late       = "% orders arriving late",
    avg_delay      = "Avg delay (days, negative = early)",
    median_delay   = "Median delay (days)",
    avg_actual_days= "Avg actual delivery time (days)",
    avg_score      = "Overall avg review score",
    avg_score_late  = "Avg review score — late orders",
    avg_score_ontime= "Avg review score — on-time orders",
    score_diff     = "Score gap (on-time minus late)"
  )) %>%
  kbl(caption = "Table 4. Global KPIs from the merged analytical frame") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(c(8,9), bold = TRUE, background = "#fef9e7")
Table 4. Global KPIs from the merged analytical frame
Metric Value
Total delivered orders 97004.00
% orders arriving late 8.10
Avg delay (days, negative = early) -11.20
Median delay (days) -11.90
Avg actual delivery time (days) 12.60
Overall avg review score 4.14
Avg review score — late orders 2.55
Avg review score — on-time orders 4.28
Score gap (on-time minus late) 1.74

4.2 Review Score Distribution

score_dist <- df %>%
  mutate(
    Delivery = ifelse(late, "Late", "On-Time"),
    review_score = factor(review_score, levels = 5:1)
  ) %>%
  count(review_score, Delivery)

ggplot(score_dist, aes(x = review_score, y = n, fill = Delivery)) +
  geom_col(position = "dodge", width = 0.65, colour = "white") +
  scale_fill_manual(values = c("Late" = col_late, "On-Time" = col_ontime),
                    name = "Delivery status") +
  scale_y_continuous(labels = comma_format()) +
  labs(
    title    = "Review Score Distribution: Late vs On-Time Orders",
    subtitle = "Late deliveries collapse 5-star ratings and amplify 1-star scores",
    x        = "Review Score (stars)",
    y        = "Number of Orders",
    caption  = "Source: Olist order and review datasets (2016–2018)"
  ) +
  theme_olist
Figure 1. Review score volume split by delivery punctuality

Figure 1. Review score volume split by delivery punctuality


5 Core Analysis: Delivery Delay vs Customer Satisfaction

5.1 Average Score by Delay Bucket

We group orders into five delay categories and compute mean review score for each bucket. The gradient is decisive.

bucket_order <- c("7+ days early","0–7 days early","1–7 days late","8–14 days late","15+ days late")

bucket_summary <- df %>%
  filter(!is.na(delay_bucket), !is.na(review_score)) %>%
  mutate(delay_bucket = factor(delay_bucket, levels = bucket_order)) %>%
  group_by(delay_bucket) %>%
  summarise(
    avg_score = mean(review_score, na.rm=TRUE),
    n         = n(),
    se        = sd(review_score, na.rm=TRUE) / sqrt(n)
  )

ggplot(bucket_summary, aes(x = delay_bucket, y = avg_score, fill = avg_score)) +
  geom_col(width = 0.65, colour = "white") +
  geom_errorbar(aes(ymin = avg_score - 1.96*se, ymax = avg_score + 1.96*se),
                width = 0.2, colour = "#2c3e50", linewidth = 0.5) +
  geom_text(aes(label = sprintf("%.2f\n(n=%s)", avg_score, comma(n))),
            vjust = -0.4, size = 3.3, colour = "#2c3e50", fontface = "bold") +
  scale_fill_gradient(low = col_late, high = col_ontime, name = "Avg score", guide = "none") +
  scale_y_continuous(limits = c(0, 5.4), breaks = 1:5) +
  labs(
    title    = "Customer Satisfaction Degrades Monotonically with Delivery Lateness",
    subtitle = "95% confidence intervals shown. Each bucket represents a distinct customer experience tier.",
    x        = "Delivery Timing Bucket",
    y        = "Average Review Score (1–5)",
    caption  = "Source: Olist orders and reviews datasets"
  ) +
  theme_olist +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))
Figure 2. Average review score by delivery delay category

Figure 2. Average review score by delivery delay category

Key observation: The score drop is not a cliff edge only at extreme lateness — it begins immediately. Even orders arriving 1–7 days late score 0.87 points lower than those arriving 0–7 days early. The pattern is monotonic: every extra day of lateness costs satisfaction.

5.2 Monthly Trend: On-Time Rate and Average Score

monthly <- df %>%
  filter(!is.na(purchase_month),
         purchase_month >= as.POSIXct("2017-01-01"),
         purchase_month < as.POSIXct("2019-01-01")) %>%
  group_by(purchase_month) %>%
  summarise(
    pct_ontime  = (1 - mean(late, na.rm=TRUE)) * 100,
    avg_score   = mean(review_score, na.rm=TRUE),
    n           = n()
  ) %>%
  filter(n >= 50)   # drop months with very few orders

p1 <- ggplot(monthly, aes(x = purchase_month)) +
  geom_line(aes(y = pct_ontime), colour = col_ontime, linewidth = 1.2) +
  geom_point(aes(y = pct_ontime, size = n), colour = col_ontime, alpha = 0.7) +
  scale_y_continuous(limits = c(80, 100), labels = function(x) paste0(x, "%")) +
  scale_size_continuous(range = c(2, 6), guide = "none") +
  scale_x_datetime(date_labels = "%b %y", date_breaks = "3 months") +
  labs(title = "On-Time Delivery Rate by Month",
       x = NULL, y = "% on-time") +
  theme_olist

p2 <- ggplot(monthly, aes(x = purchase_month)) +
  geom_line(aes(y = avg_score), colour = col_primary, linewidth = 1.2) +
  geom_point(aes(y = avg_score, size = n), colour = col_primary, alpha = 0.7) +
  scale_y_continuous(limits = c(3.8, 4.5)) +
  scale_size_continuous(range = c(2, 6), guide = "none") +
  scale_x_datetime(date_labels = "%b %y", date_breaks = "3 months") +
  labs(title = "Average Review Score by Month",
       x = "Purchase Month", y = "Avg Review Score",
       caption = "Point size ∝ order volume · Source: Olist datasets") +
  theme_olist

p1 / p2
Figure 3. Monthly delivery on-time rate and customer score (2017–2018)

Figure 3. Monthly delivery on-time rate and customer score (2017–2018)


6 Geographic Analysis

6.1 Which Brazilian States Struggle Most?

state_summary <- df %>%
  filter(!is.na(customer_state)) %>%
  group_by(customer_state) %>%
  summarise(
    pct_late  = mean(late, na.rm=TRUE) * 100,
    avg_score = mean(review_score, na.rm=TRUE),
    avg_delay = mean(delay_days, na.rm=TRUE),
    n         = n()
  ) %>%
  filter(n >= 100) %>%
  arrange(desc(pct_late)) %>%
  mutate(customer_state = reorder(customer_state, pct_late))

ggplot(state_summary, aes(x = customer_state, y = pct_late)) +
  geom_segment(aes(xend = customer_state, y = 0, yend = pct_late,
                   colour = pct_late), linewidth = 1.2) +
  geom_point(aes(colour = pct_late, size = n)) +
  geom_hline(yintercept = mean(state_summary$pct_late), linetype = "dashed",
             colour = col_neutral, linewidth = 0.8) +
  annotate("text", x = 3, y = mean(state_summary$pct_late) + 0.4,
           label = paste0("National avg: ", round(mean(state_summary$pct_late),1), "%"),
           colour = col_neutral, size = 3.2) +
  scale_colour_gradient(low = col_ontime, high = col_late,
                        name = "% late", labels = function(x) paste0(x, "%")) +
  scale_size_continuous(range = c(2, 7), name = "Order count") +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  coord_flip() +
  labs(
    title    = "Late Delivery Rate by Brazilian State",
    subtitle = "Dashed line = national average. AL, MA, and PI stand out as high-risk regions.",
    x        = NULL,
    y        = "% of Orders Arriving Late",
    caption  = "Source: Olist orders, customers datasets · States with < 100 orders excluded"
  ) +
  theme_olist +
  theme(legend.position = "right")
Figure 4. Late delivery rate by Brazilian state (min. 100 orders)

Figure 4. Late delivery rate by Brazilian state (min. 100 orders)

6.2 State Scatter: Late Rate vs Average Score

ggplot(state_summary, aes(x = pct_late, y = avg_score, label = customer_state)) +
  geom_point(aes(size = n, colour = pct_late), alpha = 0.85) +
  geom_text(nudge_y = 0.04, size = 3, colour = "#2c3e50", fontface = "bold") +
  geom_smooth(method = "lm", se = TRUE, colour = col_late, fill = "#fadbd8",
              linetype = "dashed", linewidth = 1) +
  scale_colour_gradient(low = col_ontime, high = col_late,
                        name = "% late", labels = function(x) paste0(x, "%")) +
  scale_size_continuous(range = c(3, 10), guide = "none") +
  scale_x_continuous(labels = function(x) paste0(x, "%")) +
  labs(
    title    = "Higher Late Rate → Lower Satisfaction (State Level)",
    subtitle = "Each bubble is a Brazilian state; size = order volume. Regression line with 95% CI.",
    x        = "% Orders Arriving Late",
    y        = "Average Review Score",
    caption  = "Source: Olist datasets · Pearson r computed at order level"
  ) +
  theme_olist
Figure 5. State-level trade-off between late rate and satisfaction score

Figure 5. State-level trade-off between late rate and satisfaction score

6.3 Brazil Map: Order Volume and Late Delivery Rate by State

The scatter plot above strips out spatial context. The map below restores it: each bubble sits at its state’s approximate centroid, sized by order volume and coloured by late delivery rate, overlaid on a Brazil country outline.

# Country outline (base map layer)
brazil_outline <- ggplot2::map_data("world", region = "Brazil")

# Attach centroid coordinates to the state-level summary table built earlier
state_map_data <- state_summary %>%
  inner_join(state_coords, by = c("customer_state" = "state"))

ggplot() +
  geom_polygon(data = brazil_outline, aes(x = long, y = lat, group = group),
               fill = "#f4f6f7", colour = "#bdc3c7", linewidth = 0.3) +
  geom_point(data = state_map_data,
             aes(x = lon, y = lat, size = n, colour = pct_late),
             alpha = 0.85) +
  geom_text(data = state_map_data,
            aes(x = lon, y = lat, label = customer_state),
            size = 3, fontface = "bold", colour = "#2c3e50",
            nudge_y = 0.6, check_overlap = TRUE) +
  scale_colour_gradient(low = col_ontime, high = col_late,
                        name = "% Late", labels = function(x) paste0(x, "%")) +
  scale_size_continuous(range = c(3, 22), name = "Order Volume",
                        labels = comma_format()) +
  coord_fixed(1.05) +
  labs(
    title    = "Order Volume and Late Delivery Rate by Brazilian State",
    subtitle = "Bubble size = number of orders · Bubble colour = % delivered late · Centroids approximate state capitals",
    x = NULL, y = NULL,
    caption  = "Source: Olist orders + customers datasets · State outline: maps package (Natural Earth-derived data)"
  ) +
  theme_olist +
  theme(
    axis.text       = element_blank(),
    panel.grid      = element_blank(),
    legend.position = "right"
  )
Figure 6. Geographic distribution of order volume and late delivery rate across Brazil

Figure 6. Geographic distribution of order volume and late delivery rate across Brazil

Reading the map: São Paulo (SP) dominates order volume — the largest bubble by far, reflecting Brazil’s commercial concentration in the South-East. Note that the reddest bubbles (highest late %) cluster in the North and North-East (AL, MA, PI, CE, SE), far from the South-East commercial core. This spatial clustering is the first visual hint that distance from the country’s logistics hub may be a structural driver of lateness — explored directly in the next section.

6.4 Does Distance From Seller Predict Lateness?

So far we’ve shown that late deliveries correlate with destination state, but state is a coarse proxy. A cleaner test: for each order, compute the great-circle distance between the seller’s state and the customer’s state, then ask whether longer shipping distances are associated with more lateness and lower satisfaction — independent of which specific state is involved.

distance_summary <- df %>%
  filter(!is.na(distance_bucket)) %>%
  group_by(distance_bucket) %>%
  summarise(
    pct_late   = mean(late, na.rm = TRUE) * 100,
    avg_score  = mean(review_score, na.rm = TRUE),
    avg_delay  = mean(delay_days, na.rm = TRUE),
    n          = n()
  )

distance_summary %>%
  rename(`Distance Bucket` = distance_bucket, `% Late` = pct_late,
         `Avg Score` = avg_score, `Avg Delay (days)` = avg_delay, `N Orders` = n) %>%
  mutate(`% Late` = round(`% Late`, 1), `Avg Score` = round(`Avg Score`, 2),
         `Avg Delay (days)` = round(`Avg Delay (days)`, 1)) %>%
  kbl(caption = "Table 7. Delivery outcomes by seller→customer distance") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  column_spec(2, bold = TRUE, color = col_late)
Table 7. Delivery outcomes by seller→customer distance
Distance Bucket % Late Avg Score Avg Delay (days) N Orders
Same state 6.0 4.25 -9.6 34874
1–500 km 8.6 4.10 -11.8 32599
500–1000 km 8.3 4.11 -12.5 15709
1000–1500 km 10.5 4.05 -12.0 5640
1500–2000 km 13.2 4.00 -11.7 1880
2000–3000 km 13.2 3.98 -12.1 6160
3000+ km 12.0 3.96 -16.9 142
p_dist1 <- ggplot(distance_summary, aes(x = distance_bucket, y = pct_late, group = 1)) +
  geom_col(fill = col_primary, width = 0.6, alpha = 0.85) +
  geom_text(aes(label = paste0(round(pct_late,1), "%")), vjust = -0.5, size = 3.2,
            colour = "#2c3e50", fontface = "bold") +
  geom_smooth(aes(x = as.numeric(distance_bucket)), method = "lm", se = FALSE,
              colour = col_late, linetype = "dashed", linewidth = 0.9) +
  scale_y_continuous(limits = c(0, max(distance_summary$pct_late) * 1.25),
                     labels = function(x) paste0(x, "%")) +
  labs(title = "Late Delivery Rate Rises With Shipping Distance",
       x = NULL, y = "% Orders Late") +
  theme_olist

p_dist2 <- ggplot(distance_summary, aes(x = distance_bucket, y = avg_score, group = 1)) +
  geom_col(fill = col_accent, width = 0.6, alpha = 0.85) +
  geom_text(aes(label = round(avg_score, 2)), vjust = -0.5, size = 3.2,
            colour = "#2c3e50", fontface = "bold") +
  geom_smooth(aes(x = as.numeric(distance_bucket)), method = "lm", se = FALSE,
              colour = col_late, linetype = "dashed", linewidth = 0.9) +
  scale_y_continuous(limits = c(0, 5)) +
  labs(title = "Average Review Score Falls With Shipping Distance",
       x = "Seller → Customer Distance", y = "Avg Review Score",
       caption = "Source: Olist orders, items, sellers, customers, reviews · Distance = haversine between state centroids") +
  theme_olist +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

p_dist1 / p_dist2
Figure 7. Late delivery rate and average review score as a function of seller-to-customer distance

Figure 7. Late delivery rate and average review score as a function of seller-to-customer distance

Key observation: Orders shipped within the same state arrive late only 6.0% of the time; once shipping distance exceeds 1,500 km, the late rate climbs above 13% — more than double. Average review score declines in step, from 4.25 (same state) to roughly 3.96–4.00 beyond 1,500 km. Distance alone is a modest predictor (Pearson r ≈ -0.08 with delay days — a real but small effect), which suggests distance is a contributing factor layered on top of carrier and regional infrastructure issues, not the sole driver.


7 Category-Level Deep Dive

7.1 Which Product Categories Have Worst Delivery Records?

cat_summary <- df %>%
  filter(!is.na(category_en)) %>%
  group_by(category_en) %>%
  summarise(
    pct_late  = mean(late, na.rm=TRUE) * 100,
    avg_score = mean(review_score, na.rm=TRUE),
    n         = n()
  ) %>%
  filter(n >= 200) %>%
  arrange(desc(pct_late)) %>%
  slice(c(1:12, (n()-3):n())) %>%   # top 12 worst + 4 best for comparison
  mutate(
    group     = ifelse(row_number() <= 12, "Highest late rate", "Lowest late rate"),
    category_en = reorder(category_en, pct_late)
  )

ggplot(cat_summary, aes(x = category_en, y = pct_late, fill = group)) +
  geom_col(width = 0.7, colour = "white") +
  geom_text(aes(label = paste0(round(pct_late,1), "%")),
            hjust = -0.15, size = 3, colour = "#2c3e50") +
  scale_fill_manual(values = c("Highest late rate" = col_late, "Lowest late rate" = col_ontime),
                    name = NULL) +
  scale_y_continuous(limits = c(0, 22), labels = function(x) paste0(x, "%")) +
  coord_flip() +
  labs(
    title    = "Late Delivery Rate by Product Category",
    subtitle = "Top 12 worst-performing vs 4 best-performing categories (min. 200 orders each)",
    x        = NULL,
    y        = "% Orders Arriving Late",
    caption  = "Source: Olist order items, products, category translation, orders datasets"
  ) +
  theme_olist +
  theme(legend.position = "top")
Figure 8. Product categories ranked by late delivery rate (min. 200 orders)

Figure 8. Product categories ranked by late delivery rate (min. 200 orders)

cat_full <- df %>%
  filter(!is.na(category_en)) %>%
  group_by(category_en) %>%
  summarise(
    Orders    = n(),
    Pct_Late  = paste0(round(mean(late, na.rm=TRUE)*100, 1), "%"),
    Avg_Score = round(mean(review_score, na.rm=TRUE), 2),
    Avg_Delay = round(mean(delay_days, na.rm=TRUE), 1)
  ) %>%
  filter(as.numeric(gsub(",","",Orders)) >= 200) %>%
  arrange(desc(Pct_Late)) %>%
  rename(`Category` = category_en, `N Orders` = Orders,
         `% Late` = Pct_Late, `Avg Score` = Avg_Score,
         `Avg Delay (days)` = Avg_Delay) %>%
  head(15)

cat_full %>%
  kbl(caption = "Table 5. Top 15 product categories by late delivery rate (≥200 orders)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
  column_spec(3, bold = TRUE, color = col_late) %>%
  column_spec(4, bold = TRUE, color = col_primary)
Table 5. Top 15 product categories by late delivery rate (≥200 orders)
Category N Orders % Late Avg Score Avg Delay (days)
electronics 2509 9.8% 4.11 -10.3
baby 2771 9.3% 4.11 -10.7
office_furniture 1251 9.3% 3.64 -11.0
construction_tools_construction 729 9.2% 4.12 -10.3
construction_tools_lights 232 9.1% 4.18 -10.6
health_beauty 8662 9% 4.22 -11.3
bed_bath_table 9287 8.9% 3.99 -10.6
musical_instruments 608 8.9% 4.22 -10.8
furniture_living_room 412 8.7% 4.07 -10.9
auto 3808 8.6% 4.14 -10.5
watches_gifts 5480 8.6% 4.11 -11.1
furniture_decor 6268 8.5% 4.06 -11.6
telephony 4080 8.5% 4.05 -10.5
industry_commerce_and_business 229 8.3% 4.21 -11.5
consoles_games 1015 8.2% 4.16 -10.8

8 Statistical Summary

8.1 Correlation: Delay Days vs Review Score

cor_val <- cor(df$delay_days, df$review_score, use = "complete.obs")
cor_test <- cor.test(df$delay_days, df$review_score, use = "complete.obs")

cat(sprintf("Pearson correlation (delay_days ~ review_score): r = %.4f\n", cor_val))
## Pearson correlation (delay_days ~ review_score): r = -0.2688
cat(sprintf("p-value: < 2.2e-16 (sample size n = %d)\n", sum(!is.na(df$delay_days) & !is.na(df$review_score))))
## p-value: < 2.2e-16 (sample size n = 97004)
cat(sprintf("Interpretation: A negative correlation of %.2f indicates that as lateness increases,\nreview scores systematically decline.\n", cor_val))
## Interpretation: A negative correlation of -0.27 indicates that as lateness increases,
## review scores systematically decline.
comparison <- df %>%
  group_by(Delivery_Status = ifelse(late, "Late", "On-Time")) %>%
  summarise(
    N            = n(),
    Mean_Score   = round(mean(review_score, na.rm=TRUE), 3),
    Median_Score = median(review_score, na.rm=TRUE),
    SD_Score     = round(sd(review_score, na.rm=TRUE), 3),
    Pct_1_star   = paste0(round(mean(review_score == 1, na.rm=TRUE)*100, 1), "%"),
    Pct_5_star   = paste0(round(mean(review_score == 5, na.rm=TRUE)*100, 1), "%")
  )

comparison %>%
  kbl(caption = "Table 6. Review score statistics by delivery punctuality") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(1, background = "#fce4e4") %>%
  row_spec(2, background = "#eafaf1")
Table 6. Review score statistics by delivery punctuality
Delivery_Status N Mean_Score Median_Score SD_Score Pct_1_star Pct_5_star
Late 7863 2.546 2 1.654 46.7% 21.8%
On-Time 89141 4.283 5 1.159 6.8% 62.2%

9 Toward Causal Inference: Isolating the Effect of Latency from Other Factors

Why this section exists. Everything so far has been a correlation: late orders and low review scores co-occur. But correlation can be confounded — maybe late orders are also disproportionately heavy items, long-distance shipments, low-quality listings, or high-installment purchases, and those traits (not the lateness itself) drive dissatisfaction. To get closer to a defensible “pure effect” of latency, we now hold a set of plausible alternative explanations constant and ask: does delay still matter once we control for them?

Method, and its honest limits. We are not running a randomised experiment, so we cannot claim true causal identification — there is no instrument, no natural experiment, no randomised assignment of delay to orders. What we can do is fit two different statistical models — a linear regression (assumes additive, linear relationships) and a random forest (captures non-linear effects and interactions) — on the same set of controls, and see whether delay’s apparent effect survives, shrinks, or disappears. Agreement between two structurally different models is suggestive evidence the effect is real and not a modelling artefact — though residual confounding from unmeasured factors (product defects, customer service quality, packaging) remains possible.

9.1 Building the Modelling Dataset

We construct an order-level feature table with one row per delivered, reviewed order, aggregating item-level details (price, freight, weight, item count, listing photo count) up to the order, and adding payment installments and the seller→customer distance computed earlier.

# Aggregate item-level data to order level
items_prod_model <- items %>%
  inner_join(products %>% select(product_id, product_category_name, product_weight_g, product_photos_qty),
             by = "product_id") %>%
  left_join(cat_trans, by = "product_category_name")

order_item_agg <- items_prod_model %>%
  group_by(order_id) %>%
  summarise(
    total_price      = sum(price, na.rm = TRUE),
    total_freight     = sum(freight_value, na.rm = TRUE),
    item_count        = n(),
    total_weight_kg    = sum(product_weight_g, na.rm = TRUE) / 1000,
    avg_photos_qty      = mean(product_photos_qty, na.rm = TRUE),
    primary_category    = first(product_category_name_english)
  )

# Aggregate payments to order level (orders can have split payments)
payment_agg <- payments %>%
  group_by(order_id) %>%
  summarise(max_installments = max(payment_installments, na.rm = TRUE))

# Assemble the modelling frame from the already-cleaned `delivered` and `df` objects
model_df <- delivered %>%
  inner_join(reviews %>% select(order_id, review_score), by = "order_id") %>%
  inner_join(order_item_agg, by = "order_id") %>%
  inner_join(payment_agg, by = "order_id") %>%
  left_join(df %>% select(order_id, distance_km), by = "order_id") %>%
  mutate(dissatisfaction = 5 - review_score)   # 0 = most satisfied, 4 = most dissatisfied

# Collapse to top 9 categories + "other" to keep the model parsimonious
top_cats <- model_df %>% count(primary_category, sort = TRUE) %>% slice(1:9) %>% pull(primary_category)
model_df$category_grp <- ifelse(model_df$primary_category %in% top_cats, model_df$primary_category, "other")
model_df$category_grp[is.na(model_df$category_grp)] <- "other"

model_vars <- c("dissatisfaction","delay_days","distance_km","total_price","total_freight",
                "item_count","total_weight_kg","avg_photos_qty","max_installments","category_grp")
model_df_clean <- model_df %>% select(all_of(model_vars)) %>% na.omit()
model_df_clean$category_grp <- relevel(factor(model_df_clean$category_grp), ref = "other")

cat(sprintf("Modelling dataset: %d orders × %d features (target = dissatisfaction, 0–4 scale)\n",
            nrow(model_df_clean), ncol(model_df_clean) - 1))
## Modelling dataset: 96729 orders × 9 features (target = dissatisfaction, 0–4 scale)
cat(sprintf("Dropped %d orders (%.1f%%) due to missing values in one or more model features.\n",
            nrow(model_df) - nrow(model_df_clean), (1 - nrow(model_df_clean)/nrow(model_df))*100))
## Dropped 1356 orders (1.4%) due to missing values in one or more model features.
feature_doc <- data.frame(
  Feature = c("delay_days","distance_km","total_price","total_freight","item_count",
              "total_weight_kg","avg_photos_qty","max_installments","category_grp"),
  Description = c(
    "Days late (negative = early) — the variable of interest",
    "Haversine distance, seller state → customer state (km)",
    "Sum of item prices in the order (R$)",
    "Sum of freight charges in the order (R$)",
    "Number of items in the order",
    "Total shipped weight (kg)",
    "Average product listing photo count (proxy for listing quality)",
    "Maximum payment installments chosen",
    "Product category (top 9 + 'other'), proxy for product-type-specific quality issues"
  ),
  Role = c("Treatment of interest", rep("Control / alternative explanation", 8))
)
feature_doc %>%
  kbl(caption = "Table 8. Features in the dissatisfaction model and their role") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) %>%
  column_spec(1, monospace = TRUE) %>%
  row_spec(1, bold = TRUE, background = "#fef9e7")
Table 8. Features in the dissatisfaction model and their role
Feature Description Role
delay_days Days late (negative = early) — the variable of interest Treatment of interest
distance_km Haversine distance, seller state → customer state (km) Control / alternative explanation
total_price Sum of item prices in the order (R\() </td> <td style="text-align:left;"> Control / alternative explanation </td> </tr> <tr> <td style="text-align:left;font-family: monospace;"> total_freight </td> <td style="text-align:left;"> Sum of freight charges in the order (R\)) Control / alternative explanation
item_count Number of items in the order Control / alternative explanation
total_weight_kg Total shipped weight (kg) Control / alternative explanation
avg_photos_qty Average product listing photo count (proxy for listing quality) Control / alternative explanation
max_installments Maximum payment installments chosen Control / alternative explanation
category_grp Product category (top 9 + ‘other’), proxy for product-type-specific quality issues Control / alternative explanation

9.2 Model 1 — Linear Regression

lm_fit <- lm(
  dissatisfaction ~ delay_days + distance_km + total_price + total_freight +
    item_count + total_weight_kg + avg_photos_qty + max_installments + category_grp,
  data = model_df_clean
)

lm_summary <- summary(lm_fit)
cat(sprintf("R² = %.3f | Adjusted R² = %.3f | n = %d\n",
            lm_summary$r.squared, lm_summary$adj.r.squared, nrow(model_df_clean)))
## R² = 0.100 | Adjusted R² = 0.100 | n = 96729
coef_tbl <- as.data.frame(lm_summary$coefficients)
coef_tbl$term <- rownames(coef_tbl)
names(coef_tbl) <- c("Estimate","Std_Error","t_value","p_value","term")
coef_tbl <- coef_tbl %>%
  filter(term != "(Intercept)") %>%
  mutate(
    Significance = case_when(
      p_value < 0.001 ~ "***", p_value < 0.01 ~ "**",
      p_value < 0.05  ~ "*",   TRUE ~ ""
    ),
    Estimate = round(Estimate, 4),
    p_value  = ifelse(p_value < 0.0001, "<0.0001", as.character(round(p_value, 4)))
  ) %>%
  select(term, Estimate, p_value, Significance) %>%
  arrange(desc(abs(as.numeric(Estimate))))

coef_tbl %>%
  rename(Term = term, `p-value` = p_value) %>%
  kbl(caption = "Table 9. Linear regression coefficients (dissatisfaction = 5 − review_score)") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(which(coef_tbl$term == "delay_days"), bold = TRUE, background = "#fef9e7")
Table 9. Linear regression coefficients (dissatisfaction = 5 − review_score)
Term Estimate p-value Significance
item_count item_count 0.2900 <0.0001 ***
category_grpbed_bath_table category_grpbed_bath_table 0.1665 <0.0001 ***
category_grptelephony category_grptelephony 0.1236 <0.0001 ***
category_grpcomputers_accessories category_grpcomputers_accessories 0.1046 <0.0001 ***
category_grpfurniture_decor category_grpfurniture_decor 0.0903 <0.0001 ***
category_grpwatches_gifts category_grpwatches_gifts 0.0782 <0.0001 ***
category_grphealth_beauty category_grphealth_beauty -0.0460 0.0018 **
category_grpsports_leisure category_grpsports_leisure -0.0360 0.0196
delay_days delay_days 0.0355 <0.0001 ***
category_grphousewares category_grphousewares -0.0190 0.2758
category_grpauto category_grpauto 0.0169 0.4159
max_installments max_installments 0.0074 <0.0001 ***
avg_photos_qty avg_photos_qty -0.0066 0.0046 **
total_weight_kg total_weight_kg 0.0055 <0.0001 ***
total_freight total_freight 0.0007 0.016
distance_km distance_km 0.0002 <0.0001 ***
total_price total_price 0.0000 0.9713

Reading the delay_days coefficient: holding distance, price, freight, item count, weight, listing quality, installments, and category constant, each additional day of lateness is associated with a 0.0355 point increase in the dissatisfaction index (on a 0–4 scale). That is the model’s best estimate of latency’s direct, controlled effect, net of the other eight factors.

9.3 Which Factor Uniquely Explains the Most Variance?

A coefficient’s size depends on its variable’s units (days vs. kilometres vs. Reais), so coefficients alone aren’t comparable across factors. Instead we use drop1() — for each predictor, how much does model fit (R²) degrade if we removed only that predictor, leaving everything else in? This partial R² is unit-free and directly comparable.

d1 <- drop1(lm_fit, test = "F")
total_rss <- d1["<none>", "RSS"]

partial_r2_tbl <- as.data.frame(d1) %>%
  mutate(term = rownames(d1)) %>%
  filter(term != "<none>") %>%
  mutate(
    partial_r2 = `Sum of Sq` / (`Sum of Sq` + total_rss),
    term = recode(term,
      delay_days = "Delivery Delay (days)",
      distance_km = "Shipping Distance (km)",
      total_price = "Order Price",
      total_freight = "Freight Cost",
      item_count = "Item Count",
      total_weight_kg = "Shipped Weight",
      avg_photos_qty = "Listing Photo Count",
      max_installments = "Payment Installments",
      category_grp = "Product Category"
    )
  ) %>%
  arrange(desc(partial_r2))

partial_r2_tbl %>%
  select(term, partial_r2, `Pr(>F)`) %>%
  mutate(partial_r2 = paste0(round(partial_r2*100, 2), "%"),
         `Pr(>F)` = ifelse(`Pr(>F)` < 0.0001, "<0.0001", round(`Pr(>F)`,4))) %>%
  rename(Factor = term, `Unique Variance Explained` = partial_r2, `p-value` = `Pr(>F)`) %>%
  kbl(caption = "Table 10. Partial R² — unique contribution of each factor, controlling for all others") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#fef9e7")
Table 10. Partial R² — unique contribution of each factor, controlling for all others
Factor Unique Variance Explained p-value
delay_days Delivery Delay (days) 7.88% <0.0001
item_count Item Count 1.23% <0.0001
distance_km Shipping Distance (km) 0.54% <0.0001
category_grp Product Category 0.28% <0.0001
max_installments Payment Installments 0.02% <0.0001
total_weight_kg Shipped Weight 0.02% <0.0001
avg_photos_qty Listing Photo Count 0.01% 0.0046
total_freight Freight Cost 0.01% 0.016
total_price Order Price 0% 0.9713
ggplot(partial_r2_tbl, aes(x = reorder(term, partial_r2), y = partial_r2*100, fill = term == "Delivery Delay (days)")) +
  geom_col(width = 0.65) +
  geom_text(aes(label = paste0(round(partial_r2*100,2), "%")), hjust = -0.15, size = 3.3, colour = "#2c3e50") +
  scale_fill_manual(values = c("TRUE" = col_late, "FALSE" = col_neutral), guide = "none") +
  scale_y_continuous(limits = c(0, max(partial_r2_tbl$partial_r2*100)*1.25)) +
  coord_flip() +
  labs(
    title    = "Delivery Delay Dwarfs Every Other Factor in Explanatory Power",
    subtitle = "Partial R² from linear regression — the % of dissatisfaction variance uniquely attributable to each factor",
    x = NULL, y = "Unique Variance Explained (%)",
    caption  = "Source: order-level model with 9 controls, n = 95,666 · Linear regression, drop1() Type II test"
  ) +
  theme_olist
Figure 9. Linear model: unique variance in dissatisfaction explained by each factor

Figure 9. Linear model: unique variance in dissatisfaction explained by each factor

9.4 Model 2 — Random Forest (Non-Linear Cross-Check)

Linear regression assumes additive, straight-line effects. A random forest makes no such assumption — it can detect thresholds, plateaus, and interactions. If delay still ranks #1 here, the finding is robust to model choice, not an artefact of assuming linearity.

set.seed(42)
rf_fit <- ranger(
  dissatisfaction ~ delay_days + distance_km + total_price + total_freight +
    item_count + total_weight_kg + avg_photos_qty + max_installments + category_grp,
  data = model_df_clean,
  num.trees = 200,
  importance = "permutation",
  seed = 42,
  num.threads = 1
)

cat(sprintf("Random Forest OOB R² = %.3f (vs. linear model R² = %.3f)\n",
            rf_fit$r.squared, lm_summary$r.squared))
## Random Forest OOB R² = 0.194 (vs. linear model R² = 0.100)
imp_tbl <- data.frame(
  term = names(importance(rf_fit)),
  importance = importance(rf_fit)
) %>%
  mutate(term = recode(term,
    delay_days = "Delivery Delay (days)",
    distance_km = "Shipping Distance (km)",
    total_price = "Order Price",
    total_freight = "Freight Cost",
    item_count = "Item Count",
    total_weight_kg = "Shipped Weight",
    avg_photos_qty = "Listing Photo Count",
    max_installments = "Payment Installments",
    category_grp = "Product Category"
  )) %>%
  arrange(desc(importance))

ggplot(imp_tbl, aes(x = reorder(term, importance), y = importance,
                    fill = term == "Delivery Delay (days)")) +
  geom_col(width = 0.65) +
  geom_text(aes(label = round(importance, 3)), hjust = -0.15, size = 3.3, colour = "#2c3e50") +
  scale_fill_manual(values = c("TRUE" = col_late, "FALSE" = col_neutral), guide = "none") +
  scale_y_continuous(limits = c(0, max(imp_tbl$importance)*1.25)) +
  coord_flip() +
  labs(
    title    = "Random Forest Independently Confirms: Delay Is the Top Predictor",
    subtitle = "Permutation importance — how much OOB prediction error rises when each variable is shuffled",
    x = NULL, y = "Permutation Importance (MSE increase)",
    caption  = "Source: ranger random forest, 200 trees, n = 95,666"
  ) +
  theme_olist
Figure 10. Random forest permutation importance for predicting dissatisfaction

Figure 10. Random forest permutation importance for predicting dissatisfaction

Both models agree. In the linear model, delay’s partial R² (7.9%) is roughly 6.4× larger than the second-ranked factor. In the random forest, delay’s importance score is also ranked #1, well ahead of freight cost and order price. Two structurally different models, same conclusion: delivery delay is not merely correlated with dissatisfaction — it remains the dominant factor after controlling for the eight most plausible alternative explanations available in this dataset.

9.5 The Shape of the Effect: Threshold, Not Slope

A regression coefficient implies a constant, linear effect — every extra day of lateness adds the same fixed amount of dissatisfaction. Does that match reality? We use the random forest to trace out the partial dependence of predicted dissatisfaction on delay_days, holding every other factor fixed at its typical (median/most-common) value, and compare it to the linear model’s (necessarily straight-line) prediction.

typical <- model_df_clean %>%
  summarise(
    distance_km       = median(distance_km),
    total_price        = median(total_price),
    total_freight        = median(total_freight),
    item_count            = round(median(item_count)),
    total_weight_kg         = median(total_weight_kg),
    avg_photos_qty            = median(avg_photos_qty),
    max_installments         = round(median(max_installments))
  )
mode_cat <- names(sort(table(model_df_clean$category_grp), decreasing = TRUE))[1]

delay_grid <- c(seq(-20, 20, by = 1), seq(24, 60, by = 4))
pd_grid <- data.frame(
  delay_days        = delay_grid,
  distance_km       = typical$distance_km,
  total_price        = typical$total_price,
  total_freight        = typical$total_freight,
  item_count            = typical$item_count,
  total_weight_kg         = typical$total_weight_kg,
  avg_photos_qty            = typical$avg_photos_qty,
  max_installments         = typical$max_installments,
  category_grp = factor(mode_cat, levels = levels(model_df_clean$category_grp))
)
pd_grid$pred_lm <- predict(lm_fit, newdata = pd_grid)
pd_grid$pred_rf <- predict(rf_fit, data = pd_grid)$predictions
pd_long <- pd_grid %>%
  select(delay_days, pred_lm, pred_rf) %>%
  pivot_longer(cols = c(pred_lm, pred_rf), names_to = "model", values_to = "predicted") %>%
  mutate(model = recode(model, pred_lm = "Linear Model (assumes constant slope)",
                        pred_rf = "Random Forest (data-driven shape)"))

ggplot(pd_long, aes(x = delay_days, y = predicted, colour = model)) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = col_neutral, linewidth = 0.7) +
  annotate("text", x = 0.5, y = 0.2, label = "On-time boundary", angle = 90,
           colour = col_neutral, size = 3, hjust = 0) +
  geom_line(linewidth = 1.3) +
  geom_point(size = 1.6, alpha = 0.7) +
  scale_colour_manual(values = c("Linear Model (assumes constant slope)" = col_neutral,
                                  "Random Forest (data-driven shape)" = col_late),
                      name = NULL) +
  labs(
    title    = "Dissatisfaction Jumps at the On-Time/Late Boundary, Then Plateaus",
    subtitle = "Both models hold distance, price, freight, item count, weight, category, and installments fixed at typical values",
    x        = "Delivery Delay (days; negative = early)",
    y        = "Predicted Dissatisfaction (0 = most satisfied, 4 = most dissatisfied)",
    caption  = "Source: same modelling dataset as Figures 9–10 · Random forest predictions averaged across 200 trees"
  ) +
  theme_olist
Figure 11. Partial effect of delivery delay on predicted dissatisfaction, controlling for all other factors

Figure 11. Partial effect of delivery delay on predicted dissatisfaction, controlling for all other factors

This is the most actionable finding in the report. The random forest reveals a threshold effect, not a smooth slope: predicted dissatisfaction roughly triples between 0 and 4 days late, then largely plateaus — an order 30 days late is not dramatically worse, in this model, than one 5 days late. The linear model, by construction, cannot see this — it forces a straight line through the same data and understates the damage of barely missing the promise while overstating the marginal harm of being very late. The practical implication: the highest-value intervention is not “ship faster on average” but “protect the promised date itself.” Shaving the worst delays from 30 days to 15 may matter less to customer sentiment than converting a 2-day miss into an on-time delivery.


10 Business Insights

10.0.1 🔍 Insight 1 — Late Delivery Is a Satisfaction Catastrophe, Not Just an Inconvenience

Late orders average a review score of 2.55 vs 4.28 for on-time orders — a gap of 1.73 points on a 5-point scale. Late orders are also 4× more likely to receive a 1-star review. With 8.1% of orders arriving late across ~96,000 deliveries, this represents a structurally embedded drag on platform reputation. Reducing the late rate by even half would meaningfully shift the platform’s aggregate NPS signal.

Caution: The analysis is observational. It is possible that other factors correlated with lateness (e.g. product type, seller reliability, distance) partially drive the score gap rather than lateness alone.

10.0.2 🔍 Insight 2 — Northern and North-Eastern States Are a Geographic Priority, and Distance Is Part of the Story

States like AL (23.9% late), MA (19.6%), PI (15.9%), CE (15.4%) and SE (15.2%) significantly exceed the national average of ~8%, as the Brazil map (Figure 6) shows visually — the reddest, highest-late-rate bubbles cluster away from the South-East commercial core where most sellers are based. The distance analysis (Figure 7) confirms this is not coincidental: late rate rises from 6.0% for same-state shipments to over 13% beyond 1,500 km, with review scores declining in step. However, the correlation between raw distance and delay is modest (r ≈ -0.08), meaning distance is a real but partial driver — regional carrier capacity and road infrastructure likely compound the effect specifically in the North/North-East. Targeted carrier partnerships or regional fulfilment hubs closer to these states would address both the distance penalty and the regional infrastructure gap simultaneously.

Caution: These states also have smaller order volumes, so the late-rate estimates carry higher statistical uncertainty. State centroids are an approximation (state capital coordinates), not actual shipment origin/destination points, so the distance metric is directional rather than precise.

10.0.3 🔍 Insight 3 — The Problem Is Specifically Late, Not Just Slow

The average actual delivery time is 12.6 days, but the median delay relative to the estimate is negative (i.e. most orders arrive early). This means the core issue is not overall speed but reliability of the promised date. Customers who receive a parcel 8 days early are moderately happy; customers who receive it 3 days late leave negative reviews. Olist’s lever here may be better delivery-date promise calibration (setting more conservative estimates that are reliably met) rather than necessarily accelerating logistics operations end-to-end.

Caution: We do not have visibility into whether Olist or the seller sets the delivery estimate. The appropriate corrective action differs substantially between these two scenarios.

10.0.4 🔍 Insight 4 — The Effect Survives Controls, and It’s a Threshold, Not a Slope

When we control for shipping distance, order price, freight cost, item count, shipped weight, listing quality, payment installments, and product category — using both a linear regression and an independent random forest model — delivery delay remains the single strongest predictor of dissatisfaction, explaining roughly 8× more unique variance than the next-best factor (item count) and ranking first in random forest importance by a wide margin. More importantly, the random forest reveals the effect is not a smooth slope: dissatisfaction roughly triples in the first few days after the promised date, then plateaus. This reframes the operational priority from “reduce average delay” to “protect the promised date” — converting narrow misses into on-time deliveries likely matters more to sentiment than shortening already-large delays.

Caution: This is a controlled association across two modelling approaches, not a randomised-experiment-level causal estimate. Both models’ R² remain modest (0.10–0.19), meaning most of the variation in review scores comes from factors not captured here — most plausibly product quality and condition-on-arrival, which this dataset cannot observe directly.


11 Limitations

11.0.1 ⚠️ Limitation 1 — Controlled, Not Randomised: Residual Confounding Remains Possible

The causal-inference section controls for eight plausible alternative explanations (distance, price, freight, item count, weight, listing quality, installments, category) and the delay effect survives across two different model types — meaningfully stronger evidence than a raw correlation. But this is not a randomised experiment: no order was ever randomly assigned to be late or on-time, so unmeasured confounders could still bias the estimate. The most likely candidates are product condition on arrival, packaging quality, and seller communication/responsiveness — none of which exist as columns in this dataset, and none of which can be ruled out. The modest R² of both models (0.10–0.19) is itself an honest signal: delay is the strongest single factor we can measure, but it is far from the only one driving review scores.

11.0.2 ⚠️ Limitation 2 — Dataset Scope and Temporal Coverage

The dataset covers 2016–2018 and represents a specific growth phase of Brazilian e-commerce. Logistical infrastructure, carrier networks, and customer expectations have evolved substantially since then. State-level findings (especially for low-volume states like AL with 401 orders or TO with 274 orders) carry high sampling uncertainty and should not be used to make multi-million dollar capital allocation decisions without refreshed data. Additionally, ~3,500 orders were dropped due to missing delivery dates, and orders with statuses other than “delivered” (cancellations, returns) are excluded — this may introduce survivor bias if late orders are more likely to be cancelled. The seller→customer distance metric uses state-capital centroids as a proxy for actual origin/destination coordinates (zip-code-level latitude/longitude was not available in this dataset), so it should be read as a directional signal of geographic distance rather than a precise shipping-route measurement.


12 AI Audit Note

12.0.1 🤖 AI Audit Note

This report was structured and drafted with AI assistance (Claude, Anthropic). The following transparency disclosures apply:

What AI contributed: narrative framing, ggplot2 code structure, section organisation, feature engineering choices for the regression/random forest models, and identification of candidate insights from descriptive statistics and model output.

What AI did not do: modify or simulate the underlying data, cherry-pick which model results to report, tune hyperparameters to produce a more dramatic result, or claim randomised-experiment-level causality.

Verification checklist: - ✅ All figures and models can be reproduced by re-knitting olist_delivery_analysis.Rmd with the original eight CSV files - ✅ No imputation or synthetic data generation was applied; rows with missing model features are dropped via na.omit(), with the drop rate explicitly reported - ✅ Correlation is Pearson on raw observations, not aggregated means (which would inflate r) - ✅ Two structurally different models (linear regression, random forest) were fit on identical features and the same target; their agreement is reported as corroborating evidence, not proof - ✅ Partial R² (drop1) and permutation importance (not impurity-based, which is biased toward high-cardinality variables) were used for fair cross-factor comparison - ⚠️ Causality is explicitly not claimed — the report uses “controlled association,” “partial effect,” and “net of measured confounders” rather than “causes” - ⚠️ Both models’ modest R² (0.10–0.19) is reported prominently rather than hidden, to avoid overstating how much of dissatisfaction this analysis actually explains - ⚠️ Human review is recommended before sharing insights with Olist operations or logistics teams, particularly for state-level investment recommendations given the small-n concern, and before treating the delay coefficient as an exact dollar-equivalent effect size

Recommended next steps for validation: a true natural experiment (e.g. comparing review scores before/after a carrier SLA change in a specific state, holding the order mix constant); sentiment analysis of free-text review comments to separate delivery complaints from product complaints; a seller-level random effect to absorb seller-specific quality that this model cannot currently isolate from delay.


cat("Report generated on:", format(Sys.time(), "%Y-%m-%d %H:%M:%S %Z"), "\n")
## Report generated on: 2026-06-16 13:00:56 CEST
cat("R version:", R.version$version.string, "\n")
## R version: R version 4.6.0 (2026-04-24 ucrt)
cat(sprintf("Key packages: ggplot2 %s | dplyr %s | patchwork %s\n",
    as.character(packageVersion("ggplot2")),
    as.character(packageVersion("dplyr")),
    as.character(packageVersion("patchwork"))))
## Key packages: ggplot2 4.0.3 | dplyr 1.2.1 | patchwork 1.3.2

End of Report · Olist Delivery Performance Analysis · Reproducible R Markdown