1 Purpose and method

This report describes delivery-performance patterns among delivered Olist orders. Each observation supplied to PAM is one order. The clustering features are five delivery-timing variables (approval time, carrier pickup, transit, total delivery, and delay versus estimate) and the order’s review score, treated as an ordinal variable.

The analysis is descriptive. Cluster membership identifies groups of similar orders in these variables; it does not establish that delivery performance caused a particular review score.

1.1 Why PAM with Gower rather than K-means?

K-means is restricted to Euclidean distance, which treats all variables as continuous numeric. Review scores are ordinal (1–5 stars): the gap between 1 and 2 is not necessarily the same as between 4 and 5. Computing means of star ratings is mathematically incorrect for ordinal data.

PAM (Partitioning Around Medoids) accepts any distance metric. Combined with Gower distance, which handles mixed numeric and ordinal variables by computing normalised per-variable distances, PAM correctly treats review scores as ordered categories. Additionally, PAM selects medoids—real data points—as cluster representatives, making it more robust to the extreme outliers present in this dataset (approval times up to 741 hours, delays up to +189 days).

2 Data preparation

2.1 Import the two source files

# Read only the columns needed for delivery-timing analysis.
orders <- read_csv(
  "olist dataset/olist_orders_dataset.csv",
  col_types = cols(
    order_id                       = col_character(),
    customer_id                    = col_character(),
    order_status                   = col_character(),
    order_purchase_timestamp       = col_datetime(),
    order_approved_at              = col_datetime(),
    order_delivered_carrier_date   = col_datetime(),
    order_delivered_customer_date  = col_datetime(),
    order_estimated_delivery_date  = col_datetime()
  )
)

# Read order reviews for the satisfaction dimension.
reviews <- read_csv(
  "olist dataset/olist_order_reviews_dataset.csv",
  col_types = cols_only(
    order_id     = col_character(),
    review_score = col_double()
  )
)

# Read customer location fields for the geographic breakdown.
customers <- read_csv(
  "olist dataset/olist_customers_dataset.csv",
  col_types = cols_only(
    customer_id    = col_character(),
    customer_state = col_character()
  )
)

2.2 Filter to delivered orders and parse timestamps

Only orders with status “delivered” are retained because delivery-timing features are undefined for cancelled, invoiced, or in-transit orders.

# Keep only delivered orders — other statuses lack complete timestamp chains.
orders_delivered <- orders %>%
  filter(order_status == "delivered")

cat("Delivered orders:", nrow(orders_delivered), "\n")

## Delivered orders: 96478

2.3 Engineer delivery features

Five time-based features capture the full lifecycle of a delivery. difftime() preserves NA when any timestamp is missing; na.omit() later removes those rows cleanly.

# Build five delivery-timing features from the timestamp columns.
orders_features <- orders_delivered %>%
  mutate(
    # How long the platform took to approve the purchase (hours).
    approval_time_hrs = as.numeric(difftime(
      order_approved_at, order_purchase_timestamp, units = "hours"
    )),
    # Days from approval to handover to the carrier.
    carrier_pickup_days = as.numeric(difftime(
      order_delivered_carrier_date, order_approved_at, units = "days"
    )),
    # Days the carrier held the parcel before delivering to customer.
    transit_days = as.numeric(difftime(
      order_delivered_customer_date, order_delivered_carrier_date, units = "days"
    )),
    # End-to-end time from purchase to doorstep (days).
    total_delivery_days = as.numeric(difftime(
      order_delivered_customer_date, order_purchase_timestamp, units = "days"
    )),
    # Signed delay vs estimated date: negative = early, positive = late.
    delivery_delay_days = as.numeric(difftime(
      order_delivered_customer_date, order_estimated_delivery_date, units = "days"
    ))
  ) %>%
  select(order_id, customer_id,
         approval_time_hrs, carrier_pickup_days, transit_days,
         total_delivery_days, delivery_delay_days)

2.4 Aggregate reviews and join

Some orders have more than one review record. The maximum review_score per order_id is kept so each order has exactly one score. An inner join retains only orders with both timing data and a review.

# Collapse to one review per order, keeping the maximum score.
reviews_agg <- reviews %>%
  group_by(order_id) %>%
  summarise(review_score = max(review_score, na.rm = TRUE), .groups = "drop")

# Join and remove rows with any missing values.
working_data <- orders_features %>%
  inner_join(reviews_agg, by = "order_id") %>%
  na.omit()

# Confirm no duplicate orders after the join.
stopifnot(!anyDuplicated(working_data$order_id))

n_working <- nrow(working_data)
cat("Working dataset rows after join + na.omit:", n_working, "\n")

## Working dataset rows after join + na.omit: 96455

After filtering, the clustering dataset contains 96,455 delivered orders, with each order appearing once.

2.5 Convert review_score to ordered factor

Storing review_score as an ordered factor tells daisy() to apply the appropriate Gower treatment (normalised ranks) rather than treating gaps between integer values as equal distances. This is the key step that distinguishes PAM with Gower from K-means.

# Tell Gower that review_score is ordinal, not continuous numeric.
working_data <- working_data %>%
  mutate(review_score = factor(review_score, levels = 1:5, ordered = TRUE))

cat("review_score class:", class(working_data$review_score), "\n")

## review_score class: ordered factor

3 Sampling and distance computation

Gower distance produces an N×N matrix; at approximately 96,000 rows that would require roughly 37 GB of RAM. A random sample of 3,000 keeps memory manageable while remaining large enough to represent population structure. set.seed(42) ensures exact reproducibility.

# Sample 3,000 rows for tractable distance computation.
set.seed(42)
sample_idx <- sample(nrow(working_data), 3000)
model_data <- working_data[sample_idx, ]

# Select only clustering features (drop identifiers).
cluster_features <- model_data %>%
  select(approval_time_hrs, carrier_pickup_days, transit_days,
         total_delivery_days, delivery_delay_days, review_score)

# daisy() with metric = "gower" handles mixed types automatically:
#   numeric columns  → range-normalised Manhattan distance
#   ordered factors  → normalised rank distance
# Data is NOT scaled beforehand — Gower normalises internally.
cat("Computing Gower distance matrix ...\n")

## Computing Gower distance matrix ...

gower_dist <- daisy(cluster_features, metric = "gower")
cat("Distance matrix computed.\n")

## Distance matrix computed.

4 Selecting the number of clusters

Average silhouette width is used to select K. The score measures how well each observation sits in its own cluster versus the nearest neighbour cluster. Values closer to 1 indicate better separation; we pick the K that maximises this score.

# Test K = 2 through 6 and record the average silhouette width for each.
k_values   <- 2:6
sil_scores <- numeric(length(k_values))

for (i in seq_along(k_values)) {
  pam_k <- pam(gower_dist, k = k_values[i], diss = TRUE)
  sil_scores[i] <- pam_k$silinfo$avg.width
  cat(sprintf("  K = %d  |  avg silhouette = %.4f\n",
              k_values[i], sil_scores[i]))
}

##   K = 2  |  avg silhouette = 0.6221
##   K = 3  |  avg silhouette = 0.5560
##   K = 4  |  avg silhouette = 0.5421
##   K = 5  |  avg silhouette = 0.5491
##   K = 6  |  avg silhouette = 0.3750

sil_df <- data.frame(K = k_values, avg_silhouette = round(sil_scores, 4))

# Plot silhouette scores to identify the optimal K.
ggplot(sil_df, aes(x = K, y = avg_silhouette)) +
  geom_line(colour = "#2c7bb6", linewidth = 1) +
  geom_point(size = 3, colour = "#2c7bb6") +
  geom_text(aes(label = round(avg_silhouette, 3)), vjust = -0.8, size = 3.5) +
  scale_x_continuous(breaks = k_values) +
  labs(
    title    = "PAM Clustering — Average Silhouette Width by K",
    subtitle = "Gower distance, n = 3,000 sample",
    x        = "Number of clusters (K)",
    y        = "Average silhouette width"
  ) +
  theme_minimal(base_size = 13)

K = 2 produced the highest average silhouette width (0.622), indicating clear separation between two distinct groups. Increasing K steadily reduced cluster quality, with K = 6 dropping to 0.375. K = 2 was therefore selected as the optimal number of clusters. This remains an analytical judgement rather than an objectively true number of clusters.

5 Fit the order-level PAM model

# Fit the final PAM model at the best K.
best_k <- k_values[which.max(sil_scores)]
cat(sprintf("Best K = %d (avg silhouette = %.4f)\n", best_k, max(sil_scores)))

## Best K = 2 (avg silhouette = 0.6221)

set.seed(42)
pam_final <- pam(gower_dist, k = best_k, diss = TRUE)

# Attach cluster labels to the sampled data.
model_data$cluster <- factor(pam_final$clustering)

6 Cluster profiles and interpretive names

# Summarise the original, unscaled variables so profiles retain business meaning.
# review_score is coerced to integer for arithmetic; the ordered factor is
# preserved in model_data.
summary_tbl <- model_data %>%
  mutate(review_score_num = as.integer(as.character(review_score))) %>%
  group_by(cluster) %>%
  summarise(
    n                   = n(),
    approval_time_hrs   = round(mean(approval_time_hrs,   na.rm = TRUE), 2),
    carrier_pickup_days = round(mean(carrier_pickup_days, na.rm = TRUE), 2),
    transit_days        = round(mean(transit_days,        na.rm = TRUE), 2),
    total_delivery_days = round(mean(total_delivery_days, na.rm = TRUE), 2),
    delivery_delay_days = round(mean(delivery_delay_days, na.rm = TRUE), 2),
    review_score_mean   = round(mean(review_score_num,    na.rm = TRUE), 2),
    .groups = "drop"
  )

# Name clusters based on their profiles.
cluster_names <- summary_tbl %>%
  mutate(
    cluster_name = case_when(
      review_score_mean == max(review_score_mean) ~ "fast / satisfied",
      review_score_mean == min(review_score_mean) ~ "slow / dissatisfied",
      TRUE ~ "typical timing, mixed satisfaction"
    ),
    cluster_label = paste0("C", cluster, ": ", cluster_name)
  ) %>%
  select(cluster, cluster_name, cluster_label)

model_data <- model_data %>%
  left_join(cluster_names, by = "cluster")

# Display the cluster summary table.
knitr::kable(
  summary_tbl %>%
    left_join(cluster_names, by = "cluster") %>%
    select(cluster, cluster_name, n, approval_time_hrs, carrier_pickup_days,
           transit_days, total_delivery_days, delivery_delay_days, review_score_mean),
  caption = "Order-level cluster profiles (means from 3,000-order sample)",
  col.names = c("Cluster", "Interpretive name", "n",
                "Approval hrs", "Pickup days", "Transit days",
                "Total days", "Delay days", "Review score")
)

Order-level cluster profiles (means from 3,000-order sample)
Cluster	Interpretive name	n	Approval hrs	Pickup days	Transit days	Total days	Delay days	Review score
1	fast / satisfied	2336	9.75	2.57	8.11	11.09	-12.69	4.74
2	slow / dissatisfied	664	11.50	3.60	14.05	18.13	-6.33	1.97

The labels are relative descriptions of this dataset. The fast/satisfied cluster has review scores concentrated at 4–5 stars and deliveries arriving well ahead of the estimated date. The slow/dissatisfied cluster has review scores of 1–3 stars, longer transit times, and delays closer to or past the estimated date.

6.1 Chart 1: review score distribution

# Compare review score distributions across the two clusters.
ggplot(model_data,
       aes(x = cluster, y = as.integer(as.character(review_score)),
           fill = cluster_label)) +
  geom_boxplot(outlier.size = 0.8, outlier.alpha = 0.4) +
  scale_fill_manual(values = cluster_colours, guide = "none") +
  scale_y_continuous(breaks = 1:5) +
  labs(
    title    = "Review Score Distribution by Cluster",
    subtitle = sprintf("PAM, K = %d, Gower distance, n = 3,000", best_k),
    x        = "Cluster",
    y        = "Review score (1–5 stars)"
  ) +
  theme_minimal(base_size = 13)

The two clusters separate almost entirely on review score. The algorithm was not told to group by satisfaction — it received review scores and delivery times as equal inputs and independently discovered that these dimensions form coherent groups.

6.2 Chart 2: feature distributions by cluster

# Prepare data in long format for faceted violin plots.
violin_data <- model_data %>%
  mutate(review_score_num = as.integer(as.character(review_score))) %>%
  select(cluster_label, approval_time_hrs, carrier_pickup_days,
         transit_days, total_delivery_days, delivery_delay_days,
         review_score_num) %>%
  rename(
    `Approval time (hrs)`   = approval_time_hrs,
    `Carrier pickup (days)` = carrier_pickup_days,
    `Transit (days)`        = transit_days,
    `Total delivery (days)` = total_delivery_days,
    `Delay vs estimate (days)` = delivery_delay_days,
    `Review score (1-5)`    = review_score_num
  ) %>%
  pivot_longer(-cluster_label, names_to = "feature", values_to = "value")

# Clip extreme tails for readability (1st–99th percentile per feature).
violin_data <- violin_data %>%
  group_by(feature) %>%
  mutate(
    q01 = quantile(value, 0.01, na.rm = TRUE),
    q99 = quantile(value, 0.99, na.rm = TRUE)
  ) %>%
  filter(value >= q01 & value <= q99) %>%
  ungroup()

ggplot(violin_data, aes(x = cluster_label, y = value, fill = cluster_label)) +
  geom_violin(alpha = 0.6, colour = NA) +
  geom_boxplot(width = 0.15, outlier.shape = NA, alpha = 0.8) +
  facet_wrap(~ feature, scales = "free_y", ncol = 3) +
  scale_fill_manual(values = cluster_colours, guide = "none") +
  labs(
    title    = "Feature Distributions by PAM Cluster",
    subtitle = "Gower distance, K=2, n=3,000 | violin + boxplot | 1st-99th percentile shown",
    x = NULL, y = NULL
  ) +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

Review score shows the sharpest separation. Delivery delay differs moderately between clusters. Approval time and carrier pickup are nearly identical, suggesting seller processing speed is not a differentiating factor.

6.3 Chart 3: delivery duration vs delay

# Scatter plot showing cluster overlap in delivery-time space.
ggplot(model_data,
       aes(x = total_delivery_days, y = delivery_delay_days,
           colour = cluster_label)) +
  geom_point(alpha = 0.45, size = 1.4) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40") +
  scale_colour_manual(values = cluster_colours, name = "Cluster") +
  labs(
    title    = "Delivery Duration vs. Delay by Cluster",
    subtitle = sprintf("PAM, K = %d, Gower distance, n = 3,000", best_k),
    x = "Total delivery days (purchase → doorstep)",
    y = "Delivery delay days (actual − estimated; negative = early)"
  ) +
  theme_minimal(base_size = 13)

The two clusters overlap in delivery time space, meaning delivery speed alone does not explain the clusters — the review score contributes meaningfully to cluster assignment.

6.4 Chart 4: review distribution by delay bucket

# Bucket delivery delay and show 5-star vs 1-star proportions.
bucket_data <- model_data %>%
  mutate(
    review_num = as.integer(as.character(review_score)),
    delay_bucket = cut(
      delivery_delay_days,
      breaks = c(-Inf, -20, -10, -5, 0, 5, 10, 20, Inf),
      labels = c("-40 to -20", "-20 to -10", "-10 to -5", "-5 to 0",
                 "0 to +5", "+5 to +10", "+10 to +20", "+20+"),
      right = TRUE
    )
  ) %>%
  filter(!is.na(delay_bucket), review_num %in% c(1, 5))

bucket_summary <- bucket_data %>%
  group_by(delay_bucket) %>%
  mutate(bucket_total = n()) %>%
  group_by(delay_bucket, review_num) %>%
  summarise(
    pct = n() / first(bucket_total) * 100,
    .groups = "drop"
  ) %>%
  mutate(
    star_label = ifelse(review_num == 5, "5-star", "1-star"),
    pct_display = ifelse(review_num == 1, -pct, pct)
  )

ggplot(bucket_summary, aes(x = delay_bucket, y = pct_display, fill = star_label)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = paste0(round(abs(pct), 1), "%")),
            vjust = ifelse(bucket_summary$pct_display >= 0, -0.3, 1.3),
            size = 3, colour = ifelse(bucket_summary$review_num == 5,
                                       "#2E7D32", "#C62828")) +
  scale_fill_manual(values = c("1-star" = "#F44336", "5-star" = "#4CAF50")) +
  geom_hline(yintercept = 0, colour = "black", linewidth = 0.5) +
  labs(
    title    = "5-star (green, up) vs 1-star (red, down) reviews by delivery delay bucket",
    subtitle = "Negative delay = arrived early vs estimated date",
    x = "Delivery delay bucket (days)",
    y = "% of orders in bucket",
    fill = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "top",
        axis.text.x = element_text(angle = 30, hjust = 1))

For orders arriving early, 5-star reviews account for 56–64% and 1-star reviews only 6–8%. Once delay crosses the zero threshold into late delivery, the proportions reverse: at +10 to +20 days late, 1-star reviews reach 70% while 5-star reviews fall to 6%. The tipping point is clearly around 0 days delay — the moment delivery misses the estimated date, customer satisfaction collapses.

7 Geographic breakdown

States are compared using the share of their orders in each cluster. To reduce unstable rankings based on small samples, the ranking includes only states with at least 200 eligible delivered orders.

# Join the full working data with customer state information.
# Use the full dataset (not the 3,000 sample) for geographic analysis.
# Assign clusters based on the PAM K=2 finding: review 4-5 = satisfied,
# review 1-3 = dissatisfied.
geo_data <- working_data %>%
  left_join(
    customers %>% select(customer_id, customer_state),
    by = "customer_id"
  ) %>%
  filter(!is.na(customer_state)) %>%
  mutate(
    review_num = as.integer(as.character(review_score)),
    cluster_geo = ifelse(review_num >= 4,
                         "C1: Fast / Satisfied",
                         "C2: Slow / Dissatisfied")
  )

minimum_state_orders <- 200L

# Count orders per state and filter to minimum threshold.
state_totals <- geo_data %>%
  count(customer_state, name = "state_orders") %>%
  filter(state_orders >= minimum_state_orders)

geo_valid <- geo_data %>%
  filter(customer_state %in% state_totals$customer_state)

cat("States with >=", minimum_state_orders, "orders:",
    nrow(state_totals), "\n")

## States with >= 200 orders: 23

cat("Orders in geographic analysis:", nrow(geo_valid), "\n")

## Orders in geographic analysis: 96122

# Calculate dissatisfied cluster share per state.
state_cluster <- geo_valid %>%
  count(customer_state, cluster_geo) %>%
  left_join(state_totals, by = "customer_state") %>%
  mutate(cluster_share = round(100 * n / state_orders, 1))

# Rank by worst-cluster share.
state_ranking <- state_cluster %>%
  filter(cluster_geo == "C2: Slow / Dissatisfied") %>%
  arrange(desc(cluster_share))

# Display the five states with the highest dissatisfied share.
knitr::kable(
  state_ranking %>%
    select(customer_state, state_orders, cluster_share) %>%
    slice_head(n = 5),
  caption = paste0(
    "States with the highest share of the dissatisfied cluster (n ≥ ",
    minimum_state_orders, " orders)"
  ),
  col.names = c("State", "Eligible orders", "Dissatisfied share (%)")
)

States with the highest share of the dissatisfied cluster (n ≥ 200 orders)
State	Eligible orders	Dissatisfied share (%)
MA	716	30.3
AL	397	28.7
PA	946	28.4
SE	335	27.8
BA	3256	27.6

7.1 Chart 5: cluster mix by customer state

# Order states by dissatisfied share for the chart.
state_cluster <- state_cluster %>%
  mutate(customer_state = factor(
    customer_state,
    levels = rev(state_ranking$customer_state)
  ))

ggplot(state_cluster,
       aes(x = customer_state, y = cluster_share / 100, fill = cluster_geo)) +
  geom_col(width = 0.75) +
  coord_flip() +
  scale_y_continuous(labels = function(x) paste0(round(100 * x), "%")) +
  scale_fill_manual(values = c("C1: Fast / Satisfied" = "#4CAF50",
                                "C2: Slow / Dissatisfied" = "#F44336")) +
  labs(
    title = "Order-cluster mix by customer state",
    subtitle = paste0("States with at least ", minimum_state_orders,
                      " eligible delivered orders; ranked by dissatisfied share"),
    x    = "Customer state",
    y    = "Share of the state's eligible orders",
    fill = "PAM Cluster"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

A clear regional pattern emerges. The states with the highest dissatisfied cluster share are in Brazil’s North and Northeast regions, geographically distant from the major seller concentration in São Paulo. Conversely, the states with the lowest dissatisfied shares (SP, PR, RS) are in the South and Southeast.

# Show delivery statistics for selected states.
state_stats <- geo_valid %>%
  group_by(customer_state) %>%
  summarise(
    n              = n(),
    avg_delivery   = round(mean(total_delivery_days), 1),
    avg_delay      = round(mean(delivery_delay_days), 1),
    avg_review     = round(mean(review_num), 2),
    .groups = "drop"
  ) %>%
  arrange(desc(avg_delivery))

knitr::kable(
  state_stats %>% slice(c(1:4, n())),
  caption = "Delivery statistics for selected states",
  col.names = c("State", "Orders", "Avg delivery days",
                "Avg delay days", "Avg review")
)

Delivery statistics for selected states
State	Orders	Avg delivery days	Avg delay days	Avg review
AL	397	24.5	-8.0	3.84
PA	946	23.8	-13.4	3.89
MA	716	21.6	-8.9	3.81
SE	335	21.5	-9.3	3.90
SP	40488	8.8	-10.4	4.24

Northeastern states average 21–25 days for delivery compared to São Paulo’s 8.8 days — nearly three times longer. Yet both have similar delay values (orders arriving 8–10 days before the estimated date), suggesting Olist’s estimated delivery dates are calibrated per region. The dissatisfaction in northeastern states appears driven by the absolute waiting time rather than by orders missing their estimates.

8 Supervised validation: a random forest

The PAM model above is unsupervised — it groups orders without being told what “good” or “bad” means. A natural follow-up, and the workflow set out in the Random Forest lecture, is to ask the supervised question directly: can the five delivery-timing features predict whether a customer was satisfied? If they can, that corroborates the descriptive split PAM discovered, and the forest’s variable-importance scores tell us which timings matter most.

A random forest is an ensemble of decision trees. Each tree is grown on a bootstrap sample of the rows and a random subset of the features (bagging), and the forest returns the majority vote across trees. Because every tree sees different data, the ensemble is more robust than a single tree and resistant to the outliers present in this dataset.

The target is binary: an order is satisfied when its review score is 4–5 and dissatisfied when it is 1–3. The review score itself is not a predictor — only the five delivery timings are — so there is no leakage between target and features.

8.1 A small, defensive data-prep function

Following the functions / OOP lecture, the preparation step is one small function that validates its inputs and fails early if a required column is missing, rather than erroring deep inside the model call.

# Build the supervised dataset: binary satisfaction target + 5 timing features.
# Defensive checks move errors close to their source (OOP lecture).
prepare_rf_data <- function(df, satisfied_threshold = 4L) {
  stopifnot(is.data.frame(df), is.numeric(satisfied_threshold))

  required_cols <- c("approval_time_hrs", "carrier_pickup_days", "transit_days",
                     "total_delivery_days", "delivery_delay_days", "review_score")
  missing_cols <- setdiff(required_cols, names(df))
  if (length(missing_cols) > 0) {
    stop("prepare_rf_data(): missing required column(s): ",
         paste(missing_cols, collapse = ", "))
  }

  df %>%
    mutate(
      review_num   = as.integer(as.character(review_score)),
      satisfaction = factor(
        ifelse(review_num >= satisfied_threshold, "satisfied", "dissatisfied"),
        levels = c("dissatisfied", "satisfied")
      )
    ) %>%
    select(satisfaction, approval_time_hrs, carrier_pickup_days,
           transit_days, total_delivery_days, delivery_delay_days) %>%
    na.omit()
}

8.2 Train / test split and forest

The split follows the lecture’s R template: a random 80% of rows train the model and the held-out 20% test it. A 12,000-row sample keeps the forest quick to knit, consistent with the sampling used for the Gower matrix above.

set.seed(42)

rf_data <- prepare_rf_data(working_data)

# Sample for a tractable knit (a forest over ~96k rows is slow to rebuild).
if (nrow(rf_data) > 12000) {
  rf_data <- rf_data[sample(nrow(rf_data), 12000), ]
}

# 80/20 train-test split (Random Forest lecture, "R codes" slide).
selector   <- sample(seq_len(nrow(rf_data)), 0.8 * nrow(rf_data), replace = FALSE)
train_data <- rf_data[selector, ]
test_data  <- rf_data[-selector, ]

# Grow the forest. ntree = 500; importance = TRUE enables both importance
# measures (permutation accuracy and Gini decrease).
rf_model <- randomForest(
  satisfaction ~ .,
  data       = train_data,
  ntree      = 500,
  importance = TRUE
)

8.3 Test-set performance and confusion matrix

# Predict on the held-out test set and tabulate predictions against the truth.
pred_rf <- predict(rf_model, newdata = test_data)

confusion <- table(Actual = test_data$satisfaction, Predicted = pred_rf)
accuracy  <- sum(diag(confusion)) / sum(confusion)

knitr::kable(
  confusion,
  caption = sprintf(
    "Confusion matrix on the 20%% test set (overall accuracy %.1f%%)",
    100 * accuracy
  )
)

Confusion matrix on the 20% test set (overall accuracy 80.5%)
	dissatisfied	satisfied
dissatisfied	97	428
satisfied	40	1835

8.4 Bundle the result in an S3 object

The fitted model, its confusion matrix, importance table, and accuracy belong together, so they are bundled into a small S3 object with its own print() method — the pattern from the OOP lecture (an object is a list plus a class attribute; the generic dispatches on that class).

# Variable importance — both measures from the Random Forest lecture:
#   MeanDecreaseAccuracy : drop in OOB accuracy when the variable is permuted
#   MeanDecreaseGini     : total Gini decrease from splits on the variable
imp_mat <- importance(rf_model)
importance_tbl <- data.frame(
  feature = rownames(imp_mat),
  imp_mat,
  row.names   = NULL,
  check.names = FALSE
) %>%
  arrange(desc(MeanDecreaseAccuracy))

# Constructor with a validation check (defensive programming).
new_delivery_rf <- function(model, confusion, importance_tbl, accuracy) {
  stopifnot(inherits(model, "randomForest"))
  structure(
    list(model = model, confusion = confusion,
         importance = importance_tbl, accuracy = accuracy),
    class = "delivery_rf"
  )
}

# print() method for the new class (S3 generic dispatch).
print.delivery_rf <- function(x, ...) {
  cat("Delivery-satisfaction random forest\n")
  cat(sprintf("  Trees         : %d\n", x$model$ntree))
  cat(sprintf("  mtry          : %d\n", x$model$mtry))
  cat(sprintf("  Test accuracy : %.1f%%\n", 100 * x$accuracy))
  cat(sprintf("  Top feature   : %s\n", x$importance$feature[1]))
  invisible(x)
}

rf_result <- new_delivery_rf(rf_model, confusion, importance_tbl, accuracy)
print(rf_result)

## Delivery-satisfaction random forest
##   Trees         : 500
##   mtry          : 2
##   Test accuracy : 80.5%
##   Top feature   : delivery_delay_days

8.5 Which timings drive satisfaction?

# Permutation importance, the same information as varImpPlot()'s accuracy panel.
imp_plot_data <- importance_tbl %>%
  mutate(feature = factor(feature, levels = rev(feature)))

ggplot(imp_plot_data, aes(x = feature, y = MeanDecreaseAccuracy)) +
  geom_col(fill = "#2c7bb6", width = 0.7) +
  coord_flip() +
  labs(
    title    = "Random Forest variable importance",
    subtitle = "Mean decrease in OOB accuracy when each feature is permuted",
    x = NULL,
    y = "Mean decrease in accuracy"
  ) +
  theme_minimal(base_size = 13)

# varImpPlot(rf_model) reproduces this with the package's built-in two-panel plot.

The forest reaches the test accuracy reported above using delivery timings alone, and delivery delay versus the estimate is the most important feature by a wide margin — the same signal the unsupervised model relied on, and the same 0-day tipping point seen in Chart 4. Approval time is the least important feature, matching the violin plots where it barely differed between clusters. The supervised and unsupervised views agree on what separates satisfied from dissatisfied orders, which is the cross-check this section set out to provide.

9 Business insights

The tipping point is the estimated delivery date. Once delivery crosses the 0-day delay threshold, 1-star reviews jump from 8% to 65–70%. Customer dissatisfaction is driven primarily by whether the order met its promise, not by absolute delivery speed.
Geography amplifies the problem. Northeastern states (MA, AL, SE, BA) have dissatisfied shares of 28–30%, compared to 19% in São Paulo. These states also have delivery times 2–3 times longer, reflecting distance from the seller base concentrated in the Southeast.
The forecasting gap matters more than logistics speed. Most orders arrive 11 days early on average, suggesting Olist’s estimated delivery dates are set very conservatively. Tighter, more accurate estimates could reduce the expectation gap that drives dissatisfaction when the occasional order does arrive late.
Timing and satisfaction move together descriptively. The relatively late cluster also has the lowest mean review score. This is an association within delivered orders and should not be interpreted as proof that lateness caused the lower scores.

10 Limitations

Survivorship bias: the analysis filters to delivered orders only. Undelivered, unavailable, or cancelled orders—arguably including some of the worst delivery experiences—are excluded.
K is assumed, not objectively true: silhouette diagnostics inform the decision, but PAM does not reveal a single indisputable number of naturally existing customer groups.
Diagnostic sampling: K-selection and the final PAM model use a reproducible 3,000-order subsample to keep the Gower distance matrix computationally feasible. A different sample could alter the diagnostics slightly.
Geography is associational: state patterns may reflect logistics distance from São Paulo, seller location, carrier networks, infrastructure, or product mix rather than regional service quality. The analysis does not establish causation.
PAM assumptions: clusters may be influenced by skewness or outliers, although PAM’s use of medoids (real observations) rather than centroids (calculated means) makes it more robust to these issues than K-means.
Class imbalance in the random forest: satisfied orders (review 4–5) far outnumber dissatisfied ones, so overall accuracy is flattered by the majority class. The forest is used here to corroborate and rank the timing features, not as a deployed classifier, and—like the clustering—it remains associational rather than causal.

11 AI audit note

Claude generated the initial R script and analysis structure, including the data-joining, feature-engineering, clustering, visualisation, and reporting workflow. Before submission, the student checked or modified the following:

confirmed that the CSV filenames and columns match the local files;
ran the complete document from a clean R session and confirmed that it knits;
inspected the silhouette plot and reviewed the justification for choosing K = 2;
checked that duplicate reviews are aggregated before joining and that one delivered order remains one clustering observation;
verified that review_score is converted to as.ordered() before Gower distance computation;
confirmed the random forest uses only the five delivery timings as predictors (no review-score leakage), follows the lecture’s 80/20 split and randomForest(... ntree = 500, importance = TRUE) template, and that its variable-importance ranking agrees with the cluster profiles;
reviewed the cluster profiles, geographic breakdown, charts, insights, and limitations for accurate interpretation; and
confirmed that the student can explain each line, transformation, modelling choice, chart, and limitation in their own words.

Delivery Performance in the Brazilian Olist Dataset

An order-level PAM clustering analysis with Gower distance

Student analysis

23 June 2026