This report describes delivery-performance patterns among delivered Olist orders. Each observation supplied to PAM is one order. The clustering features are five delivery-timing variables (approval time, carrier pickup, transit, total delivery, and delay versus estimate) and the order’s review score, treated as an ordinal variable.
The analysis is descriptive. Cluster membership identifies groups of similar orders in these variables; it does not establish that delivery performance caused a particular review score.
K-means is restricted to Euclidean distance, which treats all variables as continuous numeric. Review scores are ordinal (1–5 stars): the gap between 1 and 2 is not necessarily the same as between 4 and 5. Computing means of star ratings is mathematically incorrect for ordinal data.
PAM (Partitioning Around Medoids) accepts any distance metric. Combined with Gower distance, which handles mixed numeric and ordinal variables by computing normalised per-variable distances, PAM correctly treats review scores as ordered categories. Additionally, PAM selects medoids—real data points—as cluster representatives, making it more robust to the extreme outliers present in this dataset (approval times up to 741 hours, delays up to +189 days).
# Read only the columns needed for delivery-timing analysis.
orders <- read_csv(
"olist dataset/olist_orders_dataset.csv",
col_types = cols(
order_id = col_character(),
customer_id = col_character(),
order_status = col_character(),
order_purchase_timestamp = col_datetime(),
order_approved_at = col_datetime(),
order_delivered_carrier_date = col_datetime(),
order_delivered_customer_date = col_datetime(),
order_estimated_delivery_date = col_datetime()
)
)
# Read order reviews for the satisfaction dimension.
reviews <- read_csv(
"olist dataset/olist_order_reviews_dataset.csv",
col_types = cols_only(
order_id = col_character(),
review_score = col_double()
)
)
# Read customer location fields for the geographic breakdown.
customers <- read_csv(
"olist dataset/olist_customers_dataset.csv",
col_types = cols_only(
customer_id = col_character(),
customer_state = col_character()
)
)
Only orders with status “delivered” are retained because delivery-timing features are undefined for cancelled, invoiced, or in-transit orders.
# Keep only delivered orders — other statuses lack complete timestamp chains.
orders_delivered <- orders %>%
filter(order_status == "delivered")
cat("Delivered orders:", nrow(orders_delivered), "\n")
## Delivered orders: 96478
Five time-based features capture the full lifecycle of a delivery.
difftime() preserves NA when any timestamp is missing;
na.omit() later removes those rows cleanly.
# Build five delivery-timing features from the timestamp columns.
orders_features <- orders_delivered %>%
mutate(
# How long the platform took to approve the purchase (hours).
approval_time_hrs = as.numeric(difftime(
order_approved_at, order_purchase_timestamp, units = "hours"
)),
# Days from approval to handover to the carrier.
carrier_pickup_days = as.numeric(difftime(
order_delivered_carrier_date, order_approved_at, units = "days"
)),
# Days the carrier held the parcel before delivering to customer.
transit_days = as.numeric(difftime(
order_delivered_customer_date, order_delivered_carrier_date, units = "days"
)),
# End-to-end time from purchase to doorstep (days).
total_delivery_days = as.numeric(difftime(
order_delivered_customer_date, order_purchase_timestamp, units = "days"
)),
# Signed delay vs estimated date: negative = early, positive = late.
delivery_delay_days = as.numeric(difftime(
order_delivered_customer_date, order_estimated_delivery_date, units = "days"
))
) %>%
select(order_id, customer_id,
approval_time_hrs, carrier_pickup_days, transit_days,
total_delivery_days, delivery_delay_days)
Some orders have more than one review record. The maximum
review_score per order_id is kept so each
order has exactly one score. An inner join retains only orders with both
timing data and a review.
# Collapse to one review per order, keeping the maximum score.
reviews_agg <- reviews %>%
group_by(order_id) %>%
summarise(review_score = max(review_score, na.rm = TRUE), .groups = "drop")
# Join and remove rows with any missing values.
working_data <- orders_features %>%
inner_join(reviews_agg, by = "order_id") %>%
na.omit()
# Confirm no duplicate orders after the join.
stopifnot(!anyDuplicated(working_data$order_id))
n_working <- nrow(working_data)
cat("Working dataset rows after join + na.omit:", n_working, "\n")
## Working dataset rows after join + na.omit: 96455
After filtering, the clustering dataset contains 96,455 delivered orders, with each order appearing once.
Storing review_score as an ordered factor tells
daisy() to apply the appropriate Gower treatment
(normalised ranks) rather than treating gaps between integer values as
equal distances. This is the key step that distinguishes PAM with Gower
from K-means.
# Tell Gower that review_score is ordinal, not continuous numeric.
working_data <- working_data %>%
mutate(review_score = factor(review_score, levels = 1:5, ordered = TRUE))
cat("review_score class:", class(working_data$review_score), "\n")
## review_score class: ordered factor
Gower distance produces an N×N matrix; at approximately 96,000 rows
that would require roughly 37 GB of RAM. A random sample of 3,000 keeps
memory manageable while remaining large enough to represent population
structure. set.seed(42) ensures exact reproducibility.
# Sample 3,000 rows for tractable distance computation.
set.seed(42)
sample_idx <- sample(nrow(working_data), 3000)
model_data <- working_data[sample_idx, ]
# Select only clustering features (drop identifiers).
cluster_features <- model_data %>%
select(approval_time_hrs, carrier_pickup_days, transit_days,
total_delivery_days, delivery_delay_days, review_score)
# daisy() with metric = "gower" handles mixed types automatically:
# numeric columns → range-normalised Manhattan distance
# ordered factors → normalised rank distance
# Data is NOT scaled beforehand — Gower normalises internally.
cat("Computing Gower distance matrix ...\n")
## Computing Gower distance matrix ...
gower_dist <- daisy(cluster_features, metric = "gower")
cat("Distance matrix computed.\n")
## Distance matrix computed.
Average silhouette width is used to select K. The score measures how well each observation sits in its own cluster versus the nearest neighbour cluster. Values closer to 1 indicate better separation; we pick the K that maximises this score.
# Test K = 2 through 6 and record the average silhouette width for each.
k_values <- 2:6
sil_scores <- numeric(length(k_values))
for (i in seq_along(k_values)) {
pam_k <- pam(gower_dist, k = k_values[i], diss = TRUE)
sil_scores[i] <- pam_k$silinfo$avg.width
cat(sprintf(" K = %d | avg silhouette = %.4f\n",
k_values[i], sil_scores[i]))
}
## K = 2 | avg silhouette = 0.6221
## K = 3 | avg silhouette = 0.5560
## K = 4 | avg silhouette = 0.5421
## K = 5 | avg silhouette = 0.5491
## K = 6 | avg silhouette = 0.3750
sil_df <- data.frame(K = k_values, avg_silhouette = round(sil_scores, 4))
# Plot silhouette scores to identify the optimal K.
ggplot(sil_df, aes(x = K, y = avg_silhouette)) +
geom_line(colour = "#2c7bb6", linewidth = 1) +
geom_point(size = 3, colour = "#2c7bb6") +
geom_text(aes(label = round(avg_silhouette, 3)), vjust = -0.8, size = 3.5) +
scale_x_continuous(breaks = k_values) +
labs(
title = "PAM Clustering — Average Silhouette Width by K",
subtitle = "Gower distance, n = 3,000 sample",
x = "Number of clusters (K)",
y = "Average silhouette width"
) +
theme_minimal(base_size = 13)
K = 2 produced the highest average silhouette width (0.622), indicating clear separation between two distinct groups. Increasing K steadily reduced cluster quality, with K = 6 dropping to 0.375. K = 2 was therefore selected as the optimal number of clusters. This remains an analytical judgement rather than an objectively true number of clusters.
# Fit the final PAM model at the best K.
best_k <- k_values[which.max(sil_scores)]
cat(sprintf("Best K = %d (avg silhouette = %.4f)\n", best_k, max(sil_scores)))
## Best K = 2 (avg silhouette = 0.6221)
set.seed(42)
pam_final <- pam(gower_dist, k = best_k, diss = TRUE)
# Attach cluster labels to the sampled data.
model_data$cluster <- factor(pam_final$clustering)
# Summarise the original, unscaled variables so profiles retain business meaning.
# review_score is coerced to integer for arithmetic; the ordered factor is
# preserved in model_data.
summary_tbl <- model_data %>%
mutate(review_score_num = as.integer(as.character(review_score))) %>%
group_by(cluster) %>%
summarise(
n = n(),
approval_time_hrs = round(mean(approval_time_hrs, na.rm = TRUE), 2),
carrier_pickup_days = round(mean(carrier_pickup_days, na.rm = TRUE), 2),
transit_days = round(mean(transit_days, na.rm = TRUE), 2),
total_delivery_days = round(mean(total_delivery_days, na.rm = TRUE), 2),
delivery_delay_days = round(mean(delivery_delay_days, na.rm = TRUE), 2),
review_score_mean = round(mean(review_score_num, na.rm = TRUE), 2),
.groups = "drop"
)
# Name clusters based on their profiles.
cluster_names <- summary_tbl %>%
mutate(
cluster_name = case_when(
review_score_mean == max(review_score_mean) ~ "fast / satisfied",
review_score_mean == min(review_score_mean) ~ "slow / dissatisfied",
TRUE ~ "typical timing, mixed satisfaction"
),
cluster_label = paste0("C", cluster, ": ", cluster_name)
) %>%
select(cluster, cluster_name, cluster_label)
model_data <- model_data %>%
left_join(cluster_names, by = "cluster")
# Display the cluster summary table.
knitr::kable(
summary_tbl %>%
left_join(cluster_names, by = "cluster") %>%
select(cluster, cluster_name, n, approval_time_hrs, carrier_pickup_days,
transit_days, total_delivery_days, delivery_delay_days, review_score_mean),
caption = "Order-level cluster profiles (means from 3,000-order sample)",
col.names = c("Cluster", "Interpretive name", "n",
"Approval hrs", "Pickup days", "Transit days",
"Total days", "Delay days", "Review score")
)
| Cluster | Interpretive name | n | Approval hrs | Pickup days | Transit days | Total days | Delay days | Review score |
|---|---|---|---|---|---|---|---|---|
| 1 | fast / satisfied | 2336 | 9.75 | 2.57 | 8.11 | 11.09 | -12.69 | 4.74 |
| 2 | slow / dissatisfied | 664 | 11.50 | 3.60 | 14.05 | 18.13 | -6.33 | 1.97 |
The labels are relative descriptions of this dataset. The fast/satisfied cluster has review scores concentrated at 4–5 stars and deliveries arriving well ahead of the estimated date. The slow/dissatisfied cluster has review scores of 1–3 stars, longer transit times, and delays closer to or past the estimated date.
# Compare review score distributions across the two clusters.
ggplot(model_data,
aes(x = cluster, y = as.integer(as.character(review_score)),
fill = cluster_label)) +
geom_boxplot(outlier.size = 0.8, outlier.alpha = 0.4) +
scale_fill_manual(values = cluster_colours, guide = "none") +
scale_y_continuous(breaks = 1:5) +
labs(
title = "Review Score Distribution by Cluster",
subtitle = sprintf("PAM, K = %d, Gower distance, n = 3,000", best_k),
x = "Cluster",
y = "Review score (1–5 stars)"
) +
theme_minimal(base_size = 13)
The two clusters separate almost entirely on review score. The algorithm was not told to group by satisfaction — it received review scores and delivery times as equal inputs and independently discovered that these dimensions form coherent groups.
# Prepare data in long format for faceted violin plots.
violin_data <- model_data %>%
mutate(review_score_num = as.integer(as.character(review_score))) %>%
select(cluster_label, approval_time_hrs, carrier_pickup_days,
transit_days, total_delivery_days, delivery_delay_days,
review_score_num) %>%
rename(
`Approval time (hrs)` = approval_time_hrs,
`Carrier pickup (days)` = carrier_pickup_days,
`Transit (days)` = transit_days,
`Total delivery (days)` = total_delivery_days,
`Delay vs estimate (days)` = delivery_delay_days,
`Review score (1-5)` = review_score_num
) %>%
pivot_longer(-cluster_label, names_to = "feature", values_to = "value")
# Clip extreme tails for readability (1st–99th percentile per feature).
violin_data <- violin_data %>%
group_by(feature) %>%
mutate(
q01 = quantile(value, 0.01, na.rm = TRUE),
q99 = quantile(value, 0.99, na.rm = TRUE)
) %>%
filter(value >= q01 & value <= q99) %>%
ungroup()
ggplot(violin_data, aes(x = cluster_label, y = value, fill = cluster_label)) +
geom_violin(alpha = 0.6, colour = NA) +
geom_boxplot(width = 0.15, outlier.shape = NA, alpha = 0.8) +
facet_wrap(~ feature, scales = "free_y", ncol = 3) +
scale_fill_manual(values = cluster_colours, guide = "none") +
labs(
title = "Feature Distributions by PAM Cluster",
subtitle = "Gower distance, K=2, n=3,000 | violin + boxplot | 1st-99th percentile shown",
x = NULL, y = NULL
) +
theme_minimal(base_size = 11) +
theme(axis.text.x = element_text(angle = 15, hjust = 1))
Review score shows the sharpest separation. Delivery delay differs moderately between clusters. Approval time and carrier pickup are nearly identical, suggesting seller processing speed is not a differentiating factor.
# Scatter plot showing cluster overlap in delivery-time space.
ggplot(model_data,
aes(x = total_delivery_days, y = delivery_delay_days,
colour = cluster_label)) +
geom_point(alpha = 0.45, size = 1.4) +
geom_hline(yintercept = 0, linetype = "dashed", colour = "grey40") +
scale_colour_manual(values = cluster_colours, name = "Cluster") +
labs(
title = "Delivery Duration vs. Delay by Cluster",
subtitle = sprintf("PAM, K = %d, Gower distance, n = 3,000", best_k),
x = "Total delivery days (purchase → doorstep)",
y = "Delivery delay days (actual − estimated; negative = early)"
) +
theme_minimal(base_size = 13)
The two clusters overlap in delivery time space, meaning delivery speed alone does not explain the clusters — the review score contributes meaningfully to cluster assignment.
# Bucket delivery delay and show 5-star vs 1-star proportions.
bucket_data <- model_data %>%
mutate(
review_num = as.integer(as.character(review_score)),
delay_bucket = cut(
delivery_delay_days,
breaks = c(-Inf, -20, -10, -5, 0, 5, 10, 20, Inf),
labels = c("-40 to -20", "-20 to -10", "-10 to -5", "-5 to 0",
"0 to +5", "+5 to +10", "+10 to +20", "+20+"),
right = TRUE
)
) %>%
filter(!is.na(delay_bucket), review_num %in% c(1, 5))
bucket_summary <- bucket_data %>%
group_by(delay_bucket) %>%
mutate(bucket_total = n()) %>%
group_by(delay_bucket, review_num) %>%
summarise(
pct = n() / first(bucket_total) * 100,
.groups = "drop"
) %>%
mutate(
star_label = ifelse(review_num == 5, "5-star", "1-star"),
pct_display = ifelse(review_num == 1, -pct, pct)
)
ggplot(bucket_summary, aes(x = delay_bucket, y = pct_display, fill = star_label)) +
geom_col(width = 0.7) +
geom_text(aes(label = paste0(round(abs(pct), 1), "%")),
vjust = ifelse(bucket_summary$pct_display >= 0, -0.3, 1.3),
size = 3, colour = ifelse(bucket_summary$review_num == 5,
"#2E7D32", "#C62828")) +
scale_fill_manual(values = c("1-star" = "#F44336", "5-star" = "#4CAF50")) +
geom_hline(yintercept = 0, colour = "black", linewidth = 0.5) +
labs(
title = "5-star (green, up) vs 1-star (red, down) reviews by delivery delay bucket",
subtitle = "Negative delay = arrived early vs estimated date",
x = "Delivery delay bucket (days)",
y = "% of orders in bucket",
fill = NULL
) +
theme_minimal(base_size = 12) +
theme(legend.position = "top",
axis.text.x = element_text(angle = 30, hjust = 1))
For orders arriving early, 5-star reviews account for 56–64% and 1-star reviews only 6–8%. Once delay crosses the zero threshold into late delivery, the proportions reverse: at +10 to +20 days late, 1-star reviews reach 70% while 5-star reviews fall to 6%. The tipping point is clearly around 0 days delay — the moment delivery misses the estimated date, customer satisfaction collapses.
States are compared using the share of their orders in each cluster. To reduce unstable rankings based on small samples, the ranking includes only states with at least 200 eligible delivered orders.
# Join the full working data with customer state information.
# Use the full dataset (not the 3,000 sample) for geographic analysis.
# Assign clusters based on the PAM K=2 finding: review 4-5 = satisfied,
# review 1-3 = dissatisfied.
geo_data <- working_data %>%
left_join(
customers %>% select(customer_id, customer_state),
by = "customer_id"
) %>%
filter(!is.na(customer_state)) %>%
mutate(
review_num = as.integer(as.character(review_score)),
cluster_geo = ifelse(review_num >= 4,
"C1: Fast / Satisfied",
"C2: Slow / Dissatisfied")
)
minimum_state_orders <- 200L
# Count orders per state and filter to minimum threshold.
state_totals <- geo_data %>%
count(customer_state, name = "state_orders") %>%
filter(state_orders >= minimum_state_orders)
geo_valid <- geo_data %>%
filter(customer_state %in% state_totals$customer_state)
cat("States with >=", minimum_state_orders, "orders:",
nrow(state_totals), "\n")
## States with >= 200 orders: 23
cat("Orders in geographic analysis:", nrow(geo_valid), "\n")
## Orders in geographic analysis: 96122
# Calculate dissatisfied cluster share per state.
state_cluster <- geo_valid %>%
count(customer_state, cluster_geo) %>%
left_join(state_totals, by = "customer_state") %>%
mutate(cluster_share = round(100 * n / state_orders, 1))
# Rank by worst-cluster share.
state_ranking <- state_cluster %>%
filter(cluster_geo == "C2: Slow / Dissatisfied") %>%
arrange(desc(cluster_share))
# Display the five states with the highest dissatisfied share.
knitr::kable(
state_ranking %>%
select(customer_state, state_orders, cluster_share) %>%
slice_head(n = 5),
caption = paste0(
"States with the highest share of the dissatisfied cluster (n ≥ ",
minimum_state_orders, " orders)"
),
col.names = c("State", "Eligible orders", "Dissatisfied share (%)")
)
| State | Eligible orders | Dissatisfied share (%) |
|---|---|---|
| MA | 716 | 30.3 |
| AL | 397 | 28.7 |
| PA | 946 | 28.4 |
| SE | 335 | 27.8 |
| BA | 3256 | 27.6 |
# Order states by dissatisfied share for the chart.
state_cluster <- state_cluster %>%
mutate(customer_state = factor(
customer_state,
levels = rev(state_ranking$customer_state)
))
ggplot(state_cluster,
aes(x = customer_state, y = cluster_share / 100, fill = cluster_geo)) +
geom_col(width = 0.75) +
coord_flip() +
scale_y_continuous(labels = function(x) paste0(round(100 * x), "%")) +
scale_fill_manual(values = c("C1: Fast / Satisfied" = "#4CAF50",
"C2: Slow / Dissatisfied" = "#F44336")) +
labs(
title = "Order-cluster mix by customer state",
subtitle = paste0("States with at least ", minimum_state_orders,
" eligible delivered orders; ranked by dissatisfied share"),
x = "Customer state",
y = "Share of the state's eligible orders",
fill = "PAM Cluster"
) +
theme_minimal() +
theme(legend.position = "bottom")
A clear regional pattern emerges. The states with the highest dissatisfied cluster share are in Brazil’s North and Northeast regions, geographically distant from the major seller concentration in São Paulo. Conversely, the states with the lowest dissatisfied shares (SP, PR, RS) are in the South and Southeast.
# Show delivery statistics for selected states.
state_stats <- geo_valid %>%
group_by(customer_state) %>%
summarise(
n = n(),
avg_delivery = round(mean(total_delivery_days), 1),
avg_delay = round(mean(delivery_delay_days), 1),
avg_review = round(mean(review_num), 2),
.groups = "drop"
) %>%
arrange(desc(avg_delivery))
knitr::kable(
state_stats %>% slice(c(1:4, n())),
caption = "Delivery statistics for selected states",
col.names = c("State", "Orders", "Avg delivery days",
"Avg delay days", "Avg review")
)
| State | Orders | Avg delivery days | Avg delay days | Avg review |
|---|---|---|---|---|
| AL | 397 | 24.5 | -8.0 | 3.84 |
| PA | 946 | 23.8 | -13.4 | 3.89 |
| MA | 716 | 21.6 | -8.9 | 3.81 |
| SE | 335 | 21.5 | -9.3 | 3.90 |
| SP | 40488 | 8.8 | -10.4 | 4.24 |
Northeastern states average 21–25 days for delivery compared to São Paulo’s 8.8 days — nearly three times longer. Yet both have similar delay values (orders arriving 8–10 days before the estimated date), suggesting Olist’s estimated delivery dates are calibrated per region. The dissatisfaction in northeastern states appears driven by the absolute waiting time rather than by orders missing their estimates.
The PAM model above is unsupervised — it groups orders without being told what “good” or “bad” means. A natural follow-up, and the workflow set out in the Random Forest lecture, is to ask the supervised question directly: can the five delivery-timing features predict whether a customer was satisfied? If they can, that corroborates the descriptive split PAM discovered, and the forest’s variable-importance scores tell us which timings matter most.
A random forest is an ensemble of decision trees. Each tree is grown on a bootstrap sample of the rows and a random subset of the features (bagging), and the forest returns the majority vote across trees. Because every tree sees different data, the ensemble is more robust than a single tree and resistant to the outliers present in this dataset.
The target is binary: an order is satisfied when its review score is 4–5 and dissatisfied when it is 1–3. The review score itself is not a predictor — only the five delivery timings are — so there is no leakage between target and features.
Following the functions / OOP lecture, the preparation step is one small function that validates its inputs and fails early if a required column is missing, rather than erroring deep inside the model call.
# Build the supervised dataset: binary satisfaction target + 5 timing features.
# Defensive checks move errors close to their source (OOP lecture).
prepare_rf_data <- function(df, satisfied_threshold = 4L) {
stopifnot(is.data.frame(df), is.numeric(satisfied_threshold))
required_cols <- c("approval_time_hrs", "carrier_pickup_days", "transit_days",
"total_delivery_days", "delivery_delay_days", "review_score")
missing_cols <- setdiff(required_cols, names(df))
if (length(missing_cols) > 0) {
stop("prepare_rf_data(): missing required column(s): ",
paste(missing_cols, collapse = ", "))
}
df %>%
mutate(
review_num = as.integer(as.character(review_score)),
satisfaction = factor(
ifelse(review_num >= satisfied_threshold, "satisfied", "dissatisfied"),
levels = c("dissatisfied", "satisfied")
)
) %>%
select(satisfaction, approval_time_hrs, carrier_pickup_days,
transit_days, total_delivery_days, delivery_delay_days) %>%
na.omit()
}
The split follows the lecture’s R template: a random 80% of rows train the model and the held-out 20% test it. A 12,000-row sample keeps the forest quick to knit, consistent with the sampling used for the Gower matrix above.
set.seed(42)
rf_data <- prepare_rf_data(working_data)
# Sample for a tractable knit (a forest over ~96k rows is slow to rebuild).
if (nrow(rf_data) > 12000) {
rf_data <- rf_data[sample(nrow(rf_data), 12000), ]
}
# 80/20 train-test split (Random Forest lecture, "R codes" slide).
selector <- sample(seq_len(nrow(rf_data)), 0.8 * nrow(rf_data), replace = FALSE)
train_data <- rf_data[selector, ]
test_data <- rf_data[-selector, ]
# Grow the forest. ntree = 500; importance = TRUE enables both importance
# measures (permutation accuracy and Gini decrease).
rf_model <- randomForest(
satisfaction ~ .,
data = train_data,
ntree = 500,
importance = TRUE
)
# Predict on the held-out test set and tabulate predictions against the truth.
pred_rf <- predict(rf_model, newdata = test_data)
confusion <- table(Actual = test_data$satisfaction, Predicted = pred_rf)
accuracy <- sum(diag(confusion)) / sum(confusion)
knitr::kable(
confusion,
caption = sprintf(
"Confusion matrix on the 20%% test set (overall accuracy %.1f%%)",
100 * accuracy
)
)
| dissatisfied | satisfied | |
|---|---|---|
| dissatisfied | 97 | 428 |
| satisfied | 40 | 1835 |
The fitted model, its confusion matrix, importance table, and
accuracy belong together, so they are bundled into a small S3 object
with its own print() method — the pattern from the OOP
lecture (an object is a list plus a class attribute; the generic
dispatches on that class).
# Variable importance — both measures from the Random Forest lecture:
# MeanDecreaseAccuracy : drop in OOB accuracy when the variable is permuted
# MeanDecreaseGini : total Gini decrease from splits on the variable
imp_mat <- importance(rf_model)
importance_tbl <- data.frame(
feature = rownames(imp_mat),
imp_mat,
row.names = NULL,
check.names = FALSE
) %>%
arrange(desc(MeanDecreaseAccuracy))
# Constructor with a validation check (defensive programming).
new_delivery_rf <- function(model, confusion, importance_tbl, accuracy) {
stopifnot(inherits(model, "randomForest"))
structure(
list(model = model, confusion = confusion,
importance = importance_tbl, accuracy = accuracy),
class = "delivery_rf"
)
}
# print() method for the new class (S3 generic dispatch).
print.delivery_rf <- function(x, ...) {
cat("Delivery-satisfaction random forest\n")
cat(sprintf(" Trees : %d\n", x$model$ntree))
cat(sprintf(" mtry : %d\n", x$model$mtry))
cat(sprintf(" Test accuracy : %.1f%%\n", 100 * x$accuracy))
cat(sprintf(" Top feature : %s\n", x$importance$feature[1]))
invisible(x)
}
rf_result <- new_delivery_rf(rf_model, confusion, importance_tbl, accuracy)
print(rf_result)
## Delivery-satisfaction random forest
## Trees : 500
## mtry : 2
## Test accuracy : 80.5%
## Top feature : delivery_delay_days
# Permutation importance, the same information as varImpPlot()'s accuracy panel.
imp_plot_data <- importance_tbl %>%
mutate(feature = factor(feature, levels = rev(feature)))
ggplot(imp_plot_data, aes(x = feature, y = MeanDecreaseAccuracy)) +
geom_col(fill = "#2c7bb6", width = 0.7) +
coord_flip() +
labs(
title = "Random Forest variable importance",
subtitle = "Mean decrease in OOB accuracy when each feature is permuted",
x = NULL,
y = "Mean decrease in accuracy"
) +
theme_minimal(base_size = 13)
# varImpPlot(rf_model) reproduces this with the package's built-in two-panel plot.
The forest reaches the test accuracy reported above using delivery timings alone, and delivery delay versus the estimate is the most important feature by a wide margin — the same signal the unsupervised model relied on, and the same 0-day tipping point seen in Chart 4. Approval time is the least important feature, matching the violin plots where it barely differed between clusters. The supervised and unsupervised views agree on what separates satisfied from dissatisfied orders, which is the cross-check this section set out to provide.
The tipping point is the estimated delivery date. Once delivery crosses the 0-day delay threshold, 1-star reviews jump from 8% to 65–70%. Customer dissatisfaction is driven primarily by whether the order met its promise, not by absolute delivery speed.
Geography amplifies the problem. Northeastern states (MA, AL, SE, BA) have dissatisfied shares of 28–30%, compared to 19% in São Paulo. These states also have delivery times 2–3 times longer, reflecting distance from the seller base concentrated in the Southeast.
The forecasting gap matters more than logistics speed. Most orders arrive 11 days early on average, suggesting Olist’s estimated delivery dates are set very conservatively. Tighter, more accurate estimates could reduce the expectation gap that drives dissatisfaction when the occasional order does arrive late.
Timing and satisfaction move together descriptively. The relatively late cluster also has the lowest mean review score. This is an association within delivered orders and should not be interpreted as proof that lateness caused the lower scores.
Claude generated the initial R script and analysis structure, including the data-joining, feature-engineering, clustering, visualisation, and reporting workflow. Before submission, the student checked or modified the following:
review_score is converted to
as.ordered() before Gower distance computation;randomForest(... ntree = 500, importance = TRUE)
template, and that its variable-importance ranking agrees with the
cluster profiles;