This report applies two methods from the Warsaw Summer School slides
to the Olist dataset. All shared logic lives in functions.R
— this file only calls those functions and interprets the results.
Packages needed (run once in Console):
install.packages("randomForest")
install.packages("mclust")
# All helper functions live in functions.R in the same folder.
# source() runs that file and makes every function available here.
source("functions.R")
library(mclust) # adjustedRandIndex() — not in functions.R because it is only
# used for the one comparison step in this report.
d <- load_olist_data() # named list of six data frames
sat <- build_sat(d) # one row per order with all derived columns
cat("Orders loaded:", nrow(sat), "\n")
## Orders loaded: 99441
Human verification. Row count must be 99,441. Any deviation means a join fanned out — investigate before continuing.
state_profile <- build_state_profile(sat)
kable(state_profile, digits = 2,
caption = "State-level profile: the four features used for clustering")
| customer_state | avg_review | avg_delay | cross_state_rate | avg_dist_km | n_orders |
|---|---|---|---|---|---|
| AC | 4.05 | -20.08 | 1.00 | 2662.04 | 81 |
| AL | 3.76 | -8.14 | 1.00 | 1833.50 | 410 |
| AM | 4.21 | -18.88 | 1.00 | 2642.75 | 146 |
| AP | 4.19 | -19.09 | 1.00 | 2599.05 | 67 |
| BA | 3.86 | -10.20 | 0.98 | 1342.43 | 3340 |
| CE | 3.85 | -10.16 | 0.99 | 2187.61 | 1326 |
| DF | 4.07 | -11.33 | 0.98 | 837.65 | 2128 |
| ES | 4.04 | -9.98 | 1.00 | 801.33 | 2006 |
| GO | 4.04 | -11.54 | 0.98 | 776.54 | 2007 |
| MA | 3.76 | -8.97 | 0.98 | 2103.47 | 742 |
| MG | 4.14 | -12.56 | 0.86 | 532.58 | 11554 |
| MS | 4.11 | -10.33 | 1.00 | 846.13 | 713 |
| MT | 4.10 | -13.77 | 0.99 | 1353.60 | 900 |
| PA | 3.85 | -13.51 | 1.00 | 2283.64 | 962 |
| PB | 4.02 | -12.79 | 1.00 | 2085.62 | 530 |
| PE | 4.01 | -12.67 | 0.99 | 1972.02 | 1635 |
| PI | 3.92 | -10.62 | 1.00 | 1958.23 | 490 |
| PR | 4.18 | -12.65 | 0.85 | 485.79 | 5019 |
| RJ | 3.88 | -11.12 | 0.92 | 487.03 | 12687 |
| RN | 4.11 | -13.00 | 0.96 | 2105.74 | 480 |
| RO | 4.05 | -19.39 | 1.00 | 2225.80 | 252 |
| RR | 3.61 | -16.59 | 1.00 | 3247.45 | 46 |
| RS | 4.13 | -13.22 | 0.95 | 863.46 | 5443 |
| SC | 4.07 | -10.87 | 0.93 | 572.83 | 3609 |
| SE | 3.81 | -9.38 | 1.00 | 1647.57 | 349 |
| SP | 4.17 | -10.40 | 0.24 | 249.37 | 41472 |
| TO | 4.10 | -11.41 | 1.00 | 1488.66 | 279 |
# The clustering slide is explicit: normalise to stop high-range variables
# (km distances) dominating low-range ones (1–5 review scores).
state_matrix <- prep_cluster_matrix(state_profile)
apply(state_matrix, 2, function(x) c(mean = round(mean(x), 6), sd = round(sd(x), 3))) %>%
kable(caption = "After scaling: every column should have mean ≈ 0 and SD = 1")
| avg_review | avg_delay | cross_state_rate | avg_dist_km | |
|---|---|---|---|---|
| mean | 0 | 0 | 0 | 0 |
| sd | 1 | 1 | 1 | 1 |
Human verification. If any SD is not 1 the scaling failed and the clustering will be meaningless. Check before proceeding.
wss_df <- compute_wss(state_matrix, k_max = 8)
plot_elbow(wss_df)
Human verification. Look for the k where the line visibly flattens. With 27 states k = 3 is almost always defensible; k = 4 is reasonable if the elbow lands there. Change the
kargument in the next two chunks if you want to experiment.
km_fit <- run_kmeans(state_matrix, k = 3)
state_profile <- state_profile %>% mutate(km_cluster = factor(km_fit$cluster))
cat("K-means cluster sizes:\n")
## K-means cluster sizes:
print(table(km_fit$cluster))
##
## 1 2 3
## 5 1 21
as.data.frame(km_fit$centers) %>%
mutate(cluster = row_number()) %>%
select(cluster, everything()) %>%
kable(digits = 3,
caption = "K-means centroids (scaled units — positive = above national average)")
| cluster | avg_review | avg_delay | cross_state_rate | avg_dist_km |
|---|---|---|---|---|
| 1 | 0.121 | -1.839 | 0.354 | 1.345 |
| 2 | 1.098 | 0.689 | -4.814 | -1.587 |
| 3 | -0.081 | 0.405 | 0.145 | -0.245 |
pam_fit <- run_pam(state_matrix, k = 3)
state_profile <- state_profile %>% mutate(pam_cluster = factor(pam_fit$clustering))
cat("PAM cluster sizes:\n")
## PAM cluster sizes:
print(table(pam_fit$clustering))
##
## 1 2 3
## 4 10 13
cat("\nPAM medoids (the representative states for each cluster):\n")
##
## PAM medoids (the representative states for each cluster):
print(pam_fit$medoids)
## avg_review avg_delay cross_state_rate avg_dist_km
## AP 1.2256581 -1.9255146 0.3536833 1.2522497
## CE -0.9523535 0.7612475 0.3071460 0.7551114
## DF 0.4106809 0.4100540 0.1959112 -0.8760175
Human verification. Medoids are actual Brazilian states. Check they are geographically and economically plausible as cluster representatives.
ari <- adjustedRandIndex(km_fit$cluster, pam_fit$clustering)
cat("Adjusted Rand Index (k-means vs PAM):", round(ari, 4), "\n")
## Adjusted Rand Index (k-means vs PAM): 0.2899
cat("Close to 1 = strong agreement. Close to 0 = methods disagree.\n")
## Close to 1 = strong agreement. Close to 0 = methods disagree.
n_orders_named <- setNames(state_profile$n_orders, state_profile$customer_state)
km_clusters <- setNames(km_fit$cluster, rownames(state_matrix))
pca_km <- build_pca_df(state_matrix, km_clusters)
plot_clusters(pca_km,
title = "K-means clusters of Brazilian states (PCA projection)",
n_orders = n_orders_named)
pam_clusters <- setNames(pam_fit$clustering, rownames(state_matrix))
pca_pam <- build_pca_df(state_matrix, pam_clusters)
plot_clusters(pca_pam,
title = "PAM clusters of Brazilian states (PCA projection)",
n_orders = n_orders_named)
state_profile %>%
group_by(km_cluster) %>%
summarise(
n_states = n(),
states = paste(sort(customer_state), collapse = ", "),
avg_review = round(mean(avg_review), 3),
avg_delay_days = round(mean(avg_delay), 2),
cross_state_pct = round(mean(cross_state_rate) * 100, 1),
avg_dist_km = round(mean(avg_dist_km), 0),
total_orders = sum(n_orders),
.groups = "drop"
) %>%
kable(caption = "K-means cluster profiles (original unscaled values)")
| km_cluster | n_states | states | avg_review | avg_delay_days | cross_state_pct | avg_dist_km | total_orders |
|---|---|---|---|---|---|---|---|
| 1 | 5 | AC, AM, AP, RO, RR | 4.022 | -18.81 | 100.0 | 2675 | 592 |
| 2 | 1 | SP | 4.174 | -10.40 | 24.1 | 249 | 41472 |
| 3 | 21 | AL, BA, CE, DF, ES, GO, MA, MG, MS, MT, PA, PB, PE, PI, PR, RJ, RN, RS, SC, SE, TO | 3.990 | -11.34 | 96.9 | 1360 | 56609 |
state_profile %>%
ggplot(aes(x = reorder(customer_state, avg_review),
y = avg_review, fill = km_cluster)) +
geom_col() +
geom_hline(yintercept = mean(state_profile$avg_review),
linetype = "dashed", colour = "grey30") +
scale_fill_manual(
values = c("1" = "#D7301F", "2" = "#2C7FB8", "3" = "#41AB5D"),
name = "Cluster"
) +
coord_flip() +
labs(title = "Average review score by state, coloured by cluster",
subtitle = "Dashed line = mean across all states.",
x = "Customer state", y = "Average review score (1–5)") +
theme_minimal()
rf_data <- prep_rf_data(sat, threshold = 4)
cat("RF rows:", nrow(rf_data), "\n")
## RF rows: 95352
cat("Class balance:\n")
## Class balance:
print(prop.table(table(rf_data$high_review)) %>% round(3))
##
## High Low
## 0.789 0.211
Human verification. ~70% High and ~30% Low is mild imbalance — acceptable for this model. If either class were below 5%, a balancing step would be needed.
splits <- split_data(rf_data, train_prop = 0.8)
train_data <- splits$train
test_data <- splits$test
cat("Train:", nrow(train_data), "| Test:", nrow(test_data), "\n")
## Train: 76281 | Test: 19071
Takes approximately 1–2 minutes. Progress appears in the Console.
rf_model <- fit_rf(train_data, ntree = 200)
print(rf_model)
##
## Call:
## randomForest(formula = reformulate(".", response = names(train)[1]), data = train, ntree = ntree, importance = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 18.09%
## Confusion matrix:
## High Low class.error
## High 59126 1089 0.01808519
## Low 12710 3356 0.79111166
Human verification. OOB error 18–25% is typical for this problem — review scores are noisy. OOB near 5% would suggest data leakage; near 50% means no signal.
pred <- predict(rf_model, newdata = test_data)
conf_mat <- table(Actual = test_data$high_review, Predicted = pred)
print(conf_mat)
## Predicted
## Actual High Low
## High 14794 232
## Low 3185 860
accuracy <- sum(diag(conf_mat)) / sum(conf_mat)
cat("\nOverall accuracy:", round(accuracy * 100, 1), "%\n")
##
## Overall accuracy: 82.1 %
Human verification. Low reviews will have a higher error rate than High reviews (class imbalance effect). The important number commercially is how many Low reviews are predicted as High — those are unhappy customers the model would miss.
imp_df <- get_importance(rf_model)
kable(imp_df %>% mutate(across(where(is.numeric), ~ round(., 1))),
caption = "Variable importance (sorted by Mean Decrease Accuracy)")
| variable | High | Low | MeanDecreaseAccuracy | MeanDecreaseGini |
|---|---|---|---|---|
| delay_days | 150.4 | 228.1 | 240.5 | 8677.2 |
| customer_state | 37.2 | -19.1 | 37.3 | 730.5 |
| dist_km | 29.6 | -18.6 | 30.5 | 5382.5 |
| cross_state | 16.3 | -6.4 | 18.4 | 90.7 |
plot_importance(imp_df)
Reduce delivery delay first.
delay_days is 5–6× more important than any geographic
variable. Meeting the promised delivery date is the single most
impactful lever — not route distance, not whether the order crosses a
state border.
Use clusters to triage logistics investment. The cluster with the lowest average review and the worst average delay is the priority target. The PAM medoid state is the concrete benchmark: “bring this cluster up to the standard we already achieve in [medoid state].”
States matter beyond just distance.
customer_state ranks above dist_km in variable
importance. That means some states have systematically different
satisfaction outcomes even after accounting for how far the package
travels — possibly due to regional carrier quality, product category
mix, or expectations. Those states deserve individual investigation, not
a blanket distance-based fix.
State-level clustering averages out individual variation. Two states can have the same cluster membership but very different distributions within that average. Treat clusters as a triage tool, not a precise boundary.
The Random Forest is a black box. It confirms
that delay_days matters but not the exact functional form
or the causal mechanism. Nothing here establishes that faster shipping
causes better reviews — correlation only.
All shared logic (data loading, joins, distance, clustering, RF
wrappers) lives in functions.R. This report only calls
those functions. I verified:
replace = FALSE in
split_data()).sessionInfo()
## R version 4.6.0 (2026-04-24)
## Platform: aarch64-apple-darwin23
## Running under: macOS Tahoe 26.5.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Warsaw
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] randomForest_4.7-1.2 mclust_6.1.2 scales_1.4.0
## [4] knitr_1.51 cluster_2.1.8.2 lubridate_1.9.5
## [7] forcats_1.0.1 stringr_1.6.0 dplyr_1.2.1
## [10] purrr_1.2.2 readr_2.2.0 tidyr_1.3.2
## [13] tibble_3.3.1 ggplot2_4.0.3 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 stringi_1.8.7 hms_1.1.4
## [5] digest_0.6.39 magrittr_2.0.5 evaluate_1.0.5 grid_4.6.0
## [9] timechange_0.4.0 RColorBrewer_1.1-3 fastmap_1.2.0 jsonlite_2.0.0
## [13] jquerylib_0.1.4 cli_3.6.6 rlang_1.2.0 crayon_1.5.3
## [17] bit64_4.8.2 withr_3.0.2 cachem_1.1.0 yaml_2.3.12
## [21] otel_0.2.0 tools_4.6.0 parallel_4.6.0 tzdb_0.5.0
## [25] vctrs_0.7.3 R6_2.6.1 lifecycle_1.0.5 bit_4.6.0
## [29] vroom_1.7.1 pkgconfig_2.0.3 pillar_1.11.1 bslib_0.11.0
## [33] gtable_0.3.6 glue_1.8.1 xfun_0.58 tidyselect_1.2.1
## [37] rstudioapi_0.18.0 farver_2.1.2 htmltools_0.5.9 rmarkdown_2.31
## [41] labeling_0.4.3 compiler_4.6.0 S7_0.2.2