Overview

This report applies two methods from the Warsaw Summer School slides to the Olist dataset. All shared logic lives in functions.R — this file only calls those functions and interprets the results.

  • Part A — Clustering: group all 27 Brazilian states by their satisfaction and delivery profile (k-means + PAM, compared via Adjusted Rand Index).
  • Part B — Random Forest: predict high vs low review scores and rank which variables matter most.

Packages needed (run once in Console):

install.packages("randomForest")
install.packages("mclust")
# All helper functions live in functions.R in the same folder.
# source() runs that file and makes every function available here.
source("functions.R")
library(mclust)   # adjustedRandIndex() — not in functions.R because it is only
                  # used for the one comparison step in this report.

Data loading

d   <- load_olist_data()          # named list of six data frames
sat <- build_sat(d)               # one row per order with all derived columns

cat("Orders loaded:", nrow(sat), "\n")
## Orders loaded: 99441

Human verification. Row count must be 99,441. Any deviation means a join fanned out — investigate before continuing.


Part A — Clustering state profiles

A1 — Build and inspect the input features

state_profile <- build_state_profile(sat)
kable(state_profile, digits = 2,
      caption = "State-level profile: the four features used for clustering")
State-level profile: the four features used for clustering
customer_state avg_review avg_delay cross_state_rate avg_dist_km n_orders
AC 4.05 -20.08 1.00 2662.04 81
AL 3.76 -8.14 1.00 1833.50 410
AM 4.21 -18.88 1.00 2642.75 146
AP 4.19 -19.09 1.00 2599.05 67
BA 3.86 -10.20 0.98 1342.43 3340
CE 3.85 -10.16 0.99 2187.61 1326
DF 4.07 -11.33 0.98 837.65 2128
ES 4.04 -9.98 1.00 801.33 2006
GO 4.04 -11.54 0.98 776.54 2007
MA 3.76 -8.97 0.98 2103.47 742
MG 4.14 -12.56 0.86 532.58 11554
MS 4.11 -10.33 1.00 846.13 713
MT 4.10 -13.77 0.99 1353.60 900
PA 3.85 -13.51 1.00 2283.64 962
PB 4.02 -12.79 1.00 2085.62 530
PE 4.01 -12.67 0.99 1972.02 1635
PI 3.92 -10.62 1.00 1958.23 490
PR 4.18 -12.65 0.85 485.79 5019
RJ 3.88 -11.12 0.92 487.03 12687
RN 4.11 -13.00 0.96 2105.74 480
RO 4.05 -19.39 1.00 2225.80 252
RR 3.61 -16.59 1.00 3247.45 46
RS 4.13 -13.22 0.95 863.46 5443
SC 4.07 -10.87 0.93 572.83 3609
SE 3.81 -9.38 1.00 1647.57 349
SP 4.17 -10.40 0.24 249.37 41472
TO 4.10 -11.41 1.00 1488.66 279

A2 — Scale (mandatory before k-means)

# The clustering slide is explicit: normalise to stop high-range variables
# (km distances) dominating low-range ones (1–5 review scores).
state_matrix <- prep_cluster_matrix(state_profile)

apply(state_matrix, 2, function(x) c(mean = round(mean(x), 6), sd = round(sd(x), 3))) %>%
  kable(caption = "After scaling: every column should have mean ≈ 0 and SD = 1")
After scaling: every column should have mean ≈ 0 and SD = 1
avg_review avg_delay cross_state_rate avg_dist_km
mean 0 0 0 0
sd 1 1 1 1

Human verification. If any SD is not 1 the scaling failed and the clustering will be meaningless. Check before proceeding.

A3 — Elbow plot: choosing k

wss_df <- compute_wss(state_matrix, k_max = 8)
plot_elbow(wss_df)

Human verification. Look for the k where the line visibly flattens. With 27 states k = 3 is almost always defensible; k = 4 is reasonable if the elbow lands there. Change the k argument in the next two chunks if you want to experiment.

A4 — K-means

km_fit <- run_kmeans(state_matrix, k = 3)
state_profile <- state_profile %>% mutate(km_cluster = factor(km_fit$cluster))

cat("K-means cluster sizes:\n")
## K-means cluster sizes:
print(table(km_fit$cluster))
## 
##  1  2  3 
##  5  1 21
as.data.frame(km_fit$centers) %>%
  mutate(cluster = row_number()) %>%
  select(cluster, everything()) %>%
  kable(digits = 3,
        caption = "K-means centroids (scaled units — positive = above national average)")
K-means centroids (scaled units — positive = above national average)
cluster avg_review avg_delay cross_state_rate avg_dist_km
1 0.121 -1.839 0.354 1.345
2 1.098 0.689 -4.814 -1.587
3 -0.081 0.405 0.145 -0.245

A5 — PAM

pam_fit <- run_pam(state_matrix, k = 3)
state_profile <- state_profile %>% mutate(pam_cluster = factor(pam_fit$clustering))

cat("PAM cluster sizes:\n")
## PAM cluster sizes:
print(table(pam_fit$clustering))
## 
##  1  2  3 
##  4 10 13
cat("\nPAM medoids (the representative states for each cluster):\n")
## 
## PAM medoids (the representative states for each cluster):
print(pam_fit$medoids)
##    avg_review  avg_delay cross_state_rate avg_dist_km
## AP  1.2256581 -1.9255146        0.3536833   1.2522497
## CE -0.9523535  0.7612475        0.3071460   0.7551114
## DF  0.4106809  0.4100540        0.1959112  -0.8760175

Human verification. Medoids are actual Brazilian states. Check they are geographically and economically plausible as cluster representatives.

A6 — Compare: Adjusted Rand Index

ari <- adjustedRandIndex(km_fit$cluster, pam_fit$clustering)
cat("Adjusted Rand Index (k-means vs PAM):", round(ari, 4), "\n")
## Adjusted Rand Index (k-means vs PAM): 0.2899
cat("Close to 1 = strong agreement. Close to 0 = methods disagree.\n")
## Close to 1 = strong agreement. Close to 0 = methods disagree.

A7 — Visualise

n_orders_named <- setNames(state_profile$n_orders, state_profile$customer_state)

km_clusters  <- setNames(km_fit$cluster,        rownames(state_matrix))
pca_km       <- build_pca_df(state_matrix, km_clusters)

plot_clusters(pca_km,
              title    = "K-means clusters of Brazilian states (PCA projection)",
              n_orders = n_orders_named)

pam_clusters <- setNames(pam_fit$clustering, rownames(state_matrix))
pca_pam      <- build_pca_df(state_matrix, pam_clusters)

plot_clusters(pca_pam,
              title    = "PAM clusters of Brazilian states (PCA projection)",
              n_orders = n_orders_named)

A8 — Profile the clusters

state_profile %>%
  group_by(km_cluster) %>%
  summarise(
    n_states        = n(),
    states          = paste(sort(customer_state), collapse = ", "),
    avg_review      = round(mean(avg_review),            3),
    avg_delay_days  = round(mean(avg_delay),             2),
    cross_state_pct = round(mean(cross_state_rate) * 100, 1),
    avg_dist_km     = round(mean(avg_dist_km),           0),
    total_orders    = sum(n_orders),
    .groups = "drop"
  ) %>%
  kable(caption = "K-means cluster profiles (original unscaled values)")
K-means cluster profiles (original unscaled values)
km_cluster n_states states avg_review avg_delay_days cross_state_pct avg_dist_km total_orders
1 5 AC, AM, AP, RO, RR 4.022 -18.81 100.0 2675 592
2 1 SP 4.174 -10.40 24.1 249 41472
3 21 AL, BA, CE, DF, ES, GO, MA, MG, MS, MT, PA, PB, PE, PI, PR, RJ, RN, RS, SC, SE, TO 3.990 -11.34 96.9 1360 56609
state_profile %>%
  ggplot(aes(x = reorder(customer_state, avg_review),
             y = avg_review, fill = km_cluster)) +
  geom_col() +
  geom_hline(yintercept = mean(state_profile$avg_review),
             linetype = "dashed", colour = "grey30") +
  scale_fill_manual(
    values = c("1" = "#D7301F", "2" = "#2C7FB8", "3" = "#41AB5D"),
    name   = "Cluster"
  ) +
  coord_flip() +
  labs(title    = "Average review score by state, coloured by cluster",
       subtitle = "Dashed line = mean across all states.",
       x = "Customer state", y = "Average review score (1–5)") +
  theme_minimal()


Part B — Random Forest: predicting satisfaction

B1 — Prepare features

rf_data <- prep_rf_data(sat, threshold = 4)

cat("RF rows:", nrow(rf_data), "\n")
## RF rows: 95352
cat("Class balance:\n")
## Class balance:
print(prop.table(table(rf_data$high_review)) %>% round(3))
## 
##  High   Low 
## 0.789 0.211

Human verification. ~70% High and ~30% Low is mild imbalance — acceptable for this model. If either class were below 5%, a balancing step would be needed.

B2 — Train / test split

splits     <- split_data(rf_data, train_prop = 0.8)
train_data <- splits$train
test_data  <- splits$test

cat("Train:", nrow(train_data), "| Test:", nrow(test_data), "\n")
## Train: 76281 | Test: 19071

B3 — Fit the model

Takes approximately 1–2 minutes. Progress appears in the Console.

rf_model <- fit_rf(train_data, ntree = 200)
print(rf_model)
## 
## Call:
##  randomForest(formula = reformulate(".", response = names(train)[1]),      data = train, ntree = ntree, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 18.09%
## Confusion matrix:
##       High  Low class.error
## High 59126 1089  0.01808519
## Low  12710 3356  0.79111166

Human verification. OOB error 18–25% is typical for this problem — review scores are noisy. OOB near 5% would suggest data leakage; near 50% means no signal.

B4 — Confusion matrix

pred       <- predict(rf_model, newdata = test_data)
conf_mat   <- table(Actual = test_data$high_review, Predicted = pred)
print(conf_mat)
##       Predicted
## Actual  High   Low
##   High 14794   232
##   Low   3185   860
accuracy   <- sum(diag(conf_mat)) / sum(conf_mat)
cat("\nOverall accuracy:", round(accuracy * 100, 1), "%\n")
## 
## Overall accuracy: 82.1 %

Human verification. Low reviews will have a higher error rate than High reviews (class imbalance effect). The important number commercially is how many Low reviews are predicted as High — those are unhappy customers the model would miss.

B5 — Variable importance

imp_df <- get_importance(rf_model)
kable(imp_df %>% mutate(across(where(is.numeric), ~ round(., 1))),
      caption = "Variable importance (sorted by Mean Decrease Accuracy)")
Variable importance (sorted by Mean Decrease Accuracy)
variable High Low MeanDecreaseAccuracy MeanDecreaseGini
delay_days 150.4 228.1 240.5 8677.2
customer_state 37.2 -19.1 37.3 730.5
dist_km 29.6 -18.6 30.5 5382.5
cross_state 16.3 -6.4 18.4 90.7
plot_importance(imp_df)


Actionable outcomes

  1. Reduce delivery delay first. delay_days is 5–6× more important than any geographic variable. Meeting the promised delivery date is the single most impactful lever — not route distance, not whether the order crosses a state border.

  2. Use clusters to triage logistics investment. The cluster with the lowest average review and the worst average delay is the priority target. The PAM medoid state is the concrete benchmark: “bring this cluster up to the standard we already achieve in [medoid state].”

  3. States matter beyond just distance. customer_state ranks above dist_km in variable importance. That means some states have systematically different satisfaction outcomes even after accounting for how far the package travels — possibly due to regional carrier quality, product category mix, or expectations. Those states deserve individual investigation, not a blanket distance-based fix.


Limitations and AI audit note

Limitations

  1. State-level clustering averages out individual variation. Two states can have the same cluster membership but very different distributions within that average. Treat clusters as a triage tool, not a precise boundary.

  2. The Random Forest is a black box. It confirms that delay_days matters but not the exact functional form or the causal mechanism. Nothing here establishes that faster shipping causes better reviews — correlation only.

AI audit note

All shared logic (data loading, joins, distance, clustering, RF wrappers) lives in functions.R. This report only calls those functions. I verified:

  • Scaling produced mean ≈ 0 and SD = 1 for all four features.
  • PAM medoids are plausible Brazilian states.
  • The ARI was computed after, not before, inspecting cluster assignments (no reverse engineering).
  • Train and test sets do not overlap (replace = FALSE in split_data()).
  • OOB error rate is in the expected range for noisy survey data.
  • All association language — no causal claims.
sessionInfo()
## R version 4.6.0 (2026-04-24)
## Platform: aarch64-apple-darwin23
## Running under: macOS Tahoe 26.5.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Warsaw
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] randomForest_4.7-1.2 mclust_6.1.2         scales_1.4.0        
##  [4] knitr_1.51           cluster_2.1.8.2      lubridate_1.9.5     
##  [7] forcats_1.0.1        stringr_1.6.0        dplyr_1.2.1         
## [10] purrr_1.2.2          readr_2.2.0          tidyr_1.3.2         
## [13] tibble_3.3.1         ggplot2_4.0.3        tidyverse_2.0.0     
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     stringi_1.8.7      hms_1.1.4         
##  [5] digest_0.6.39      magrittr_2.0.5     evaluate_1.0.5     grid_4.6.0        
##  [9] timechange_0.4.0   RColorBrewer_1.1-3 fastmap_1.2.0      jsonlite_2.0.0    
## [13] jquerylib_0.1.4    cli_3.6.6          rlang_1.2.0        crayon_1.5.3      
## [17] bit64_4.8.2        withr_3.0.2        cachem_1.1.0       yaml_2.3.12       
## [21] otel_0.2.0         tools_4.6.0        parallel_4.6.0     tzdb_0.5.0        
## [25] vctrs_0.7.3        R6_2.6.1           lifecycle_1.0.5    bit_4.6.0         
## [29] vroom_1.7.1        pkgconfig_2.0.3    pillar_1.11.1      bslib_0.11.0      
## [33] gtable_0.3.6       glue_1.8.1         xfun_0.58          tidyselect_1.2.1  
## [37] rstudioapi_0.18.0  farver_2.1.2       htmltools_0.5.9    rmarkdown_2.31    
## [41] labeling_0.4.3     compiler_4.6.0     S7_0.2.2