This report is the group submission for the Codex + ggplot2 for Responsible Analytics summer school. The overall topic is customer satisfaction on Olist, Brazil’s largest e-commerce platform.
The dataset covers ~99,000 orders placed between 2016 and 2018, with matched delivery records, payment information, product data, and customer review scores (1–5 stars).
Each group member independently analysed a different dimension of satisfaction:
| Section | Sub-topic | Analyst |
|---|---|---|
| A | Geography — where customers and sellers are | Malakai Shaffer |
| B | Products & categories — what is being bought | Edvinas Kučys |
| C | Payment behaviour — how customers pay | [Group Member] |
| D | Delivery timing — statistical correlation | Michal |
All code was AI-assisted (Codex / Claude) with human verification at every step. Findings are framed as associations, not causal claims.
Analyst: Malakai Shaffer
Question: Does where a customer lives, where the seller is based, or the distance between them appear to be associated with the review score they leave?
# build_sat() joins orders + reviews + customers + sellers and computes
# delay_days, cross_state flag, and straight-line distance (Haversine).
sat <- build_sat(d)
cat("Rows in sat (one per order):", nrow(sat), "\n")
## Rows in sat (one per order): 99441
state_reviews <- sat %>%
filter(!is.na(customer_state), !is.na(review_score)) %>%
group_by(customer_state) %>%
summarise(avg_review = mean(review_score), n_orders = n(), .groups = "drop")
ggplot(state_reviews,
aes(x = reorder(customer_state, avg_review), y = avg_review)) +
geom_col(fill = "#2C7FB8") +
geom_hline(yintercept = mean(sat$review_score, na.rm = TRUE),
linetype = "dashed", colour = "grey30") +
coord_flip() +
labs(
title = "Average review score by customer state",
subtitle = "Dashed line = national average (4.09). All 27 states.",
x = NULL, y = "Average review score (1–5)"
) +
theme_minimal()
dist_bins <- sat %>%
filter(!is.na(dist_km), !is.na(review_score)) %>%
mutate(distance_band = cut(
dist_km,
breaks = c(0, 100, 500, 1500, Inf),
labels = c("0–100 km", "100–500 km", "500–1,500 km", "> 1,500 km"),
right = TRUE, include.lowest = TRUE
)) %>%
group_by(distance_band) %>%
summarise(
n_orders = n(),
avg_review = mean(review_score),
se_review = sd(review_score) / sqrt(n()),
.groups = "drop"
)
ggplot(dist_bins, aes(x = distance_band, y = avg_review)) +
geom_col(fill = "#6A51A3", width = 0.6) +
geom_errorbar(aes(ymin = avg_review - se_review,
ymax = avg_review + se_review),
width = 0.2, colour = "grey30") +
geom_text(aes(label = comma(n_orders), y = 3.55),
colour = "white", fontface = "bold", size = 3.8) +
geom_hline(yintercept = mean(sat$review_score, na.rm = TRUE),
linetype = "dashed", colour = "grey30") +
labs(
title = "Average review score by customer–seller distance",
subtitle = "Error bars = ±1 SE. White = order count per band. Dashed = national average.",
x = "Straight-line distance (km)", y = "Average review score (1–5)"
) +
theme_minimal()
rf_data_geo <- prep_rf_data(sat, threshold = 4)
splits_geo <- split_data(rf_data_geo)
rf_geo <- fit_rf(splits_geo$train, ntree = 200)
imp_geo <- get_importance(rf_geo)
plot_importance(imp_geo) +
labs(title = "RF variable importance — geographic features",
subtitle = "delay_days is 5–6× more important than any geographic variable.")
state_centroids <- geolocation %>%
filter(geolocation_lat > -34, geolocation_lat < 6,
geolocation_lng > -74, geolocation_lng < -34) %>%
group_by(geolocation_state) %>%
summarise(lat = mean(geolocation_lat), lng = mean(geolocation_lng), .groups = "drop")
map_data <- state_centroids %>%
inner_join(state_reviews, by = c("geolocation_state" = "customer_state")) %>%
mutate(popup_text = paste0(
"<b>", geolocation_state, "</b><br>",
"Avg review: <b>", round(avg_review, 3), "</b><br>",
"Orders: ", comma(n_orders)
))
pal <- colorNumeric("RdYlGn", domain = map_data$avg_review)
leaflet(map_data) %>%
addTiles() %>%
setView(lng = -52, lat = -14, zoom = 4) %>%
addCircleMarkers(
lng = ~lng, lat = ~lat,
radius = ~sqrt(n_orders / 200),
fillColor = ~pal(avg_review), color = ~pal(avg_review),
fillOpacity = 0.85, stroke = FALSE,
popup = ~popup_text, label = ~geolocation_state
) %>%
addLegend("bottomright", pal = pal, values = ~avg_review,
title = "Avg review", opacity = 1)
Section A findings:
delay_days is 5–6× more
predictive than any geographic variable — geography matters, but
primarily as a proxy for delivery timing.Analyst: Edvinas Kučys
Question: Which product categories are associated with low satisfaction, and do products cluster into meaningful groups by price, delivery time, and rating?
Note: This section analyses orders where customers left a numeric score but no written comment — the “silent rating” signal. Orders with written reviews are excluded to focus on typical, unprompted satisfaction.
library(NbClust)
# Identify orders that have ANY written review text
edv_written_ids <- unique(reviews$order_id[
(!is.na(reviews$review_comment_title) & trimws(reviews$review_comment_title) != "") |
(!is.na(reviews$review_comment_message) & trimws(reviews$review_comment_message) != "")
])
# Restrict to orders WITHOUT written text
edv_reviews_clean <- reviews[
!reviews$order_id %in% edv_written_ids, ]
# Review scores for those clean orders
edv_product_reviews <- merge(
order_items[, c("order_id", "product_id")],
edv_reviews_clean[, c("order_id", "review_score")],
by = "order_id"
)
edv_category_reviews <- merge(
edv_product_reviews,
products[, c("product_id", "product_category_name")],
by = "product_id", all.x = TRUE
)
edv_category_reviews <- edv_category_reviews[
!is.na(edv_category_reviews$product_category_name) &
trimws(edv_category_reviews$product_category_name) != "", ]
edv_cat_summary <- aggregate(
review_score ~ product_category_name, data = edv_category_reviews,
FUN = function(x) c(avg = mean(x), n = length(x))
)
edv_cat_summary <- data.frame(
product_category_name = edv_cat_summary$product_category_name,
average_score = edv_cat_summary$review_score[, "avg"],
review_count = edv_cat_summary$review_score[, "n"]
)
edv_cat_summary <- edv_cat_summary[
edv_cat_summary$review_count >= 10 & edv_cat_summary$average_score <= 4.5, ]
edv_cat_summary <- edv_cat_summary[
order(edv_cat_summary$average_score, -edv_cat_summary$review_count), ]
edv_worst_cats <- head(edv_cat_summary, 15)
edv_worst_cats$label <- paste0(
edv_worst_cats$product_category_name, " (n = ", edv_worst_cats$review_count, ")"
)
ggplot(edv_worst_cats,
aes(x = reorder(label, average_score), y = average_score)) +
geom_col(fill = "darkorange") +
scale_y_continuous(limits = c(0, 5), breaks = 1:5) +
coord_flip() +
labs(
title = "Worst-rated product categories (silent raters only)",
subtitle = "Minimum 10 reviews; average score ≤ 4.5.",
x = NULL, y = "Average review score (1–5)"
) +
theme_minimal()
# Prepare product-level features: avg review, price, freight, delivery time
edv_orders_delivery <- orders %>%
mutate(
delivery_days = as.numeric(
as.POSIXct(order_delivered_customer_date) -
as.POSIXct(order_purchase_timestamp), units = "days"
)
) %>%
select(order_id, delivery_days)
edv_purchase_data <- edv_product_reviews %>%
merge(order_items[, c("order_id","product_id","price","freight_value")],
by = c("order_id","product_id")) %>%
merge(products[, c("product_id","product_category_name")],
by = "product_id", all.x = TRUE) %>%
merge(edv_orders_delivery, by = "order_id", all.x = TRUE)
edv_product_summary <- aggregate(
cbind(review_score, price, freight_value, delivery_days) ~ product_id,
data = edv_purchase_data, FUN = mean, na.rm = TRUE
)
edv_product_summary$review_count <- as.numeric(
table(edv_purchase_data$product_id)[edv_product_summary$product_id]
)
edv_product_summary <- edv_product_summary[edv_product_summary$review_count >= 10, ]
edv_feature_cols <- c("review_score","review_count","price","freight_value","delivery_days")
edv_complete <- complete.cases(edv_product_summary[, edv_feature_cols])
edv_features_sc <- scale(edv_product_summary[edv_complete, edv_feature_cols])
set.seed(42)
# k = 7 was selected using CCC and Pseudo-F statistics (run offline via NbClust).
edv_km <- kmeans(edv_features_sc, centers = 7, nstart = 25)
edv_pca <- prcomp(edv_features_sc)
edv_var <- summary(edv_pca)$importance[2, 1:2]
edv_plot_df <- as.data.frame(edv_pca$x[, 1:2]) %>%
mutate(cluster = factor(edv_km$cluster))
ggplot(edv_plot_df, aes(x = PC1, y = PC2, colour = cluster)) +
geom_point(size = 2, alpha = 0.7) +
scale_colour_brewer(palette = "Set1", name = "Cluster") +
labs(
title = paste0("K-means product clustering (k = 7)"),
subtitle = paste0("PCA projection — PC1 + PC2 explain ",
round(sum(edv_var) * 100, 1), "% of variance."),
x = paste0("PC1 (", round(edv_var[1] * 100, 1), "%)"),
y = paste0("PC2 (", round(edv_var[2] * 100, 1), "%)")
) +
theme_minimal()
edv_summary_df <- data.frame(
cluster = factor(edv_km$cluster),
edv_product_summary[edv_complete, edv_feature_cols]
)
edv_cluster_profile <- aggregate(. ~ cluster, data = edv_summary_df, FUN = mean)
edv_cluster_profile$product_count <- as.numeric(table(edv_km$cluster))
edv_cluster_profile[, -1] <- round(edv_cluster_profile[, -1], 2)
kable(edv_cluster_profile[, c("cluster","product_count","review_score","price",
"freight_value","delivery_days")],
caption = "Product cluster profiles — mean values per cluster")
| cluster | product_count | review_score | price | freight_value | delivery_days |
|---|---|---|---|---|---|
| 1 | 603 | 4.58 | 55.43 | 14.07 | 8.46 |
| 2 | 28 | 4.36 | 673.72 | 31.97 | 14.09 |
| 3 | 92 | 4.42 | 153.14 | 40.56 | 14.24 |
| 4 | 549 | 4.42 | 96.76 | 18.20 | 14.06 |
| 5 | 23 | 2.84 | 108.75 | 24.25 | 31.81 |
| 6 | 77 | 2.81 | 54.74 | 15.33 | 12.40 |
| 7 | 23 | 4.20 | 91.05 | 16.91 | 13.01 |
Section B findings:
Analyst: [Group Member]
Question: Does how a customer pays — the payment method, total value, or number of instalments — appear to be associated with their satisfaction score?
This section also uses silent-rater orders (numeric score, no written comment) to isolate payment behaviour from text-expressed grievances.
# Build the payment feature dataset
pay_features <- payments %>%
group_by(order_id) %>%
summarise(
payment_value = sum(payment_value, na.rm = TRUE),
max_installments = max(payment_installments, na.rm = TRUE),
n_payment_types = n_distinct(payment_type),
main_payment_type = names(sort(table(payment_type), decreasing = TRUE))[1],
.groups = "drop"
)
# Silent-rater reviews joined to payment features
pay_review_ids <- reviews %>%
filter(!is.na(review_score),
is.na(review_comment_message),
is.na(review_comment_title)) %>%
distinct(order_id, .keep_all = TRUE)
pay_model_data <- pay_review_ids %>%
left_join(orders, by = "order_id") %>%
left_join(customers, by = "customer_id") %>%
left_join(pay_features, by = "order_id") %>%
mutate(
purchase_date = as.POSIXct(order_purchase_timestamp),
delivered_date = as.POSIXct(order_delivered_customer_date),
estimated_date = as.POSIXct(order_estimated_delivery_date),
delivery_days = as.numeric(difftime(delivered_date, purchase_date, units = "days")),
delay_days = as.numeric(difftime(delivered_date, estimated_date, units = "days")),
satisfaction_group = case_when(
review_score >= 4 ~ "happy",
review_score <= 3 ~ "mad",
TRUE ~ "neutral"
)
)
pay_no_comment <- pay_model_data %>%
filter(!is.na(payment_value), !is.na(max_installments),
!is.na(main_payment_type), payment_value > 0)
pay_no_comment <- pay_no_comment %>% mutate(score_f = factor(review_score))
ggplot(pay_no_comment, aes(x = score_f, y = payment_value)) +
geom_boxplot(fill = "lightblue", outlier.alpha = 0.1) +
scale_y_log10(labels = comma) +
labs(
title = "Payment value by review score (silent raters)",
subtitle = "Log scale. Median payment value is broadly similar across all scores.",
x = "Review Score", y = "Payment value (R$, log scale)"
) +
theme_minimal()
pay_no_comment %>%
group_by(review_score) %>%
summarise(avg_inst = mean(max_installments, na.rm = TRUE), .groups = "drop") %>%
mutate(score_f = factor(review_score)) %>%
ggplot(aes(x = score_f, y = avg_inst, fill = score_f)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = c(
"1" = "#d73027", "2" = "#f46d43", "3" = "#fee08b",
"4" = "#a6d96a", "5" = "#1a9850"
)) +
labs(
title = "Average number of instalments by review score",
subtitle = "Lower-rated orders are associated with slightly more payment instalments.",
x = "Review Score", y = "Average number of instalments"
) +
theme_minimal()
pay_rf_data <- pay_no_comment %>%
select(satisfaction_group, payment_value, max_installments,
delivery_days, delay_days) %>%
drop_na() %>%
filter(if_all(where(is.numeric), is.finite)) %>%
mutate(satisfaction_group = as.factor(satisfaction_group))
set.seed(456)
pay_rf_model <- randomForest(
satisfaction_group ~ ., data = pay_rf_data,
ntree = 200, importance = TRUE
)
as.data.frame(importance(pay_rf_model)) %>%
rownames_to_column("Feature") %>%
arrange(desc(MeanDecreaseGini)) %>%
ggplot(aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "RF variable importance — payment & delivery features",
subtitle = "Higher = more important in separating happy from mad customers.",
x = NULL, y = "Mean Decrease Gini"
) +
theme_minimal()
Section C findings:
delay_days and delivery_days dominate. Payment
variables contribute much less signal.Analyst: Michal
Question: Which observable variables correlate most strongly with review score? Is delivery delay the dominant signal?
mic_payments_clean <- payments %>%
group_by(order_id) %>%
summarise(payment_value = sum(payment_value), .groups = "drop")
mic_items_clean <- order_items %>%
group_by(order_id) %>%
summarise(price = sum(price), freight_value = sum(freight_value), .groups = "drop")
mic_data <- orders %>%
select(order_id, order_delivered_customer_date, order_estimated_delivery_date) %>%
inner_join(reviews %>% select(order_id, review_score), by = "order_id") %>%
inner_join(mic_payments_clean, by = "order_id") %>%
inner_join(mic_items_clean, by = "order_id") %>%
na.omit() %>%
mutate(order_delay = as.numeric(
order_delivered_customer_date - order_estimated_delivery_date,
units = "days"
))
mic_correlations <- data.frame(
variable = c("order_delay", "freight_value", "payment_value", "price"),
correlation = c(
cor(mic_data$order_delay, mic_data$review_score, method = "spearman"),
cor(mic_data$freight_value, mic_data$review_score, method = "spearman"),
cor(mic_data$payment_value, mic_data$review_score, method = "spearman"),
cor(mic_data$price, mic_data$review_score, method = "spearman")
)
)
ggplot(mic_correlations,
aes(x = reorder(variable, correlation), y = correlation,
fill = correlation)) +
geom_col() +
geom_text(aes(label = round(correlation, 3)), hjust = -0.2, size = 4.5) +
scale_fill_gradient2(low = "#d73027", mid = "white", high = "#1a9850",
midpoint = 0, guide = "none") +
coord_flip() +
ylim(-0.25, 0.05) +
labs(
title = "Spearman correlation with review score",
subtitle = "Negative = variable increases as satisfaction falls.",
x = NULL, y = "Spearman ρ"
) +
theme_minimal()
mic_data <- mic_data %>% mutate(score_f = factor(review_score))
ggplot(mic_data, aes(x = score_f, y = order_delay, fill = score_f)) +
geom_boxplot(outlier.alpha = 0.15) +
scale_fill_manual(values = c(
"1" = "#d73027","2" = "#f46d43","3" = "#fee08b",
"4" = "#a6d96a","5" = "#1a9850"
), guide = "none") +
labs(
title = "Delivery delay (vs. estimated date) by review score",
subtitle = "Negative = arrived early. 1-star orders are noticeably later than 5-star orders.",
x = "Review Score", y = "Delay (days)"
) +
theme_minimal()
mic_data %>%
group_by(review_score) %>%
summarise(avg_delay = mean(order_delay), .groups = "drop") %>%
mutate(score_f = factor(review_score)) %>%
ggplot(aes(x = score_f, y = avg_delay, fill = score_f)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = c(
"1" = "#d73027","2" = "#f46d43","3" = "#fee08b",
"4" = "#a6d96a","5" = "#1a9850"
)) +
labs(
title = "Average delivery delay by review score",
subtitle = "1-star orders arrive ~10 days late on average; 5-star orders arrive early.",
x = "Review Score", y = "Average delay (days)"
) +
theme_minimal()
Section D findings:
order_delay has the strongest negative Spearman
correlation with review score (ρ ≈ −0.22). All other variables (freight,
payment value, item price) have correlations close to zero.The four sub-analyses converge on a consistent set of findings.
| Evidence source | Finding |
|---|---|
| Section A (RF importance) | delay_days is 5–6× more predictive than any geographic
variable |
| Section B (product clusters) | Worst-rated categories are bulk/heavy goods — associated with longer delivery |
| Section C (payment RF) | Delivery timing dominates even in a payment-focused model |
| Section D (Spearman) | Order delay has the strongest statistical correlation with review score |
All four independent analyses point to the same root cause: delivery timing relative to the promised date is the primary driver of customer dissatisfaction — more than geography, product category, or payment method.
1. Extend estimated delivery dates in North and North-East states. The North-East cluster (AL, MA, SE) has both the worst average review scores and the thinnest early-delivery margin (Section A). Olist doesn’t need to make logistics faster — it needs to make promises more honest. Under-promising by 1–2 days costs nothing and directly improves how deliveries feel to the customer.
2. Audit bulky-goods logistics separately. Section B identifies furniture, appliances, and home goods as the consistently worst-rated categories. These items are harder to ship and more likely to arrive damaged or late. A dedicated logistics track for these categories — with different delivery estimates, better packaging standards, and proactive tracking updates — would address the source of the low scores rather than masking them.
3. Recruit sellers in underserved states to reduce cross-state shipping. Section A shows that same-state orders score higher than cross-state orders (4.21 vs 4.05). 64% of all orders cross a state border because sellers are concentrated in São Paulo. Seller recruitment programmes in the North-East and Centre-West would shorten supply chains, reduce delays, and improve satisfaction at the source.
Observational data — no causal claims. All findings are associations. Delivery delay correlates with bad reviews but correlation does not mean that faster delivery would necessarily improve scores — confounders such as product quality, seller responsiveness, and customer expectations are not captured in the dataset.
Sections B and C use only silent-rater orders. Filtering to orders without written reviews makes the payment and product analyses cleaner but also means the findings do not generalise to the ~30% of orders where customers wrote a comment. Those customers may behave differently.
São Paulo concentration. SP accounts for ~42% of all orders. Any state-level average is heavily influenced by SP’s size, and findings about “typical” performance may reflect SP’s logistics infrastructure more than a nationwide pattern.
All four sections of this report were produced with AI assistance (Codex / Claude) under the Responsible Analytics framework of the course. The following verification steps were applied across all sections:
sessionInfo()
## R version 4.6.0 (2026-04-24)
## Platform: aarch64-apple-darwin23
## Running under: macOS Tahoe 26.5.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Warsaw
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] NbClust_3.0.1 randomForest_4.7-1.2 leaflet_2.2.3
## [4] scales_1.4.0 knitr_1.51 cluster_2.1.8.2
## [7] lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0
## [10] dplyr_1.2.1 purrr_1.2.2 readr_2.2.0
## [13] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.3
## [16] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 stringi_1.8.7 hms_1.1.4
## [5] digest_0.6.39 magrittr_2.0.5 evaluate_1.0.5 grid_4.6.0
## [9] timechange_0.4.0 RColorBrewer_1.1-3 fastmap_1.2.0 jsonlite_2.0.0
## [13] crosstalk_1.2.2 jquerylib_0.1.4 cli_3.6.6 crayon_1.5.3
## [17] rlang_1.2.0 bit64_4.8.2 withr_3.0.2 cachem_1.1.0
## [21] yaml_2.3.12 otel_0.2.0 parallel_4.6.0 tools_4.6.0
## [25] tzdb_0.5.0 vctrs_0.7.3 R6_2.6.1 lifecycle_1.0.5
## [29] htmlwidgets_1.6.4 bit_4.6.0 vroom_1.7.1 pkgconfig_2.0.3
## [33] pillar_1.11.1 bslib_0.11.0 gtable_0.3.6 glue_1.8.1
## [37] xfun_0.58 tidyselect_1.2.1 rstudioapi_0.18.0 farver_2.1.2
## [41] htmltools_0.5.9 labeling_0.4.3 rmarkdown_2.31 compiler_4.6.0
## [45] S7_0.2.2