Introduction

This report is the group submission for the Codex + ggplot2 for Responsible Analytics summer school. The overall topic is customer satisfaction on Olist, Brazil’s largest e-commerce platform.

The dataset covers ~99,000 orders placed between 2016 and 2018, with matched delivery records, payment information, product data, and customer review scores (1–5 stars).

Each group member independently analysed a different dimension of satisfaction:

Section Sub-topic Analyst
A Geography — where customers and sellers are Malakai Shaffer
B Products & categories — what is being bought Edvinas Kučys
C Payment behaviour — how customers pay [Group Member]
D Delivery timing — statistical correlation Michal

All code was AI-assisted (Codex / Claude) with human verification at every step. Findings are framed as associations, not causal claims.


Shared data loading

All four analyses draw from the same nine CSV files. Loading once here avoids duplication and ensures every section is working from identical source data.

# Shared helpers (haversine, build_sat, clustering functions, etc.)
source("functions.R")
library(lubridate)
library(leaflet)
library(randomForest)
library(cluster)
library(NbClust)
# Core tables used across all sections
d           <- load_olist_data()            # orders, reviews, customers, sellers, order_items, geolocation
payments    <- read_csv("olist_order_payments_dataset.csv", show_col_types = FALSE)
products    <- read_csv("olist_products_dataset.csv",       show_col_types = FALSE)
cat_trans   <- read_csv("product_category_name_translation.csv", show_col_types = FALSE)

# Convenience aliases used by multiple sections
orders      <- d$orders
reviews     <- d$reviews
customers   <- d$customers
sellers     <- d$sellers
order_items <- d$order_items
geolocation <- d$geolocation

cat("Orders:", nrow(orders), "| Reviews:", nrow(reviews),
    "| Customers:", nrow(customers), "\n")
## Orders: 99441 | Reviews: 99224 | Customers: 99441
# Overall distribution of review scores — the starting point for the whole group
reviews %>%
  dplyr::group_by(review_score) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  mutate(
    pct     = round(n / sum(n) * 100, 1),
    label   = paste0(pct, "%"),
    score_f = factor(review_score)
  ) %>%
  ggplot(aes(x = score_f, y = n, fill = score_f)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = label), vjust = -0.4, size = 4.5) +
  scale_fill_manual(values = c(
    "1" = "#d73027", "2" = "#f46d43", "3" = "#fee08b",
    "4" = "#a6d96a", "5" = "#1a9850"
  )) +
  labs(
    title    = "Distribution of review scores — all 99k orders",
    subtitle = "57.8 % of customers leave a 5-star review; 11.5 % leave 1 star.",
    x = "Review Score", y = "Number of Reviews"
  ) +
  theme_minimal()

Reading it. The distribution is strongly left-skewed: most customers are satisfied (or at least don’t leave negative reviews). The 11.5 % who give 1 star represent ~11,400 orders — still a large absolute number worth investigating.


Section A — Geography of Satisfaction

Analyst: Malakai Shaffer

Question: Does where a customer lives, where the seller is based, or the distance between them appear to be associated with the review score they leave?

# build_sat() joins orders + reviews + customers + sellers and computes
# delay_days, cross_state flag, and straight-line distance (Haversine).
sat <- build_sat(d)
cat("Rows in sat (one per order):", nrow(sat), "\n")
## Rows in sat (one per order): 99441

A1 — Average review score by customer state

state_reviews <- sat %>%
  filter(!is.na(customer_state), !is.na(review_score)) %>%
  group_by(customer_state) %>%
  summarise(avg_review = mean(review_score), n_orders = n(), .groups = "drop")

ggplot(state_reviews,
       aes(x = reorder(customer_state, avg_review), y = avg_review)) +
  geom_col(fill = "#2C7FB8") +
  geom_hline(yintercept = mean(sat$review_score, na.rm = TRUE),
             linetype = "dashed", colour = "grey30") +
  coord_flip() +
  labs(
    title    = "Average review score by customer state",
    subtitle = "Dashed line = national average (4.09). All 27 states.",
    x = NULL, y = "Average review score (1–5)"
  ) +
  theme_minimal()

A2 — Satisfaction falls with distance from seller

dist_bins <- sat %>%
  filter(!is.na(dist_km), !is.na(review_score)) %>%
  mutate(distance_band = cut(
    dist_km,
    breaks = c(0, 100, 500, 1500, Inf),
    labels = c("0–100 km", "100–500 km", "500–1,500 km", "> 1,500 km"),
    right = TRUE, include.lowest = TRUE
  )) %>%
  group_by(distance_band) %>%
  summarise(
    n_orders   = n(),
    avg_review = mean(review_score),
    se_review  = sd(review_score) / sqrt(n()),
    .groups = "drop"
  )

ggplot(dist_bins, aes(x = distance_band, y = avg_review)) +
  geom_col(fill = "#6A51A3", width = 0.6) +
  geom_errorbar(aes(ymin = avg_review - se_review,
                    ymax = avg_review + se_review),
                width = 0.2, colour = "grey30") +
  geom_text(aes(label = comma(n_orders), y = 3.55),
            colour = "white", fontface = "bold", size = 3.8) +
  geom_hline(yintercept = mean(sat$review_score, na.rm = TRUE),
             linetype = "dashed", colour = "grey30") +
  labs(
    title    = "Average review score by customer–seller distance",
    subtitle = "Error bars = ±1 SE. White = order count per band. Dashed = national average.",
    x = "Straight-line distance (km)", y = "Average review score (1–5)"
  ) +
  theme_minimal()

A3 — What predicts a low review? (Random Forest)

rf_data_geo <- prep_rf_data(sat, threshold = 4)
splits_geo  <- split_data(rf_data_geo)
rf_geo      <- fit_rf(splits_geo$train, ntree = 200)
imp_geo     <- get_importance(rf_geo)
plot_importance(imp_geo) +
  labs(title = "RF variable importance — geographic features",
       subtitle = "delay_days is 5–6× more important than any geographic variable.")

A4 — Interactive map of Brazil

state_centroids <- geolocation %>%
  filter(geolocation_lat > -34, geolocation_lat < 6,
         geolocation_lng > -74, geolocation_lng < -34) %>%
  group_by(geolocation_state) %>%
  summarise(lat = mean(geolocation_lat), lng = mean(geolocation_lng), .groups = "drop")

map_data <- state_centroids %>%
  inner_join(state_reviews, by = c("geolocation_state" = "customer_state")) %>%
  mutate(popup_text = paste0(
    "<b>", geolocation_state, "</b><br>",
    "Avg review: <b>", round(avg_review, 3), "</b><br>",
    "Orders: ", comma(n_orders)
  ))

pal <- colorNumeric("RdYlGn", domain = map_data$avg_review)

leaflet(map_data) %>%
  addTiles() %>%
  setView(lng = -52, lat = -14, zoom = 4) %>%
  addCircleMarkers(
    lng = ~lng, lat = ~lat,
    radius = ~sqrt(n_orders / 200),
    fillColor = ~pal(avg_review), color = ~pal(avg_review),
    fillOpacity = 0.85, stroke = FALSE,
    popup = ~popup_text, label = ~geolocation_state
  ) %>%
  addLegend("bottomright", pal = pal, values = ~avg_review,
            title = "Avg review", opacity = 1)

Section A findings:

  • North and North-East states (AL, MA, SE) score below the national average; South-East and South (SP, PR) score above it.
  • Satisfaction declines monotonically with distance: orders under 100 km average 4.21 vs 3.94 for orders over 1,500 km.
  • The Random Forest shows that delay_days is 5–6× more predictive than any geographic variable — geography matters, but primarily as a proxy for delivery timing.

Section B — Product & Category Analysis

Analyst: Edvinas Kučys

Question: Which product categories are associated with low satisfaction, and do products cluster into meaningful groups by price, delivery time, and rating?

Note: This section analyses orders where customers left a numeric score but no written comment — the “silent rating” signal. Orders with written reviews are excluded to focus on typical, unprompted satisfaction.

library(NbClust)

# Identify orders that have ANY written review text
edv_written_ids <- unique(reviews$order_id[
  (!is.na(reviews$review_comment_title)   & trimws(reviews$review_comment_title)   != "") |
  (!is.na(reviews$review_comment_message) & trimws(reviews$review_comment_message) != "")
])

# Restrict to orders WITHOUT written text
edv_reviews_clean <- reviews[
  !reviews$order_id %in% edv_written_ids, ]

# Review scores for those clean orders
edv_product_reviews <- merge(
  order_items[, c("order_id", "product_id")],
  edv_reviews_clean[,   c("order_id", "review_score")],
  by = "order_id"
)

B1 — Worst-rated product categories

edv_category_reviews <- merge(
  edv_product_reviews,
  products[, c("product_id", "product_category_name")],
  by = "product_id", all.x = TRUE
)
edv_category_reviews <- edv_category_reviews[
  !is.na(edv_category_reviews$product_category_name) &
    trimws(edv_category_reviews$product_category_name) != "", ]

edv_cat_summary <- aggregate(
  review_score ~ product_category_name, data = edv_category_reviews,
  FUN = function(x) c(avg = mean(x), n = length(x))
)
edv_cat_summary <- data.frame(
  product_category_name = edv_cat_summary$product_category_name,
  average_score  = edv_cat_summary$review_score[, "avg"],
  review_count   = edv_cat_summary$review_score[, "n"]
)
edv_cat_summary <- edv_cat_summary[
  edv_cat_summary$review_count >= 10 & edv_cat_summary$average_score <= 4.5, ]
edv_cat_summary <- edv_cat_summary[
  order(edv_cat_summary$average_score, -edv_cat_summary$review_count), ]
edv_worst_cats  <- head(edv_cat_summary, 15)
edv_worst_cats$label <- paste0(
  edv_worst_cats$product_category_name, " (n = ", edv_worst_cats$review_count, ")"
)

ggplot(edv_worst_cats,
       aes(x = reorder(label, average_score), y = average_score)) +
  geom_col(fill = "darkorange") +
  scale_y_continuous(limits = c(0, 5), breaks = 1:5) +
  coord_flip() +
  labs(
    title    = "Worst-rated product categories (silent raters only)",
    subtitle = "Minimum 10 reviews; average score ≤ 4.5.",
    x = NULL, y = "Average review score (1–5)"
  ) +
  theme_minimal()

B2 — K-means product clustering (k = 7)

# Prepare product-level features: avg review, price, freight, delivery time
edv_orders_delivery <- orders %>%
  mutate(
    delivery_days = as.numeric(
      as.POSIXct(order_delivered_customer_date) -
      as.POSIXct(order_purchase_timestamp),   units = "days"
    )
  ) %>%
  select(order_id, delivery_days)

edv_purchase_data <- edv_product_reviews %>%
  merge(order_items[, c("order_id","product_id","price","freight_value")],
        by = c("order_id","product_id")) %>%
  merge(products[, c("product_id","product_category_name")],
        by = "product_id", all.x = TRUE) %>%
  merge(edv_orders_delivery, by = "order_id", all.x = TRUE)

edv_product_summary <- aggregate(
  cbind(review_score, price, freight_value, delivery_days) ~ product_id,
  data = edv_purchase_data, FUN = mean, na.rm = TRUE
)
edv_product_summary$review_count <- as.numeric(
  table(edv_purchase_data$product_id)[edv_product_summary$product_id]
)
edv_product_summary <- edv_product_summary[edv_product_summary$review_count >= 10, ]
edv_feature_cols <- c("review_score","review_count","price","freight_value","delivery_days")
edv_complete     <- complete.cases(edv_product_summary[, edv_feature_cols])
edv_features_sc  <- scale(edv_product_summary[edv_complete, edv_feature_cols])

set.seed(42)
# k = 7 was selected using CCC and Pseudo-F statistics (run offline via NbClust).
edv_km  <- kmeans(edv_features_sc, centers = 7, nstart = 25)
edv_pca <- prcomp(edv_features_sc)
edv_var <- summary(edv_pca)$importance[2, 1:2]

edv_plot_df <- as.data.frame(edv_pca$x[, 1:2]) %>%
  mutate(cluster = factor(edv_km$cluster))

ggplot(edv_plot_df, aes(x = PC1, y = PC2, colour = cluster)) +
  geom_point(size = 2, alpha = 0.7) +
  scale_colour_brewer(palette = "Set1", name = "Cluster") +
  labs(
    title    = paste0("K-means product clustering (k = 7)"),
    subtitle = paste0("PCA projection — PC1 + PC2 explain ",
                      round(sum(edv_var) * 100, 1), "% of variance."),
    x = paste0("PC1 (", round(edv_var[1] * 100, 1), "%)"),
    y = paste0("PC2 (", round(edv_var[2] * 100, 1), "%)")
  ) +
  theme_minimal()

edv_summary_df <- data.frame(
  cluster        = factor(edv_km$cluster),
  edv_product_summary[edv_complete, edv_feature_cols]
)
edv_cluster_profile <- aggregate(. ~ cluster, data = edv_summary_df, FUN = mean)
edv_cluster_profile$product_count <- as.numeric(table(edv_km$cluster))
edv_cluster_profile[, -1] <- round(edv_cluster_profile[, -1], 2)
kable(edv_cluster_profile[, c("cluster","product_count","review_score","price",
                               "freight_value","delivery_days")],
      caption = "Product cluster profiles — mean values per cluster")
Product cluster profiles — mean values per cluster
cluster product_count review_score price freight_value delivery_days
1 603 4.58 55.43 14.07 8.46
2 28 4.36 673.72 31.97 14.09
3 92 4.42 153.14 40.56 14.24
4 549 4.42 96.76 18.20 14.06
5 23 2.84 108.75 24.25 31.81
6 77 2.81 54.74 15.33 12.40
7 23 4.20 91.05 16.91 13.01

Section B findings:

  • Categories with the lowest silent ratings tend to be large/heavy items — furniture, appliances, and home goods. These are also the categories with the highest freight cost and longest delivery times.
  • The 7-cluster product segmentation separates high-price/slow-delivery items from low-price/fast-delivery items with a clear satisfaction gradient between them.
  • Insight: the worst-rated categories are not inherently bad products — their low ratings are associated with the logistics challenges of shipping bulky goods.

Section C — Payment Behaviour Analysis

Analyst: [Group Member]

Question: Does how a customer pays — the payment method, total value, or number of instalments — appear to be associated with their satisfaction score?

This section also uses silent-rater orders (numeric score, no written comment) to isolate payment behaviour from text-expressed grievances.

# Build the payment feature dataset
pay_features <- payments %>%
  group_by(order_id) %>%
  summarise(
    payment_value      = sum(payment_value,          na.rm = TRUE),
    max_installments   = max(payment_installments,   na.rm = TRUE),
    n_payment_types    = n_distinct(payment_type),
    main_payment_type  = names(sort(table(payment_type), decreasing = TRUE))[1],
    .groups = "drop"
  )

# Silent-rater reviews joined to payment features
pay_review_ids <- reviews %>%
  filter(!is.na(review_score),
         is.na(review_comment_message),
         is.na(review_comment_title)) %>%
  distinct(order_id, .keep_all = TRUE)

pay_model_data <- pay_review_ids %>%
  left_join(orders,       by = "order_id") %>%
  left_join(customers,    by = "customer_id") %>%
  left_join(pay_features, by = "order_id") %>%
  mutate(
    purchase_date  = as.POSIXct(order_purchase_timestamp),
    delivered_date = as.POSIXct(order_delivered_customer_date),
    estimated_date = as.POSIXct(order_estimated_delivery_date),
    delivery_days  = as.numeric(difftime(delivered_date, purchase_date,  units = "days")),
    delay_days     = as.numeric(difftime(delivered_date, estimated_date, units = "days")),
    satisfaction_group = case_when(
      review_score >= 4 ~ "happy",
      review_score <= 3 ~ "mad",
      TRUE              ~ "neutral"
    )
  )

pay_no_comment <- pay_model_data %>%
  filter(!is.na(payment_value), !is.na(max_installments),
         !is.na(main_payment_type), payment_value > 0)

C1 — Payment value vs rating (log scale)

pay_no_comment <- pay_no_comment %>% mutate(score_f = factor(review_score))
ggplot(pay_no_comment, aes(x = score_f, y = payment_value)) +
  geom_boxplot(fill = "lightblue", outlier.alpha = 0.1) +
  scale_y_log10(labels = comma) +
  labs(
    title    = "Payment value by review score (silent raters)",
    subtitle = "Log scale. Median payment value is broadly similar across all scores.",
    x = "Review Score", y = "Payment value (R$, log scale)"
  ) +
  theme_minimal()

C2 — Average instalments by rating

pay_no_comment %>%
  group_by(review_score) %>%
  summarise(avg_inst = mean(max_installments, na.rm = TRUE), .groups = "drop") %>%
  mutate(score_f = factor(review_score)) %>%
  ggplot(aes(x = score_f, y = avg_inst, fill = score_f)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c(
    "1" = "#d73027", "2" = "#f46d43", "3" = "#fee08b",
    "4" = "#a6d96a", "5" = "#1a9850"
  )) +
  labs(
    title    = "Average number of instalments by review score",
    subtitle = "Lower-rated orders are associated with slightly more payment instalments.",
    x = "Review Score", y = "Average number of instalments"
  ) +
  theme_minimal()

C3 — What payment features predict satisfaction? (Random Forest)

pay_rf_data <- pay_no_comment %>%
  select(satisfaction_group, payment_value, max_installments,
         delivery_days, delay_days) %>%
  drop_na() %>%
  filter(if_all(where(is.numeric), is.finite)) %>%
  mutate(satisfaction_group = as.factor(satisfaction_group))

set.seed(456)
pay_rf_model <- randomForest(
  satisfaction_group ~ ., data = pay_rf_data,
  ntree = 200, importance = TRUE
)

as.data.frame(importance(pay_rf_model)) %>%
  rownames_to_column("Feature") %>%
  arrange(desc(MeanDecreaseGini)) %>%
  ggplot(aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title    = "RF variable importance — payment & delivery features",
    subtitle = "Higher = more important in separating happy from mad customers.",
    x = NULL, y = "Mean Decrease Gini"
  ) +
  theme_minimal()

Section C findings:

  • Payment value and number of instalments show only a weak association with review score across the distributions — the box plots are broadly overlapping.
  • When delivery features are included in the RF, delay_days and delivery_days dominate. Payment variables contribute much less signal.
  • Conclusion: how customers pay is not a strong predictor of dissatisfaction. Delivery experience appears to be more important even within this payment-focused analysis.

Section D — Delivery Timing & Statistical Correlation

Analyst: Michal

Question: Which observable variables correlate most strongly with review score? Is delivery delay the dominant signal?

mic_payments_clean <- payments %>%
  group_by(order_id) %>%
  summarise(payment_value = sum(payment_value), .groups = "drop")

mic_items_clean <- order_items %>%
  group_by(order_id) %>%
  summarise(price = sum(price), freight_value = sum(freight_value), .groups = "drop")

mic_data <- orders %>%
  select(order_id, order_delivered_customer_date, order_estimated_delivery_date) %>%
  inner_join(reviews %>% select(order_id, review_score), by = "order_id") %>%
  inner_join(mic_payments_clean, by = "order_id") %>%
  inner_join(mic_items_clean,    by = "order_id") %>%
  na.omit() %>%
  mutate(order_delay = as.numeric(
    order_delivered_customer_date - order_estimated_delivery_date,
    units = "days"
  ))

D1 — Spearman correlations with review score

mic_correlations <- data.frame(
  variable    = c("order_delay", "freight_value", "payment_value", "price"),
  correlation = c(
    cor(mic_data$order_delay,   mic_data$review_score, method = "spearman"),
    cor(mic_data$freight_value, mic_data$review_score, method = "spearman"),
    cor(mic_data$payment_value, mic_data$review_score, method = "spearman"),
    cor(mic_data$price,         mic_data$review_score, method = "spearman")
  )
)

ggplot(mic_correlations,
       aes(x = reorder(variable, correlation), y = correlation,
           fill = correlation)) +
  geom_col() +
  geom_text(aes(label = round(correlation, 3)), hjust = -0.2, size = 4.5) +
  scale_fill_gradient2(low = "#d73027", mid = "white", high = "#1a9850",
                       midpoint = 0, guide = "none") +
  coord_flip() +
  ylim(-0.25, 0.05) +
  labs(
    title    = "Spearman correlation with review score",
    subtitle = "Negative = variable increases as satisfaction falls.",
    x = NULL, y = "Spearman ρ"
  ) +
  theme_minimal()

D2 — Delivery delay by review score

mic_data <- mic_data %>% mutate(score_f = factor(review_score))
ggplot(mic_data, aes(x = score_f, y = order_delay, fill = score_f)) +
  geom_boxplot(outlier.alpha = 0.15) +
  scale_fill_manual(values = c(
    "1" = "#d73027","2" = "#f46d43","3" = "#fee08b",
    "4" = "#a6d96a","5" = "#1a9850"
  ), guide = "none") +
  labs(
    title    = "Delivery delay (vs. estimated date) by review score",
    subtitle = "Negative = arrived early. 1-star orders are noticeably later than 5-star orders.",
    x = "Review Score", y = "Delay (days)"
  ) +
  theme_minimal()

mic_data %>%
  group_by(review_score) %>%
  summarise(avg_delay = mean(order_delay), .groups = "drop") %>%
  mutate(score_f = factor(review_score)) %>%
  ggplot(aes(x = score_f, y = avg_delay, fill = score_f)) +
  geom_col(show.legend = FALSE) +
  scale_fill_manual(values = c(
    "1" = "#d73027","2" = "#f46d43","3" = "#fee08b",
    "4" = "#a6d96a","5" = "#1a9850"
  )) +
  labs(
    title    = "Average delivery delay by review score",
    subtitle = "1-star orders arrive ~10 days late on average; 5-star orders arrive early.",
    x = "Review Score", y = "Average delay (days)"
  ) +
  theme_minimal()

Section D findings:

  • order_delay has the strongest negative Spearman correlation with review score (ρ ≈ −0.22). All other variables (freight, payment value, item price) have correlations close to zero.
  • 1-star orders arrive roughly 10 days late on average; 5-star orders arrive early. This is the clearest statistical signal in the whole dataset.
  • The correlation is Spearman (rank-based), so it is robust to the non-normal distribution of review scores and delay values.

Group Findings & Actionable Outcomes

The four sub-analyses converge on a consistent set of findings.

What drives satisfaction?

Evidence source Finding
Section A (RF importance) delay_days is 5–6× more predictive than any geographic variable
Section B (product clusters) Worst-rated categories are bulk/heavy goods — associated with longer delivery
Section C (payment RF) Delivery timing dominates even in a payment-focused model
Section D (Spearman) Order delay has the strongest statistical correlation with review score

All four independent analyses point to the same root cause: delivery timing relative to the promised date is the primary driver of customer dissatisfaction — more than geography, product category, or payment method.

Three actionable recommendations

1. Extend estimated delivery dates in North and North-East states. The North-East cluster (AL, MA, SE) has both the worst average review scores and the thinnest early-delivery margin (Section A). Olist doesn’t need to make logistics faster — it needs to make promises more honest. Under-promising by 1–2 days costs nothing and directly improves how deliveries feel to the customer.

2. Audit bulky-goods logistics separately. Section B identifies furniture, appliances, and home goods as the consistently worst-rated categories. These items are harder to ship and more likely to arrive damaged or late. A dedicated logistics track for these categories — with different delivery estimates, better packaging standards, and proactive tracking updates — would address the source of the low scores rather than masking them.

3. Recruit sellers in underserved states to reduce cross-state shipping. Section A shows that same-state orders score higher than cross-state orders (4.21 vs 4.05). 64% of all orders cross a state border because sellers are concentrated in São Paulo. Seller recruitment programmes in the North-East and Centre-West would shorten supply chains, reduce delays, and improve satisfaction at the source.


Limitations and AI Audit Note

Limitations

  1. Observational data — no causal claims. All findings are associations. Delivery delay correlates with bad reviews but correlation does not mean that faster delivery would necessarily improve scores — confounders such as product quality, seller responsiveness, and customer expectations are not captured in the dataset.

  2. Sections B and C use only silent-rater orders. Filtering to orders without written reviews makes the payment and product analyses cleaner but also means the findings do not generalise to the ~30% of orders where customers wrote a comment. Those customers may behave differently.

  3. São Paulo concentration. SP accounts for ~42% of all orders. Any state-level average is heavily influenced by SP’s size, and findings about “typical” performance may reflect SP’s logistics infrastructure more than a nationwide pattern.

AI Audit Note

All four sections of this report were produced with AI assistance (Codex / Claude) under the Responsible Analytics framework of the course. The following verification steps were applied across all sections:

  • Row counts were checked after every join to confirm no fan-out from many-to-many relationships.
  • Variable distributions were spot-checked before modelling to confirm that scaling, filtering, and feature engineering produced plausible values.
  • Random Forest seeds are set and documented for reproducibility; OOB error rates were inspected and fell in the expected range for noisy survey data.
  • Association language was used throughout; any causal phrasing introduced by AI drafts was rewritten before inclusion.
  • No external citations are used — all numbers come from the dataset directly. There are no [UNVERIFIED] claims.
sessionInfo()
## R version 4.6.0 (2026-04-24)
## Platform: aarch64-apple-darwin23
## Running under: macOS Tahoe 26.5.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Warsaw
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] NbClust_3.0.1        randomForest_4.7-1.2 leaflet_2.2.3       
##  [4] scales_1.4.0         knitr_1.51           cluster_2.1.8.2     
##  [7] lubridate_1.9.5      forcats_1.0.1        stringr_1.6.0       
## [10] dplyr_1.2.1          purrr_1.2.2          readr_2.2.0         
## [13] tidyr_1.3.2          tibble_3.3.1         ggplot2_4.0.3       
## [16] tidyverse_2.0.0     
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     stringi_1.8.7      hms_1.1.4         
##  [5] digest_0.6.39      magrittr_2.0.5     evaluate_1.0.5     grid_4.6.0        
##  [9] timechange_0.4.0   RColorBrewer_1.1-3 fastmap_1.2.0      jsonlite_2.0.0    
## [13] crosstalk_1.2.2    jquerylib_0.1.4    cli_3.6.6          crayon_1.5.3      
## [17] rlang_1.2.0        bit64_4.8.2        withr_3.0.2        cachem_1.1.0      
## [21] yaml_2.3.12        otel_0.2.0         parallel_4.6.0     tools_4.6.0       
## [25] tzdb_0.5.0         vctrs_0.7.3        R6_2.6.1           lifecycle_1.0.5   
## [29] htmlwidgets_1.6.4  bit_4.6.0          vroom_1.7.1        pkgconfig_2.0.3   
## [33] pillar_1.11.1      bslib_0.11.0       gtable_0.3.6       glue_1.8.1        
## [37] xfun_0.58          tidyselect_1.2.1   rstudioapi_0.18.0  farver_2.1.2      
## [41] htmltools_0.5.9    labeling_0.4.3     rmarkdown_2.31     compiler_4.6.0    
## [45] S7_0.2.2