Introduction

This report examines how geography relates to customer satisfaction on Olist, a Brazilian e-commerce platform. The group topic is customer satisfaction; my individual angle is the geographic dimension — does where a customer lives, where the seller is based, and the distance between them appear to be associated with the review score a customer leaves and with how late their order arrives?

Every claim below is framed as an association, not a cause. Observational data of this kind cannot, on its own, establish causation, and that caution is carried through the whole report.

AI-assisted workflow note. This analysis was written with AI assistance (Claude), but every join, row count, and figure was checked by hand against the raw data before being trusted. The verification notes printed after each step are the record of that checking. A fuller AI audit note appears at the end.

# functions.R contains all shared helpers (data loading, joins, Haversine distance,
# zip centroids). Sourcing it here means we use the same definitions as the
# clustering report — no duplication, no drift.
source("functions.R")
library(lubridate)   # date arithmetic — not in functions.R as it is report-specific
# Run install.packages("leaflet") ONCE in your R Console before knitting.
library(leaflet)

Step 1 — Load the data and sanity-check it

I only load the tables this sub-topic actually needs: orders, reviews, customers, sellers, the order–items bridge (which is how an order is linked to a seller), and the geolocation file (used only for the optional map at the end).

# read_csv() is from readr (part of the tidyverse). It is faster than the base
# read.csv() and does not silently convert text columns into factors.
# The CSVs sit in the SAME folder as this .Rmd, so plain file names work when knitting.

orders      <- read_csv("olist_orders_dataset.csv")
reviews     <- read_csv("olist_order_reviews_dataset.csv")
customers   <- read_csv("olist_customers_dataset.csv")
sellers     <- read_csv("olist_sellers_dataset.csv")
order_items <- read_csv("olist_order_items_dataset.csv")
geolocation <- read_csv("olist_geolocation_dataset.csv")
# A quick, honest health-check of each table BEFORE doing any analysis.
# If these numbers look wrong, everything downstream is wrong, so we look first.

# Row counts for every table we loaded.
tibble(
  table = c("orders", "reviews", "customers", "sellers", "order_items", "geolocation"),
  rows  = c(nrow(orders), nrow(reviews), nrow(customers),
            nrow(sellers), nrow(order_items), nrow(geolocation))
) %>%
  kable(caption = "Row counts per loaded table")
Row counts per loaded table
table rows
orders 99441
reviews 99224
customers 99441
sellers 3095
order_items 112650
geolocation 1000163
# How complete is the key date column we will rely on for delivery delay?
# colSums(is.na(...)) counts missing values column by column.
orders %>%
  summarise(
    total_orders                 = n(),
    missing_delivered_date       = sum(is.na(order_delivered_customer_date)),
    missing_estimated_date       = sum(is.na(order_estimated_delivery_date))
  ) %>%
  kable(caption = "Missingness in the delivery dates we depend on")
Missingness in the delivery dates we depend on
total_orders missing_delivered_date missing_estimated_date
99441 2965 0
# What time span does the data cover? Sanity-checks that dates parsed correctly.
orders %>%
  summarise(
    earliest_purchase = min(order_purchase_timestamp, na.rm = TRUE),
    latest_purchase   = max(order_purchase_timestamp, na.rm = TRUE)
  ) %>%
  kable(caption = "Date range of orders")
Date range of orders
earliest_purchase latest_purchase
2016-09-04 21:15:19 2018-10-17 17:30:18

Human verification — Step 1. Orders should be ~99,441 rows and reviews ~99,224. The geolocation table is by far the largest (≈1M rows of postcode coordinates). A non-trivial number of orders have a missing order_delivered_customer_date (orders that were cancelled or never delivered) — those will correctly drop out of the delay analysis later. The purchase dates should fall in 2016–2018. I checked these against the figures I had already confirmed in Python, and they match.

Step 2 — Join three tables into one satisfaction dataset

The goal is one row per order, carrying: its review score, the customer’s state, and the seller’s state. That lets me compare geography against satisfaction.

The joins, in order:

  1. orders + reviews → attach a review score to each order.
  2. + customers → attach customer_state, customer_city, customer_zip_code_prefix.
  3. + sellers (via the order_items bridge) → attach seller_state and seller_zip_code_prefix (the zip is also kept so the distance calculation later can find the seller’s geographic centroid).

A couple of honest data-shape problems have to be handled first, or the joins will silently inflate the row count:

  • A few orders carry more than one review. I collapse reviews to one mean score per order so the join cannot duplicate orders.
  • An order can contain items from more than one seller. I take one seller per order (the first seller listed) to keep one row per order. This is a deliberate simplification and is listed as a limitation later.
# Collapse reviews to ONE row per order (mean score) so the join can't fan out.
# Most orders have exactly one review; this only affects the handful with several.
reviews_per_order <- reviews %>%
  group_by(order_id) %>%
  summarise(review_score = mean(review_score, na.rm = TRUE), .groups = "drop")
# Reduce the order_items bridge to ONE seller per order.
# distinct(order_id, .keep_all = TRUE) keeps the first row for each order_id,
# which gives us a single seller_id per order.
order_primary_seller <- order_items %>%
  distinct(order_id, .keep_all = TRUE) %>%   # one row per order
  select(order_id, seller_id) %>%            # we only need these two columns
  left_join(sellers, by = "seller_id") %>%   # attach that seller's state
  select(order_id, seller_state, seller_zip_code_prefix)  # keep zip too — needed for distance calc
# JOIN 1: orders + reviews (left_join keeps every order, even if it has no review)
sat <- orders %>%
  left_join(reviews_per_order, by = "order_id")

# Verification: row count should stay equal to the number of orders.
cat("Rows after join 1 (orders + reviews):", nrow(sat),
    "| original orders:", nrow(orders), "\n")
## Rows after join 1 (orders + reviews): 99441 | original orders: 99441

Human verification — Join 1. Row count must equal the original order count (~99,441). If it had grown, a duplicate order_id in reviews would be fanning out the data — which is exactly why I averaged reviews to one row per order first.

# JOIN 2: + customers, to bring in the customer's location.
# Orders link to customers via customer_id (one customer_id per order in this data).
sat <- sat %>%
  left_join(
    customers %>% select(customer_id, customer_state,
                         customer_city, customer_zip_code_prefix),
    by = "customer_id"
  )

cat("Rows after join 2 (+ customers):", nrow(sat),
    "| missing customer_state:", sum(is.na(sat$customer_state)), "\n")
## Rows after join 2 (+ customers): 99441 | missing customer_state: 0

Human verification — Join 2. Row count must still match the order count, and customer_state should have essentially no missing values — every order in Olist is tied to a customer record. I confirmed both.

# JOIN 3: + seller state (via the one-seller-per-order table built above).
sat <- sat %>%
  left_join(order_primary_seller, by = "order_id")

cat("Rows after join 3 (+ seller state):", nrow(sat),
    "| missing seller_state:", sum(is.na(sat$seller_state)), "\n")
## Rows after join 3 (+ seller state): 99441 | missing seller_state: 775

Human verification — Join 3. Row count unchanged again. Some orders have a missing seller_state — these are orders with no matching item/seller row (a known gap in the data). They drop out of the cross-state comparison and that is correct.

# --- Derived columns ------------------------------------------------------

sat <- sat %>%
  mutate(
    # delay_days = actual delivery date minus the date Olist promised.
    # as.numeric(..., units = "days") turns the time difference into a plain number.
    #   negative  = arrived EARLY (before the estimate)
    #   zero      = arrived bang on the estimated date
    #   positive  = arrived LATE
    delay_days = as.numeric(
      order_delivered_customer_date - order_estimated_delivery_date,
      units = "days"
    ),

    # cross_state flag: TRUE when the customer and seller are in DIFFERENT states.
    # We only mark it when BOTH states are known, otherwise it stays NA.
    cross_state = if_else(
      is.na(customer_state) | is.na(seller_state),
      NA,
      customer_state != seller_state
    )
  )

# Quick look at the new columns on a few delivered orders.
sat %>%
  filter(!is.na(delay_days)) %>%
  select(order_id, customer_state, seller_state, cross_state,
         review_score, delay_days) %>%
  head(5) %>%
  kable(caption = "Sample of the joined dataset with derived columns")
Sample of the joined dataset with derived columns
order_id customer_state seller_state cross_state review_score delay_days
e481f51cbdc54678b7cc49136f2d6af7 SP SP FALSE 4 -7.107488
53cdb2fc8bc7dce0b6741e2150273451 BA SP TRUE 4 -5.355729
47770eb9100c2d0c44946d9cf07ec65d GO SP TRUE 5 -17.245498
949d5b44dbf5de918fe9c16f97b45f8a RN MG TRUE 5 -12.980069
ad21c59c0840e6cb83a9ceb5573f8159 SP SP FALSE 5 -9.238171

Human verification — derived columns. I spot-checked a handful of rows by hand: where customer_state and seller_state differ, cross_state is TRUE; a negative delay_days lines up with a delivery date earlier than the estimate. Both behave as intended.

Derived column — straight-line distance between customer and seller (km)

The geolocation file contains a latitude and longitude for every zip code prefix in Brazil. By finding the geographic centre of each prefix and computing the great-circle distance between the customer’s zip and the seller’s zip, I can assign a rough straight-line distance in km to each order.

This is not a road distance — it ignores motorways, rivers, and mountain terrain — but it is an honest first approximation of physical separation. haversine_km() and build_zip_centroids() are defined once in functions.R and reused here.

# build_zip_centroids() filters out impossible coordinates and returns mean lat/lng
# per zip prefix — defined in functions.R.
zip_centroids <- build_zip_centroids(geolocation)

# Attach the customer's centroid, then the seller's, then compute distance.
# haversine_km() is also defined in functions.R.
sat <- sat %>%
  left_join(
    zip_centroids %>% rename(cust_lat = lat, cust_lng = lng),
    by = c("customer_zip_code_prefix" = "geolocation_zip_code_prefix")
  ) %>%
  left_join(
    zip_centroids %>% rename(sell_lat = lat, sell_lng = lng),
    by = c("seller_zip_code_prefix" = "geolocation_zip_code_prefix")
  ) %>%
  mutate(dist_km = haversine_km(cust_lng, cust_lat, sell_lng, sell_lat))

# Quick summary — how many orders have a valid distance, and what's the spread?
sat %>%
  filter(!is.na(dist_km)) %>%
  summarise(
    n_with_distance = n(),
    min_km          = round(min(dist_km),    1),
    median_km       = round(median(dist_km), 1),
    mean_km         = round(mean(dist_km),   1),
    max_km          = round(max(dist_km),    1)
  ) %>%
  kable(caption = "Distribution of straight-line customer–seller distance (km)")
Distribution of straight-line customer–seller distance (km)
n_with_distance min_km median_km mean_km max_km
98174 0 433.7 601.3 3579.9

Human verification — distance column. The median should be a few hundred km (most orders are same-city or adjacent-state). The maximum should not exceed ~4,300 km (the breadth of Brazil). I checked: the numbers fall in the right range, and a handful of NAs appear where a zip code prefix has no geolocation record, which is expected and harmless — those rows simply have no distance and drop out of Chart D.

Step 3 — Visualising the geography of satisfaction

Chart A — Average review score by customer state

# Build a per-state summary: mean review score and how many reviewed orders sit
# behind it (n matters — a state with few orders is a noisier estimate).
state_reviews <- sat %>%
  filter(!is.na(customer_state), !is.na(review_score)) %>%
  group_by(customer_state) %>%
  summarise(
    avg_review = mean(review_score),
    n_orders   = n(),
    .groups = "drop"
  )

ggplot(state_reviews,
       # reorder() sorts the bars by score so the ranking is readable at a glance.
       aes(x = reorder(customer_state, avg_review), y = avg_review)) +
  geom_col(fill = "#2C7FB8") +
  # A reference line at the national average puts each state in context.
  geom_hline(yintercept = mean(sat$review_score, na.rm = TRUE),
             linetype = "dashed", colour = "grey30") +
  coord_flip() +                                  # horizontal bars: 27 labels fit cleanly
  labs(
    title = "Average review score by customer state",
    subtitle = "Dashed line = national average. All 27 states.",
    x = "Customer state",
    y = "Average review score (1–5)"
  ) +
  theme_minimal()

Reading it cautiously: scores sit in a fairly narrow band (roughly 3.7–4.2), so the state-to-state differences are real but modest. States in the North and North-East tend to sit below the national average; the South-East and South tend to sit above it.

Chart B — Average delivery delay by customer state

# Mean delay per state, using only orders that were actually delivered
# (delay_days exists). Positive = late on average, negative = early on average.
state_delay <- sat %>%
  filter(!is.na(customer_state), !is.na(delay_days)) %>%
  group_by(customer_state) %>%
  summarise(
    avg_delay = mean(delay_days),
    n_orders  = n(),
    .groups = "drop"
  )

ggplot(state_delay,
       aes(x = reorder(customer_state, avg_delay), y = avg_delay,
           # colour the bars by whether the state is, on average, early or late
           fill = avg_delay > 0)) +
  geom_col() +
  scale_fill_manual(values = c("FALSE" = "#41AB5D", "TRUE" = "#D7301F"),
                    labels = c("FALSE" = "Early on average",
                               "TRUE" = "Late on average"),
                    name = NULL) +
  coord_flip() +
  labs(
    title = "Average delivery delay vs. the promised date, by customer state",
    subtitle = "Negative = arrived early; positive = arrived late",
    x = "Customer state",
    y = "Average delay (days)"
  ) +
  theme_minimal()

Reading it cautiously: almost every state arrives early on average — Olist’s estimated dates are conservative. But the size of that early margin varies a lot by geography, and the states with the smallest early margin (closest to late) overlap with the lower-satisfaction states in Chart A.

Chart C — Cross-state vs same-state orders

# Summarise the two groups: same-state vs cross-state shipments.
cross_summary <- sat %>%
  filter(!is.na(cross_state), !is.na(review_score)) %>%
  group_by(cross_state) %>%
  summarise(
    n_orders   = n(),
    avg_review = mean(review_score),
    .groups = "drop"
  ) %>%
  # Turn the TRUE/FALSE flag into readable labels for the chart.
  mutate(group = if_else(cross_state, "Cross-state", "Same-state"))

# We want order VOLUME and average REVIEW side by side. pivot_longer() stacks the two
# different measures into one column so we can facet them into two panels.
cross_long <- cross_summary %>%
  select(group, n_orders, avg_review) %>%
  pivot_longer(c(n_orders, avg_review),
               names_to = "metric", values_to = "value") %>%
  mutate(metric = recode(metric,
                         n_orders   = "Order volume",
                         avg_review = "Average review score"))

ggplot(cross_long, aes(x = group, y = value, fill = group)) +
  geom_col(show.legend = FALSE) +
  # free_y lets each panel use its own scale (counts vs a 1–5 score).
  facet_wrap(~ metric, scales = "free_y") +
  scale_fill_manual(values = c("Same-state" = "#41AB5D",
                               "Cross-state" = "#D7301F")) +
  labs(
    title = "Same-state vs cross-state orders: volume and satisfaction",
    x = NULL, y = NULL
  ) +
  theme_minimal()

# The exact numbers behind the chart, for the audit trail.
cross_summary %>%
  select(group, n_orders, avg_review) %>%
  kable(digits = 3,
        caption = "Cross-state vs same-state: counts and mean review score")
Cross-state vs same-state: counts and mean review score
group n_orders avg_review
Same-state 35235 4.210
Cross-state 62682 4.046

Reading it cautiously: most orders cross a state line (sellers are concentrated in São Paulo), and cross-state orders carry a slightly lower average review score than same-state orders. The gap is small but consistent with the delivery pattern in Chart B.

Chart D — Average review score by shipping distance

# Cut the continuous dist_km column into four ordered bands.
# cut() divides a numeric variable into labelled intervals.
# Inf as the final upper bound means "everything above 1,500 km".
dist_bins <- sat %>%
  filter(!is.na(dist_km), !is.na(review_score)) %>%
  mutate(
    distance_band = cut(
      dist_km,
      breaks = c(0, 100, 500, 1500, Inf),
      labels = c("0–100 km", "100–500 km", "500–1,500 km", "> 1,500 km"),
      right           = TRUE,
      include.lowest  = TRUE
    )
  ) %>%
  group_by(distance_band) %>%
  summarise(
    n_orders   = n(),
    avg_review = mean(review_score),
    # Standard error of the mean = SD / sqrt(n). SE is appropriate here because each
    # band contains thousands of orders — the SE of the MEAN is tiny, which is what
    # matters for comparing band averages.
    se_review  = sd(review_score) / sqrt(n()),
    .groups = "drop"
  )

ggplot(dist_bins, aes(x = distance_band, y = avg_review)) +
  geom_col(fill = "#6A51A3", width = 0.6) +
  geom_errorbar(aes(ymin = avg_review - se_review,
                    ymax = avg_review + se_review),
                width = 0.2, colour = "grey30") +
  # White order-count labels sit inside the bars at y = 3.5.
  geom_text(aes(label = comma(n_orders), y = 3.5),
            colour = "white", fontface = "bold", size = 3.8) +
  geom_hline(yintercept = mean(sat$review_score, na.rm = TRUE),
             linetype = "dashed", colour = "grey30") +
  labs(
    title    = "Average review score by straight-line distance between customer and seller",
    subtitle = "Error bars = ±1 SE of the mean. White labels = orders per band. Dashed = national average.",
    x = "Straight-line distance (km)",
    y = "Average review score (1–5)"
  ) +
  theme_minimal()

Reading it cautiously: there is a visible step-down in satisfaction as distance grows — orders shipped under 100 km score noticeably above the national average, while those over 1,500 km sit below it. The error bars are tight (each band contains thousands of orders), so the pattern is unlikely to be random noise. What causes it is not settled by this chart alone: distance is almost certainly a proxy for delivery time and logistics quality in remote areas, not something that independently drives dissatisfaction.

dist_bins %>%
  select(distance_band, n_orders, avg_review, se_review) %>%
  kable(digits = 3,
        caption = "Satisfaction by distance band — underlying figures")
Satisfaction by distance band — underlying figures
distance_band n_orders avg_review se_review
0–100 km 18214 4.213 0.009
100–500 km 37532 4.132 0.007
500–1,500 km 32741 4.058 0.007
> 1,500 km 8940 3.942 0.015

Bonus — A point map of Brazil coloured by satisfaction

This uses the geolocation file’s latitude/longitude. It needs no extra mapping packages — it simply plots each state’s geographic centre and colours it by that state’s average review score. The result reads as a rough map of Brazil.

# Each state's geographic centre = the mean lat/long of its postcodes.
# We filter to sensible Brazil bounds first, because the raw file contains a few
# impossible coordinates (data-entry errors) that would distort the averages.
state_centroids <- geolocation %>%
  filter(geolocation_lat > -34, geolocation_lat < 6,     # Brazil's lat range
         geolocation_lng > -74, geolocation_lng < -34) %>% # Brazil's lng range
  group_by(geolocation_state) %>%
  summarise(
    lat = mean(geolocation_lat),
    lng = mean(geolocation_lng),
    .groups = "drop"
  )

# Join each state's average review score onto its centroid.
map_data <- state_centroids %>%
  inner_join(state_reviews,
             by = c("geolocation_state" = "customer_state"))

ggplot(map_data, aes(x = lng, y = lat)) +
  geom_point(aes(colour = avg_review, size = n_orders)) +
  geom_text(aes(label = geolocation_state),
            vjust = -1, size = 3, colour = "grey20") +
  scale_colour_viridis_c(option = "plasma", name = "Avg review") +
  scale_size_continuous(name = "Orders", labels = comma) +
  coord_quickmap() +   # keeps the lat/long aspect ratio looking map-like
  labs(
    title = "Average satisfaction by state, placed on Brazil's geography",
    subtitle = "Point = state centroid (mean of its postcodes). Colour = avg review, size = order volume.",
    x = "Longitude", y = "Latitude"
  ) +
  theme_minimal()

Human verification — map. Before plotting I filtered out coordinates outside Brazil’s real latitude/longitude bounds. The brightest (most satisfied, highest-volume) point sits in the South-East where São Paulo is — which matches the per-state ranking in Chart A, so the map and the bar chart tell the same story.

Interactive map — explore satisfaction by state

The static chart above is a snapshot. The leaflet map below is interactive: zoom, pan, click any circle to see the state’s name, average score, and order count.

Circle size scales with order volume (square-root scaled so São Paulo doesn’t visually dwarf the rest of Brazil). Colour runs from red (lower satisfaction) to green (higher).

# map_data was built in the previous chunk (state centroids + avg review + n_orders).
# We add an HTML popup string for each state — leaflet renders the HTML directly.
map_data <- map_data %>%
  mutate(
    popup_text = paste0(
      "<b>", geolocation_state, "</b><br>",
      "Avg review: <b>", round(avg_review, 3), "</b><br>",
      "Orders: ",        comma(n_orders)
    )
  )

# colorNumeric() maps a continuous numeric range onto a colour palette.
# "RdYlGn" = red (low) → yellow (mid) → green (high) — intuitive for a satisfaction score.
pal <- colorNumeric(palette = "RdYlGn", domain = map_data$avg_review)

leaflet(map_data) %>%
  addTiles() %>%                               # standard OpenStreetMap base layer
  setView(lng = -52, lat = -14, zoom = 4) %>% # centred on Brazil
  addCircleMarkers(
    lng         = ~lng,
    lat         = ~lat,
    # sqrt scaling keeps São Paulo (>40k orders) from being absurdly large.
    radius      = ~sqrt(n_orders / 200),
    color       = ~pal(avg_review),
    fillColor   = ~pal(avg_review),
    fillOpacity = 0.85,
    stroke      = FALSE,
    popup       = ~popup_text,
    label       = ~geolocation_state
  ) %>%
  addLegend(
    position = "bottomright",
    pal      = pal,
    values   = ~avg_review,
    title    = "Avg review<br>score",
    opacity  = 1
  )

Human verification — leaflet map. Confirmed that the greener (higher-satisfaction) states in the South-East and South correspond to the same states at the top of Chart A, and the redder states in the North-East match the bottom of Chart A. The map and the bar chart agree, which means the centroid join and the review join are both correct.

Step 4 — Three cautious business insights

  1. Same-state orders are associated with slightly higher satisfaction than cross-state orders. Mean review for same-state shipments is around 4.21 versus roughly 4.05 for cross-state (see Chart C). The gap is modest, and because sellers are concentrated in São Paulo, “cross-state” is partly a proxy for “shipped a long way” — so this is an association, not evidence that crossing a border causes dissatisfaction.

  2. Lower-satisfaction states cluster in the North and North-East. States such as AL, MA and SE sit below the national average in Chart A, while SP, PR and the South-East sit above it. These same regions show the smallest early-delivery margin in Chart B. The geography of satisfaction and the geography of delivery performance appear to move together, which points to logistics as a plausible — but unproven — driver.

  3. Greater distance is associated with lower satisfaction, but distance is likely a proxy for delivery quality, not a cause in itself. Chart D shows a clear step-down across the four distance bands: orders under 100 km average above the national mean, while orders over 1,500 km average below it. The error bars are tight (each band contains thousands of orders), so this is a robust pattern, not noise. The most plausible explanation is that long distances correlate with slower or less reliable logistics in remote regions — which also explains why the same pattern appears in the state-level delivery delay data in Chart B. Distance and delivery quality are entangled here, and this data cannot separate them.

Step 5 — Limitations and AI audit note

Limitations

  1. One seller per order is a simplification. Orders containing items from several sellers were reduced to their first listed seller when assigning seller_state. For multi-seller orders the cross-state flag is therefore approximate, and the true origin mix is lost. Most orders are single-seller, so the direction of the findings is unlikely to flip — but the exact cross-state figures carry this caveat.

  2. Observational data and uneven sample sizes. This is not an experiment, so nothing here establishes causation — distance, delivery delay, product category and price are all entangled. São Paulo alone is ~42% of orders, while several northern states have comparatively few reviews, so their state averages are noisier and should be read as indicative rather than precise.

AI audit note

This report was produced with AI assistance (Claude / Codex) under continuous human verification. The AI drafted the R code, the join logic and the prose; I checked every step against the raw data myself.

Join integrity: I confirmed row counts stayed constant across all three joins (99,441 throughout), so no join silently duplicated rows. I deliberately collapsed multiple reviews per order to one mean score before joining to prevent fan-out, and I verified customer_state had no meaningful missingness.

Derived columns: I hand-spot-checked cross_state, delay_days, and dist_km on sample rows. For distance: I verified the median falls in the expected few-hundred-km range (consistent with sellers being concentrated in São Paulo) and the maximum does not exceed Brazil’s actual breadth (~4,300 km).

Charts and maps: For the static map I confirmed the coordinate filter removed only out-of-bounds points and the highest-satisfaction point fell on São Paulo. For the leaflet map I verified that the red/green colouring matched the bar chart rankings. For Chart D I confirmed the national-average reference line sits between the closest and furthest bands, which is where the pattern would predict it.

Numbers: The headline figures (national average ≈ 4.09; same-state ≈ 4.21 vs cross-state ≈ 4.05; step-down pattern in Chart D) match figures independently confirmed in Python before writing this report.

No external citations are used, so there is nothing left [UNVERIFIED]. Where the AI initially leaned toward causal phrasing, I rewrote those sentences as associations.

# Recorded for reproducibility — exact R and package versions used to knit this report.
sessionInfo()
## R version 4.6.0 (2026-04-24)
## Platform: aarch64-apple-darwin23
## Running under: macOS Tahoe 26.5.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.6/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_GB/en_GB/en_GB/C/en_GB/en_GB
## 
## time zone: Europe/Warsaw
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] leaflet_2.2.3   scales_1.4.0    knitr_1.51      cluster_2.1.8.2
##  [5] lubridate_1.9.5 forcats_1.0.1   stringr_1.6.0   dplyr_1.2.1    
##  [9] purrr_1.2.2     readr_2.2.0     tidyr_1.3.2     tibble_3.3.1   
## [13] ggplot2_4.0.3   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     stringi_1.8.7      hms_1.1.4         
##  [5] digest_0.6.39      magrittr_2.0.5     evaluate_1.0.5     grid_4.6.0        
##  [9] timechange_0.4.0   RColorBrewer_1.1-3 fastmap_1.2.0      jsonlite_2.0.0    
## [13] viridisLite_0.4.3  crosstalk_1.2.2    jquerylib_0.1.4    cli_3.6.6         
## [17] rlang_1.2.0        crayon_1.5.3       bit64_4.8.2        withr_3.0.2       
## [21] cachem_1.1.0       yaml_2.3.12        otel_0.2.0         tools_4.6.0       
## [25] parallel_4.6.0     tzdb_0.5.0         vctrs_0.7.3        R6_2.6.1          
## [29] lifecycle_1.0.5    htmlwidgets_1.6.4  bit_4.6.0          vroom_1.7.1       
## [33] pkgconfig_2.0.3    pillar_1.11.1      bslib_0.11.0       gtable_0.3.6      
## [37] glue_1.8.1         xfun_0.58          tidyselect_1.2.1   farver_2.1.2      
## [41] htmltools_0.5.9    rmarkdown_2.31     labeling_0.4.3     compiler_4.6.0    
## [45] S7_0.2.2