1 Executive Summary

This report uses unsupervised learning to identify order experience clusters in the Olist marketplace. The clusters are created from delivery, freight, order complexity, payment, and customer geography variables. Review outcomes are deliberately excluded from clustering and are used only after clustering to understand how each experience profile relates to customer satisfaction.

The analysis answers a practical customer satisfaction question: which types of order experiences look risky or positive even when customers do not leave a review?

2 Data Sources

The report loads the required Olist CSV files from the project folder and builds one row per order. One-to-many tables, such as order items and payments, are aggregated before joining to avoid duplicated orders and inflated totals.

data_dir <- params$data_dir

files <- c(
  orders = "olist_orders_dataset.csv",
  reviews = "olist_order_reviews_dataset.csv",
  customers = "olist_customers_dataset.csv",
  items = "olist_order_items_dataset.csv",
  products = "olist_products_dataset.csv",
  sellers = "olist_sellers_dataset.csv",
  payments = "olist_order_payments_dataset.csv",
  translation = "product_category_name_translation.csv"
)

missing_files <- files[!file.exists(file.path(data_dir, files))]
if (length(missing_files) > 0) {
  stop("Missing required file(s): ", paste(missing_files, collapse = ", "))
}

orders <- read_csv(file.path(data_dir, files[["orders"]]), show_col_types = FALSE)
reviews <- read_csv(file.path(data_dir, files[["reviews"]]), show_col_types = FALSE)
customers <- read_csv(file.path(data_dir, files[["customers"]]), show_col_types = FALSE)
items <- read_csv(file.path(data_dir, files[["items"]]), show_col_types = FALSE)
products <- read_csv(file.path(data_dir, files[["products"]]), show_col_types = FALSE)
sellers <- read_csv(file.path(data_dir, files[["sellers"]]), show_col_types = FALSE)
payments <- read_csv(file.path(data_dir, files[["payments"]]), show_col_types = FALSE)
translation <- read_csv(file.path(data_dir, files[["translation"]]), show_col_types = FALSE)

The raw tables contain order status and timestamps, review scores, customer state, product and seller information, freight charges, and payment behavior. These sources are sufficient to describe the order experience without using review data as a clustering input.

3 Analytical Dataset

3.1 Customer Region

Customer state is mapped into a Brazilian macro-region. This reduces the geographic feature from 27 states to a smaller and more interpretable set of regional indicators.

state_region_lookup <- tibble(
  customer_state = c(
    "AC", "AP", "AM", "PA", "RO", "RR", "TO",
    "AL", "BA", "CE", "MA", "PB", "PE", "PI", "RN", "SE",
    "DF", "GO", "MT", "MS",
    "ES", "MG", "RJ", "SP",
    "PR", "RS", "SC"
  ),
  customer_region = c(
    rep("North", 7),
    rep("Northeast", 9),
    rep("Central-West", 4),
    rep("Southeast", 4),
    rep("South", 3)
  )
)

Using region rather than raw state avoids giving k-means too many sparse location indicators while still retaining broad geographic context.

3.2 Order-Level Aggregation

timestamp_fields <- c(
  "order_purchase_timestamp",
  "order_approved_at",
  "order_delivered_carrier_date",
  "order_delivered_customer_date",
  "order_estimated_delivery_date"
)

orders_clean <- orders %>%
  mutate(across(all_of(timestamp_fields), parse_olist_datetime)) %>%
  filter(order_status == "delivered")

reviews_order <- reviews %>%
  mutate(
    review_creation_date = parse_olist_datetime(review_creation_date),
    review_answer_timestamp = parse_olist_datetime(review_answer_timestamp),
    has_written_review = if_else(
      !is.na(review_comment_title) | !is.na(review_comment_message),
      TRUE,
      FALSE
    )
  ) %>%
  group_by(order_id) %>%
  arrange(desc(review_answer_timestamp), desc(review_creation_date), .by_group = TRUE) %>%
  summarise(
    review_score = first(review_score),
    review_count = n(),
    has_written_review = any(has_written_review),
    .groups = "drop"
  )

products_enriched <- products %>%
  left_join(translation, by = "product_category_name") %>%
  mutate(
    product_category = coalesce(product_category_name_english, product_category_name, "unknown")
  )

items_enriched <- items %>%
  left_join(products_enriched %>% select(product_id, product_category), by = "product_id") %>%
  left_join(sellers %>% select(seller_id, seller_state), by = "seller_id")

items_order <- items_enriched %>%
  group_by(order_id) %>%
  summarise(
    number_of_items = n(),
    number_of_sellers = n_distinct(seller_id),
    number_of_categories = n_distinct(product_category),
    order_value = sum(price, na.rm = TRUE),
    freight_value = sum(freight_value, na.rm = TRUE),
    dominant_seller_state = mode_value(seller_state),
    .groups = "drop"
  )

payments_order <- payments %>%
  group_by(order_id) %>%
  summarise(
    payment_installments = max(payment_installments, na.rm = TRUE),
    payment_count = n(),
    payment_value = sum(payment_value, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    payment_installments = if_else(is.infinite(payment_installments), NA_real_, payment_installments)
  )

order_level <- orders_clean %>%
  left_join(customers %>% select(customer_id, customer_state), by = "customer_id") %>%
  left_join(state_region_lookup, by = "customer_state") %>%
  left_join(items_order, by = "order_id") %>%
  left_join(payments_order, by = "order_id") %>%
  left_join(reviews_order, by = "order_id") %>%
  mutate(
    delivery_time_days = as.numeric(as.Date(order_delivered_customer_date) -
                                     as.Date(order_purchase_timestamp)),
    delay_days = as.numeric(as.Date(order_delivered_customer_date) -
                              as.Date(order_estimated_delivery_date)),
    is_late = if_else(!is.na(delay_days), as.integer(delay_days > 0), NA_integer_),
    has_review = !is.na(review_score),
    has_written_review = coalesce(has_written_review, FALSE),
    payment_installments = coalesce(payment_installments, 0),
    negative_review = if_else(has_review, as.integer(review_score <= 2), NA_integer_)
  )

glimpse(order_level)
## Rows: 96,478
## Columns: 27
## $ order_id                      <chr> "e481f51cbdc54678b7cc49136f2d6af7", "53c…
## $ customer_id                   <chr> "9ef432eb6251297304e76186b10a928d", "b08…
## $ order_status                  <chr> "delivered", "delivered", "delivered", "…
## $ order_purchase_timestamp      <dttm> 2017-10-02 10:56:33, 2018-07-24 20:41:3…
## $ order_approved_at             <dttm> 2017-10-02 11:07:15, 2018-07-26 03:24:2…
## $ order_delivered_carrier_date  <dttm> 2017-10-04 19:55:00, 2018-07-26 14:31:0…
## $ order_delivered_customer_date <dttm> 2017-10-10 21:25:13, 2018-08-07 15:27:4…
## $ order_estimated_delivery_date <dttm> 2017-10-18, 2018-08-13, 2018-09-04, 201…
## $ customer_state                <chr> "SP", "BA", "GO", "RN", "SP", "PR", "RJ"…
## $ customer_region               <chr> "Southeast", "Northeast", "Central-West"…
## $ number_of_items               <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1…
## $ number_of_sellers             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ number_of_categories          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ order_value                   <dbl> 29.99, 118.70, 159.90, 45.00, 19.90, 147…
## $ freight_value                 <dbl> 8.72, 22.76, 19.22, 27.20, 8.72, 27.36, …
## $ dominant_seller_state         <chr> "SP", "SP", "SP", "MG", "SP", "SP", "SP"…
## $ payment_installments          <dbl> 1, 1, 3, 1, 1, 6, 3, 1, 1, 1, 1, 1, 3, 1…
## $ payment_count                 <int> 3, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1…
## $ payment_value                 <dbl> 38.71, 141.46, 179.12, 72.20, 28.62, 175…
## $ review_score                  <dbl> 4, 4, 5, 5, 5, 4, 5, 1, 5, 1, 4, 5, 5, 4…
## $ review_count                  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ has_written_review            <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, F…
## $ delivery_time_days            <dbl> 8, 14, 9, 14, 3, 17, 10, 10, 18, 13, 6, …
## $ delay_days                    <dbl> -8, -6, -18, -13, -10, -6, -12, -32, -7,…
## $ is_late                       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ has_review                    <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
## $ negative_review               <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0…

This dataset has one row per delivered order. order_value is based on item prices, while freight_value is kept separately so clusters can distinguish product cost from shipping cost. Review fields are joined with a left_join, so orders are not removed because of missing review information.

review_audit <- tibble(
  metric = c(
    "Delivered orders in analytical dataset",
    "Delivered orders without review_score",
    "Delivered orders without written title/message",
    "Review-score participation rate",
    "Written-review participation rate"
  ),
  value = c(
    nrow(order_level),
    sum(!order_level$has_review),
    sum(!order_level$has_written_review),
    mean(order_level$has_review),
    mean(order_level$has_written_review)
  )
)

kable(
  review_audit %>%
    mutate(
      value = if_else(
        str_detect(metric, "rate"),
        percent(as.numeric(value), accuracy = 0.1),
        comma(as.numeric(value))
      )
    ),
  caption = "Audit of review availability among delivered orders"
)
Audit of review availability among delivered orders
metric value
Delivered orders in analytical dataset 96,478.00
Delivered orders without review_score 0.00
Delivered orders without written title/message 55,178.00
Review-score participation rate 100.0%
Written-review participation rate 42.8%

The Olist CSVs used here contain a review_score for every delivered order. Therefore, review_participation_rate based on score availability is 100%. The more useful measure for the project question is written_review_participation_rate, which checks whether the customer wrote a review title or comment.

4 Missing Values and Distributions

candidate_variables <- c(
  "delivery_time_days",
  "delay_days",
  "is_late",
  "freight_value",
  "order_value",
  "number_of_items",
  "number_of_sellers",
  "number_of_categories",
  "payment_installments",
  "customer_region"
)

missing_summary <- order_level %>%
  summarise(across(all_of(candidate_variables), ~ mean(is.na(.x)) * 100)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "missing_percent") %>%
  arrange(desc(missing_percent))

kable(
  missing_summary %>% mutate(missing_percent = round(missing_percent, 2)),
  caption = "Missing values in candidate clustering variables"
)
Missing values in candidate clustering variables
variable missing_percent
delivery_time_days 0.01
delay_days 0.01
is_late 0.01
freight_value 0.00
order_value 0.00
number_of_items 0.00
number_of_sellers 0.00
number_of_categories 0.00
payment_installments 0.00
customer_region 0.00

The clustering dataset will remove records with missing clustering variables. This mainly removes orders without usable delivery dates or item/payment information. Review fields are not part of this filtering step because reviews are not clustering inputs.

numeric_candidates <- c(
  "delivery_time_days",
  "delay_days",
  "is_late",
  "freight_value",
  "order_value",
  "number_of_items",
  "number_of_sellers",
  "number_of_categories",
  "payment_installments"
)

distribution_summary <- order_level %>%
  summarise(
    across(
      all_of(numeric_candidates),
      list(
        mean = ~ safe_mean(.x),
        sd = ~ sd(.x, na.rm = TRUE),
        min = ~ min(.x, na.rm = TRUE),
        p25 = ~ quantile(.x, 0.25, na.rm = TRUE),
        median = ~ median(.x, na.rm = TRUE),
        p75 = ~ quantile(.x, 0.75, na.rm = TRUE),
        p95 = ~ quantile(.x, 0.95, na.rm = TRUE),
        p99 = ~ quantile(.x, 0.99, na.rm = TRUE),
        max = ~ max(.x, na.rm = TRUE)
      ),
      .names = "{.col}_{.fn}"
    )
  ) %>%
  pivot_longer(everything()) %>%
  separate(name, into = c("variable", "statistic"), sep = "_(?=[^_]+$)") %>%
  pivot_wider(names_from = statistic, values_from = value)

kable(
  distribution_summary %>%
    mutate(across(where(is.numeric), ~ round(.x, 2))),
  caption = "Distribution summary for numerical clustering variables"
)
Distribution summary for numerical clustering variables
variable mean sd min p25 median p75 p95 p99 max
delivery_time_days 12.50 9.56 0.00 7.00 10.00 16.00 29.00 46.00 210.00
delay_days -11.88 10.18 -147.00 -17.00 -12.00 -7.00 3.00 18.00 188.00
is_late 0.07 0.25 0.00 0.00 0.00 0.00 1.00 1.00 1.00
freight_value 22.79 21.56 0.00 13.85 17.17 24.02 54.78 104.24 1794.96
order_value 137.04 209.05 0.85 45.90 86.57 149.90 399.00 990.00 13440.00
number_of_items 1.14 0.54 1.00 1.00 1.00 1.00 2.00 3.00 21.00
number_of_sellers 1.01 0.12 1.00 1.00 1.00 1.00 1.00 2.00 5.00
number_of_categories 1.01 0.09 1.00 1.00 1.00 1.00 1.00 1.00 3.00
payment_installments 2.93 2.71 0.00 1.00 2.00 4.00 10.00 10.00 24.00

The monetary and count variables are typically right-skewed in marketplace data: most orders are small, while a minority contain expensive products, high freight, or multiple items. These variables are transformed with log1p before standardization. delay_days can be negative when orders arrive early, so it is kept in its original scale.

order_level %>%
  select(all_of(numeric_candidates)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 40, fill = blue, color = "white", na.rm = TRUE) +
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  labs(
    title = "Distributions of Candidate Numerical Variables",
    x = NULL,
    y = "Number of orders"
  )

The histograms confirm that raw scale differences and skew would distort distance-based clustering. Transforming and standardizing the variables ensures that the k-means algorithm is not dominated by high-value orders or freight outliers.

5 Preprocessing for Clustering

cluster_base <- order_level %>%
  select(
    order_id,
    review_score,
    has_review,
    has_written_review,
    negative_review,
    all_of(candidate_variables)
  ) %>%
  drop_na(all_of(candidate_variables)) %>%
  mutate(
    log_delivery_time_days = log1p(delivery_time_days),
    log_freight_value = log1p(freight_value),
    log_order_value = log1p(order_value),
    log_number_of_items = log1p(number_of_items),
    log_number_of_sellers = log1p(number_of_sellers),
    log_number_of_categories = log1p(number_of_categories),
    log_payment_installments = log1p(payment_installments),
    customer_region = factor(customer_region)
  )

region_dummies <- model.matrix(~ customer_region - 1, data = cluster_base) %>%
  as_tibble()

cluster_features <- cluster_base %>%
  select(
    log_delivery_time_days,
    delay_days,
    is_late,
    log_freight_value,
    log_order_value,
    log_number_of_items,
    log_number_of_sellers,
    log_number_of_categories,
    log_payment_installments
  ) %>%
  bind_cols(region_dummies)

cluster_matrix <- scale(cluster_features)

preprocessing_summary <- tibble(
  step = c(
    "Original orders",
    "Delivered orders retained in analytical dataset",
    "Delivered orders retained after removing missing clustering variables",
    "Delivered orders removed before clustering because of missing clustering variables",
    "Variables used by k-means after customer-region one-hot encoding"
  ),
  value = c(
    nrow(orders),
    nrow(order_level),
    nrow(cluster_base),
    nrow(order_level) - nrow(cluster_base),
    ncol(cluster_matrix)
  )
)

kable(preprocessing_summary, caption = "Preprocessing summary")
Preprocessing summary
step value
Original orders 99441
Delivered orders retained in analytical dataset 96478
Delivered orders retained after removing missing clustering variables 96470
Delivered orders removed before clustering because of missing clustering variables 8
Variables used by k-means after customer-region one-hot encoding 14

The final clustering matrix includes transformed numerical variables and one-hot encoded customer regions. Review score, review availability, and negative review indicators remain outside the clustering matrix.

6 Selecting the Number of Clusters

The report evaluates cluster counts from 2 to 10 using two complementary diagnostics:

  • Elbow method: looks for a point where adding more clusters gives diminishing improvement in within-cluster sum of squares.
  • Silhouette score: measures how well-separated clusters are, where higher values indicate cleaner separation.
k_values <- 2:10

kmeans_results <- map(
  k_values,
  ~ kmeans(cluster_matrix, centers = .x, nstart = 25, iter.max = 500, algorithm = "Lloyd")
)

elbow_results <- tibble(
  k = k_values,
  total_withinss = map_dbl(kmeans_results, "tot.withinss")
)

silhouette_sample_size <- min(params$max_silhouette_sample, nrow(cluster_matrix))
silhouette_index <- sample(seq_len(nrow(cluster_matrix)), silhouette_sample_size)
silhouette_dist <- dist(cluster_matrix[silhouette_index, ])

silhouette_results <- map2_dfr(
  k_values,
  kmeans_results,
  ~ {
    sampled_clusters <- .y$cluster[silhouette_index]
    sil <- silhouette(sampled_clusters, silhouette_dist)
    tibble(
      k = .x,
      avg_silhouette = mean(sil[, "sil_width"])
    )
  }
)

k_diagnostics <- elbow_results %>%
  left_join(silhouette_results, by = "k")

kable(
  k_diagnostics %>% mutate(across(where(is.numeric), ~ round(.x, 4))),
  caption = "Cluster-count diagnostics"
)
Cluster-count diagnostics
k total_withinss avg_silhouette
2 1159179.0 0.2822
3 1014057.2 0.2943
4 877397.6 0.3236
5 775890.5 0.3364
6 680835.9 0.3725
7 610959.6 0.2731
8 516260.8 0.3120
9 466867.8 0.2979
10 442077.8 0.2553

The table should be read by balancing fit and interpretability. The elbow method usually favors a compact solution where within-cluster variation stops falling sharply, while silhouette may prefer fewer, more separated clusters.

ggplot(elbow_results, aes(x = k, y = total_withinss)) +
  geom_line(color = blue, linewidth = 1) +
  geom_point(color = blue, size = 2.5) +
  scale_x_continuous(breaks = k_values) +
  labs(
    title = "Elbow Method",
    subtitle = "Lower within-cluster sum of squares indicates tighter clusters",
    x = "Number of clusters",
    y = "Total within-cluster sum of squares"
  )

The elbow plot helps identify a practical number of clusters where additional complexity has diminishing returns. A very large number of clusters may fit the data better mechanically but can be harder to explain operationally.

ggplot(silhouette_results, aes(x = k, y = avg_silhouette)) +
  geom_line(color = orange, linewidth = 1) +
  geom_point(color = orange, size = 2.5) +
  scale_x_continuous(breaks = k_values) +
  labs(
    title = "Average Silhouette Score",
    subtitle = "Higher values indicate better-separated clusters",
    x = "Number of clusters",
    y = "Average silhouette score"
  )

The silhouette plot provides a separation-based check on the elbow result. If the highest silhouette solution is too coarse for business interpretation, the final choice can favor a nearby value that still performs well and produces actionable profiles.

best_silhouette_k <- silhouette_results %>%
  slice_max(avg_silhouette, n = 1, with_ties = FALSE) %>%
  pull(k)

selected_k <- best_silhouette_k

selected_k
## [1] 6

For reproducibility, this report selects the k with the highest average silhouette score. If business stakeholders need more granular segments, the diagnostic table can support choosing a nearby alternative.

7 K-Means Clustering

final_kmeans <- kmeans(
  cluster_matrix,
  centers = selected_k,
  nstart = 50,
  iter.max = 500,
  algorithm = "Lloyd"
)

clustered_orders <- cluster_base %>%
  mutate(cluster_raw = final_kmeans$cluster)

cluster_order <- clustered_orders %>%
  group_by(cluster_raw) %>%
  summarise(
    mean_delay = mean(delay_days, na.rm = TRUE),
    mean_delivery_time = mean(delivery_time_days, na.rm = TRUE),
    late_rate = mean(is_late, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(sort_score = mean_delay + 5 * late_rate + 0.1 * mean_delivery_time) %>%
  arrange(sort_score) %>%
  mutate(cluster = row_number())

clustered_orders <- clustered_orders %>%
  left_join(cluster_order %>% select(cluster_raw, cluster), by = "cluster_raw") %>%
  mutate(cluster = factor(cluster))

cluster_size <- clustered_orders %>%
  count(cluster, name = "orders") %>%
  mutate(share = orders / sum(orders))

kable(
  cluster_size %>% mutate(share = percent(share, accuracy = 0.1)),
  caption = "Cluster sizes"
)
Cluster sizes
cluster orders share
1 1445 1.5%
2 1784 1.8%
3 12785 13.3%
4 12973 13.4%
5 61124 63.4%
6 6359 6.6%

Clusters are relabeled after training so lower-numbered clusters generally represent smoother delivery experiences and higher-numbered clusters represent more delayed or operationally difficult experiences. This relabeling does not change the clustering result; it only makes interpretation easier.

8 PCA Cluster Visualization

pca_fit <- prcomp(cluster_matrix, center = FALSE, scale. = FALSE)

pca_scores <- as_tibble(pca_fit$x[, 1:2]) %>%
  rename(PC1 = 1, PC2 = 2) %>%
  bind_cols(clustered_orders %>% select(order_id, cluster))

pca_variance <- summary(pca_fit)$importance[2, 1:2]

ggplot(pca_scores, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(alpha = 0.25, size = 0.7) +
  labs(
    title = "Order Experience Clusters Projected with PCA",
    subtitle = paste0(
      "PC1 explains ", percent(pca_variance[1], accuracy = 0.1),
      "; PC2 explains ", percent(pca_variance[2], accuracy = 0.1),
      " of standardized feature variance"
    ),
    x = "Principal component 1",
    y = "Principal component 2",
    color = "Cluster"
  )

The PCA plot compresses the clustering variables into two dimensions for visualization. Because k-means was fit in the full standardized feature space, some separation may not be fully visible in two dimensions.

9 Cluster Profiles

cluster_profile <- clustered_orders %>%
  group_by(cluster) %>%
  summarise(
    orders = n(),
    share = n() / nrow(clustered_orders),
    delivery_time_days = mean(delivery_time_days, na.rm = TRUE),
    delay_days = mean(delay_days, na.rm = TRUE),
    late_rate = mean(is_late, na.rm = TRUE),
    freight_value = mean(freight_value, na.rm = TRUE),
    order_value = mean(order_value, na.rm = TRUE),
    number_of_items = mean(number_of_items, na.rm = TRUE),
    number_of_sellers = mean(number_of_sellers, na.rm = TRUE),
    number_of_categories = mean(number_of_categories, na.rm = TRUE),
    payment_installments = mean(payment_installments, na.rm = TRUE),
    .groups = "drop"
  )

kable(
  cluster_profile %>%
    mutate(
      share = percent(share, accuracy = 0.1),
      across(where(is.numeric), ~ round(.x, 2))
    ),
  caption = "Cluster profile based on original feature scales"
)
Cluster profile based on original feature scales
cluster orders share delivery_time_days delay_days late_rate freight_value order_value number_of_items number_of_sellers number_of_categories payment_installments
1 1445 1.5% 9.31 -16.49 0.02 45.82 210.05 2.42 1.92 1.55 4.28
2 1784 1.8% 22.21 -15.92 0.08 41.08 181.04 1.11 1.01 1.00 3.30
3 12785 13.3% 12.70 -14.39 0.00 23.91 136.41 1.13 1.00 1.00 2.95
4 12973 13.4% 15.63 -14.37 0.00 31.76 157.57 1.12 1.00 1.00 3.24
5 61124 63.4% 9.38 -12.94 0.00 19.36 128.47 1.12 1.00 1.00 2.80
6 6359 6.6% 33.67 10.65 1.00 24.77 149.87 1.11 1.00 1.00 3.04

This table translates the cluster centroids back into business-readable order characteristics. The most important comparisons are delivery speed, lateness, freight cost, order value, and order complexity.

region_profile <- clustered_orders %>%
  count(cluster, customer_region) %>%
  group_by(cluster) %>%
  mutate(share = n / sum(n)) %>%
  ungroup()

ggplot(region_profile, aes(x = cluster, y = share, fill = customer_region)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = percent) +
  labs(
    title = "Customer Region Mix by Cluster",
    x = "Cluster",
    y = "Share of cluster",
    fill = "Customer region"
  )

The region mix shows whether some clusters are partly geographic. If a delayed or expensive-freight cluster is concentrated in a region, the business response may involve logistics routing, fulfillment coverage, or customer communication by region.

cluster_feature_means <- clustered_orders %>%
  group_by(cluster) %>%
  summarise(across(all_of(candidate_variables[-10]), ~ mean(.x, na.rm = TRUE)), .groups = "drop")

cluster_feature_scaled <- cluster_feature_means %>%
  column_to_rownames("cluster") %>%
  scale() %>%
  as_tibble(rownames = "cluster") %>%
  pivot_longer(-cluster, names_to = "variable", values_to = "standardized_mean")

ggplot(cluster_feature_scaled, aes(x = variable, y = cluster, fill = standardized_mean)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = red, mid = "white", high = blue, midpoint = 0) +
  labs(
    title = "Standardized Cluster Profile Heatmap",
    subtitle = "Blue means above-average for that variable; red means below-average",
    x = NULL,
    y = "Cluster",
    fill = "Standardized mean"
  ) +
  theme(axis.text.x = element_text(angle = 35, hjust = 1))

The heatmap makes the relative pattern of each cluster easier to see. It is especially useful for identifying whether a cluster is defined mostly by lateness, high freight, high order value, many sellers/categories, or regional composition.

10 Satisfaction Outcomes by Cluster

Review outcomes are added only after clusters have been created. This preserves the unsupervised design and allows the clusters to be interpreted against observed satisfaction.

review_profile <- clustered_orders %>%
  group_by(cluster) %>%
  summarise(
    orders = n(),
    review_participation_rate = mean(has_review, na.rm = TRUE),
    written_review_participation_rate = mean(has_written_review, na.rm = TRUE),
    avg_review_score = mean(review_score, na.rm = TRUE),
    negative_review_rate = mean(negative_review, na.rm = TRUE),
    .groups = "drop"
  )

kable(
  review_profile %>%
    mutate(
      review_participation_rate = percent(review_participation_rate, accuracy = 0.1),
      written_review_participation_rate = percent(written_review_participation_rate, accuracy = 0.1),
      negative_review_rate = percent(negative_review_rate, accuracy = 0.1),
      avg_review_score = round(avg_review_score, 2)
    ),
  caption = "Review outcomes by cluster, not used for clustering"
)
Review outcomes by cluster, not used for clustering
cluster orders review_participation_rate written_review_participation_rate avg_review_score negative_review_rate
1 1445 100.0% 65.7% 2.95 44.9%
2 1784 100.0% 48.0% 4.02 15.2%
3 12785 100.0% 37.2% 4.32 8.5%
4 12973 100.0% 45.2% 4.25 9.5%
5 61124 100.0% 40.9% 4.31 8.9%
6 6359 100.0% 60.8% 2.26 62.7%

This table shows whether operationally different order experiences correspond to different satisfaction outcomes. In this dataset, score-based review participation is 100% for delivered orders, so the written-review participation rate is the better measure of customers who actively wrote feedback.

review_profile %>%
  select(cluster, review_participation_rate, written_review_participation_rate, negative_review_rate, avg_review_score) %>%
  pivot_longer(-cluster, names_to = "metric", values_to = "value") %>%
  mutate(
    metric = recode(
      metric,
      review_participation_rate = "Review participation rate",
      written_review_participation_rate = "Written-review participation rate",
      negative_review_rate = "Negative review rate",
      avg_review_score = "Average review score"
    )
  ) %>%
  ggplot(aes(x = cluster, y = value, fill = metric)) +
  geom_col(position = "dodge") +
  facet_wrap(~ metric, scales = "free_y", ncol = 1) +
  guides(fill = "none") +
  labs(
    title = "Satisfaction Metrics by Order Experience Cluster",
    x = "Cluster",
    y = NULL
  )

The plot compares satisfaction outcomes across clusters. If a cluster has many non-reviewers but resembles low-review-score clusters operationally, those silent customers may still represent hidden dissatisfaction risk.

11 Business Interpretation

overall_profile <- clustered_orders %>%
  summarise(
    delivery_time_days = mean(delivery_time_days, na.rm = TRUE),
    delay_days = mean(delay_days, na.rm = TRUE),
    late_rate = mean(is_late, na.rm = TRUE),
    freight_value = mean(freight_value, na.rm = TRUE),
    order_value = mean(order_value, na.rm = TRUE),
    number_of_items = mean(number_of_items, na.rm = TRUE),
    number_of_sellers = mean(number_of_sellers, na.rm = TRUE),
    number_of_categories = mean(number_of_categories, na.rm = TRUE),
    payment_installments = mean(payment_installments, na.rm = TRUE)
  )

interpretation_table <- cluster_profile %>%
  left_join(review_profile, by = c("cluster", "orders")) %>%
  mutate(
    delivery_signal = case_when(
      delay_days > overall_profile$delay_days + 2 | late_rate > overall_profile$late_rate + 0.05 ~ "late / delayed experience",
      delivery_time_days < overall_profile$delivery_time_days - 2 & late_rate < overall_profile$late_rate ~ "fast / reliable delivery",
      TRUE ~ "typical delivery"
    ),
    cost_signal = case_when(
      freight_value > overall_profile$freight_value & order_value > overall_profile$order_value ~ "high-value and high-freight orders",
      freight_value > overall_profile$freight_value ~ "freight-heavy orders",
      order_value > overall_profile$order_value ~ "high-value orders",
      TRUE ~ "lower-cost orders"
    ),
    complexity_signal = case_when(
      number_of_sellers > overall_profile$number_of_sellers |
        number_of_categories > overall_profile$number_of_categories |
        number_of_items > overall_profile$number_of_items ~ "more complex baskets",
      TRUE ~ "simple baskets"
    ),
    satisfaction_signal = case_when(
      negative_review_rate > mean(review_profile$negative_review_rate, na.rm = TRUE) ~ "higher dissatisfaction risk",
      avg_review_score > mean(review_profile$avg_review_score, na.rm = TRUE) ~ "stronger observed satisfaction",
      TRUE ~ "average observed satisfaction"
    ),
    business_interpretation = paste(
      str_to_sentence(delivery_signal),
      cost_signal,
      complexity_signal,
      satisfaction_signal,
      sep = "; "
    )
  ) %>%
  select(cluster, business_interpretation)

kable(interpretation_table, caption = "Business interpretation of each cluster")
Business interpretation of each cluster
cluster business_interpretation
1 Fast / reliable delivery; high-value and high-freight orders; more complex baskets; higher dissatisfaction risk
2 Typical delivery; high-value and high-freight orders; simple baskets; stronger observed satisfaction
3 Typical delivery; freight-heavy orders; simple baskets; stronger observed satisfaction
4 Typical delivery; high-value and high-freight orders; simple baskets; stronger observed satisfaction
5 Fast / reliable delivery; lower-cost orders; simple baskets; stronger observed satisfaction
6 Late / delayed experience; high-value and high-freight orders; simple baskets; higher dissatisfaction risk

The interpretation table converts the statistical cluster profiles into business language. These labels should be reviewed with the profile tables above because the exact meaning of each cluster depends on the fitted solution and the selected number of clusters.

12 Final Business Insights

The clusters can support satisfaction management in three ways:

  1. Identify hidden dissatisfaction among non-reviewers. Orders without reviews can be assigned to the same experience clusters as reviewed orders. If a non-reviewed order belongs to a cluster with high negative-review rates, it may represent silent dissatisfaction.
  2. Separate delivery problems from order complexity. A low-satisfaction cluster driven by lateness needs a different business response than a cluster driven by high freight, large baskets, or multiple sellers.
  3. Improve targeting. Customer support, delivery communication, seller monitoring, and logistics improvements can be prioritized for clusters with the weakest satisfaction outcomes.
  4. Preserve interpretability. The features are directly understandable by marketplace teams: delivery time, delay, lateness, freight, order size, seller/category complexity, installments, and region.

13 Limitations

This clustering analysis is descriptive and unsupervised. It does not prove that a feature causes dissatisfaction. K-means also assumes roughly spherical clusters in standardized numeric space, so it may simplify complex marketplace behavior. Customer region is included through one-hot indicators, but the analysis does not directly model geographic distance. Finally, review outcomes may be biased because customers who leave reviews can differ systematically from customers who remain silent.

The strongest next step is to use these clusters as interpretable experience segments, then compare them with review text, repeat purchase behavior, delivery distance, and seller performance in a follow-up supervised or causal analysis.