This report uses unsupervised learning to identify order experience clusters in the Olist marketplace. The clusters are created from delivery, freight, order complexity, payment, and customer geography variables. Review outcomes are deliberately excluded from clustering and are used only after clustering to understand how each experience profile relates to customer satisfaction.
The analysis answers a practical customer satisfaction question: which types of order experiences look risky or positive even when customers do not leave a review?
The report loads the required Olist CSV files from the project folder and builds one row per order. One-to-many tables, such as order items and payments, are aggregated before joining to avoid duplicated orders and inflated totals.
data_dir <- params$data_dir
files <- c(
orders = "olist_orders_dataset.csv",
reviews = "olist_order_reviews_dataset.csv",
customers = "olist_customers_dataset.csv",
items = "olist_order_items_dataset.csv",
products = "olist_products_dataset.csv",
sellers = "olist_sellers_dataset.csv",
payments = "olist_order_payments_dataset.csv",
translation = "product_category_name_translation.csv"
)
missing_files <- files[!file.exists(file.path(data_dir, files))]
if (length(missing_files) > 0) {
stop("Missing required file(s): ", paste(missing_files, collapse = ", "))
}
orders <- read_csv(file.path(data_dir, files[["orders"]]), show_col_types = FALSE)
reviews <- read_csv(file.path(data_dir, files[["reviews"]]), show_col_types = FALSE)
customers <- read_csv(file.path(data_dir, files[["customers"]]), show_col_types = FALSE)
items <- read_csv(file.path(data_dir, files[["items"]]), show_col_types = FALSE)
products <- read_csv(file.path(data_dir, files[["products"]]), show_col_types = FALSE)
sellers <- read_csv(file.path(data_dir, files[["sellers"]]), show_col_types = FALSE)
payments <- read_csv(file.path(data_dir, files[["payments"]]), show_col_types = FALSE)
translation <- read_csv(file.path(data_dir, files[["translation"]]), show_col_types = FALSE)
The raw tables contain order status and timestamps, review scores, customer state, product and seller information, freight charges, and payment behavior. These sources are sufficient to describe the order experience without using review data as a clustering input.
Customer state is mapped into a Brazilian macro-region. This reduces the geographic feature from 27 states to a smaller and more interpretable set of regional indicators.
state_region_lookup <- tibble(
customer_state = c(
"AC", "AP", "AM", "PA", "RO", "RR", "TO",
"AL", "BA", "CE", "MA", "PB", "PE", "PI", "RN", "SE",
"DF", "GO", "MT", "MS",
"ES", "MG", "RJ", "SP",
"PR", "RS", "SC"
),
customer_region = c(
rep("North", 7),
rep("Northeast", 9),
rep("Central-West", 4),
rep("Southeast", 4),
rep("South", 3)
)
)
Using region rather than raw state avoids giving k-means too many sparse location indicators while still retaining broad geographic context.
timestamp_fields <- c(
"order_purchase_timestamp",
"order_approved_at",
"order_delivered_carrier_date",
"order_delivered_customer_date",
"order_estimated_delivery_date"
)
orders_clean <- orders %>%
mutate(across(all_of(timestamp_fields), parse_olist_datetime)) %>%
filter(order_status == "delivered")
reviews_order <- reviews %>%
mutate(
review_creation_date = parse_olist_datetime(review_creation_date),
review_answer_timestamp = parse_olist_datetime(review_answer_timestamp),
has_written_review = if_else(
!is.na(review_comment_title) | !is.na(review_comment_message),
TRUE,
FALSE
)
) %>%
group_by(order_id) %>%
arrange(desc(review_answer_timestamp), desc(review_creation_date), .by_group = TRUE) %>%
summarise(
review_score = first(review_score),
review_count = n(),
has_written_review = any(has_written_review),
.groups = "drop"
)
products_enriched <- products %>%
left_join(translation, by = "product_category_name") %>%
mutate(
product_category = coalesce(product_category_name_english, product_category_name, "unknown")
)
items_enriched <- items %>%
left_join(products_enriched %>% select(product_id, product_category), by = "product_id") %>%
left_join(sellers %>% select(seller_id, seller_state), by = "seller_id")
items_order <- items_enriched %>%
group_by(order_id) %>%
summarise(
number_of_items = n(),
number_of_sellers = n_distinct(seller_id),
number_of_categories = n_distinct(product_category),
order_value = sum(price, na.rm = TRUE),
freight_value = sum(freight_value, na.rm = TRUE),
dominant_seller_state = mode_value(seller_state),
.groups = "drop"
)
payments_order <- payments %>%
group_by(order_id) %>%
summarise(
payment_installments = max(payment_installments, na.rm = TRUE),
payment_count = n(),
payment_value = sum(payment_value, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
payment_installments = if_else(is.infinite(payment_installments), NA_real_, payment_installments)
)
order_level <- orders_clean %>%
left_join(customers %>% select(customer_id, customer_state), by = "customer_id") %>%
left_join(state_region_lookup, by = "customer_state") %>%
left_join(items_order, by = "order_id") %>%
left_join(payments_order, by = "order_id") %>%
left_join(reviews_order, by = "order_id") %>%
mutate(
delivery_time_days = as.numeric(as.Date(order_delivered_customer_date) -
as.Date(order_purchase_timestamp)),
delay_days = as.numeric(as.Date(order_delivered_customer_date) -
as.Date(order_estimated_delivery_date)),
is_late = if_else(!is.na(delay_days), as.integer(delay_days > 0), NA_integer_),
has_review = !is.na(review_score),
has_written_review = coalesce(has_written_review, FALSE),
payment_installments = coalesce(payment_installments, 0),
negative_review = if_else(has_review, as.integer(review_score <= 2), NA_integer_)
)
glimpse(order_level)
## Rows: 96,478
## Columns: 27
## $ order_id <chr> "e481f51cbdc54678b7cc49136f2d6af7", "53c…
## $ customer_id <chr> "9ef432eb6251297304e76186b10a928d", "b08…
## $ order_status <chr> "delivered", "delivered", "delivered", "…
## $ order_purchase_timestamp <dttm> 2017-10-02 10:56:33, 2018-07-24 20:41:3…
## $ order_approved_at <dttm> 2017-10-02 11:07:15, 2018-07-26 03:24:2…
## $ order_delivered_carrier_date <dttm> 2017-10-04 19:55:00, 2018-07-26 14:31:0…
## $ order_delivered_customer_date <dttm> 2017-10-10 21:25:13, 2018-08-07 15:27:4…
## $ order_estimated_delivery_date <dttm> 2017-10-18, 2018-08-13, 2018-09-04, 201…
## $ customer_state <chr> "SP", "BA", "GO", "RN", "SP", "PR", "RJ"…
## $ customer_region <chr> "Southeast", "Northeast", "Central-West"…
## $ number_of_items <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1…
## $ number_of_sellers <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ number_of_categories <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ order_value <dbl> 29.99, 118.70, 159.90, 45.00, 19.90, 147…
## $ freight_value <dbl> 8.72, 22.76, 19.22, 27.20, 8.72, 27.36, …
## $ dominant_seller_state <chr> "SP", "SP", "SP", "MG", "SP", "SP", "SP"…
## $ payment_installments <dbl> 1, 1, 3, 1, 1, 6, 3, 1, 1, 1, 1, 1, 3, 1…
## $ payment_count <int> 3, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1…
## $ payment_value <dbl> 38.71, 141.46, 179.12, 72.20, 28.62, 175…
## $ review_score <dbl> 4, 4, 5, 5, 5, 4, 5, 1, 5, 1, 4, 5, 5, 4…
## $ review_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ has_written_review <lgl> TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, F…
## $ delivery_time_days <dbl> 8, 14, 9, 14, 3, 17, 10, 10, 18, 13, 6, …
## $ delay_days <dbl> -8, -6, -18, -13, -10, -6, -12, -32, -7,…
## $ is_late <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ has_review <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
## $ negative_review <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0…
This dataset has one row per delivered order.
order_value is based on item prices, while
freight_value is kept separately so clusters can
distinguish product cost from shipping cost. Review fields are joined
with a left_join, so orders are not removed because of
missing review information.
review_audit <- tibble(
metric = c(
"Delivered orders in analytical dataset",
"Delivered orders without review_score",
"Delivered orders without written title/message",
"Review-score participation rate",
"Written-review participation rate"
),
value = c(
nrow(order_level),
sum(!order_level$has_review),
sum(!order_level$has_written_review),
mean(order_level$has_review),
mean(order_level$has_written_review)
)
)
kable(
review_audit %>%
mutate(
value = if_else(
str_detect(metric, "rate"),
percent(as.numeric(value), accuracy = 0.1),
comma(as.numeric(value))
)
),
caption = "Audit of review availability among delivered orders"
)
| metric | value |
|---|---|
| Delivered orders in analytical dataset | 96,478.00 |
| Delivered orders without review_score | 0.00 |
| Delivered orders without written title/message | 55,178.00 |
| Review-score participation rate | 100.0% |
| Written-review participation rate | 42.8% |
The Olist CSVs used here contain a review_score for
every delivered order. Therefore, review_participation_rate
based on score availability is 100%. The more useful measure for the
project question is written_review_participation_rate,
which checks whether the customer wrote a review title or comment.
candidate_variables <- c(
"delivery_time_days",
"delay_days",
"is_late",
"freight_value",
"order_value",
"number_of_items",
"number_of_sellers",
"number_of_categories",
"payment_installments",
"customer_region"
)
missing_summary <- order_level %>%
summarise(across(all_of(candidate_variables), ~ mean(is.na(.x)) * 100)) %>%
pivot_longer(everything(), names_to = "variable", values_to = "missing_percent") %>%
arrange(desc(missing_percent))
kable(
missing_summary %>% mutate(missing_percent = round(missing_percent, 2)),
caption = "Missing values in candidate clustering variables"
)
| variable | missing_percent |
|---|---|
| delivery_time_days | 0.01 |
| delay_days | 0.01 |
| is_late | 0.01 |
| freight_value | 0.00 |
| order_value | 0.00 |
| number_of_items | 0.00 |
| number_of_sellers | 0.00 |
| number_of_categories | 0.00 |
| payment_installments | 0.00 |
| customer_region | 0.00 |
The clustering dataset will remove records with missing clustering variables. This mainly removes orders without usable delivery dates or item/payment information. Review fields are not part of this filtering step because reviews are not clustering inputs.
numeric_candidates <- c(
"delivery_time_days",
"delay_days",
"is_late",
"freight_value",
"order_value",
"number_of_items",
"number_of_sellers",
"number_of_categories",
"payment_installments"
)
distribution_summary <- order_level %>%
summarise(
across(
all_of(numeric_candidates),
list(
mean = ~ safe_mean(.x),
sd = ~ sd(.x, na.rm = TRUE),
min = ~ min(.x, na.rm = TRUE),
p25 = ~ quantile(.x, 0.25, na.rm = TRUE),
median = ~ median(.x, na.rm = TRUE),
p75 = ~ quantile(.x, 0.75, na.rm = TRUE),
p95 = ~ quantile(.x, 0.95, na.rm = TRUE),
p99 = ~ quantile(.x, 0.99, na.rm = TRUE),
max = ~ max(.x, na.rm = TRUE)
),
.names = "{.col}_{.fn}"
)
) %>%
pivot_longer(everything()) %>%
separate(name, into = c("variable", "statistic"), sep = "_(?=[^_]+$)") %>%
pivot_wider(names_from = statistic, values_from = value)
kable(
distribution_summary %>%
mutate(across(where(is.numeric), ~ round(.x, 2))),
caption = "Distribution summary for numerical clustering variables"
)
| variable | mean | sd | min | p25 | median | p75 | p95 | p99 | max |
|---|---|---|---|---|---|---|---|---|---|
| delivery_time_days | 12.50 | 9.56 | 0.00 | 7.00 | 10.00 | 16.00 | 29.00 | 46.00 | 210.00 |
| delay_days | -11.88 | 10.18 | -147.00 | -17.00 | -12.00 | -7.00 | 3.00 | 18.00 | 188.00 |
| is_late | 0.07 | 0.25 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 |
| freight_value | 22.79 | 21.56 | 0.00 | 13.85 | 17.17 | 24.02 | 54.78 | 104.24 | 1794.96 |
| order_value | 137.04 | 209.05 | 0.85 | 45.90 | 86.57 | 149.90 | 399.00 | 990.00 | 13440.00 |
| number_of_items | 1.14 | 0.54 | 1.00 | 1.00 | 1.00 | 1.00 | 2.00 | 3.00 | 21.00 |
| number_of_sellers | 1.01 | 0.12 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 2.00 | 5.00 |
| number_of_categories | 1.01 | 0.09 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 3.00 |
| payment_installments | 2.93 | 2.71 | 0.00 | 1.00 | 2.00 | 4.00 | 10.00 | 10.00 | 24.00 |
The monetary and count variables are typically right-skewed in
marketplace data: most orders are small, while a minority contain
expensive products, high freight, or multiple items. These variables are
transformed with log1p before standardization.
delay_days can be negative when orders arrive early, so it
is kept in its original scale.
order_level %>%
select(all_of(numeric_candidates)) %>%
pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 40, fill = blue, color = "white", na.rm = TRUE) +
facet_wrap(~ variable, scales = "free", ncol = 3) +
labs(
title = "Distributions of Candidate Numerical Variables",
x = NULL,
y = "Number of orders"
)
The histograms confirm that raw scale differences and skew would distort distance-based clustering. Transforming and standardizing the variables ensures that the k-means algorithm is not dominated by high-value orders or freight outliers.
cluster_base <- order_level %>%
select(
order_id,
review_score,
has_review,
has_written_review,
negative_review,
all_of(candidate_variables)
) %>%
drop_na(all_of(candidate_variables)) %>%
mutate(
log_delivery_time_days = log1p(delivery_time_days),
log_freight_value = log1p(freight_value),
log_order_value = log1p(order_value),
log_number_of_items = log1p(number_of_items),
log_number_of_sellers = log1p(number_of_sellers),
log_number_of_categories = log1p(number_of_categories),
log_payment_installments = log1p(payment_installments),
customer_region = factor(customer_region)
)
region_dummies <- model.matrix(~ customer_region - 1, data = cluster_base) %>%
as_tibble()
cluster_features <- cluster_base %>%
select(
log_delivery_time_days,
delay_days,
is_late,
log_freight_value,
log_order_value,
log_number_of_items,
log_number_of_sellers,
log_number_of_categories,
log_payment_installments
) %>%
bind_cols(region_dummies)
cluster_matrix <- scale(cluster_features)
preprocessing_summary <- tibble(
step = c(
"Original orders",
"Delivered orders retained in analytical dataset",
"Delivered orders retained after removing missing clustering variables",
"Delivered orders removed before clustering because of missing clustering variables",
"Variables used by k-means after customer-region one-hot encoding"
),
value = c(
nrow(orders),
nrow(order_level),
nrow(cluster_base),
nrow(order_level) - nrow(cluster_base),
ncol(cluster_matrix)
)
)
kable(preprocessing_summary, caption = "Preprocessing summary")
| step | value |
|---|---|
| Original orders | 99441 |
| Delivered orders retained in analytical dataset | 96478 |
| Delivered orders retained after removing missing clustering variables | 96470 |
| Delivered orders removed before clustering because of missing clustering variables | 8 |
| Variables used by k-means after customer-region one-hot encoding | 14 |
The final clustering matrix includes transformed numerical variables and one-hot encoded customer regions. Review score, review availability, and negative review indicators remain outside the clustering matrix.
The report evaluates cluster counts from 2 to 10 using two complementary diagnostics:
k_values <- 2:10
kmeans_results <- map(
k_values,
~ kmeans(cluster_matrix, centers = .x, nstart = 25, iter.max = 500, algorithm = "Lloyd")
)
elbow_results <- tibble(
k = k_values,
total_withinss = map_dbl(kmeans_results, "tot.withinss")
)
silhouette_sample_size <- min(params$max_silhouette_sample, nrow(cluster_matrix))
silhouette_index <- sample(seq_len(nrow(cluster_matrix)), silhouette_sample_size)
silhouette_dist <- dist(cluster_matrix[silhouette_index, ])
silhouette_results <- map2_dfr(
k_values,
kmeans_results,
~ {
sampled_clusters <- .y$cluster[silhouette_index]
sil <- silhouette(sampled_clusters, silhouette_dist)
tibble(
k = .x,
avg_silhouette = mean(sil[, "sil_width"])
)
}
)
k_diagnostics <- elbow_results %>%
left_join(silhouette_results, by = "k")
kable(
k_diagnostics %>% mutate(across(where(is.numeric), ~ round(.x, 4))),
caption = "Cluster-count diagnostics"
)
| k | total_withinss | avg_silhouette |
|---|---|---|
| 2 | 1159179.0 | 0.2822 |
| 3 | 1014057.2 | 0.2943 |
| 4 | 877397.6 | 0.3236 |
| 5 | 775890.5 | 0.3364 |
| 6 | 680835.9 | 0.3725 |
| 7 | 610959.6 | 0.2731 |
| 8 | 516260.8 | 0.3120 |
| 9 | 466867.8 | 0.2979 |
| 10 | 442077.8 | 0.2553 |
The table should be read by balancing fit and interpretability. The elbow method usually favors a compact solution where within-cluster variation stops falling sharply, while silhouette may prefer fewer, more separated clusters.
ggplot(elbow_results, aes(x = k, y = total_withinss)) +
geom_line(color = blue, linewidth = 1) +
geom_point(color = blue, size = 2.5) +
scale_x_continuous(breaks = k_values) +
labs(
title = "Elbow Method",
subtitle = "Lower within-cluster sum of squares indicates tighter clusters",
x = "Number of clusters",
y = "Total within-cluster sum of squares"
)
The elbow plot helps identify a practical number of clusters where additional complexity has diminishing returns. A very large number of clusters may fit the data better mechanically but can be harder to explain operationally.
ggplot(silhouette_results, aes(x = k, y = avg_silhouette)) +
geom_line(color = orange, linewidth = 1) +
geom_point(color = orange, size = 2.5) +
scale_x_continuous(breaks = k_values) +
labs(
title = "Average Silhouette Score",
subtitle = "Higher values indicate better-separated clusters",
x = "Number of clusters",
y = "Average silhouette score"
)
The silhouette plot provides a separation-based check on the elbow result. If the highest silhouette solution is too coarse for business interpretation, the final choice can favor a nearby value that still performs well and produces actionable profiles.
best_silhouette_k <- silhouette_results %>%
slice_max(avg_silhouette, n = 1, with_ties = FALSE) %>%
pull(k)
selected_k <- best_silhouette_k
selected_k
## [1] 6
For reproducibility, this report selects the k with the
highest average silhouette score. If business stakeholders need more
granular segments, the diagnostic table can support choosing a nearby
alternative.
final_kmeans <- kmeans(
cluster_matrix,
centers = selected_k,
nstart = 50,
iter.max = 500,
algorithm = "Lloyd"
)
clustered_orders <- cluster_base %>%
mutate(cluster_raw = final_kmeans$cluster)
cluster_order <- clustered_orders %>%
group_by(cluster_raw) %>%
summarise(
mean_delay = mean(delay_days, na.rm = TRUE),
mean_delivery_time = mean(delivery_time_days, na.rm = TRUE),
late_rate = mean(is_late, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(sort_score = mean_delay + 5 * late_rate + 0.1 * mean_delivery_time) %>%
arrange(sort_score) %>%
mutate(cluster = row_number())
clustered_orders <- clustered_orders %>%
left_join(cluster_order %>% select(cluster_raw, cluster), by = "cluster_raw") %>%
mutate(cluster = factor(cluster))
cluster_size <- clustered_orders %>%
count(cluster, name = "orders") %>%
mutate(share = orders / sum(orders))
kable(
cluster_size %>% mutate(share = percent(share, accuracy = 0.1)),
caption = "Cluster sizes"
)
| cluster | orders | share |
|---|---|---|
| 1 | 1445 | 1.5% |
| 2 | 1784 | 1.8% |
| 3 | 12785 | 13.3% |
| 4 | 12973 | 13.4% |
| 5 | 61124 | 63.4% |
| 6 | 6359 | 6.6% |
Clusters are relabeled after training so lower-numbered clusters generally represent smoother delivery experiences and higher-numbered clusters represent more delayed or operationally difficult experiences. This relabeling does not change the clustering result; it only makes interpretation easier.
pca_fit <- prcomp(cluster_matrix, center = FALSE, scale. = FALSE)
pca_scores <- as_tibble(pca_fit$x[, 1:2]) %>%
rename(PC1 = 1, PC2 = 2) %>%
bind_cols(clustered_orders %>% select(order_id, cluster))
pca_variance <- summary(pca_fit)$importance[2, 1:2]
ggplot(pca_scores, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.25, size = 0.7) +
labs(
title = "Order Experience Clusters Projected with PCA",
subtitle = paste0(
"PC1 explains ", percent(pca_variance[1], accuracy = 0.1),
"; PC2 explains ", percent(pca_variance[2], accuracy = 0.1),
" of standardized feature variance"
),
x = "Principal component 1",
y = "Principal component 2",
color = "Cluster"
)
The PCA plot compresses the clustering variables into two dimensions for visualization. Because k-means was fit in the full standardized feature space, some separation may not be fully visible in two dimensions.
cluster_profile <- clustered_orders %>%
group_by(cluster) %>%
summarise(
orders = n(),
share = n() / nrow(clustered_orders),
delivery_time_days = mean(delivery_time_days, na.rm = TRUE),
delay_days = mean(delay_days, na.rm = TRUE),
late_rate = mean(is_late, na.rm = TRUE),
freight_value = mean(freight_value, na.rm = TRUE),
order_value = mean(order_value, na.rm = TRUE),
number_of_items = mean(number_of_items, na.rm = TRUE),
number_of_sellers = mean(number_of_sellers, na.rm = TRUE),
number_of_categories = mean(number_of_categories, na.rm = TRUE),
payment_installments = mean(payment_installments, na.rm = TRUE),
.groups = "drop"
)
kable(
cluster_profile %>%
mutate(
share = percent(share, accuracy = 0.1),
across(where(is.numeric), ~ round(.x, 2))
),
caption = "Cluster profile based on original feature scales"
)
| cluster | orders | share | delivery_time_days | delay_days | late_rate | freight_value | order_value | number_of_items | number_of_sellers | number_of_categories | payment_installments |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1445 | 1.5% | 9.31 | -16.49 | 0.02 | 45.82 | 210.05 | 2.42 | 1.92 | 1.55 | 4.28 |
| 2 | 1784 | 1.8% | 22.21 | -15.92 | 0.08 | 41.08 | 181.04 | 1.11 | 1.01 | 1.00 | 3.30 |
| 3 | 12785 | 13.3% | 12.70 | -14.39 | 0.00 | 23.91 | 136.41 | 1.13 | 1.00 | 1.00 | 2.95 |
| 4 | 12973 | 13.4% | 15.63 | -14.37 | 0.00 | 31.76 | 157.57 | 1.12 | 1.00 | 1.00 | 3.24 |
| 5 | 61124 | 63.4% | 9.38 | -12.94 | 0.00 | 19.36 | 128.47 | 1.12 | 1.00 | 1.00 | 2.80 |
| 6 | 6359 | 6.6% | 33.67 | 10.65 | 1.00 | 24.77 | 149.87 | 1.11 | 1.00 | 1.00 | 3.04 |
This table translates the cluster centroids back into business-readable order characteristics. The most important comparisons are delivery speed, lateness, freight cost, order value, and order complexity.
region_profile <- clustered_orders %>%
count(cluster, customer_region) %>%
group_by(cluster) %>%
mutate(share = n / sum(n)) %>%
ungroup()
ggplot(region_profile, aes(x = cluster, y = share, fill = customer_region)) +
geom_col(position = "fill") +
scale_y_continuous(labels = percent) +
labs(
title = "Customer Region Mix by Cluster",
x = "Cluster",
y = "Share of cluster",
fill = "Customer region"
)
The region mix shows whether some clusters are partly geographic. If a delayed or expensive-freight cluster is concentrated in a region, the business response may involve logistics routing, fulfillment coverage, or customer communication by region.
cluster_feature_means <- clustered_orders %>%
group_by(cluster) %>%
summarise(across(all_of(candidate_variables[-10]), ~ mean(.x, na.rm = TRUE)), .groups = "drop")
cluster_feature_scaled <- cluster_feature_means %>%
column_to_rownames("cluster") %>%
scale() %>%
as_tibble(rownames = "cluster") %>%
pivot_longer(-cluster, names_to = "variable", values_to = "standardized_mean")
ggplot(cluster_feature_scaled, aes(x = variable, y = cluster, fill = standardized_mean)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = red, mid = "white", high = blue, midpoint = 0) +
labs(
title = "Standardized Cluster Profile Heatmap",
subtitle = "Blue means above-average for that variable; red means below-average",
x = NULL,
y = "Cluster",
fill = "Standardized mean"
) +
theme(axis.text.x = element_text(angle = 35, hjust = 1))
The heatmap makes the relative pattern of each cluster easier to see. It is especially useful for identifying whether a cluster is defined mostly by lateness, high freight, high order value, many sellers/categories, or regional composition.
Review outcomes are added only after clusters have been created. This preserves the unsupervised design and allows the clusters to be interpreted against observed satisfaction.
review_profile <- clustered_orders %>%
group_by(cluster) %>%
summarise(
orders = n(),
review_participation_rate = mean(has_review, na.rm = TRUE),
written_review_participation_rate = mean(has_written_review, na.rm = TRUE),
avg_review_score = mean(review_score, na.rm = TRUE),
negative_review_rate = mean(negative_review, na.rm = TRUE),
.groups = "drop"
)
kable(
review_profile %>%
mutate(
review_participation_rate = percent(review_participation_rate, accuracy = 0.1),
written_review_participation_rate = percent(written_review_participation_rate, accuracy = 0.1),
negative_review_rate = percent(negative_review_rate, accuracy = 0.1),
avg_review_score = round(avg_review_score, 2)
),
caption = "Review outcomes by cluster, not used for clustering"
)
| cluster | orders | review_participation_rate | written_review_participation_rate | avg_review_score | negative_review_rate |
|---|---|---|---|---|---|
| 1 | 1445 | 100.0% | 65.7% | 2.95 | 44.9% |
| 2 | 1784 | 100.0% | 48.0% | 4.02 | 15.2% |
| 3 | 12785 | 100.0% | 37.2% | 4.32 | 8.5% |
| 4 | 12973 | 100.0% | 45.2% | 4.25 | 9.5% |
| 5 | 61124 | 100.0% | 40.9% | 4.31 | 8.9% |
| 6 | 6359 | 100.0% | 60.8% | 2.26 | 62.7% |
This table shows whether operationally different order experiences correspond to different satisfaction outcomes. In this dataset, score-based review participation is 100% for delivered orders, so the written-review participation rate is the better measure of customers who actively wrote feedback.
review_profile %>%
select(cluster, review_participation_rate, written_review_participation_rate, negative_review_rate, avg_review_score) %>%
pivot_longer(-cluster, names_to = "metric", values_to = "value") %>%
mutate(
metric = recode(
metric,
review_participation_rate = "Review participation rate",
written_review_participation_rate = "Written-review participation rate",
negative_review_rate = "Negative review rate",
avg_review_score = "Average review score"
)
) %>%
ggplot(aes(x = cluster, y = value, fill = metric)) +
geom_col(position = "dodge") +
facet_wrap(~ metric, scales = "free_y", ncol = 1) +
guides(fill = "none") +
labs(
title = "Satisfaction Metrics by Order Experience Cluster",
x = "Cluster",
y = NULL
)
The plot compares satisfaction outcomes across clusters. If a cluster has many non-reviewers but resembles low-review-score clusters operationally, those silent customers may still represent hidden dissatisfaction risk.
overall_profile <- clustered_orders %>%
summarise(
delivery_time_days = mean(delivery_time_days, na.rm = TRUE),
delay_days = mean(delay_days, na.rm = TRUE),
late_rate = mean(is_late, na.rm = TRUE),
freight_value = mean(freight_value, na.rm = TRUE),
order_value = mean(order_value, na.rm = TRUE),
number_of_items = mean(number_of_items, na.rm = TRUE),
number_of_sellers = mean(number_of_sellers, na.rm = TRUE),
number_of_categories = mean(number_of_categories, na.rm = TRUE),
payment_installments = mean(payment_installments, na.rm = TRUE)
)
interpretation_table <- cluster_profile %>%
left_join(review_profile, by = c("cluster", "orders")) %>%
mutate(
delivery_signal = case_when(
delay_days > overall_profile$delay_days + 2 | late_rate > overall_profile$late_rate + 0.05 ~ "late / delayed experience",
delivery_time_days < overall_profile$delivery_time_days - 2 & late_rate < overall_profile$late_rate ~ "fast / reliable delivery",
TRUE ~ "typical delivery"
),
cost_signal = case_when(
freight_value > overall_profile$freight_value & order_value > overall_profile$order_value ~ "high-value and high-freight orders",
freight_value > overall_profile$freight_value ~ "freight-heavy orders",
order_value > overall_profile$order_value ~ "high-value orders",
TRUE ~ "lower-cost orders"
),
complexity_signal = case_when(
number_of_sellers > overall_profile$number_of_sellers |
number_of_categories > overall_profile$number_of_categories |
number_of_items > overall_profile$number_of_items ~ "more complex baskets",
TRUE ~ "simple baskets"
),
satisfaction_signal = case_when(
negative_review_rate > mean(review_profile$negative_review_rate, na.rm = TRUE) ~ "higher dissatisfaction risk",
avg_review_score > mean(review_profile$avg_review_score, na.rm = TRUE) ~ "stronger observed satisfaction",
TRUE ~ "average observed satisfaction"
),
business_interpretation = paste(
str_to_sentence(delivery_signal),
cost_signal,
complexity_signal,
satisfaction_signal,
sep = "; "
)
) %>%
select(cluster, business_interpretation)
kable(interpretation_table, caption = "Business interpretation of each cluster")
| cluster | business_interpretation |
|---|---|
| 1 | Fast / reliable delivery; high-value and high-freight orders; more complex baskets; higher dissatisfaction risk |
| 2 | Typical delivery; high-value and high-freight orders; simple baskets; stronger observed satisfaction |
| 3 | Typical delivery; freight-heavy orders; simple baskets; stronger observed satisfaction |
| 4 | Typical delivery; high-value and high-freight orders; simple baskets; stronger observed satisfaction |
| 5 | Fast / reliable delivery; lower-cost orders; simple baskets; stronger observed satisfaction |
| 6 | Late / delayed experience; high-value and high-freight orders; simple baskets; higher dissatisfaction risk |
The interpretation table converts the statistical cluster profiles into business language. These labels should be reviewed with the profile tables above because the exact meaning of each cluster depends on the fitted solution and the selected number of clusters.
The clusters can support satisfaction management in three ways:
This clustering analysis is descriptive and unsupervised. It does not prove that a feature causes dissatisfaction. K-means also assumes roughly spherical clusters in standardized numeric space, so it may simplify complex marketplace behavior. Customer region is included through one-hot indicators, but the analysis does not directly model geographic distance. Finally, review outcomes may be biased because customers who leave reviews can differ systematically from customers who remain silent.
The strongest next step is to use these clusters as interpretable experience segments, then compare them with review text, repeat purchase behavior, delivery distance, and seller performance in a follow-up supervised or causal analysis.