Presentation purpose: This report combines descriptive analytics, unsupervised learning (category clustering), and supervised learning (Random Forest) to turn Olist customer review data into business actions.
Main business question: Where is customer quality pain, what causes it, and who should fix it?
3 numbers every stakeholder needs before reading further.
| Metric | Value | What it means |
|---|---|---|
| Overall dissatisfaction rate | 16.5% of reviewed orders score ≤ 2 | 1 in ~6 customers is unhappy |
| On-time delivery, still dissatisfied | 11.7% of orders delivered early/on-time still score ≤ 2 | Delivery is not the dominant root cause |
| Return/refund complaints | Highest dissatisfaction intensity at ~79% | Process pain amplifies product defect impact |
Analytical flow: Data construction → Category Risk → Root-Cause Split (delivery vs product) → Text Evidence → Priority Matrix → Recommendations → Unsupervised Clustering → Supervised Random Forest.
Presenter note: Start by saying that the project is not only a technical model-building exercise. It is a business diagnosis: identify the pain, explain the cause, and convert the result into actions.
Research question for this part: How do we build one reliable analysis table from separate Olist files, and how do we define a dissatisfied customer?
Why this part matters: All later charts and models depend on this table. If the joins, review aggregation, or target definition are wrong, the business conclusions will also be wrong.
base_tbl <- order_items %>%
left_join(product_lookup, by = "product_id") %>%
left_join(orders_clean, by = "order_id") %>%
left_join(reviews_clean, by = "order_id") %>%
mutate(category_en = replace_na(category_en, "unknown"))Grain: one row = one order-item. Reviews are aggregated at order level then joined.
Target definition: a customer is treated as
Dissatisfied when review_score <= 2.
Scores 3, 4, and 5 are treated as non-dissatisfied for the supervised
model.
Presenter note: Explain that the project uses review score as the main signal of dissatisfaction, but still checks review text because scores and comments do not always perfectly agree.
Research question for this part: Can we trust the numeric review score alone, or do written comments reveal extra dissatisfaction?
negative_themes_vec <- c(
"Product defect/low quality", "Wrong item/mismatch",
"Packaging issue", "Delivery delay/not arrived",
"Seller/service issue", "Return/refund process",
"Price/freight concern"
)
classify_theme <- function(msg) {
case_when(
str_detect(msg, "quebrad|amassad|defeit|estrag|baixa qualidade|pessima qualidade|nao funciona|ruim") ~ "Product defect/low quality",
str_detect(msg, "diferente|nao corresponde|nao e o que pedi|errad|cor errada|tamanho errado") ~ "Wrong item/mismatch",
str_detect(msg, "embalag|caixa|pacote|mal embalado|avariado") ~ "Packaging issue",
str_detect(msg, "atras|demor|prazo|entrega|tarde|nao chegou|nao recebi") ~ "Delivery delay/not arrived",
str_detect(msg, "atendimento|suporte|resposta|vendedor|sac") ~ "Seller/service issue",
str_detect(msg, "cancel|estorno|reembolso|devolu|troca") ~ "Return/refund process",
str_detect(msg, "preco|caro|valor|frete") ~ "Price/freight concern",
TRUE ~ "Other/unclear"
)
}
comment_tbl <- base_tbl %>%
filter(!is.na(review_comment_message), category_en != "unknown") %>%
mutate(
msg = review_comment_message %>%
iconv(from = "", to = "ASCII//TRANSLIT") %>%
str_to_lower() %>%
str_replace_all("[^a-z0-9 ]", " ") %>%
str_squish(),
theme = classify_theme(msg)
)
mismatch_tbl <- comment_tbl %>%
mutate(
negative_theme = theme %in% negative_themes_vec,
high_score_negative_text = if_else(review_band == "High (4-5)" & negative_theme, 1L, 0L),
low_score_nonnegative_text = if_else(review_band == "Low (1-2)" & !negative_theme, 1L, 0L)
) %>%
summarise(
`Total comments with text` = n(),
`High score + negative text (rate)` = percent(mean(high_score_negative_text), 0.1),
`Low score + non-negative text (rate)` = percent(mean(low_score_nonnegative_text), 0.1)
)
kable(mismatch_tbl, align = "lcc") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Total comments with text | High score + negative text (rate) | Low score + non-negative text (rate) |
|---|---|---|
| 47672 | 26.8% | 15.1% |
Implication: ~27% of happy-scored orders contain complaint language, and ~15% of low-scored orders have non-negative text. We must use both numeric scores and text themes together.
Presenter note: This justifies why the project does not stop at average review scores. Some customers give a high score but still complain in the text, so the text themes help explain the real operational problem.
Research question for this part: Which product categories create the highest customer dissatisfaction and business risk?
Method: We group orders by product category and calculate review volume, average review score, dissatisfaction rate, delay rate, and a combined risk score. The risk score gives more weight to dissatisfaction but also includes volume because a large category can create more total customer pain.
MIN_ORDERS <- 100
category_core <- base_tbl %>%
filter(!is.na(review_score), category_en != "unknown") %>%
group_by(category_en) %>%
summarise(
reviewed_orders = n_distinct(order_id),
avg_review = mean(review_score, na.rm = TRUE),
dissatisfaction_rate = mean(dissatisfied, na.rm = TRUE),
delay_rate = mean(delayed, na.rm = TRUE),
.groups = "drop"
) %>%
filter(reviewed_orders >= MIN_ORDERS) %>%
mutate(
z_dissat = as.numeric(scale(dissatisfaction_rate)),
z_vol = as.numeric(scale(log1p(reviewed_orders))),
risk_score = 0.60 * z_dissat + 0.40 * z_vol
) %>%
arrange(desc(risk_score))
top15 <- category_core %>% slice_head(n = 15)
ggplot(top15,
aes(x = reorder(category_en, dissatisfaction_rate),
y = dissatisfaction_rate,
fill = reviewed_orders)) +
geom_col(width = 0.7) +
geom_text(aes(label = percent(dissatisfaction_rate, 0.1)),
hjust = -0.15, size = 3.2, colour = "grey30") +
coord_flip() +
scale_y_continuous(labels = percent_format(0.1),
expand = expansion(mult = c(0, 0.12))) +
scale_fill_gradient(low = CLR_ORANGE, high = CLR_RED, labels = comma_format()) +
labs(
title = "Top 15 Categories by Dissatisfaction Rate",
subtitle = paste0("Categories with >= ", MIN_ORDERS, " reviewed orders | Colour = order volume"),
x = NULL, y = "Dissatisfaction Rate (score <= 2)",
fill = "Reviewed orders",
caption = "Risk score = 60% dissatisfaction rate + 40% log(volume)"
) +
theme_olist()top15 %>%
transmute(
Category = category_en,
`Reviewed Orders` = comma(reviewed_orders),
`Avg Score` = round(avg_review, 2),
`Dissatisfaction %` = percent(dissatisfaction_rate, 0.1),
`Delay Rate %` = percent(delay_rate, 0.1),
`Risk Score` = round(risk_score, 2)
) %>%
kable(align = "lrrrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(1:3, bold = TRUE, background = "#fff3cd")| Category | Reviewed Orders | Avg Score | Dissatisfaction % | Delay Rate % | Risk Score |
|---|---|---|---|---|---|
| office_furniture | 1,273 | 3.48 | 26.3% | 8.8% | 1.66 |
| fashion_male_clothing | 112 | 3.62 | 28.8% | 5.3% | 1.33 |
| bed_bath_table | 9,417 | 3.87 | 19.4% | 8.3% | 1.19 |
| furniture_decor | 6,449 | 3.89 | 19.7% | 8.3% | 1.13 |
| fixed_telephony | 217 | 3.67 | 26.1% | 4.2% | 1.12 |
| computers_accessories | 6,689 | 3.92 | 18.8% | 7.6% | 1.00 |
| watches_gifts | 5,624 | 4.00 | 16.7% | 8.1% | 0.63 |
| audio | 350 | 3.81 | 22.0% | 12.6% | 0.63 |
| telephony | 4,199 | 3.93 | 17.2% | 8.1% | 0.63 |
| housewares | 5,884 | 4.04 | 16.0% | 6.3% | 0.55 |
| baby | 2,885 | 3.99 | 17.2% | 8.5% | 0.52 |
| sports_leisure | 7,720 | 4.09 | 15.0% | 7.2% | 0.47 |
| garden_tools | 3,518 | 4.03 | 16.3% | 7.8% | 0.44 |
| auto | 3,897 | 4.04 | 15.9% | 8.1% | 0.42 |
| home_confort | 397 | 3.83 | 20.3% | 10.1% | 0.41 |
Result interpretation: The table highlights the categories with the strongest dissatisfaction signal after filtering out very small categories. In presentation, focus on the first few rows rather than reading the whole table. The strongest categories here are the ones where dissatisfaction is both visible and business-relevant.
Presenter note: Say: “This ranking answers where the pain is. But it does not yet tell us why customers are unhappy. That is why the next section separates delivery-related pain from product-related pain.”
Research question for this part: Are dissatisfied reviews mainly caused by late delivery, or do customers still complain when delivery is on time?
Method: We split orders into delivery timing groups: on-time/early, late 1-3 days, late 4-7 days, and late more than 7 days. If dissatisfaction is high even in the on-time/early column, the issue is probably product-related rather than logistics-only.
top12_cats <- category_core %>% slice_head(n = 12) %>% pull(category_en)
delay_split <- base_tbl %>%
filter(!is.na(review_score), category_en %in% top12_cats, !is.na(delay_days)) %>%
mutate(
delivery_segment = case_when(
delay_days <= 0 ~ "On-time / Early",
delay_days <= 3 ~ "Late 1-3 days",
delay_days <= 7 ~ "Late 4-7 days",
TRUE ~ "Late > 7 days"
),
delivery_segment = factor(delivery_segment,
levels = c("On-time / Early", "Late 1-3 days", "Late 4-7 days", "Late > 7 days"))
) %>%
group_by(category_en, delivery_segment) %>%
summarise(
n_orders = n_distinct(order_id),
dissatisfaction_rate = mean(dissatisfied, na.rm = TRUE),
.groups = "drop"
) %>%
filter(n_orders >= 30)
ggplot(delay_split,
aes(x = delivery_segment,
y = reorder(category_en, dissatisfaction_rate),
fill = dissatisfaction_rate)) +
geom_tile(colour = "white", linewidth = 0.8) +
geom_text(aes(label = if_else(n_orders >= 30, percent(dissatisfaction_rate, 1), "")),
size = 3, colour = "white", fontface = "bold") +
scale_fill_gradient2(low = CLR_GREEN, mid = CLR_ORANGE, high = CLR_RED,
midpoint = 0.20, labels = percent_format(1)) +
labs(
title = "Dissatisfaction Rate by Delivery Timing",
subtitle = "Cells with < 30 orders hidden | % shown in white = dissatisfaction rate",
x = "Delivery Segment", y = NULL, fill = "Dissatisfaction",
caption = "Key: High dissatisfaction in 'On-time/Early' column -> product issue, not logistics"
) +
theme_olist() +
theme(axis.text.x = element_text(angle = 20, hjust = 1))verdict_tbl <- base_tbl %>%
filter(!is.na(dissatisfied), category_en %in% top12_cats, !is.na(delay_days)) %>%
mutate(ontime = delay_days <= 0) %>%
group_by(category_en, ontime) %>%
summarise(dissat = mean(dissatisfied), n = n_distinct(order_id), .groups = "drop") %>%
pivot_wider(names_from = ontime, values_from = c(dissat, n), names_prefix = "") %>%
rename(
ontime_dissat = dissat_TRUE,
late_dissat = dissat_FALSE,
n_ontime = n_TRUE,
n_late = n_FALSE
) %>%
mutate(
delta = late_dissat - ontime_dissat,
root_cause = case_when(
ontime_dissat >= 0.15 & delta < 0.15 ~ "Product-driven",
ontime_dissat >= 0.15 & delta >= 0.15 ~ "Mixed (Product + Ops)",
ontime_dissat < 0.15 & delta >= 0.15 ~ "Ops/Delivery-driven",
TRUE ~ "Low risk"
)
) %>%
arrange(desc(ontime_dissat))
verdict_tbl %>%
transmute(
Category = category_en,
`On-time Dissatisfied` = percent(ontime_dissat, 0.1),
`Late Dissatisfied` = percent(late_dissat, 0.1),
`Gap (Late - On-time)` = percent(delta, 0.1),
`Root Cause` = root_cause
) %>%
kable(align = "lrrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(which(str_detect(verdict_tbl$root_cause, "Product")), background = "#fde8e8")| Category | On-time Dissatisfied | Late Dissatisfied | Gap (Late - On-time) | Root Cause |
|---|---|---|---|---|
| office_furniture | 22.6% | 57.0% | 34.5% | Mixed (Product + Ops) |
| fashion_male_clothing | 22.0% | 85.7% | 63.7% | Mixed (Product + Ops) |
| fixed_telephony | 21.3% | 72.7% | 51.4% | Mixed (Product + Ops) |
| furniture_decor | 15.2% | 53.6% | 38.5% | Mixed (Product + Ops) |
| bed_bath_table | 15.2% | 56.8% | 41.7% | Mixed (Product + Ops) |
| audio | 14.6% | 71.7% | 57.2% | Ops/Delivery-driven |
| computers_accessories | 14.0% | 54.5% | 40.5% | Ops/Delivery-driven |
| telephony | 12.7% | 50.1% | 37.5% | Ops/Delivery-driven |
| housewares | 12.1% | 50.8% | 38.7% | Ops/Delivery-driven |
| watches_gifts | 11.3% | 58.6% | 47.3% | Ops/Delivery-driven |
| baby | 11.0% | 60.7% | 49.7% | Ops/Delivery-driven |
| sports_leisure | 9.9% | 56.6% | 46.7% | Ops/Delivery-driven |
Result interpretation: Categories marked as Product-driven have high dissatisfaction even when orders arrive on time. Categories marked as Ops/Delivery-driven become much worse mainly when delivery is late. Mixed categories need both product and logistics actions.
Presenter note: This is the key business diagnosis: delivery delay matters, but it is not the whole story. The executive summary already shows that 11.7% of on-time orders are still dissatisfied, so product quality and process issues must be investigated.
Research question for this part: What are dissatisfied customers actually complaining about?
Method: Review comments are grouped into simple business themes using keywords. This is not a complex NLP model, but it is useful because it converts raw text into interpretable complaint categories such as product defect, wrong item, packaging, delivery, and return/refund issues.
theme_summary <- comment_tbl %>%
group_by(theme) %>%
summarise(
comments = n(),
dissatisfaction_rate = mean(dissatisfied, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(dissatisfaction_rate))
ggplot(theme_summary %>% filter(theme != "Other/unclear"),
aes(x = comments, y = dissatisfaction_rate,
colour = dissatisfaction_rate, size = comments)) +
geom_point(alpha = 0.85) +
geom_text_repel(aes(label = theme), size = 3.2, max.overlaps = 20,
colour = "grey20", fontface = "bold") +
scale_x_log10(labels = comma_format()) +
scale_y_continuous(labels = percent_format(1), limits = c(0, 1)) +
scale_colour_gradient(low = CLR_ORANGE, high = CLR_RED, guide = "none") +
scale_size_continuous(range = c(4, 14), guide = "none") +
labs(
title = "Complaint Themes: Volume vs Dissatisfaction Intensity",
subtitle = "Bubble size = comment volume | Top-right = highest priority",
x = "Number of Comments (log scale)", y = "Dissatisfaction Rate",
caption = "'Other/unclear' excluded for clarity"
) +
theme_olist()theme_summary %>%
filter(theme != "Other/unclear") %>%
transmute(
Theme = theme,
`Comment Volume` = comma(comments),
`Dissatisfaction Rate` = percent(dissatisfaction_rate, 0.1),
Priority = case_when(
dissatisfaction_rate >= 0.65 ~ "Critical",
dissatisfaction_rate >= 0.45 ~ "High",
TRUE ~ "Medium"
)
) %>%
kable(align = "lrrl") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(
which(theme_summary %>% filter(theme != "Other/unclear") %>% pull(dissatisfaction_rate) >= 0.65),
background = "#fde8e8", bold = TRUE
)| Theme | Comment Volume | Dissatisfaction Rate | Priority |
|---|---|---|---|
| Return/refund process | 808 | 79.2% | Critical |
| Wrong item/mismatch | 1,228 | 68.2% | Critical |
| Product defect/low quality | 1,300 | 63.4% | High |
| Packaging issue | 1,074 | 50.5% | High |
| Seller/service issue | 1,420 | 41.5% | Medium |
| Price/freight concern | 935 | 33.3% | Medium |
| Delivery delay/not arrived | 14,890 | 21.5% | Medium |
Result interpretation: The most important themes are not only those with many comments, but those with high dissatisfaction intensity. A theme with a very high dissatisfaction rate should be treated as a serious root cause even if the volume is smaller.
Presenter note: Emphasize that text evidence explains the “why” behind the numbers. For example, wrong item, product defects, and return/refund problems point to different business owners and different fixes.
theme_cat <- comment_tbl %>%
filter(category_en %in% top12_cats, theme != "Other/unclear") %>%
group_by(category_en, theme) %>%
summarise(n = n(), .groups = "drop") %>%
group_by(category_en) %>%
mutate(share = n / sum(n)) %>%
ungroup()
ggplot(theme_cat,
aes(x = theme, y = reorder(category_en, n), fill = share, size = n)) +
geom_point(shape = 21, colour = "white") +
scale_fill_gradient(low = CLR_ORANGE, high = CLR_RED, labels = percent_format(1)) +
scale_size_continuous(range = c(2, 12)) +
scale_x_discrete(labels = function(x) str_wrap(x, 12)) +
labs(
title = "Complaint Theme Mix by Category",
subtitle = "Bubble size = raw count | Colour = share within category",
x = NULL, y = NULL,
fill = "Share of category\ncomplaints", size = "Comment count",
caption = "Larger + darker = dominant complaint type for that category"
) +
theme_olist() +
theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 9))Result interpretation: This chart shows which complaint theme dominates each risky category. It helps convert a category-level problem into a specific action: fix product listing accuracy, inspect seller quality, improve packaging, review delivery SLA, or simplify returns.
Research question for this part: Which categories should management fix first?
Method: The matrix compares dissatisfaction intensity on the x-axis with reviewed order volume on the y-axis. This creates four business quadrants: FIX NOW, INVESTIGATE, MONITOR, and WATCH.
x_mid <- median(category_core$dissatisfaction_rate)
y_mid <- median(log10(category_core$reviewed_orders))
matrix_data <- category_core %>%
mutate(
log_vol = log10(reviewed_orders),
quadrant = case_when(
dissatisfaction_rate >= x_mid & log_vol >= y_mid ~ "FIX NOW",
dissatisfaction_rate >= x_mid & log_vol < y_mid ~ "INVESTIGATE",
dissatisfaction_rate < x_mid & log_vol >= y_mid ~ "MONITOR",
TRUE ~ "WATCH"
),
q_colour = case_when(
quadrant == "FIX NOW" ~ CLR_RED,
quadrant == "INVESTIGATE" ~ CLR_ORANGE,
quadrant == "MONITOR" ~ CLR_BLUE,
TRUE ~ CLR_GREY
),
label_flag = quadrant %in% c("FIX NOW", "INVESTIGATE") |
risk_score > quantile(risk_score, 0.85, na.rm = TRUE)
)
ggplot(matrix_data,
aes(x = dissatisfaction_rate, y = log_vol,
colour = quadrant, size = reviewed_orders)) +
annotate("rect", xmin = x_mid, xmax = Inf, ymin = y_mid, ymax = Inf, fill = CLR_RED, alpha = 0.05) +
annotate("rect", xmin = x_mid, xmax = Inf, ymin = -Inf, ymax = y_mid, fill = CLR_ORANGE, alpha = 0.05) +
annotate("rect", xmin = -Inf, xmax = x_mid, ymin = y_mid, ymax = Inf, fill = CLR_BLUE, alpha = 0.05) +
annotate("rect", xmin = -Inf, xmax = x_mid, ymin = -Inf, ymax = y_mid, fill = CLR_GREY, alpha = 0.05) +
annotate("text", x = max(matrix_data$dissatisfaction_rate) * 0.97,
y = max(matrix_data$log_vol) * 1.01,
label = "FIX NOW", hjust = 1, fontface = "bold", colour = CLR_RED, size = 4) +
annotate("text", x = max(matrix_data$dissatisfaction_rate) * 0.97,
y = y_mid * 0.97,
label = "INVESTIGATE", hjust = 1, fontface = "bold", colour = CLR_ORANGE, size = 4) +
annotate("text", x = x_mid * 1.02,
y = max(matrix_data$log_vol) * 1.01,
label = "MONITOR", hjust = 0, fontface = "bold", colour = CLR_BLUE, size = 4) +
geom_point(alpha = 0.75) +
geom_text_repel(
data = matrix_data %>% filter(label_flag),
aes(label = str_replace_all(category_en, "_", " ")),
size = 3, fontface = "bold", max.overlaps = 20, colour = "grey20"
) +
geom_vline(xintercept = x_mid, linetype = "dashed", colour = "grey50") +
geom_hline(yintercept = y_mid, linetype = "dashed", colour = "grey50") +
scale_x_continuous(labels = percent_format(1)) +
scale_y_continuous(labels = function(x) comma(10^x)) +
scale_colour_manual(values = c(
"FIX NOW" = CLR_RED, "INVESTIGATE" = CLR_ORANGE,
"MONITOR" = CLR_BLUE, "WATCH" = CLR_GREY
)) +
scale_size_continuous(range = c(2, 10), labels = comma_format()) +
labs(
title = "Category Priority Matrix",
subtitle = "X = dissatisfaction intensity | Y = reviewed order volume",
x = "Dissatisfaction Rate (score <= 2)", y = "Reviewed Orders (log scale)",
colour = "Quadrant", size = "Reviewed orders",
caption = "Dashed lines = median split"
) +
theme_olist()Result interpretation: Categories in FIX NOW combine high dissatisfaction with high volume, so they deserve the strongest management attention. INVESTIGATE categories may have high dissatisfaction but lower volume, so they need deeper review before large investment.
Presenter note: Say: “This matrix turns analysis into prioritization. It prevents us from fixing only the loudest problem or only the biggest category; we focus on the categories that combine pain and scale.”
Research question for this part: Who owns the fix, and what action should be taken for each risky category?
Method: This table combines the priority quadrant, root-cause label, dominant complaint theme, business owner, and recommended action.
reco_tbl <- matrix_data %>%
filter(quadrant %in% c("FIX NOW", "INVESTIGATE")) %>%
left_join(verdict_tbl %>% select(category_en, root_cause, ontime_dissat), by = "category_en") %>%
left_join(
theme_cat %>%
group_by(category_en) %>%
slice_max(n, n = 1) %>%
select(category_en, top_complaint = theme),
by = "category_en"
) %>%
mutate(
owner = case_when(
str_detect(root_cause, "Product") ~ "Product / Merchant Team",
str_detect(root_cause, "Ops") ~ "Operations / Logistics",
str_detect(root_cause, "Mixed") ~ "Product + Operations (joint)",
TRUE ~ "TBD"
),
recommended_action = case_when(
str_detect(top_complaint, "Wrong item|mismatch") ~
"Audit product listing accuracy (title, photo, spec); enforce SKU mapping checks pre-shipment.",
str_detect(top_complaint, "defect|low quality") ~
"Add incoming quality inspection at seller onboarding; require quality cert for high-volume sellers.",
str_detect(top_complaint, "Packaging") ~
"Define packaging standards by product size/weight; penalise sellers with >10% packaging complaints.",
str_detect(top_complaint, "Delivery") ~
"Review SLA promise-dates; flag sellers with >15% late rate for performance review.",
str_detect(top_complaint, "Return|refund") ~
"Simplify return policy; add one-click return initiation in post-delivery email.",
TRUE ~ "Deep-dive review text sample; run seller scorecard audit."
)
) %>%
arrange(quadrant, desc(dissatisfaction_rate))
reco_tbl %>%
transmute(
Quadrant = quadrant,
Category = str_replace_all(category_en, "_", " "),
`Dissat. Rate` = percent(dissatisfaction_rate, 0.1),
Volume = comma(reviewed_orders),
`Root Cause` = root_cause,
`Top Complaint` = top_complaint,
Owner = owner,
`Recommended Action` = recommended_action
) %>%
kable(align = "llrrlll") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 12) %>%
row_spec(which(reco_tbl$quadrant == "FIX NOW"), background = "#fff0f0", bold = TRUE) %>%
column_spec(8, width = "30em")| Quadrant | Category | Dissat. Rate | Volume | Root Cause | Top Complaint | Owner | Recommended Action |
|---|---|---|---|---|---|---|---|
| FIX NOW | office furniture | 26.3% | 1,273 | Mixed (Product + Ops) | Delivery delay/not arrived | Product / Merchant Team | Review SLA promise-dates; flag sellers with >15% late rate for performance review. |
| FIX NOW | furniture decor | 19.7% | 6,449 | Mixed (Product + Ops) | Delivery delay/not arrived | Product / Merchant Team | Review SLA promise-dates; flag sellers with >15% late rate for performance review. |
| FIX NOW | bed bath table | 19.4% | 9,417 | Mixed (Product + Ops) | Delivery delay/not arrived | Product / Merchant Team | Review SLA promise-dates; flag sellers with >15% late rate for performance review. |
| FIX NOW | computers accessories | 18.8% | 6,689 | Ops/Delivery-driven | Delivery delay/not arrived | Operations / Logistics | Review SLA promise-dates; flag sellers with >15% late rate for performance review. |
| FIX NOW | telephony | 17.2% | 4,199 | Ops/Delivery-driven | Delivery delay/not arrived | Operations / Logistics | Review SLA promise-dates; flag sellers with >15% late rate for performance review. |
| FIX NOW | baby | 17.2% | 2,885 | Ops/Delivery-driven | Delivery delay/not arrived | Operations / Logistics | Review SLA promise-dates; flag sellers with >15% late rate for performance review. |
| FIX NOW | watches gifts | 16.7% | 5,624 | Ops/Delivery-driven | Delivery delay/not arrived | Operations / Logistics | Review SLA promise-dates; flag sellers with >15% late rate for performance review. |
| FIX NOW | garden tools | 16.3% | 3,518 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
| FIX NOW | housewares | 16.0% | 5,884 | Ops/Delivery-driven | Delivery delay/not arrived | Operations / Logistics | Review SLA promise-dates; flag sellers with >15% late rate for performance review. |
| FIX NOW | electronics | 16.0% | 2,550 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
| FIX NOW | consoles games | 16.0% | 1,062 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
| INVESTIGATE | fashion male clothing | 28.8% | 112 | Mixed (Product + Ops) | Delivery delay/not arrived | Product / Merchant Team | Review SLA promise-dates; flag sellers with >15% late rate for performance review. |
| INVESTIGATE | fixed telephony | 26.1% | 217 | Mixed (Product + Ops) | Delivery delay/not arrived | Product / Merchant Team | Review SLA promise-dates; flag sellers with >15% late rate for performance review. |
| INVESTIGATE | audio | 22.0% | 350 | Ops/Delivery-driven | Delivery delay/not arrived | Operations / Logistics | Review SLA promise-dates; flag sellers with >15% late rate for performance review. |
| INVESTIGATE | construction tools safety | 20.6% | 167 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
| INVESTIGATE | home confort | 20.3% | 397 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
| INVESTIGATE | air conditioning | 19.2% | 253 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
| INVESTIGATE | art | 18.7% | 202 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
| INVESTIGATE | furniture living room | 18.5% | 422 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
| INVESTIGATE | fashion underwear beach | 18.3% | 121 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
| INVESTIGATE | home construction | 18.2% | 490 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
| INVESTIGATE | costruction tools garden | 17.6% | 194 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
| INVESTIGATE | christmas supplies | 17.6% | 128 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
| INVESTIGATE | construction tools lights | 17.1% | 244 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
| INVESTIGATE | drinks | 16.1% | 297 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
| INVESTIGATE | agro industry and commerce | 16.0% | 182 | NA | NA | TBD | Deep-dive review text sample; run seller scorecard audit. |
How to use this table: This is the action list for managers. Each row connects an analytical signal to an owner and a concrete operational fix.
Presenter note: Do not read every row slowly. Pick one FIX NOW category and explain the logic from left to right: category → dissatisfaction rate → root cause → complaint theme → owner → action.
Research question for this part: What should stakeholders remember after seeing all the evidence?
| # | Finding | Evidence | Action Owner |
|---|---|---|---|
| 1 | Product quality, not delivery, drives most dissatisfaction | 11.7% of on-time orders still score <= 2 | Product / Merchant Team |
| 2 | fashion_male_clothing, office_furniture,
fixed_telephony have highest dissatisfaction rate |
Section 2 ranking | Product Quality |
| 3 | bed_bath_table and furniture_decor are
highest business risk due to volume x rate |
Priority Matrix | Product Quality |
| 4 | Wrong item/mismatch and product defects are most destructive complaint types (63-68% dissat.) | Text theme analysis | Listing accuracy & QC |
| 5 | Return/refund friction amplifies every other problem (79% dissat. rate) | Theme table | Customer Experience |
| 6 | 27% of satisfied customers still wrote complaint text — tracking scores alone underestimates real pain | Mismatch check | Analytics / Monitoring |
| 7 | Clustering turns many categories into a few structural risk segments | Section 8 clustering | Product + Operations |
| 8 | Random Forest can be used as an early-warning model, but the current output must be interpreted carefully because recall is high while precision and accuracy are low | Section 9 model evaluation | Analytics / CX |
North Star recommendation: Build a monthly Category Quality Scorecard combining dissatisfaction rate + on-time dissatisfaction + top complaint theme. Any category crossing 15% on-time dissatisfaction triggers an automatic seller audit.
Research question for this part: Can unsupervised learning group product categories into meaningful quality-risk segments without using a target label?
Goal: Group product categories by their quality-and-delivery profile so that each cluster receives a single targeted intervention.
Why clustering is useful here: The business has many product categories. Instead of making a separate strategy for every category, clustering reduces them into a small number of structural segments with similar risk patterns.
| Feature | Why it matters |
|---|---|
dissatisfaction_rate |
Overall quality pain |
ontime_dissat |
Product-driven pain (delivery-independent) |
delay_rate |
Operational / logistics pain |
avg_review |
General sentiment |
log_vol |
Business scale |
Research question: Which variables best describe the risk profile of a product category?
cluster_input <- category_core %>%
left_join(verdict_tbl %>% select(category_en, ontime_dissat), by = "category_en") %>%
filter(!is.na(ontime_dissat)) %>%
mutate(log_vol = log1p(reviewed_orders)) %>%
select(category_en, dissatisfaction_rate, ontime_dissat, delay_rate, avg_review, log_vol)
cat_names <- cluster_input$category_en
x_scaled <- cluster_input %>%
select(-category_en) %>%
scale()
rownames(x_scaled) <- cat_names
cat("Categories entering clustering:", nrow(x_scaled), "\n")## Categories entering clustering: 12
Normalisation is mandatory for k-means:
scale()converts every feature to mean 0, SD 1 so no variable dominates by sheer magnitude.
Presenter note: Explain that clustering uses distances. If we do not scale the variables, volume or review score could dominate the distance calculation and hide the true quality patterns.
Research question: How many category segments should we use?
set.seed(42)
p_elbow <- fviz_nbclust(x_scaled, kmeans, method = "wss", k.max = 8,
linecolor = CLR_RED) +
labs(title = "Elbow Method (WSS)",
subtitle = "Look for the elbow — point of diminishing returns",
x = "Number of clusters K", y = "Total within-cluster SS") +
theme_olist()
p_sil <- fviz_nbclust(x_scaled, kmeans, method = "silhouette", k.max = 8,
linecolor = CLR_BLUE) +
labs(title = "Average Silhouette Width",
subtitle = "Higher = tighter, better-separated clusters",
x = "Number of clusters K", y = "Average silhouette width") +
theme_olist()
p_elbow + p_sil# Change K_CHOSEN if your plots suggest a different number
K_CHOSEN <- 3
cat("Using K =", K_CHOSEN, "clusters\n")## Using K = 3 clusters
Presenter note: K = 3 is easy to explain in business terms: high-risk, medium/moderate-risk, and lower-risk or operational-watch categories.
Research question: What category groups are produced by k-means?
set.seed(42)
km_fit <- kmeans(x_scaled, centers = K_CHOSEN, nstart = 25, iter.max = 100)
cat("k-means variance explained (between-SS / total-SS):",
round(km_fit$betweenss / km_fit$totss, 3), "\n")## k-means variance explained (between-SS / total-SS): 0.826
fviz_cluster(km_fit,
data = x_scaled,
ellipse.type = "convex",
palette = c(CLR_RED, CLR_BLUE, CLR_GREEN, CLR_ORANGE)[1:K_CHOSEN],
ggtheme = theme_olist(),
repel = TRUE) +
labs(title = paste0("k-Means Cluster Map (K = ", K_CHOSEN, ")"),
subtitle = "PCA projection — first 2 components shown")How to read the plot: Each point is a product category. Categories close together have similar dissatisfaction, delivery, review, and volume profiles. The PCA projection is only for visualization; the clustering uses the full scaled feature matrix.
Research question: Are the category segments stable if we use a more robust clustering method?
PAM uses real data points as cluster centres (medoids), making it robust to outliers. We use Gower distance so the method handles mixed data types.
# Pass unscaled data — daisy() handles normalisation internally for Gower
gower_dist <- daisy(
cluster_input %>% select(-category_en),
metric = "gower"
)
set.seed(42)
pam_fit <- pam(gower_dist, k = K_CHOSEN, diss = TRUE)
cat("PAM medoids (representative categories):\n")## PAM medoids (representative categories):
## [1] "fixed_telephony" "furniture_decor" "watches_gifts"
fviz_cluster(
list(data = x_scaled, cluster = pam_fit$clustering),
ellipse.type = "convex",
palette = c(CLR_RED, CLR_BLUE, CLR_GREEN, CLR_ORANGE)[1:K_CHOSEN],
ggtheme = theme_olist(),
repel = TRUE
) +
labs(title = paste0("PAM Cluster Map (K = ", K_CHOSEN, ", Gower distance)"),
subtitle = "PCA projection — first 2 components shown")Research question: Do k-means and PAM give similar segment assignments?
# Helper: compute Rand and Jaccard from two label vectors (no fossil needed)
rand_jaccard <- function(a, b) {
n <- length(a)
# all pairs
pairs <- combn(n, 2)
same_a <- a[pairs[1,]] == a[pairs[2,]]
same_b <- b[pairs[1,]] == b[pairs[2,]]
tp <- sum( same_a & same_b) # both same
tn <- sum(!same_a & !same_b) # both different
fp <- sum(!same_a & same_b) # different in a, same in b
fn <- sum( same_a & !same_b) # same in a, different in b
rand <- (tp + tn) / (tp + tn + fp + fn)
jaccard <- tp / (tp + fp + fn)
list(rand = rand, jaccard = jaccard)
}
km_labels <- km_fit$cluster
pam_labels <- pam_fit$clustering
rj <- rand_jaccard(km_labels, pam_labels)
ari <- adjustedRandIndex(km_labels, pam_labels) # from mclust
index_tbl <- tibble(
Index = c("Rand Index", "Adjusted Rand Index (ARI)", "Jaccard Similarity"),
Value = round(c(rj$rand, ari, rj$jaccard), 3),
Interpretation = c(
"Proportion of pairs that agree between k-means and PAM (1 = identical)",
"Rand corrected for chance; >0.80 = strong agreement",
"Agreement on co-clustered pairs only — a zoom-in on positive matches"
)
)
kable(index_tbl, align = "llr") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Index | Value | Interpretation |
|---|---|---|
| Rand Index | 0.727 | Proportion of pairs that agree between k-means and PAM (1 = identical) |
| Adjusted Rand Index (ARI) | 0.440 | Rand corrected for chance; >0.80 = strong agreement |
| Jaccard Similarity | 0.471 | Agreement on co-clustered pairs only — a zoom-in on positive matches |
Rule of thumb: ARI > 0.80 means the two methods agree strongly — either clustering can be used. ARI < 0.60 signals real disagreement — prefer PAM (more robust, handles mixed data via Gower distance).
Presenter note: In the current output, ARI is moderate/low, so be honest: the clustering is exploratory and useful for business segmentation, not an absolute truth.
Research question: What does each cluster mean in business language?
profile_tbl <- cluster_input %>%
mutate(
km_cluster = factor(km_fit$cluster, labels = paste0("KM-", seq_len(K_CHOSEN))),
pam_cluster = factor(pam_fit$clustering, labels = paste0("PAM-", seq_len(K_CHOSEN)))
)
km_profile <- profile_tbl %>%
group_by(Cluster = km_cluster) %>%
summarise(
`N categories` = n(),
`Avg Dissatisfaction` = percent(mean(dissatisfaction_rate), 0.1),
`Avg On-time Dissat.` = percent(mean(ontime_dissat), 0.1),
`Avg Delay Rate` = percent(mean(delay_rate), 0.1),
`Avg Review Score` = round(mean(avg_review), 2),
`Example categories` = paste(head(category_en, 3), collapse = ", "),
.groups = "drop"
)
kable(km_profile, caption = "k-Means Cluster Profiles", align = "lrrrrrl") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
column_spec(7, width = "25em")| Cluster | N categories | Avg Dissatisfaction | Avg On-time Dissat. | Avg Delay Rate | Avg Review Score | Example categories |
|---|---|---|---|---|---|---|
| KM-1 | 1 | 22.0% | 14.6% | 12.6% | 3.81 | audio |
| KM-2 | 3 | 27.1% | 22.0% | 6.1% | 3.59 | office_furniture, fashion_male_clothing, fixed_telephony |
| KM-3 | 8 | 17.5% | 12.7% | 7.8% | 3.97 | bed_bath_table, furniture_decor, computers_accessories |
cluster_means <- profile_tbl %>%
group_by(Cluster = km_cluster) %>%
summarise(across(dissatisfaction_rate:log_vol, mean), .groups = "drop") %>%
pivot_longer(-Cluster, names_to = "Feature", values_to = "Mean")
ggplot(cluster_means, aes(x = Feature, y = Cluster, fill = Mean)) +
geom_tile(colour = "white", linewidth = 0.6) +
geom_text(aes(label = round(Mean, 2)), colour = "white", fontface = "bold", size = 3.5) +
scale_fill_gradient2(low = CLR_BLUE, mid = CLR_ORANGE, high = CLR_RED, midpoint = 0) +
scale_x_discrete(labels = c(
dissatisfaction_rate = "Dissatisfaction\nRate",
ontime_dissat = "On-time\nDissat.",
delay_rate = "Delay Rate",
avg_review = "Avg Review\nScore",
log_vol = "Log Volume"
)) +
labs(title = "Cluster Feature Heatmap (k-Means, raw means)",
subtitle = "Red = high value | Blue = low value",
x = NULL, y = "Cluster", fill = "Mean") +
theme_olist()Result interpretation: The profile table is more important than the cluster plot for business interpretation. Use the average dissatisfaction, on-time dissatisfaction, delay rate, and review score to name each cluster.
# Adjust labels after inspecting your actual heatmap above
strategy_tbl <- tibble(
Cluster = paste0("KM-", 1:K_CHOSEN),
`Likely Label` = c(
"High-risk: product + delivery pain",
"Moderate: product-driven dissatisfaction",
"Low-risk: operational under-performers"
)[1:K_CHOSEN],
`Recommended Intervention` = c(
"Immediate seller quality audit + return policy fix",
"SKU accuracy review; incoming QC at onboarding",
"SLA promise-date review; logistics partner audit"
)[1:K_CHOSEN],
Owner = c(
"Product + CX (joint)",
"Product / Merchant Team",
"Operations / Logistics"
)[1:K_CHOSEN]
)
kable(strategy_tbl, align = "llll") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
row_spec(1, background = "#fde8e8", bold = TRUE)| Cluster | Likely Label | Recommended Intervention | Owner |
|---|---|---|---|
| KM-1 | High-risk: product + delivery pain | Immediate seller quality audit + return policy fix | Product + CX (joint) |
| KM-2 | Moderate: product-driven dissatisfaction | SKU accuracy review; incoming QC at onboarding | Product / Merchant Team |
| KM-3 | Low-risk: operational under-performers | SLA promise-date review; logistics partner audit | Operations / Logistics |
North Star integration: Feed cluster membership back into the monthly Category Quality Scorecard. Track whether categories migrate between clusters over time — a move from Cluster 1 to Cluster 3 is a concrete, measurable quality improvement.
Presenter note: This is the unsupervised learning contribution. We did not predict a label here; we discovered structural groups of categories and translated each group into a business strategy.
Research question for this part: Can supervised learning predict which orders are likely to become dissatisfied, and which features drive that risk?
Goal: Train a Random Forest classifier to predict whether an order will receive a dissatisfied review (score ≤ 2), identify the most predictive features, and evaluate model accuracy on a held-out test set.
Important modelling note for presentation: The current Random Forest output has very high recall but low accuracy and precision. That means it catches almost all dissatisfied orders, but it also creates many false alarms. Present it as an early-warning experiment, not as a final deployment-ready model.
| Feature | Role |
|---|---|
delay_days |
Delivery lateness in days |
delayed |
Binary flag: was delivery late? |
dissatisfaction_rate (category) |
Historical quality level of the product category |
delay_rate (category) |
Historical logistics performance of the category |
avg_review (category) |
Baseline sentiment for the category |
log_vol (category) |
Business scale (log-transformed order volume) |
Research question: Which delivery and category-level variables are useful for predicting dissatisfaction?
# Join category-level features onto order-level data
rf_data <- base_tbl %>%
filter(!is.na(dissatisfied), !is.na(delay_days), category_en != "unknown") %>%
left_join(
category_core %>%
mutate(log_vol = log1p(reviewed_orders)) %>%
select(category_en, cat_dissat = dissatisfaction_rate,
cat_delay = delay_rate, cat_avg_review = avg_review, log_vol),
by = "category_en"
) %>%
filter(!is.na(cat_dissat)) %>%
transmute(
dissatisfied = factor(dissatisfied, levels = c(0, 1),
labels = c("Satisfied", "Dissatisfied")),
delay_days = as.numeric(delay_days),
delayed = as.integer(delayed),
cat_dissat,
cat_delay,
cat_avg_review,
log_vol
) %>%
drop_na()
cat("Rows entering RF model:", nrow(rf_data), "\n")## Rows entering RF model: 107735
## Class balance:
##
## Satisfied Dissatisfied
## 91490 16245
Research question: How do we evaluate the model on unseen data?
set.seed(42)
train_idx <- sample(seq_len(nrow(rf_data)), size = floor(0.8 * nrow(rf_data)))
train_rf <- rf_data[ train_idx, ]
test_rf <- rf_data[-train_idx, ]
cat("Train rows:", nrow(train_rf), "| Test rows:", nrow(test_rf), "\n")## Train rows: 86188 | Test rows: 21547
An 80/20 split is sufficient given the dataset size.
set.seed(42)ensures reproducibility.
Research question: Can Random Forest learn patterns between delivery performance, category quality history, and dissatisfaction?
set.seed(42)
rf_model <- randomForest(
dissatisfied ~ .,
data = train_rf,
ntree = 500, # number of trees
mtry = 2, # features considered at each split (≈ sqrt(6))
importance = TRUE, # needed for varImpPlot
classwt = c(1, 3) # upweight minority class (dissatisfied)
)
print(rf_model)##
## Call:
## randomForest(formula = dissatisfied ~ ., data = train_rf, ntree = 500, mtry = 2, importance = TRUE, classwt = c(1, 3))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 83.34%
## Confusion matrix:
## Satisfied Dissatisfied class.error
## Satisfied 1425 71714 0.980516551
## Dissatisfied 119 12930 0.009119473
classwt: dissatisfied orders are a minority class (≈10–15%). Upweighting them (3×) prevents the model from trivially predicting “Satisfied” for every order.
Presenter note: In this run, the 3× weight is too aggressive. It produces high recall but very low accuracy and precision because the model predicts too many orders as
Dissatisfied. Mention this as a limitation and improvement opportunity.
Research question: Which variables are most important for predicting dissatisfied reviews?
imp_df <- importance(rf_model) %>%
as.data.frame() %>%
rownames_to_column("Feature") %>%
rename(MeanDecreaseAccuracy = MeanDecreaseAccuracy,
MeanDecreaseGini = MeanDecreaseGini) %>%
arrange(desc(MeanDecreaseAccuracy))
# ── tidy ggplot instead of base varImpPlot ──────────────────────────────────
p_acc <- ggplot(imp_df,
aes(x = MeanDecreaseAccuracy,
y = reorder(Feature, MeanDecreaseAccuracy))) +
geom_col(fill = CLR_BLUE, width = 0.6) +
labs(title = "Variable Importance — Mean Decrease in Accuracy",
subtitle = "How much OOB accuracy drops when a feature is randomly shuffled",
x = "Mean Decrease Accuracy", y = NULL) +
theme_olist()
p_gini <- ggplot(imp_df,
aes(x = MeanDecreaseGini,
y = reorder(Feature, MeanDecreaseGini))) +
geom_col(fill = CLR_RED, width = 0.6) +
labs(title = "Variable Importance — Mean Decrease in Gini",
subtitle = "Total reduction in node impurity attributable to each feature",
x = "Mean Decrease Gini", y = NULL) +
theme_olist()
p_acc + p_giniimp_df %>%
transmute(
Feature,
`Mean Decrease Accuracy` = round(MeanDecreaseAccuracy, 3),
`Mean Decrease Gini` = round(MeanDecreaseGini, 3)
) %>%
kable(align = "lrr") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(1:2, bold = TRUE, background = "#fff3cd")| Feature | Mean Decrease Accuracy | Mean Decrease Gini |
|---|---|---|
| delay_days | 5.206 | 2753.908 |
| log_vol | -2.185 | 121.712 |
| cat_dissat | -11.396 | 282.910 |
| cat_avg_review | -13.033 | 247.389 |
| cat_delay | -15.821 | 128.790 |
| delayed | -29.280 | 918.226 |
Research question: How well does the model classify dissatisfied and satisfied orders?
pred_rf <- predict(rf_model, newdata = test_rf)
conf_mat <- table(Actual = test_rf$dissatisfied, Predicted = pred_rf)
print(conf_mat)## Predicted
## Actual Satisfied Dissatisfied
## Satisfied 361 17990
## Dissatisfied 31 3165
TP <- conf_mat["Dissatisfied", "Dissatisfied"]
TN <- conf_mat["Satisfied", "Satisfied"]
FP <- conf_mat["Satisfied", "Dissatisfied"]
FN <- conf_mat["Dissatisfied", "Satisfied"]
accuracy <- (TP + TN) / sum(conf_mat)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)
f1 <- 2 * precision * recall / (precision + recall)
metrics_tbl <- tibble(
Metric = c("Accuracy", "Precision (Dissatisfied)", "Recall (Dissatisfied)", "F1 Score"),
Value = percent(c(accuracy, precision, recall, f1), 0.1),
Interpretation = c(
"Overall share of correctly classified orders",
"Of predicted dissatisfied, how many truly were?",
"Of truly dissatisfied, how many did the model catch?",
"Harmonic mean of Precision & Recall — key metric for imbalanced classes"
)
)
kable(metrics_tbl, align = "llr") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(4, bold = TRUE, background = "#e8f5e9")| Metric | Value | Interpretation |
|---|---|---|
| Accuracy | 16.4% | Overall share of correctly classified orders |
| Precision (Dissatisfied) | 15.0% | Of predicted dissatisfied, how many truly were? |
| Recall (Dissatisfied) | 99.0% | Of truly dissatisfied, how many did the model catch? |
| F1 Score | 26.0% | Harmonic mean of Precision & Recall — key metric for imbalanced classes |
conf_df <- as.data.frame(conf_mat) %>%
rename(Count = Freq)
ggplot(conf_df, aes(x = Predicted, y = Actual, fill = Count)) +
geom_tile(colour = "white", linewidth = 1.2) +
geom_text(aes(label = comma(Count)), size = 5, fontface = "bold", colour = "white") +
scale_fill_gradient(low = CLR_BLUE, high = CLR_RED, labels = comma_format()) +
labs(
title = "Confusion Matrix — Random Forest (Test Set)",
subtitle = "Rows = true label | Columns = predicted label",
x = "Predicted", y = "Actual", fill = "Count"
) +
theme_olist()Reading the matrix: the top-left cell (Satisfied → Satisfied) and bottom-right (Dissatisfied → Dissatisfied) are correct predictions. Off-diagonal cells are errors. A high recall on Dissatisfied means the model successfully flags most at-risk orders for early intervention.
Presenter note: The current metrics show the trade-off clearly: accuracy is low, precision is low, but recall is very high. This model should be tuned with a milder class weight or probability threshold before real deployment.
Research question: What is the final supervised-learning insight for business use?
| Finding | Detail |
|---|---|
| Top predictive feature | cat_dissat — a category’s historical dissatisfaction
rate is the single strongest signal |
| Second strongest | cat_avg_review — category-level baseline sentiment
reinforces the first signal |
| Delivery signal | delay_days and delayed contribute, but
less than product-quality proxies — consistent with Section 3 |
| Business use | Score each new order at purchase time using category features; flag high-probability dissatisfied orders for proactive seller outreach |
Presenter note: End by connecting both learning methods: clustering gives category-level strategic segments, while Random Forest gives order-level risk prediction. Together, they support long-term quality improvement and short-term proactive intervention.
Report generated with R 4.5.3 · Dataset: Olist E-Commerce Public Dataset (Kaggle)