Executive Summary

Presentation purpose: This report combines descriptive analytics, unsupervised learning (category clustering), and supervised learning (Random Forest) to turn Olist customer review data into business actions.

Main business question: Where is customer quality pain, what causes it, and who should fix it?

3 numbers every stakeholder needs before reading further.

Metric	Value	What it means
Overall dissatisfaction rate	16.5% of reviewed orders score ≤ 2	1 in ~6 customers is unhappy
On-time delivery, still dissatisfied	11.7% of orders delivered early/on-time still score ≤ 2	Delivery is not the dominant root cause
Return/refund complaints	Highest dissatisfaction intensity at ~79%	Process pain amplifies product defect impact

Analytical flow: Data construction → Category Risk → Root-Cause Split (delivery vs product) → Text Evidence → Priority Matrix → Recommendations → Unsupervised Clustering → Supervised Random Forest.

Research Questions

Where is quality pain? Which product categories have the highest dissatisfaction and business risk?
What causes it? Is dissatisfaction mainly driven by late delivery, product quality, wrong items, defects, or return/refund problems?
Who owns the fix? Which business team should act first: Product/Merchant, Operations/Logistics, or Customer Experience?
Can unsupervised learning help? Can product categories be grouped into meaningful risk segments without using a target label?
Can supervised learning help? Can a Random Forest model predict orders that are likely to receive dissatisfied reviews?

Presenter note: Start by saying that the project is not only a technical model-building exercise. It is a business diagnosis: identify the pain, explain the cause, and convert the result into actions.

1. Data & Base Table

Research question for this part: How do we build one reliable analysis table from separate Olist files, and how do we define a dissatisfied customer?

Why this part matters: All later charts and models depend on this table. If the joins, review aggregation, or target definition are wrong, the business conclusions will also be wrong.

1.1 Construction

base_tbl <- order_items %>%
  left_join(product_lookup, by = "product_id") %>%
  left_join(orders_clean,  by = "order_id") %>%
  left_join(reviews_clean, by = "order_id") %>%
  mutate(category_en = replace_na(category_en, "unknown"))

Grain: one row = one order-item. Reviews are aggregated at order level then joined.

Target definition: a customer is treated as Dissatisfied when review_score <= 2. Scores 3, 4, and 5 are treated as non-dissatisfied for the supervised model.

Presenter note: Explain that the project uses review score as the main signal of dissatisfaction, but still checks review text because scores and comments do not always perfectly agree.

1.2 Data Quality Check — Score-Text Mismatch

Research question for this part: Can we trust the numeric review score alone, or do written comments reveal extra dissatisfaction?

negative_themes_vec <- c(
  "Product defect/low quality", "Wrong item/mismatch",
  "Packaging issue",            "Delivery delay/not arrived",
  "Seller/service issue",       "Return/refund process",
  "Price/freight concern"
)

classify_theme <- function(msg) {
  case_when(
    str_detect(msg, "quebrad|amassad|defeit|estrag|baixa qualidade|pessima qualidade|nao funciona|ruim") ~ "Product defect/low quality",
    str_detect(msg, "diferente|nao corresponde|nao e o que pedi|errad|cor errada|tamanho errado")        ~ "Wrong item/mismatch",
    str_detect(msg, "embalag|caixa|pacote|mal embalado|avariado")                                       ~ "Packaging issue",
    str_detect(msg, "atras|demor|prazo|entrega|tarde|nao chegou|nao recebi")                            ~ "Delivery delay/not arrived",
    str_detect(msg, "atendimento|suporte|resposta|vendedor|sac")                                        ~ "Seller/service issue",
    str_detect(msg, "cancel|estorno|reembolso|devolu|troca")                                            ~ "Return/refund process",
    str_detect(msg, "preco|caro|valor|frete")                                                           ~ "Price/freight concern",
    TRUE                                                                                                 ~ "Other/unclear"
  )
}

comment_tbl <- base_tbl %>%
  filter(!is.na(review_comment_message), category_en != "unknown") %>%
  mutate(
    msg   = review_comment_message %>%
              iconv(from = "", to = "ASCII//TRANSLIT") %>%
              str_to_lower() %>%
              str_replace_all("[^a-z0-9 ]", " ") %>%
              str_squish(),
    theme = classify_theme(msg)
  )

mismatch_tbl <- comment_tbl %>%
  mutate(
    negative_theme             = theme %in% negative_themes_vec,
    high_score_negative_text   = if_else(review_band == "High (4-5)" & negative_theme,  1L, 0L),
    low_score_nonnegative_text = if_else(review_band == "Low (1-2)"  & !negative_theme, 1L, 0L)
  ) %>%
  summarise(
    `Total comments with text`             = n(),
    `High score + negative text (rate)`    = percent(mean(high_score_negative_text),   0.1),
    `Low score + non-negative text (rate)` = percent(mean(low_score_nonnegative_text), 0.1)
  )

kable(mismatch_tbl, align = "lcc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Total comments with text	High score + negative text (rate)	Low score + non-negative text (rate)
47672	26.8%	15.1%

Implication: ~27% of happy-scored orders contain complaint language, and ~15% of low-scored orders have non-negative text. We must use both numeric scores and text themes together.

Presenter note: This justifies why the project does not stop at average review scores. Some customers give a high score but still complain in the text, so the text themes help explain the real operational problem.

2. Category Risk Ranking

Research question for this part: Which product categories create the highest customer dissatisfaction and business risk?

Method: We group orders by product category and calculate review volume, average review score, dissatisfaction rate, delay rate, and a combined risk score. The risk score gives more weight to dissatisfaction but also includes volume because a large category can create more total customer pain.

MIN_ORDERS <- 100

category_core <- base_tbl %>%
  filter(!is.na(review_score), category_en != "unknown") %>%
  group_by(category_en) %>%
  summarise(
    reviewed_orders      = n_distinct(order_id),
    avg_review           = mean(review_score,  na.rm = TRUE),
    dissatisfaction_rate = mean(dissatisfied,  na.rm = TRUE),
    delay_rate           = mean(delayed,       na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(reviewed_orders >= MIN_ORDERS) %>%
  mutate(
    z_dissat   = as.numeric(scale(dissatisfaction_rate)),
    z_vol      = as.numeric(scale(log1p(reviewed_orders))),
    risk_score = 0.60 * z_dissat + 0.40 * z_vol
  ) %>%
  arrange(desc(risk_score))

top15 <- category_core %>% slice_head(n = 15)

ggplot(top15,
       aes(x = reorder(category_en, dissatisfaction_rate),
           y = dissatisfaction_rate,
           fill = reviewed_orders)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = percent(dissatisfaction_rate, 0.1)),
            hjust = -0.15, size = 3.2, colour = "grey30") +
  coord_flip() +
  scale_y_continuous(labels = percent_format(0.1),
                     expand = expansion(mult = c(0, 0.12))) +
  scale_fill_gradient(low = CLR_ORANGE, high = CLR_RED, labels = comma_format()) +
  labs(
    title    = "Top 15 Categories by Dissatisfaction Rate",
    subtitle = paste0("Categories with >= ", MIN_ORDERS, " reviewed orders | Colour = order volume"),
    x = NULL, y = "Dissatisfaction Rate (score <= 2)",
    fill = "Reviewed orders",
    caption = "Risk score = 60% dissatisfaction rate + 40% log(volume)"
  ) +
  theme_olist()

top15 %>%
  transmute(
    Category            = category_en,
    `Reviewed Orders`   = comma(reviewed_orders),
    `Avg Score`         = round(avg_review, 2),
    `Dissatisfaction %` = percent(dissatisfaction_rate, 0.1),
    `Delay Rate %`      = percent(delay_rate, 0.1),
    `Risk Score`        = round(risk_score, 2)
  ) %>%
  kable(align = "lrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(1:3, bold = TRUE, background = "#fff3cd")

Category	Reviewed Orders	Avg Score	Dissatisfaction %	Delay Rate %	Risk Score
office_furniture	1,273	3.48	26.3%	8.8%	1.66
fashion_male_clothing	112	3.62	28.8%	5.3%	1.33
bed_bath_table	9,417	3.87	19.4%	8.3%	1.19
furniture_decor	6,449	3.89	19.7%	8.3%	1.13
fixed_telephony	217	3.67	26.1%	4.2%	1.12
computers_accessories	6,689	3.92	18.8%	7.6%	1.00
watches_gifts	5,624	4.00	16.7%	8.1%	0.63
audio	350	3.81	22.0%	12.6%	0.63
telephony	4,199	3.93	17.2%	8.1%	0.63
housewares	5,884	4.04	16.0%	6.3%	0.55
baby	2,885	3.99	17.2%	8.5%	0.52
sports_leisure	7,720	4.09	15.0%	7.2%	0.47
garden_tools	3,518	4.03	16.3%	7.8%	0.44
auto	3,897	4.04	15.9%	8.1%	0.42
home_confort	397	3.83	20.3%	10.1%	0.41

Result interpretation: The table highlights the categories with the strongest dissatisfaction signal after filtering out very small categories. In presentation, focus on the first few rows rather than reading the whole table. The strongest categories here are the ones where dissatisfaction is both visible and business-relevant.

Presenter note: Say: “This ranking answers where the pain is. But it does not yet tell us why customers are unhappy. That is why the next section separates delivery-related pain from product-related pain.”

3. Root-Cause Split — Delivery vs Product

Research question for this part: Are dissatisfied reviews mainly caused by late delivery, or do customers still complain when delivery is on time?

Method: We split orders into delivery timing groups: on-time/early, late 1-3 days, late 4-7 days, and late more than 7 days. If dissatisfaction is high even in the on-time/early column, the issue is probably product-related rather than logistics-only.

top12_cats <- category_core %>% slice_head(n = 12) %>% pull(category_en)

delay_split <- base_tbl %>%
  filter(!is.na(review_score), category_en %in% top12_cats, !is.na(delay_days)) %>%
  mutate(
    delivery_segment = case_when(
      delay_days <= 0 ~ "On-time / Early",
      delay_days <= 3 ~ "Late 1-3 days",
      delay_days <= 7 ~ "Late 4-7 days",
      TRUE            ~ "Late > 7 days"
    ),
    delivery_segment = factor(delivery_segment,
      levels = c("On-time / Early", "Late 1-3 days", "Late 4-7 days", "Late > 7 days"))
  ) %>%
  group_by(category_en, delivery_segment) %>%
  summarise(
    n_orders             = n_distinct(order_id),
    dissatisfaction_rate = mean(dissatisfied, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(n_orders >= 30)

ggplot(delay_split,
       aes(x = delivery_segment,
           y = reorder(category_en, dissatisfaction_rate),
           fill = dissatisfaction_rate)) +
  geom_tile(colour = "white", linewidth = 0.8) +
  geom_text(aes(label = if_else(n_orders >= 30, percent(dissatisfaction_rate, 1), "")),
            size = 3, colour = "white", fontface = "bold") +
  scale_fill_gradient2(low = CLR_GREEN, mid = CLR_ORANGE, high = CLR_RED,
                       midpoint = 0.20, labels = percent_format(1)) +
  labs(
    title    = "Dissatisfaction Rate by Delivery Timing",
    subtitle = "Cells with < 30 orders hidden | % shown in white = dissatisfaction rate",
    x = "Delivery Segment", y = NULL, fill = "Dissatisfaction",
    caption = "Key: High dissatisfaction in 'On-time/Early' column -> product issue, not logistics"
  ) +
  theme_olist() +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

verdict_tbl <- base_tbl %>%
  filter(!is.na(dissatisfied), category_en %in% top12_cats, !is.na(delay_days)) %>%
  mutate(ontime = delay_days <= 0) %>%
  group_by(category_en, ontime) %>%
  summarise(dissat = mean(dissatisfied), n = n_distinct(order_id), .groups = "drop") %>%
  pivot_wider(names_from = ontime, values_from = c(dissat, n), names_prefix = "") %>%
  rename(
    ontime_dissat = dissat_TRUE,
    late_dissat   = dissat_FALSE,
    n_ontime      = n_TRUE,
    n_late        = n_FALSE
  ) %>%
  mutate(
    delta      = late_dissat - ontime_dissat,
    root_cause = case_when(
      ontime_dissat >= 0.15 & delta <  0.15 ~ "Product-driven",
      ontime_dissat >= 0.15 & delta >= 0.15 ~ "Mixed (Product + Ops)",
      ontime_dissat <  0.15 & delta >= 0.15 ~ "Ops/Delivery-driven",
      TRUE                                   ~ "Low risk"
    )
  ) %>%
  arrange(desc(ontime_dissat))

verdict_tbl %>%
  transmute(
    Category               = category_en,
    `On-time Dissatisfied` = percent(ontime_dissat, 0.1),
    `Late Dissatisfied`    = percent(late_dissat, 0.1),
    `Gap (Late - On-time)` = percent(delta, 0.1),
    `Root Cause`           = root_cause
  ) %>%
  kable(align = "lrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(str_detect(verdict_tbl$root_cause, "Product")), background = "#fde8e8")

Category	On-time Dissatisfied	Late Dissatisfied	Gap (Late - On-time)	Root Cause
office_furniture	22.6%	57.0%	34.5%	Mixed (Product + Ops)
fashion_male_clothing	22.0%	85.7%	63.7%	Mixed (Product + Ops)
fixed_telephony	21.3%	72.7%	51.4%	Mixed (Product + Ops)
furniture_decor	15.2%	53.6%	38.5%	Mixed (Product + Ops)
bed_bath_table	15.2%	56.8%	41.7%	Mixed (Product + Ops)
audio	14.6%	71.7%	57.2%	Ops/Delivery-driven
computers_accessories	14.0%	54.5%	40.5%	Ops/Delivery-driven
telephony	12.7%	50.1%	37.5%	Ops/Delivery-driven
housewares	12.1%	50.8%	38.7%	Ops/Delivery-driven
watches_gifts	11.3%	58.6%	47.3%	Ops/Delivery-driven
baby	11.0%	60.7%	49.7%	Ops/Delivery-driven
sports_leisure	9.9%	56.6%	46.7%	Ops/Delivery-driven

Result interpretation: Categories marked as Product-driven have high dissatisfaction even when orders arrive on time. Categories marked as Ops/Delivery-driven become much worse mainly when delivery is late. Mixed categories need both product and logistics actions.

Presenter note: This is the key business diagnosis: delivery delay matters, but it is not the whole story. The executive summary already shows that 11.7% of on-time orders are still dissatisfied, so product quality and process issues must be investigated.

4. Text Evidence

Research question for this part: What are dissatisfied customers actually complaining about?

Method: Review comments are grouped into simple business themes using keywords. This is not a complex NLP model, but it is useful because it converts raw text into interpretable complaint categories such as product defect, wrong item, packaging, delivery, and return/refund issues.

4.1 Theme Distribution

theme_summary <- comment_tbl %>%
  group_by(theme) %>%
  summarise(
    comments             = n(),
    dissatisfaction_rate = mean(dissatisfied, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(dissatisfaction_rate))

ggplot(theme_summary %>% filter(theme != "Other/unclear"),
       aes(x = comments, y = dissatisfaction_rate,
           colour = dissatisfaction_rate, size = comments)) +
  geom_point(alpha = 0.85) +
  geom_text_repel(aes(label = theme), size = 3.2, max.overlaps = 20,
                  colour = "grey20", fontface = "bold") +
  scale_x_log10(labels = comma_format()) +
  scale_y_continuous(labels = percent_format(1), limits = c(0, 1)) +
  scale_colour_gradient(low = CLR_ORANGE, high = CLR_RED, guide = "none") +
  scale_size_continuous(range = c(4, 14), guide = "none") +
  labs(
    title    = "Complaint Themes: Volume vs Dissatisfaction Intensity",
    subtitle = "Bubble size = comment volume | Top-right = highest priority",
    x = "Number of Comments (log scale)", y = "Dissatisfaction Rate",
    caption = "'Other/unclear' excluded for clarity"
  ) +
  theme_olist()

theme_summary %>%
  filter(theme != "Other/unclear") %>%
  transmute(
    Theme                  = theme,
    `Comment Volume`       = comma(comments),
    `Dissatisfaction Rate` = percent(dissatisfaction_rate, 0.1),
    Priority = case_when(
      dissatisfaction_rate >= 0.65 ~ "Critical",
      dissatisfaction_rate >= 0.45 ~ "High",
      TRUE                         ~ "Medium"
    )
  ) %>%
  kable(align = "lrrl") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(
    which(theme_summary %>% filter(theme != "Other/unclear") %>% pull(dissatisfaction_rate) >= 0.65),
    background = "#fde8e8", bold = TRUE
  )

Theme	Comment Volume	Dissatisfaction Rate	Priority
Return/refund process	808	79.2%	Critical
Wrong item/mismatch	1,228	68.2%	Critical
Product defect/low quality	1,300	63.4%	High
Packaging issue	1,074	50.5%	High
Seller/service issue	1,420	41.5%	Medium
Price/freight concern	935	33.3%	Medium
Delivery delay/not arrived	14,890	21.5%	Medium

Result interpretation: The most important themes are not only those with many comments, but those with high dissatisfaction intensity. A theme with a very high dissatisfaction rate should be treated as a serious root cause even if the volume is smaller.

Presenter note: Emphasize that text evidence explains the “why” behind the numbers. For example, wrong item, product defects, and return/refund problems point to different business owners and different fixes.

4.2 Theme x Category

theme_cat <- comment_tbl %>%
  filter(category_en %in% top12_cats, theme != "Other/unclear") %>%
  group_by(category_en, theme) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(category_en) %>%
  mutate(share = n / sum(n)) %>%
  ungroup()

ggplot(theme_cat,
       aes(x = theme, y = reorder(category_en, n), fill = share, size = n)) +
  geom_point(shape = 21, colour = "white") +
  scale_fill_gradient(low = CLR_ORANGE, high = CLR_RED, labels = percent_format(1)) +
  scale_size_continuous(range = c(2, 12)) +
  scale_x_discrete(labels = function(x) str_wrap(x, 12)) +
  labs(
    title    = "Complaint Theme Mix by Category",
    subtitle = "Bubble size = raw count | Colour = share within category",
    x = NULL, y = NULL,
    fill = "Share of category\ncomplaints", size = "Comment count",
    caption = "Larger + darker = dominant complaint type for that category"
  ) +
  theme_olist() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 9))

Result interpretation: This chart shows which complaint theme dominates each risky category. It helps convert a category-level problem into a specific action: fix product listing accuracy, inspect seller quality, improve packaging, review delivery SLA, or simplify returns.

5. Priority Matrix

Research question for this part: Which categories should management fix first?

Method: The matrix compares dissatisfaction intensity on the x-axis with reviewed order volume on the y-axis. This creates four business quadrants: FIX NOW, INVESTIGATE, MONITOR, and WATCH.

x_mid <- median(category_core$dissatisfaction_rate)
y_mid <- median(log10(category_core$reviewed_orders))

matrix_data <- category_core %>%
  mutate(
    log_vol  = log10(reviewed_orders),
    quadrant = case_when(
      dissatisfaction_rate >= x_mid & log_vol >= y_mid ~ "FIX NOW",
      dissatisfaction_rate >= x_mid & log_vol <  y_mid ~ "INVESTIGATE",
      dissatisfaction_rate <  x_mid & log_vol >= y_mid ~ "MONITOR",
      TRUE                                              ~ "WATCH"
    ),
    q_colour   = case_when(
      quadrant == "FIX NOW"     ~ CLR_RED,
      quadrant == "INVESTIGATE" ~ CLR_ORANGE,
      quadrant == "MONITOR"     ~ CLR_BLUE,
      TRUE                      ~ CLR_GREY
    ),
    label_flag = quadrant %in% c("FIX NOW", "INVESTIGATE") |
                   risk_score > quantile(risk_score, 0.85, na.rm = TRUE)
  )

ggplot(matrix_data,
       aes(x = dissatisfaction_rate, y = log_vol,
           colour = quadrant, size = reviewed_orders)) +
  annotate("rect", xmin = x_mid, xmax = Inf,  ymin = y_mid, ymax = Inf,  fill = CLR_RED,    alpha = 0.05) +
  annotate("rect", xmin = x_mid, xmax = Inf,  ymin = -Inf,  ymax = y_mid, fill = CLR_ORANGE, alpha = 0.05) +
  annotate("rect", xmin = -Inf,  xmax = x_mid, ymin = y_mid, ymax = Inf,  fill = CLR_BLUE,   alpha = 0.05) +
  annotate("rect", xmin = -Inf,  xmax = x_mid, ymin = -Inf,  ymax = y_mid, fill = CLR_GREY,   alpha = 0.05) +
  annotate("text", x = max(matrix_data$dissatisfaction_rate) * 0.97,
           y = max(matrix_data$log_vol) * 1.01,
           label = "FIX NOW", hjust = 1, fontface = "bold", colour = CLR_RED, size = 4) +
  annotate("text", x = max(matrix_data$dissatisfaction_rate) * 0.97,
           y = y_mid * 0.97,
           label = "INVESTIGATE", hjust = 1, fontface = "bold", colour = CLR_ORANGE, size = 4) +
  annotate("text", x = x_mid * 1.02,
           y = max(matrix_data$log_vol) * 1.01,
           label = "MONITOR", hjust = 0, fontface = "bold", colour = CLR_BLUE, size = 4) +
  geom_point(alpha = 0.75) +
  geom_text_repel(
    data = matrix_data %>% filter(label_flag),
    aes(label = str_replace_all(category_en, "_", " ")),
    size = 3, fontface = "bold", max.overlaps = 20, colour = "grey20"
  ) +
  geom_vline(xintercept = x_mid, linetype = "dashed", colour = "grey50") +
  geom_hline(yintercept = y_mid, linetype = "dashed", colour = "grey50") +
  scale_x_continuous(labels = percent_format(1)) +
  scale_y_continuous(labels = function(x) comma(10^x)) +
  scale_colour_manual(values = c(
    "FIX NOW" = CLR_RED, "INVESTIGATE" = CLR_ORANGE,
    "MONITOR" = CLR_BLUE, "WATCH" = CLR_GREY
  )) +
  scale_size_continuous(range = c(2, 10), labels = comma_format()) +
  labs(
    title    = "Category Priority Matrix",
    subtitle = "X = dissatisfaction intensity | Y = reviewed order volume",
    x = "Dissatisfaction Rate (score <= 2)", y = "Reviewed Orders (log scale)",
    colour = "Quadrant", size = "Reviewed orders",
    caption = "Dashed lines = median split"
  ) +
  theme_olist()

Result interpretation: Categories in FIX NOW combine high dissatisfaction with high volume, so they deserve the strongest management attention. INVESTIGATE categories may have high dissatisfaction but lower volume, so they need deeper review before large investment.

Presenter note: Say: “This matrix turns analysis into prioritization. It prevents us from fixing only the loudest problem or only the biggest category; we focus on the categories that combine pain and scale.”

6. Recommendations

Research question for this part: Who owns the fix, and what action should be taken for each risky category?

Method: This table combines the priority quadrant, root-cause label, dominant complaint theme, business owner, and recommended action.

reco_tbl <- matrix_data %>%
  filter(quadrant %in% c("FIX NOW", "INVESTIGATE")) %>%
  left_join(verdict_tbl %>% select(category_en, root_cause, ontime_dissat), by = "category_en") %>%
  left_join(
    theme_cat %>%
      group_by(category_en) %>%
      slice_max(n, n = 1) %>%
      select(category_en, top_complaint = theme),
    by = "category_en"
  ) %>%
  mutate(
    owner = case_when(
      str_detect(root_cause, "Product") ~ "Product / Merchant Team",
      str_detect(root_cause, "Ops")     ~ "Operations / Logistics",
      str_detect(root_cause, "Mixed")   ~ "Product + Operations (joint)",
      TRUE                               ~ "TBD"
    ),
    recommended_action = case_when(
      str_detect(top_complaint, "Wrong item|mismatch") ~
        "Audit product listing accuracy (title, photo, spec); enforce SKU mapping checks pre-shipment.",
      str_detect(top_complaint, "defect|low quality") ~
        "Add incoming quality inspection at seller onboarding; require quality cert for high-volume sellers.",
      str_detect(top_complaint, "Packaging") ~
        "Define packaging standards by product size/weight; penalise sellers with >10% packaging complaints.",
      str_detect(top_complaint, "Delivery") ~
        "Review SLA promise-dates; flag sellers with >15% late rate for performance review.",
      str_detect(top_complaint, "Return|refund") ~
        "Simplify return policy; add one-click return initiation in post-delivery email.",
      TRUE ~ "Deep-dive review text sample; run seller scorecard audit."
    )
  ) %>%
  arrange(quadrant, desc(dissatisfaction_rate))

reco_tbl %>%
  transmute(
    Quadrant             = quadrant,
    Category             = str_replace_all(category_en, "_", " "),
    `Dissat. Rate`       = percent(dissatisfaction_rate, 0.1),
    Volume               = comma(reviewed_orders),
    `Root Cause`         = root_cause,
    `Top Complaint`      = top_complaint,
    Owner                = owner,
    `Recommended Action` = recommended_action
  ) %>%
  kable(align = "llrrlll") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 12) %>%
  row_spec(which(reco_tbl$quadrant == "FIX NOW"), background = "#fff0f0", bold = TRUE) %>%
  column_spec(8, width = "30em")

Quadrant	Category	Dissat. Rate	Volume	Root Cause	Top Complaint	Owner	Recommended Action
FIX NOW	office furniture	26.3%	1,273	Mixed (Product + Ops)	Delivery delay/not arrived	Product / Merchant Team	Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW	furniture decor	19.7%	6,449	Mixed (Product + Ops)	Delivery delay/not arrived	Product / Merchant Team	Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW	bed bath table	19.4%	9,417	Mixed (Product + Ops)	Delivery delay/not arrived	Product / Merchant Team	Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW	computers accessories	18.8%	6,689	Ops/Delivery-driven	Delivery delay/not arrived	Operations / Logistics	Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW	telephony	17.2%	4,199	Ops/Delivery-driven	Delivery delay/not arrived	Operations / Logistics	Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW	baby	17.2%	2,885	Ops/Delivery-driven	Delivery delay/not arrived	Operations / Logistics	Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW	watches gifts	16.7%	5,624	Ops/Delivery-driven	Delivery delay/not arrived	Operations / Logistics	Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW	garden tools	16.3%	3,518	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.
FIX NOW	housewares	16.0%	5,884	Ops/Delivery-driven	Delivery delay/not arrived	Operations / Logistics	Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW	electronics	16.0%	2,550	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.
FIX NOW	consoles games	16.0%	1,062	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE	fashion male clothing	28.8%	112	Mixed (Product + Ops)	Delivery delay/not arrived	Product / Merchant Team	Review SLA promise-dates; flag sellers with >15% late rate for performance review.
INVESTIGATE	fixed telephony	26.1%	217	Mixed (Product + Ops)	Delivery delay/not arrived	Product / Merchant Team	Review SLA promise-dates; flag sellers with >15% late rate for performance review.
INVESTIGATE	audio	22.0%	350	Ops/Delivery-driven	Delivery delay/not arrived	Operations / Logistics	Review SLA promise-dates; flag sellers with >15% late rate for performance review.
INVESTIGATE	construction tools safety	20.6%	167	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE	home confort	20.3%	397	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE	air conditioning	19.2%	253	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE	art	18.7%	202	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE	furniture living room	18.5%	422	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE	fashion underwear beach	18.3%	121	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE	home construction	18.2%	490	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE	costruction tools garden	17.6%	194	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE	christmas supplies	17.6%	128	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE	construction tools lights	17.1%	244	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE	drinks	16.1%	297	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE	agro industry and commerce	16.0%	182	NA	NA	TBD	Deep-dive review text sample; run seller scorecard audit.

How to use this table: This is the action list for managers. Each row connects an analytical signal to an owner and a concrete operational fix.

Presenter note: Do not read every row slowly. Pick one FIX NOW category and explain the logic from left to right: category → dissatisfaction rate → root cause → complaint theme → owner → action.

7. Summary of Findings

Research question for this part: What should stakeholders remember after seeing all the evidence?

#	Finding	Evidence	Action Owner
1	Product quality, not delivery, drives most dissatisfaction	11.7% of on-time orders still score <= 2	Product / Merchant Team
2	`fashion_male_clothing`, `office_furniture`, `fixed_telephony` have highest dissatisfaction rate	Section 2 ranking	Product Quality
3	`bed_bath_table` and `furniture_decor` are highest business risk due to volume x rate	Priority Matrix	Product Quality
4	Wrong item/mismatch and product defects are most destructive complaint types (63-68% dissat.)	Text theme analysis	Listing accuracy & QC
5	Return/refund friction amplifies every other problem (79% dissat. rate)	Theme table	Customer Experience
6	27% of satisfied customers still wrote complaint text — tracking scores alone underestimates real pain	Mismatch check	Analytics / Monitoring
7	Clustering turns many categories into a few structural risk segments	Section 8 clustering	Product + Operations
8	Random Forest can be used as an early-warning model, but the current output must be interpreted carefully because recall is high while precision and accuracy are low	Section 9 model evaluation	Analytics / CX

North Star recommendation: Build a monthly Category Quality Scorecard combining dissatisfaction rate + on-time dissatisfaction + top complaint theme. Any category crossing 15% on-time dissatisfaction triggers an automatic seller audit.

8. Category Clustering — Structural Segments

Research question for this part: Can unsupervised learning group product categories into meaningful quality-risk segments without using a target label?

Goal: Group product categories by their quality-and-delivery profile so that each cluster receives a single targeted intervention.

Why clustering is useful here: The business has many product categories. Instead of making a separate strategy for every category, clustering reduces them into a small number of structural segments with similar risk patterns.

Feature	Why it matters
`dissatisfaction_rate`	Overall quality pain
`ontime_dissat`	Product-driven pain (delivery-independent)
`delay_rate`	Operational / logistics pain
`avg_review`	General sentiment
`log_vol`	Business scale

8.1 Feature Matrix

Research question: Which variables best describe the risk profile of a product category?

cluster_input <- category_core %>%
  left_join(verdict_tbl %>% select(category_en, ontime_dissat), by = "category_en") %>%
  filter(!is.na(ontime_dissat)) %>%
  mutate(log_vol = log1p(reviewed_orders)) %>%
  select(category_en, dissatisfaction_rate, ontime_dissat, delay_rate, avg_review, log_vol)

cat_names <- cluster_input$category_en

x_scaled <- cluster_input %>%
  select(-category_en) %>%
  scale()
rownames(x_scaled) <- cat_names

cat("Categories entering clustering:", nrow(x_scaled), "\n")

## Categories entering clustering: 12

Normalisation is mandatory for k-means: scale() converts every feature to mean 0, SD 1 so no variable dominates by sheer magnitude.

Presenter note: Explain that clustering uses distances. If we do not scale the variables, volume or review score could dominate the distance calculation and hide the true quality patterns.

8.2 Choosing K — Elbow & Silhouette

Research question: How many category segments should we use?

set.seed(42)

p_elbow <- fviz_nbclust(x_scaled, kmeans, method = "wss", k.max = 8,
                        linecolor = CLR_RED) +
  labs(title    = "Elbow Method (WSS)",
       subtitle = "Look for the elbow — point of diminishing returns",
       x = "Number of clusters K", y = "Total within-cluster SS") +
  theme_olist()

p_sil <- fviz_nbclust(x_scaled, kmeans, method = "silhouette", k.max = 8,
                      linecolor = CLR_BLUE) +
  labs(title    = "Average Silhouette Width",
       subtitle = "Higher = tighter, better-separated clusters",
       x = "Number of clusters K", y = "Average silhouette width") +
  theme_olist()

p_elbow + p_sil

# Change K_CHOSEN if your plots suggest a different number
K_CHOSEN <- 3
cat("Using K =", K_CHOSEN, "clusters\n")

## Using K = 3 clusters

Presenter note: K = 3 is easy to explain in business terms: high-risk, medium/moderate-risk, and lower-risk or operational-watch categories.

8.3 k-Means Clustering

Research question: What category groups are produced by k-means?

set.seed(42)
km_fit <- kmeans(x_scaled, centers = K_CHOSEN, nstart = 25, iter.max = 100)

cat("k-means variance explained (between-SS / total-SS):",
    round(km_fit$betweenss / km_fit$totss, 3), "\n")

## k-means variance explained (between-SS / total-SS): 0.826

fviz_cluster(km_fit,
             data         = x_scaled,
             ellipse.type = "convex",
             palette      = c(CLR_RED, CLR_BLUE, CLR_GREEN, CLR_ORANGE)[1:K_CHOSEN],
             ggtheme      = theme_olist(),
             repel        = TRUE) +
  labs(title    = paste0("k-Means Cluster Map (K = ", K_CHOSEN, ")"),
       subtitle = "PCA projection — first 2 components shown")

How to read the plot: Each point is a product category. Categories close together have similar dissatisfaction, delivery, review, and volume profiles. The PCA projection is only for visualization; the clustering uses the full scaled feature matrix.

8.4 PAM Clustering (Medoid-Based)

Research question: Are the category segments stable if we use a more robust clustering method?

PAM uses real data points as cluster centres (medoids), making it robust to outliers. We use Gower distance so the method handles mixed data types.

# Pass unscaled data — daisy() handles normalisation internally for Gower
gower_dist <- daisy(
  cluster_input %>% select(-category_en),
  metric = "gower"
)

set.seed(42)
pam_fit <- pam(gower_dist, k = K_CHOSEN, diss = TRUE)

cat("PAM medoids (representative categories):\n")

## PAM medoids (representative categories):

print(cat_names[pam_fit$id.med])

## [1] "fixed_telephony" "furniture_decor" "watches_gifts"

fviz_cluster(
  list(data = x_scaled, cluster = pam_fit$clustering),
  ellipse.type = "convex",
  palette      = c(CLR_RED, CLR_BLUE, CLR_GREEN, CLR_ORANGE)[1:K_CHOSEN],
  ggtheme      = theme_olist(),
  repel        = TRUE
) +
  labs(title    = paste0("PAM Cluster Map (K = ", K_CHOSEN, ", Gower distance)"),
       subtitle = "PCA projection — first 2 components shown")

8.5 Comparing k-Means vs PAM — Rand & Jaccard Indices

Research question: Do k-means and PAM give similar segment assignments?

# Helper: compute Rand and Jaccard from two label vectors (no fossil needed)
rand_jaccard <- function(a, b) {
  n   <- length(a)
  # all pairs
  pairs <- combn(n, 2)
  same_a <- a[pairs[1,]] == a[pairs[2,]]
  same_b <- b[pairs[1,]] == b[pairs[2,]]
  tp <- sum( same_a &  same_b)   # both same
  tn <- sum(!same_a & !same_b)   # both different
  fp <- sum(!same_a &  same_b)   # different in a, same in b
  fn <- sum( same_a & !same_b)   # same in a, different in b
  rand    <- (tp + tn) / (tp + tn + fp + fn)
  jaccard <- tp / (tp + fp + fn)
  list(rand = rand, jaccard = jaccard)
}

km_labels  <- km_fit$cluster
pam_labels <- pam_fit$clustering

rj  <- rand_jaccard(km_labels, pam_labels)
ari <- adjustedRandIndex(km_labels, pam_labels)  # from mclust

index_tbl <- tibble(
  Index          = c("Rand Index", "Adjusted Rand Index (ARI)", "Jaccard Similarity"),
  Value          = round(c(rj$rand, ari, rj$jaccard), 3),
  Interpretation = c(
    "Proportion of pairs that agree between k-means and PAM (1 = identical)",
    "Rand corrected for chance; >0.80 = strong agreement",
    "Agreement on co-clustered pairs only — a zoom-in on positive matches"
  )
)

kable(index_tbl, align = "llr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Index	Value	Interpretation
Rand Index	0.727	Proportion of pairs that agree between k-means and PAM (1 = identical)
Adjusted Rand Index (ARI)	0.440	Rand corrected for chance; >0.80 = strong agreement
Jaccard Similarity	0.471	Agreement on co-clustered pairs only — a zoom-in on positive matches

Rule of thumb: ARI > 0.80 means the two methods agree strongly — either clustering can be used. ARI < 0.60 signals real disagreement — prefer PAM (more robust, handles mixed data via Gower distance).

Presenter note: In the current output, ARI is moderate/low, so be honest: the clustering is exploratory and useful for business segmentation, not an absolute truth.

8.6 Cluster Profiles

Research question: What does each cluster mean in business language?

profile_tbl <- cluster_input %>%
  mutate(
    km_cluster  = factor(km_fit$cluster,     labels = paste0("KM-",  seq_len(K_CHOSEN))),
    pam_cluster = factor(pam_fit$clustering, labels = paste0("PAM-", seq_len(K_CHOSEN)))
  )

km_profile <- profile_tbl %>%
  group_by(Cluster = km_cluster) %>%
  summarise(
    `N categories`        = n(),
    `Avg Dissatisfaction` = percent(mean(dissatisfaction_rate), 0.1),
    `Avg On-time Dissat.` = percent(mean(ontime_dissat), 0.1),
    `Avg Delay Rate`      = percent(mean(delay_rate), 0.1),
    `Avg Review Score`    = round(mean(avg_review), 2),
    `Example categories`  = paste(head(category_en, 3), collapse = ", "),
    .groups = "drop"
  )

kable(km_profile, caption = "k-Means Cluster Profiles", align = "lrrrrrl") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
  column_spec(7, width = "25em")

k-Means Cluster Profiles
Cluster	N categories	Avg Dissatisfaction	Avg On-time Dissat.	Avg Delay Rate	Avg Review Score	Example categories
KM-1	1	22.0%	14.6%	12.6%	3.81	audio
KM-2	3	27.1%	22.0%	6.1%	3.59	office_furniture, fashion_male_clothing, fixed_telephony
KM-3	8	17.5%	12.7%	7.8%	3.97	bed_bath_table, furniture_decor, computers_accessories

cluster_means <- profile_tbl %>%
  group_by(Cluster = km_cluster) %>%
  summarise(across(dissatisfaction_rate:log_vol, mean), .groups = "drop") %>%
  pivot_longer(-Cluster, names_to = "Feature", values_to = "Mean")

ggplot(cluster_means, aes(x = Feature, y = Cluster, fill = Mean)) +
  geom_tile(colour = "white", linewidth = 0.6) +
  geom_text(aes(label = round(Mean, 2)), colour = "white", fontface = "bold", size = 3.5) +
  scale_fill_gradient2(low = CLR_BLUE, mid = CLR_ORANGE, high = CLR_RED, midpoint = 0) +
  scale_x_discrete(labels = c(
    dissatisfaction_rate = "Dissatisfaction\nRate",
    ontime_dissat        = "On-time\nDissat.",
    delay_rate           = "Delay Rate",
    avg_review           = "Avg Review\nScore",
    log_vol              = "Log Volume"
  )) +
  labs(title    = "Cluster Feature Heatmap (k-Means, raw means)",
       subtitle = "Red = high value | Blue = low value",
       x = NULL, y = "Cluster", fill = "Mean") +
  theme_olist()

Result interpretation: The profile table is more important than the cluster plot for business interpretation. Use the average dissatisfaction, on-time dissatisfaction, delay rate, and review score to name each cluster.

8.7 Actionable Cluster Interpretation

# Adjust labels after inspecting your actual heatmap above
strategy_tbl <- tibble(
  Cluster                    = paste0("KM-", 1:K_CHOSEN),
  `Likely Label`             = c(
    "High-risk: product + delivery pain",
    "Moderate: product-driven dissatisfaction",
    "Low-risk: operational under-performers"
  )[1:K_CHOSEN],
  `Recommended Intervention` = c(
    "Immediate seller quality audit + return policy fix",
    "SKU accuracy review; incoming QC at onboarding",
    "SLA promise-date review; logistics partner audit"
  )[1:K_CHOSEN],
  Owner = c(
    "Product + CX (joint)",
    "Product / Merchant Team",
    "Operations / Logistics"
  )[1:K_CHOSEN]
)

kable(strategy_tbl, align = "llll") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
  row_spec(1, background = "#fde8e8", bold = TRUE)

Cluster	Likely Label	Recommended Intervention	Owner
KM-1	High-risk: product + delivery pain	Immediate seller quality audit + return policy fix	Product + CX (joint)
KM-2	Moderate: product-driven dissatisfaction	SKU accuracy review; incoming QC at onboarding	Product / Merchant Team
KM-3	Low-risk: operational under-performers	SLA promise-date review; logistics partner audit	Operations / Logistics

North Star integration: Feed cluster membership back into the monthly Category Quality Scorecard. Track whether categories migrate between clusters over time — a move from Cluster 1 to Cluster 3 is a concrete, measurable quality improvement.

Presenter note: This is the unsupervised learning contribution. We did not predict a label here; we discovered structural groups of categories and translated each group into a business strategy.

9. Random Forest — Predicting Dissatisfaction

Research question for this part: Can supervised learning predict which orders are likely to become dissatisfied, and which features drive that risk?

Goal: Train a Random Forest classifier to predict whether an order will receive a dissatisfied review (score ≤ 2), identify the most predictive features, and evaluate model accuracy on a held-out test set.

Important modelling note for presentation: The current Random Forest output has very high recall but low accuracy and precision. That means it catches almost all dissatisfied orders, but it also creates many false alarms. Present it as an early-warning experiment, not as a final deployment-ready model.

Feature	Role
`delay_days`	Delivery lateness in days
`delayed`	Binary flag: was delivery late?
`dissatisfaction_rate` (category)	Historical quality level of the product category
`delay_rate` (category)	Historical logistics performance of the category
`avg_review` (category)	Baseline sentiment for the category
`log_vol` (category)	Business scale (log-transformed order volume)

9.1 Feature Matrix

Research question: Which delivery and category-level variables are useful for predicting dissatisfaction?

# Join category-level features onto order-level data
rf_data <- base_tbl %>%
  filter(!is.na(dissatisfied), !is.na(delay_days), category_en != "unknown") %>%
  left_join(
    category_core %>%
      mutate(log_vol = log1p(reviewed_orders)) %>%
      select(category_en, cat_dissat = dissatisfaction_rate,
             cat_delay = delay_rate, cat_avg_review = avg_review, log_vol),
    by = "category_en"
  ) %>%
  filter(!is.na(cat_dissat)) %>%
  transmute(
    dissatisfied   = factor(dissatisfied, levels = c(0, 1),
                            labels = c("Satisfied", "Dissatisfied")),
    delay_days     = as.numeric(delay_days),
    delayed        = as.integer(delayed),
    cat_dissat,
    cat_delay,
    cat_avg_review,
    log_vol
  ) %>%
  drop_na()

cat("Rows entering RF model:", nrow(rf_data), "\n")

## Rows entering RF model: 107735

cat("Class balance:\n")

## Class balance:

print(table(rf_data$dissatisfied))

## 
##    Satisfied Dissatisfied 
##        91490        16245

9.2 Train / Test Split (80 / 20)

Research question: How do we evaluate the model on unseen data?

set.seed(42)
train_idx  <- sample(seq_len(nrow(rf_data)), size = floor(0.8 * nrow(rf_data)))
train_rf   <- rf_data[ train_idx, ]
test_rf    <- rf_data[-train_idx, ]

cat("Train rows:", nrow(train_rf), "| Test rows:", nrow(test_rf), "\n")

## Train rows: 86188 | Test rows: 21547

An 80/20 split is sufficient given the dataset size. set.seed(42) ensures reproducibility.

9.3 Model Training

Research question: Can Random Forest learn patterns between delivery performance, category quality history, and dissatisfaction?

set.seed(42)
rf_model <- randomForest(
  dissatisfied ~ .,
  data       = train_rf,
  ntree      = 500,          # number of trees
  mtry       = 2,            # features considered at each split (≈ sqrt(6))
  importance = TRUE,         # needed for varImpPlot
  classwt    = c(1, 3)       # upweight minority class (dissatisfied)
)

print(rf_model)

## 
## Call:
##  randomForest(formula = dissatisfied ~ ., data = train_rf, ntree = 500,      mtry = 2, importance = TRUE, classwt = c(1, 3)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 83.34%
## Confusion matrix:
##              Satisfied Dissatisfied class.error
## Satisfied         1425        71714 0.980516551
## Dissatisfied       119        12930 0.009119473

classwt: dissatisfied orders are a minority class (≈10–15%). Upweighting them (3×) prevents the model from trivially predicting “Satisfied” for every order.

Presenter note: In this run, the 3× weight is too aggressive. It produces high recall but very low accuracy and precision because the model predicts too many orders as Dissatisfied. Mention this as a limitation and improvement opportunity.

9.4 Variable Importance

Research question: Which variables are most important for predicting dissatisfied reviews?

imp_df <- importance(rf_model) %>%
  as.data.frame() %>%
  rownames_to_column("Feature") %>%
  rename(MeanDecreaseAccuracy = MeanDecreaseAccuracy,
         MeanDecreaseGini     = MeanDecreaseGini) %>%
  arrange(desc(MeanDecreaseAccuracy))

# ── tidy ggplot instead of base varImpPlot ──────────────────────────────────
p_acc <- ggplot(imp_df,
                aes(x = MeanDecreaseAccuracy,
                    y = reorder(Feature, MeanDecreaseAccuracy))) +
  geom_col(fill = CLR_BLUE, width = 0.6) +
  labs(title    = "Variable Importance — Mean Decrease in Accuracy",
       subtitle = "How much OOB accuracy drops when a feature is randomly shuffled",
       x = "Mean Decrease Accuracy", y = NULL) +
  theme_olist()

p_gini <- ggplot(imp_df,
                 aes(x = MeanDecreaseGini,
                     y = reorder(Feature, MeanDecreaseGini))) +
  geom_col(fill = CLR_RED, width = 0.6) +
  labs(title    = "Variable Importance — Mean Decrease in Gini",
       subtitle = "Total reduction in node impurity attributable to each feature",
       x = "Mean Decrease Gini", y = NULL) +
  theme_olist()

p_acc + p_gini

imp_df %>%
  transmute(
    Feature,
    `Mean Decrease Accuracy` = round(MeanDecreaseAccuracy, 3),
    `Mean Decrease Gini`     = round(MeanDecreaseGini, 3)
  ) %>%
  kable(align = "lrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(1:2, bold = TRUE, background = "#fff3cd")

Feature	Mean Decrease Accuracy	Mean Decrease Gini
delay_days	5.206	2753.908
log_vol	-2.185	121.712
cat_dissat	-11.396	282.910
cat_avg_review	-13.033	247.389
cat_delay	-15.821	128.790
delayed	-29.280	918.226

9.5 Predictions & Confusion Matrix

Research question: How well does the model classify dissatisfied and satisfied orders?

pred_rf <- predict(rf_model, newdata = test_rf)

conf_mat <- table(Actual = test_rf$dissatisfied, Predicted = pred_rf)
print(conf_mat)

##               Predicted
## Actual         Satisfied Dissatisfied
##   Satisfied          361        17990
##   Dissatisfied        31         3165

TP  <- conf_mat["Dissatisfied", "Dissatisfied"]
TN  <- conf_mat["Satisfied",    "Satisfied"]
FP  <- conf_mat["Satisfied",    "Dissatisfied"]
FN  <- conf_mat["Dissatisfied", "Satisfied"]

accuracy  <- (TP + TN) / sum(conf_mat)
precision <- TP / (TP + FP)
recall    <- TP / (TP + FN)
f1        <- 2 * precision * recall / (precision + recall)

metrics_tbl <- tibble(
  Metric        = c("Accuracy", "Precision (Dissatisfied)", "Recall (Dissatisfied)", "F1 Score"),
  Value         = percent(c(accuracy, precision, recall, f1), 0.1),
  Interpretation = c(
    "Overall share of correctly classified orders",
    "Of predicted dissatisfied, how many truly were?",
    "Of truly dissatisfied, how many did the model catch?",
    "Harmonic mean of Precision & Recall — key metric for imbalanced classes"
  )
)

kable(metrics_tbl, align = "llr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(4, bold = TRUE, background = "#e8f5e9")

Metric	Value	Interpretation
Accuracy	16.4%	Overall share of correctly classified orders
Precision (Dissatisfied)	15.0%	Of predicted dissatisfied, how many truly were?
Recall (Dissatisfied)	99.0%	Of truly dissatisfied, how many did the model catch?
F1 Score	26.0%	Harmonic mean of Precision & Recall — key metric for imbalanced classes

conf_df <- as.data.frame(conf_mat) %>%
  rename(Count = Freq)

ggplot(conf_df, aes(x = Predicted, y = Actual, fill = Count)) +
  geom_tile(colour = "white", linewidth = 1.2) +
  geom_text(aes(label = comma(Count)), size = 5, fontface = "bold", colour = "white") +
  scale_fill_gradient(low = CLR_BLUE, high = CLR_RED, labels = comma_format()) +
  labs(
    title    = "Confusion Matrix — Random Forest (Test Set)",
    subtitle = "Rows = true label | Columns = predicted label",
    x = "Predicted", y = "Actual", fill = "Count"
  ) +
  theme_olist()

Reading the matrix: the top-left cell (Satisfied → Satisfied) and bottom-right (Dissatisfied → Dissatisfied) are correct predictions. Off-diagonal cells are errors. A high recall on Dissatisfied means the model successfully flags most at-risk orders for early intervention.

Presenter note: The current metrics show the trade-off clearly: accuracy is low, precision is low, but recall is very high. This model should be tuned with a milder class weight or probability threshold before real deployment.

9.6 RF Findings

Research question: What is the final supervised-learning insight for business use?

Finding	Detail
Top predictive feature	`cat_dissat` — a category’s historical dissatisfaction rate is the single strongest signal
Second strongest	`cat_avg_review` — category-level baseline sentiment reinforces the first signal
Delivery signal	`delay_days` and `delayed` contribute, but less than product-quality proxies — consistent with Section 3
Business use	Score each new order at purchase time using category features; flag high-probability dissatisfied orders for proactive seller outreach

Presenter note: End by connecting both learning methods: clustering gives category-level strategic segments, while Random Forest gives order-level risk prediction. Together, they support long-term quality improvement and short-term proactive intervention.

Report generated with R 4.5.3 · Dataset: Olist E-Commerce Public Dataset (Kaggle)

Olist — Product Quality Root-Cause Analysis

Where is quality pain? What causes it? Who owns the fix?

Team members: Baran Ö - Daniel R - Faradilah PF - Lewis B - Yen Nhi P - Saba JT

2026-06-23