Executive Summary

Presentation purpose: This report combines descriptive analytics, unsupervised learning (category clustering), and supervised learning (Random Forest) to turn Olist customer review data into business actions.

Main business question: Where is customer quality pain, what causes it, and who should fix it?

3 numbers every stakeholder needs before reading further.

Metric Value What it means
Overall dissatisfaction rate 16.5% of reviewed orders score ≤ 2 1 in ~6 customers is unhappy
On-time delivery, still dissatisfied 11.7% of orders delivered early/on-time still score ≤ 2 Delivery is not the dominant root cause
Return/refund complaints Highest dissatisfaction intensity at ~79% Process pain amplifies product defect impact

Analytical flow: Data construction → Category Risk → Root-Cause Split (delivery vs product) → Text Evidence → Priority Matrix → Recommendations → Unsupervised Clustering → Supervised Random Forest.

Research Questions

  1. Where is quality pain? Which product categories have the highest dissatisfaction and business risk?
  2. What causes it? Is dissatisfaction mainly driven by late delivery, product quality, wrong items, defects, or return/refund problems?
  3. Who owns the fix? Which business team should act first: Product/Merchant, Operations/Logistics, or Customer Experience?
  4. Can unsupervised learning help? Can product categories be grouped into meaningful risk segments without using a target label?
  5. Can supervised learning help? Can a Random Forest model predict orders that are likely to receive dissatisfied reviews?

Presenter note: Start by saying that the project is not only a technical model-building exercise. It is a business diagnosis: identify the pain, explain the cause, and convert the result into actions.


1. Data & Base Table

Research question for this part: How do we build one reliable analysis table from separate Olist files, and how do we define a dissatisfied customer?

Why this part matters: All later charts and models depend on this table. If the joins, review aggregation, or target definition are wrong, the business conclusions will also be wrong.

1.1 Construction

base_tbl <- order_items %>%
  left_join(product_lookup, by = "product_id") %>%
  left_join(orders_clean,  by = "order_id") %>%
  left_join(reviews_clean, by = "order_id") %>%
  mutate(category_en = replace_na(category_en, "unknown"))

Grain: one row = one order-item. Reviews are aggregated at order level then joined.

Target definition: a customer is treated as Dissatisfied when review_score <= 2. Scores 3, 4, and 5 are treated as non-dissatisfied for the supervised model.

Presenter note: Explain that the project uses review score as the main signal of dissatisfaction, but still checks review text because scores and comments do not always perfectly agree.

1.2 Data Quality Check — Score-Text Mismatch

Research question for this part: Can we trust the numeric review score alone, or do written comments reveal extra dissatisfaction?

negative_themes_vec <- c(
  "Product defect/low quality", "Wrong item/mismatch",
  "Packaging issue",            "Delivery delay/not arrived",
  "Seller/service issue",       "Return/refund process",
  "Price/freight concern"
)

classify_theme <- function(msg) {
  case_when(
    str_detect(msg, "quebrad|amassad|defeit|estrag|baixa qualidade|pessima qualidade|nao funciona|ruim") ~ "Product defect/low quality",
    str_detect(msg, "diferente|nao corresponde|nao e o que pedi|errad|cor errada|tamanho errado")        ~ "Wrong item/mismatch",
    str_detect(msg, "embalag|caixa|pacote|mal embalado|avariado")                                       ~ "Packaging issue",
    str_detect(msg, "atras|demor|prazo|entrega|tarde|nao chegou|nao recebi")                            ~ "Delivery delay/not arrived",
    str_detect(msg, "atendimento|suporte|resposta|vendedor|sac")                                        ~ "Seller/service issue",
    str_detect(msg, "cancel|estorno|reembolso|devolu|troca")                                            ~ "Return/refund process",
    str_detect(msg, "preco|caro|valor|frete")                                                           ~ "Price/freight concern",
    TRUE                                                                                                 ~ "Other/unclear"
  )
}

comment_tbl <- base_tbl %>%
  filter(!is.na(review_comment_message), category_en != "unknown") %>%
  mutate(
    msg   = review_comment_message %>%
              iconv(from = "", to = "ASCII//TRANSLIT") %>%
              str_to_lower() %>%
              str_replace_all("[^a-z0-9 ]", " ") %>%
              str_squish(),
    theme = classify_theme(msg)
  )

mismatch_tbl <- comment_tbl %>%
  mutate(
    negative_theme             = theme %in% negative_themes_vec,
    high_score_negative_text   = if_else(review_band == "High (4-5)" & negative_theme,  1L, 0L),
    low_score_nonnegative_text = if_else(review_band == "Low (1-2)"  & !negative_theme, 1L, 0L)
  ) %>%
  summarise(
    `Total comments with text`             = n(),
    `High score + negative text (rate)`    = percent(mean(high_score_negative_text),   0.1),
    `Low score + non-negative text (rate)` = percent(mean(low_score_nonnegative_text), 0.1)
  )

kable(mismatch_tbl, align = "lcc") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Total comments with text High score + negative text (rate) Low score + non-negative text (rate)
47672 26.8% 15.1%

Implication: ~27% of happy-scored orders contain complaint language, and ~15% of low-scored orders have non-negative text. We must use both numeric scores and text themes together.

Presenter note: This justifies why the project does not stop at average review scores. Some customers give a high score but still complain in the text, so the text themes help explain the real operational problem.


2. Category Risk Ranking

Research question for this part: Which product categories create the highest customer dissatisfaction and business risk?

Method: We group orders by product category and calculate review volume, average review score, dissatisfaction rate, delay rate, and a combined risk score. The risk score gives more weight to dissatisfaction but also includes volume because a large category can create more total customer pain.

MIN_ORDERS <- 100

category_core <- base_tbl %>%
  filter(!is.na(review_score), category_en != "unknown") %>%
  group_by(category_en) %>%
  summarise(
    reviewed_orders      = n_distinct(order_id),
    avg_review           = mean(review_score,  na.rm = TRUE),
    dissatisfaction_rate = mean(dissatisfied,  na.rm = TRUE),
    delay_rate           = mean(delayed,       na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(reviewed_orders >= MIN_ORDERS) %>%
  mutate(
    z_dissat   = as.numeric(scale(dissatisfaction_rate)),
    z_vol      = as.numeric(scale(log1p(reviewed_orders))),
    risk_score = 0.60 * z_dissat + 0.40 * z_vol
  ) %>%
  arrange(desc(risk_score))

top15 <- category_core %>% slice_head(n = 15)

ggplot(top15,
       aes(x = reorder(category_en, dissatisfaction_rate),
           y = dissatisfaction_rate,
           fill = reviewed_orders)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = percent(dissatisfaction_rate, 0.1)),
            hjust = -0.15, size = 3.2, colour = "grey30") +
  coord_flip() +
  scale_y_continuous(labels = percent_format(0.1),
                     expand = expansion(mult = c(0, 0.12))) +
  scale_fill_gradient(low = CLR_ORANGE, high = CLR_RED, labels = comma_format()) +
  labs(
    title    = "Top 15 Categories by Dissatisfaction Rate",
    subtitle = paste0("Categories with >= ", MIN_ORDERS, " reviewed orders | Colour = order volume"),
    x = NULL, y = "Dissatisfaction Rate (score <= 2)",
    fill = "Reviewed orders",
    caption = "Risk score = 60% dissatisfaction rate + 40% log(volume)"
  ) +
  theme_olist()

top15 %>%
  transmute(
    Category            = category_en,
    `Reviewed Orders`   = comma(reviewed_orders),
    `Avg Score`         = round(avg_review, 2),
    `Dissatisfaction %` = percent(dissatisfaction_rate, 0.1),
    `Delay Rate %`      = percent(delay_rate, 0.1),
    `Risk Score`        = round(risk_score, 2)
  ) %>%
  kable(align = "lrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(1:3, bold = TRUE, background = "#fff3cd")
Category Reviewed Orders Avg Score Dissatisfaction % Delay Rate % Risk Score
office_furniture 1,273 3.48 26.3% 8.8% 1.66
fashion_male_clothing 112 3.62 28.8% 5.3% 1.33
bed_bath_table 9,417 3.87 19.4% 8.3% 1.19
furniture_decor 6,449 3.89 19.7% 8.3% 1.13
fixed_telephony 217 3.67 26.1% 4.2% 1.12
computers_accessories 6,689 3.92 18.8% 7.6% 1.00
watches_gifts 5,624 4.00 16.7% 8.1% 0.63
audio 350 3.81 22.0% 12.6% 0.63
telephony 4,199 3.93 17.2% 8.1% 0.63
housewares 5,884 4.04 16.0% 6.3% 0.55
baby 2,885 3.99 17.2% 8.5% 0.52
sports_leisure 7,720 4.09 15.0% 7.2% 0.47
garden_tools 3,518 4.03 16.3% 7.8% 0.44
auto 3,897 4.04 15.9% 8.1% 0.42
home_confort 397 3.83 20.3% 10.1% 0.41

Result interpretation: The table highlights the categories with the strongest dissatisfaction signal after filtering out very small categories. In presentation, focus on the first few rows rather than reading the whole table. The strongest categories here are the ones where dissatisfaction is both visible and business-relevant.

Presenter note: Say: “This ranking answers where the pain is. But it does not yet tell us why customers are unhappy. That is why the next section separates delivery-related pain from product-related pain.”


3. Root-Cause Split — Delivery vs Product

Research question for this part: Are dissatisfied reviews mainly caused by late delivery, or do customers still complain when delivery is on time?

Method: We split orders into delivery timing groups: on-time/early, late 1-3 days, late 4-7 days, and late more than 7 days. If dissatisfaction is high even in the on-time/early column, the issue is probably product-related rather than logistics-only.

top12_cats <- category_core %>% slice_head(n = 12) %>% pull(category_en)

delay_split <- base_tbl %>%
  filter(!is.na(review_score), category_en %in% top12_cats, !is.na(delay_days)) %>%
  mutate(
    delivery_segment = case_when(
      delay_days <= 0 ~ "On-time / Early",
      delay_days <= 3 ~ "Late 1-3 days",
      delay_days <= 7 ~ "Late 4-7 days",
      TRUE            ~ "Late > 7 days"
    ),
    delivery_segment = factor(delivery_segment,
      levels = c("On-time / Early", "Late 1-3 days", "Late 4-7 days", "Late > 7 days"))
  ) %>%
  group_by(category_en, delivery_segment) %>%
  summarise(
    n_orders             = n_distinct(order_id),
    dissatisfaction_rate = mean(dissatisfied, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(n_orders >= 30)

ggplot(delay_split,
       aes(x = delivery_segment,
           y = reorder(category_en, dissatisfaction_rate),
           fill = dissatisfaction_rate)) +
  geom_tile(colour = "white", linewidth = 0.8) +
  geom_text(aes(label = if_else(n_orders >= 30, percent(dissatisfaction_rate, 1), "")),
            size = 3, colour = "white", fontface = "bold") +
  scale_fill_gradient2(low = CLR_GREEN, mid = CLR_ORANGE, high = CLR_RED,
                       midpoint = 0.20, labels = percent_format(1)) +
  labs(
    title    = "Dissatisfaction Rate by Delivery Timing",
    subtitle = "Cells with < 30 orders hidden | % shown in white = dissatisfaction rate",
    x = "Delivery Segment", y = NULL, fill = "Dissatisfaction",
    caption = "Key: High dissatisfaction in 'On-time/Early' column -> product issue, not logistics"
  ) +
  theme_olist() +
  theme(axis.text.x = element_text(angle = 20, hjust = 1))

verdict_tbl <- base_tbl %>%
  filter(!is.na(dissatisfied), category_en %in% top12_cats, !is.na(delay_days)) %>%
  mutate(ontime = delay_days <= 0) %>%
  group_by(category_en, ontime) %>%
  summarise(dissat = mean(dissatisfied), n = n_distinct(order_id), .groups = "drop") %>%
  pivot_wider(names_from = ontime, values_from = c(dissat, n), names_prefix = "") %>%
  rename(
    ontime_dissat = dissat_TRUE,
    late_dissat   = dissat_FALSE,
    n_ontime      = n_TRUE,
    n_late        = n_FALSE
  ) %>%
  mutate(
    delta      = late_dissat - ontime_dissat,
    root_cause = case_when(
      ontime_dissat >= 0.15 & delta <  0.15 ~ "Product-driven",
      ontime_dissat >= 0.15 & delta >= 0.15 ~ "Mixed (Product + Ops)",
      ontime_dissat <  0.15 & delta >= 0.15 ~ "Ops/Delivery-driven",
      TRUE                                   ~ "Low risk"
    )
  ) %>%
  arrange(desc(ontime_dissat))

verdict_tbl %>%
  transmute(
    Category               = category_en,
    `On-time Dissatisfied` = percent(ontime_dissat, 0.1),
    `Late Dissatisfied`    = percent(late_dissat, 0.1),
    `Gap (Late - On-time)` = percent(delta, 0.1),
    `Root Cause`           = root_cause
  ) %>%
  kable(align = "lrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(which(str_detect(verdict_tbl$root_cause, "Product")), background = "#fde8e8")
Category On-time Dissatisfied Late Dissatisfied Gap (Late - On-time) Root Cause
office_furniture 22.6% 57.0% 34.5% Mixed (Product + Ops)
fashion_male_clothing 22.0% 85.7% 63.7% Mixed (Product + Ops)
fixed_telephony 21.3% 72.7% 51.4% Mixed (Product + Ops)
furniture_decor 15.2% 53.6% 38.5% Mixed (Product + Ops)
bed_bath_table 15.2% 56.8% 41.7% Mixed (Product + Ops)
audio 14.6% 71.7% 57.2% Ops/Delivery-driven
computers_accessories 14.0% 54.5% 40.5% Ops/Delivery-driven
telephony 12.7% 50.1% 37.5% Ops/Delivery-driven
housewares 12.1% 50.8% 38.7% Ops/Delivery-driven
watches_gifts 11.3% 58.6% 47.3% Ops/Delivery-driven
baby 11.0% 60.7% 49.7% Ops/Delivery-driven
sports_leisure 9.9% 56.6% 46.7% Ops/Delivery-driven

Result interpretation: Categories marked as Product-driven have high dissatisfaction even when orders arrive on time. Categories marked as Ops/Delivery-driven become much worse mainly when delivery is late. Mixed categories need both product and logistics actions.

Presenter note: This is the key business diagnosis: delivery delay matters, but it is not the whole story. The executive summary already shows that 11.7% of on-time orders are still dissatisfied, so product quality and process issues must be investigated.


4. Text Evidence

Research question for this part: What are dissatisfied customers actually complaining about?

Method: Review comments are grouped into simple business themes using keywords. This is not a complex NLP model, but it is useful because it converts raw text into interpretable complaint categories such as product defect, wrong item, packaging, delivery, and return/refund issues.

4.1 Theme Distribution

theme_summary <- comment_tbl %>%
  group_by(theme) %>%
  summarise(
    comments             = n(),
    dissatisfaction_rate = mean(dissatisfied, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(dissatisfaction_rate))

ggplot(theme_summary %>% filter(theme != "Other/unclear"),
       aes(x = comments, y = dissatisfaction_rate,
           colour = dissatisfaction_rate, size = comments)) +
  geom_point(alpha = 0.85) +
  geom_text_repel(aes(label = theme), size = 3.2, max.overlaps = 20,
                  colour = "grey20", fontface = "bold") +
  scale_x_log10(labels = comma_format()) +
  scale_y_continuous(labels = percent_format(1), limits = c(0, 1)) +
  scale_colour_gradient(low = CLR_ORANGE, high = CLR_RED, guide = "none") +
  scale_size_continuous(range = c(4, 14), guide = "none") +
  labs(
    title    = "Complaint Themes: Volume vs Dissatisfaction Intensity",
    subtitle = "Bubble size = comment volume | Top-right = highest priority",
    x = "Number of Comments (log scale)", y = "Dissatisfaction Rate",
    caption = "'Other/unclear' excluded for clarity"
  ) +
  theme_olist()

theme_summary %>%
  filter(theme != "Other/unclear") %>%
  transmute(
    Theme                  = theme,
    `Comment Volume`       = comma(comments),
    `Dissatisfaction Rate` = percent(dissatisfaction_rate, 0.1),
    Priority = case_when(
      dissatisfaction_rate >= 0.65 ~ "Critical",
      dissatisfaction_rate >= 0.45 ~ "High",
      TRUE                         ~ "Medium"
    )
  ) %>%
  kable(align = "lrrl") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(
    which(theme_summary %>% filter(theme != "Other/unclear") %>% pull(dissatisfaction_rate) >= 0.65),
    background = "#fde8e8", bold = TRUE
  )
Theme Comment Volume Dissatisfaction Rate Priority
Return/refund process 808 79.2% Critical
Wrong item/mismatch 1,228 68.2% Critical
Product defect/low quality 1,300 63.4% High
Packaging issue 1,074 50.5% High
Seller/service issue 1,420 41.5% Medium
Price/freight concern 935 33.3% Medium
Delivery delay/not arrived 14,890 21.5% Medium

Result interpretation: The most important themes are not only those with many comments, but those with high dissatisfaction intensity. A theme with a very high dissatisfaction rate should be treated as a serious root cause even if the volume is smaller.

Presenter note: Emphasize that text evidence explains the “why” behind the numbers. For example, wrong item, product defects, and return/refund problems point to different business owners and different fixes.

4.2 Theme x Category

theme_cat <- comment_tbl %>%
  filter(category_en %in% top12_cats, theme != "Other/unclear") %>%
  group_by(category_en, theme) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(category_en) %>%
  mutate(share = n / sum(n)) %>%
  ungroup()

ggplot(theme_cat,
       aes(x = theme, y = reorder(category_en, n), fill = share, size = n)) +
  geom_point(shape = 21, colour = "white") +
  scale_fill_gradient(low = CLR_ORANGE, high = CLR_RED, labels = percent_format(1)) +
  scale_size_continuous(range = c(2, 12)) +
  scale_x_discrete(labels = function(x) str_wrap(x, 12)) +
  labs(
    title    = "Complaint Theme Mix by Category",
    subtitle = "Bubble size = raw count | Colour = share within category",
    x = NULL, y = NULL,
    fill = "Share of category\ncomplaints", size = "Comment count",
    caption = "Larger + darker = dominant complaint type for that category"
  ) +
  theme_olist() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1, size = 9))

Result interpretation: This chart shows which complaint theme dominates each risky category. It helps convert a category-level problem into a specific action: fix product listing accuracy, inspect seller quality, improve packaging, review delivery SLA, or simplify returns.


5. Priority Matrix

Research question for this part: Which categories should management fix first?

Method: The matrix compares dissatisfaction intensity on the x-axis with reviewed order volume on the y-axis. This creates four business quadrants: FIX NOW, INVESTIGATE, MONITOR, and WATCH.

x_mid <- median(category_core$dissatisfaction_rate)
y_mid <- median(log10(category_core$reviewed_orders))

matrix_data <- category_core %>%
  mutate(
    log_vol  = log10(reviewed_orders),
    quadrant = case_when(
      dissatisfaction_rate >= x_mid & log_vol >= y_mid ~ "FIX NOW",
      dissatisfaction_rate >= x_mid & log_vol <  y_mid ~ "INVESTIGATE",
      dissatisfaction_rate <  x_mid & log_vol >= y_mid ~ "MONITOR",
      TRUE                                              ~ "WATCH"
    ),
    q_colour   = case_when(
      quadrant == "FIX NOW"     ~ CLR_RED,
      quadrant == "INVESTIGATE" ~ CLR_ORANGE,
      quadrant == "MONITOR"     ~ CLR_BLUE,
      TRUE                      ~ CLR_GREY
    ),
    label_flag = quadrant %in% c("FIX NOW", "INVESTIGATE") |
                   risk_score > quantile(risk_score, 0.85, na.rm = TRUE)
  )

ggplot(matrix_data,
       aes(x = dissatisfaction_rate, y = log_vol,
           colour = quadrant, size = reviewed_orders)) +
  annotate("rect", xmin = x_mid, xmax = Inf,  ymin = y_mid, ymax = Inf,  fill = CLR_RED,    alpha = 0.05) +
  annotate("rect", xmin = x_mid, xmax = Inf,  ymin = -Inf,  ymax = y_mid, fill = CLR_ORANGE, alpha = 0.05) +
  annotate("rect", xmin = -Inf,  xmax = x_mid, ymin = y_mid, ymax = Inf,  fill = CLR_BLUE,   alpha = 0.05) +
  annotate("rect", xmin = -Inf,  xmax = x_mid, ymin = -Inf,  ymax = y_mid, fill = CLR_GREY,   alpha = 0.05) +
  annotate("text", x = max(matrix_data$dissatisfaction_rate) * 0.97,
           y = max(matrix_data$log_vol) * 1.01,
           label = "FIX NOW", hjust = 1, fontface = "bold", colour = CLR_RED, size = 4) +
  annotate("text", x = max(matrix_data$dissatisfaction_rate) * 0.97,
           y = y_mid * 0.97,
           label = "INVESTIGATE", hjust = 1, fontface = "bold", colour = CLR_ORANGE, size = 4) +
  annotate("text", x = x_mid * 1.02,
           y = max(matrix_data$log_vol) * 1.01,
           label = "MONITOR", hjust = 0, fontface = "bold", colour = CLR_BLUE, size = 4) +
  geom_point(alpha = 0.75) +
  geom_text_repel(
    data = matrix_data %>% filter(label_flag),
    aes(label = str_replace_all(category_en, "_", " ")),
    size = 3, fontface = "bold", max.overlaps = 20, colour = "grey20"
  ) +
  geom_vline(xintercept = x_mid, linetype = "dashed", colour = "grey50") +
  geom_hline(yintercept = y_mid, linetype = "dashed", colour = "grey50") +
  scale_x_continuous(labels = percent_format(1)) +
  scale_y_continuous(labels = function(x) comma(10^x)) +
  scale_colour_manual(values = c(
    "FIX NOW" = CLR_RED, "INVESTIGATE" = CLR_ORANGE,
    "MONITOR" = CLR_BLUE, "WATCH" = CLR_GREY
  )) +
  scale_size_continuous(range = c(2, 10), labels = comma_format()) +
  labs(
    title    = "Category Priority Matrix",
    subtitle = "X = dissatisfaction intensity | Y = reviewed order volume",
    x = "Dissatisfaction Rate (score <= 2)", y = "Reviewed Orders (log scale)",
    colour = "Quadrant", size = "Reviewed orders",
    caption = "Dashed lines = median split"
  ) +
  theme_olist()

Result interpretation: Categories in FIX NOW combine high dissatisfaction with high volume, so they deserve the strongest management attention. INVESTIGATE categories may have high dissatisfaction but lower volume, so they need deeper review before large investment.

Presenter note: Say: “This matrix turns analysis into prioritization. It prevents us from fixing only the loudest problem or only the biggest category; we focus on the categories that combine pain and scale.”


6. Recommendations

Research question for this part: Who owns the fix, and what action should be taken for each risky category?

Method: This table combines the priority quadrant, root-cause label, dominant complaint theme, business owner, and recommended action.

reco_tbl <- matrix_data %>%
  filter(quadrant %in% c("FIX NOW", "INVESTIGATE")) %>%
  left_join(verdict_tbl %>% select(category_en, root_cause, ontime_dissat), by = "category_en") %>%
  left_join(
    theme_cat %>%
      group_by(category_en) %>%
      slice_max(n, n = 1) %>%
      select(category_en, top_complaint = theme),
    by = "category_en"
  ) %>%
  mutate(
    owner = case_when(
      str_detect(root_cause, "Product") ~ "Product / Merchant Team",
      str_detect(root_cause, "Ops")     ~ "Operations / Logistics",
      str_detect(root_cause, "Mixed")   ~ "Product + Operations (joint)",
      TRUE                               ~ "TBD"
    ),
    recommended_action = case_when(
      str_detect(top_complaint, "Wrong item|mismatch") ~
        "Audit product listing accuracy (title, photo, spec); enforce SKU mapping checks pre-shipment.",
      str_detect(top_complaint, "defect|low quality") ~
        "Add incoming quality inspection at seller onboarding; require quality cert for high-volume sellers.",
      str_detect(top_complaint, "Packaging") ~
        "Define packaging standards by product size/weight; penalise sellers with >10% packaging complaints.",
      str_detect(top_complaint, "Delivery") ~
        "Review SLA promise-dates; flag sellers with >15% late rate for performance review.",
      str_detect(top_complaint, "Return|refund") ~
        "Simplify return policy; add one-click return initiation in post-delivery email.",
      TRUE ~ "Deep-dive review text sample; run seller scorecard audit."
    )
  ) %>%
  arrange(quadrant, desc(dissatisfaction_rate))

reco_tbl %>%
  transmute(
    Quadrant             = quadrant,
    Category             = str_replace_all(category_en, "_", " "),
    `Dissat. Rate`       = percent(dissatisfaction_rate, 0.1),
    Volume               = comma(reviewed_orders),
    `Root Cause`         = root_cause,
    `Top Complaint`      = top_complaint,
    Owner                = owner,
    `Recommended Action` = recommended_action
  ) %>%
  kable(align = "llrrlll") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE, font_size = 12) %>%
  row_spec(which(reco_tbl$quadrant == "FIX NOW"), background = "#fff0f0", bold = TRUE) %>%
  column_spec(8, width = "30em")
Quadrant Category Dissat. Rate Volume Root Cause Top Complaint Owner Recommended Action
FIX NOW office furniture 26.3% 1,273 Mixed (Product + Ops) Delivery delay/not arrived Product / Merchant Team Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW furniture decor 19.7% 6,449 Mixed (Product + Ops) Delivery delay/not arrived Product / Merchant Team Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW bed bath table 19.4% 9,417 Mixed (Product + Ops) Delivery delay/not arrived Product / Merchant Team Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW computers accessories 18.8% 6,689 Ops/Delivery-driven Delivery delay/not arrived Operations / Logistics Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW telephony 17.2% 4,199 Ops/Delivery-driven Delivery delay/not arrived Operations / Logistics Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW baby 17.2% 2,885 Ops/Delivery-driven Delivery delay/not arrived Operations / Logistics Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW watches gifts 16.7% 5,624 Ops/Delivery-driven Delivery delay/not arrived Operations / Logistics Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW garden tools 16.3% 3,518 NA NA TBD Deep-dive review text sample; run seller scorecard audit.
FIX NOW housewares 16.0% 5,884 Ops/Delivery-driven Delivery delay/not arrived Operations / Logistics Review SLA promise-dates; flag sellers with >15% late rate for performance review.
FIX NOW electronics 16.0% 2,550 NA NA TBD Deep-dive review text sample; run seller scorecard audit.
FIX NOW consoles games 16.0% 1,062 NA NA TBD Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE fashion male clothing 28.8% 112 Mixed (Product + Ops) Delivery delay/not arrived Product / Merchant Team Review SLA promise-dates; flag sellers with >15% late rate for performance review.
INVESTIGATE fixed telephony 26.1% 217 Mixed (Product + Ops) Delivery delay/not arrived Product / Merchant Team Review SLA promise-dates; flag sellers with >15% late rate for performance review.
INVESTIGATE audio 22.0% 350 Ops/Delivery-driven Delivery delay/not arrived Operations / Logistics Review SLA promise-dates; flag sellers with >15% late rate for performance review.
INVESTIGATE construction tools safety 20.6% 167 NA NA TBD Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE home confort 20.3% 397 NA NA TBD Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE air conditioning 19.2% 253 NA NA TBD Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE art 18.7% 202 NA NA TBD Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE furniture living room 18.5% 422 NA NA TBD Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE fashion underwear beach 18.3% 121 NA NA TBD Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE home construction 18.2% 490 NA NA TBD Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE costruction tools garden 17.6% 194 NA NA TBD Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE christmas supplies 17.6% 128 NA NA TBD Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE construction tools lights 17.1% 244 NA NA TBD Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE drinks 16.1% 297 NA NA TBD Deep-dive review text sample; run seller scorecard audit.
INVESTIGATE agro industry and commerce 16.0% 182 NA NA TBD Deep-dive review text sample; run seller scorecard audit.

How to use this table: This is the action list for managers. Each row connects an analytical signal to an owner and a concrete operational fix.

Presenter note: Do not read every row slowly. Pick one FIX NOW category and explain the logic from left to right: category → dissatisfaction rate → root cause → complaint theme → owner → action.


7. Summary of Findings

Research question for this part: What should stakeholders remember after seeing all the evidence?

# Finding Evidence Action Owner
1 Product quality, not delivery, drives most dissatisfaction 11.7% of on-time orders still score <= 2 Product / Merchant Team
2 fashion_male_clothing, office_furniture, fixed_telephony have highest dissatisfaction rate Section 2 ranking Product Quality
3 bed_bath_table and furniture_decor are highest business risk due to volume x rate Priority Matrix Product Quality
4 Wrong item/mismatch and product defects are most destructive complaint types (63-68% dissat.) Text theme analysis Listing accuracy & QC
5 Return/refund friction amplifies every other problem (79% dissat. rate) Theme table Customer Experience
6 27% of satisfied customers still wrote complaint text — tracking scores alone underestimates real pain Mismatch check Analytics / Monitoring
7 Clustering turns many categories into a few structural risk segments Section 8 clustering Product + Operations
8 Random Forest can be used as an early-warning model, but the current output must be interpreted carefully because recall is high while precision and accuracy are low Section 9 model evaluation Analytics / CX

North Star recommendation: Build a monthly Category Quality Scorecard combining dissatisfaction rate + on-time dissatisfaction + top complaint theme. Any category crossing 15% on-time dissatisfaction triggers an automatic seller audit.


8. Category Clustering — Structural Segments

Research question for this part: Can unsupervised learning group product categories into meaningful quality-risk segments without using a target label?

Goal: Group product categories by their quality-and-delivery profile so that each cluster receives a single targeted intervention.

Why clustering is useful here: The business has many product categories. Instead of making a separate strategy for every category, clustering reduces them into a small number of structural segments with similar risk patterns.

Feature Why it matters
dissatisfaction_rate Overall quality pain
ontime_dissat Product-driven pain (delivery-independent)
delay_rate Operational / logistics pain
avg_review General sentiment
log_vol Business scale

8.1 Feature Matrix

Research question: Which variables best describe the risk profile of a product category?

cluster_input <- category_core %>%
  left_join(verdict_tbl %>% select(category_en, ontime_dissat), by = "category_en") %>%
  filter(!is.na(ontime_dissat)) %>%
  mutate(log_vol = log1p(reviewed_orders)) %>%
  select(category_en, dissatisfaction_rate, ontime_dissat, delay_rate, avg_review, log_vol)

cat_names <- cluster_input$category_en

x_scaled <- cluster_input %>%
  select(-category_en) %>%
  scale()
rownames(x_scaled) <- cat_names

cat("Categories entering clustering:", nrow(x_scaled), "\n")
## Categories entering clustering: 12

Normalisation is mandatory for k-means: scale() converts every feature to mean 0, SD 1 so no variable dominates by sheer magnitude.

Presenter note: Explain that clustering uses distances. If we do not scale the variables, volume or review score could dominate the distance calculation and hide the true quality patterns.

8.2 Choosing K — Elbow & Silhouette

Research question: How many category segments should we use?

set.seed(42)

p_elbow <- fviz_nbclust(x_scaled, kmeans, method = "wss", k.max = 8,
                        linecolor = CLR_RED) +
  labs(title    = "Elbow Method (WSS)",
       subtitle = "Look for the elbow — point of diminishing returns",
       x = "Number of clusters K", y = "Total within-cluster SS") +
  theme_olist()

p_sil <- fviz_nbclust(x_scaled, kmeans, method = "silhouette", k.max = 8,
                      linecolor = CLR_BLUE) +
  labs(title    = "Average Silhouette Width",
       subtitle = "Higher = tighter, better-separated clusters",
       x = "Number of clusters K", y = "Average silhouette width") +
  theme_olist()

p_elbow + p_sil

# Change K_CHOSEN if your plots suggest a different number
K_CHOSEN <- 3
cat("Using K =", K_CHOSEN, "clusters\n")
## Using K = 3 clusters

Presenter note: K = 3 is easy to explain in business terms: high-risk, medium/moderate-risk, and lower-risk or operational-watch categories.

8.3 k-Means Clustering

Research question: What category groups are produced by k-means?

set.seed(42)
km_fit <- kmeans(x_scaled, centers = K_CHOSEN, nstart = 25, iter.max = 100)

cat("k-means variance explained (between-SS / total-SS):",
    round(km_fit$betweenss / km_fit$totss, 3), "\n")
## k-means variance explained (between-SS / total-SS): 0.826
fviz_cluster(km_fit,
             data         = x_scaled,
             ellipse.type = "convex",
             palette      = c(CLR_RED, CLR_BLUE, CLR_GREEN, CLR_ORANGE)[1:K_CHOSEN],
             ggtheme      = theme_olist(),
             repel        = TRUE) +
  labs(title    = paste0("k-Means Cluster Map (K = ", K_CHOSEN, ")"),
       subtitle = "PCA projection — first 2 components shown")

How to read the plot: Each point is a product category. Categories close together have similar dissatisfaction, delivery, review, and volume profiles. The PCA projection is only for visualization; the clustering uses the full scaled feature matrix.

8.4 PAM Clustering (Medoid-Based)

Research question: Are the category segments stable if we use a more robust clustering method?

PAM uses real data points as cluster centres (medoids), making it robust to outliers. We use Gower distance so the method handles mixed data types.

# Pass unscaled data — daisy() handles normalisation internally for Gower
gower_dist <- daisy(
  cluster_input %>% select(-category_en),
  metric = "gower"
)

set.seed(42)
pam_fit <- pam(gower_dist, k = K_CHOSEN, diss = TRUE)

cat("PAM medoids (representative categories):\n")
## PAM medoids (representative categories):
print(cat_names[pam_fit$id.med])
## [1] "fixed_telephony" "furniture_decor" "watches_gifts"
fviz_cluster(
  list(data = x_scaled, cluster = pam_fit$clustering),
  ellipse.type = "convex",
  palette      = c(CLR_RED, CLR_BLUE, CLR_GREEN, CLR_ORANGE)[1:K_CHOSEN],
  ggtheme      = theme_olist(),
  repel        = TRUE
) +
  labs(title    = paste0("PAM Cluster Map (K = ", K_CHOSEN, ", Gower distance)"),
       subtitle = "PCA projection — first 2 components shown")

8.5 Comparing k-Means vs PAM — Rand & Jaccard Indices

Research question: Do k-means and PAM give similar segment assignments?

# Helper: compute Rand and Jaccard from two label vectors (no fossil needed)
rand_jaccard <- function(a, b) {
  n   <- length(a)
  # all pairs
  pairs <- combn(n, 2)
  same_a <- a[pairs[1,]] == a[pairs[2,]]
  same_b <- b[pairs[1,]] == b[pairs[2,]]
  tp <- sum( same_a &  same_b)   # both same
  tn <- sum(!same_a & !same_b)   # both different
  fp <- sum(!same_a &  same_b)   # different in a, same in b
  fn <- sum( same_a & !same_b)   # same in a, different in b
  rand    <- (tp + tn) / (tp + tn + fp + fn)
  jaccard <- tp / (tp + fp + fn)
  list(rand = rand, jaccard = jaccard)
}

km_labels  <- km_fit$cluster
pam_labels <- pam_fit$clustering

rj  <- rand_jaccard(km_labels, pam_labels)
ari <- adjustedRandIndex(km_labels, pam_labels)  # from mclust

index_tbl <- tibble(
  Index          = c("Rand Index", "Adjusted Rand Index (ARI)", "Jaccard Similarity"),
  Value          = round(c(rj$rand, ari, rj$jaccard), 3),
  Interpretation = c(
    "Proportion of pairs that agree between k-means and PAM (1 = identical)",
    "Rand corrected for chance; >0.80 = strong agreement",
    "Agreement on co-clustered pairs only — a zoom-in on positive matches"
  )
)

kable(index_tbl, align = "llr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Index Value Interpretation
Rand Index 0.727 Proportion of pairs that agree between k-means and PAM (1 = identical)
Adjusted Rand Index (ARI) 0.440 Rand corrected for chance; >0.80 = strong agreement
Jaccard Similarity 0.471 Agreement on co-clustered pairs only — a zoom-in on positive matches

Rule of thumb: ARI > 0.80 means the two methods agree strongly — either clustering can be used. ARI < 0.60 signals real disagreement — prefer PAM (more robust, handles mixed data via Gower distance).

Presenter note: In the current output, ARI is moderate/low, so be honest: the clustering is exploratory and useful for business segmentation, not an absolute truth.

8.6 Cluster Profiles

Research question: What does each cluster mean in business language?

profile_tbl <- cluster_input %>%
  mutate(
    km_cluster  = factor(km_fit$cluster,     labels = paste0("KM-",  seq_len(K_CHOSEN))),
    pam_cluster = factor(pam_fit$clustering, labels = paste0("PAM-", seq_len(K_CHOSEN)))
  )

km_profile <- profile_tbl %>%
  group_by(Cluster = km_cluster) %>%
  summarise(
    `N categories`        = n(),
    `Avg Dissatisfaction` = percent(mean(dissatisfaction_rate), 0.1),
    `Avg On-time Dissat.` = percent(mean(ontime_dissat), 0.1),
    `Avg Delay Rate`      = percent(mean(delay_rate), 0.1),
    `Avg Review Score`    = round(mean(avg_review), 2),
    `Example categories`  = paste(head(category_en, 3), collapse = ", "),
    .groups = "drop"
  )

kable(km_profile, caption = "k-Means Cluster Profiles", align = "lrrrrrl") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
  column_spec(7, width = "25em")
k-Means Cluster Profiles
Cluster N categories Avg Dissatisfaction Avg On-time Dissat. Avg Delay Rate Avg Review Score Example categories
KM-1 1 22.0% 14.6% 12.6% 3.81 audio
KM-2 3 27.1% 22.0% 6.1% 3.59 office_furniture, fashion_male_clothing, fixed_telephony
KM-3 8 17.5% 12.7% 7.8% 3.97 bed_bath_table, furniture_decor, computers_accessories
cluster_means <- profile_tbl %>%
  group_by(Cluster = km_cluster) %>%
  summarise(across(dissatisfaction_rate:log_vol, mean), .groups = "drop") %>%
  pivot_longer(-Cluster, names_to = "Feature", values_to = "Mean")

ggplot(cluster_means, aes(x = Feature, y = Cluster, fill = Mean)) +
  geom_tile(colour = "white", linewidth = 0.6) +
  geom_text(aes(label = round(Mean, 2)), colour = "white", fontface = "bold", size = 3.5) +
  scale_fill_gradient2(low = CLR_BLUE, mid = CLR_ORANGE, high = CLR_RED, midpoint = 0) +
  scale_x_discrete(labels = c(
    dissatisfaction_rate = "Dissatisfaction\nRate",
    ontime_dissat        = "On-time\nDissat.",
    delay_rate           = "Delay Rate",
    avg_review           = "Avg Review\nScore",
    log_vol              = "Log Volume"
  )) +
  labs(title    = "Cluster Feature Heatmap (k-Means, raw means)",
       subtitle = "Red = high value | Blue = low value",
       x = NULL, y = "Cluster", fill = "Mean") +
  theme_olist()

Result interpretation: The profile table is more important than the cluster plot for business interpretation. Use the average dissatisfaction, on-time dissatisfaction, delay rate, and review score to name each cluster.

8.7 Actionable Cluster Interpretation

# Adjust labels after inspecting your actual heatmap above
strategy_tbl <- tibble(
  Cluster                    = paste0("KM-", 1:K_CHOSEN),
  `Likely Label`             = c(
    "High-risk: product + delivery pain",
    "Moderate: product-driven dissatisfaction",
    "Low-risk: operational under-performers"
  )[1:K_CHOSEN],
  `Recommended Intervention` = c(
    "Immediate seller quality audit + return policy fix",
    "SKU accuracy review; incoming QC at onboarding",
    "SLA promise-date review; logistics partner audit"
  )[1:K_CHOSEN],
  Owner = c(
    "Product + CX (joint)",
    "Product / Merchant Team",
    "Operations / Logistics"
  )[1:K_CHOSEN]
)

kable(strategy_tbl, align = "llll") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE) %>%
  row_spec(1, background = "#fde8e8", bold = TRUE)
Cluster Likely Label Recommended Intervention Owner
KM-1 High-risk: product + delivery pain Immediate seller quality audit + return policy fix Product + CX (joint)
KM-2 Moderate: product-driven dissatisfaction SKU accuracy review; incoming QC at onboarding Product / Merchant Team
KM-3 Low-risk: operational under-performers SLA promise-date review; logistics partner audit Operations / Logistics

North Star integration: Feed cluster membership back into the monthly Category Quality Scorecard. Track whether categories migrate between clusters over time — a move from Cluster 1 to Cluster 3 is a concrete, measurable quality improvement.

Presenter note: This is the unsupervised learning contribution. We did not predict a label here; we discovered structural groups of categories and translated each group into a business strategy.



9. Random Forest — Predicting Dissatisfaction

Research question for this part: Can supervised learning predict which orders are likely to become dissatisfied, and which features drive that risk?

Goal: Train a Random Forest classifier to predict whether an order will receive a dissatisfied review (score ≤ 2), identify the most predictive features, and evaluate model accuracy on a held-out test set.

Important modelling note for presentation: The current Random Forest output has very high recall but low accuracy and precision. That means it catches almost all dissatisfied orders, but it also creates many false alarms. Present it as an early-warning experiment, not as a final deployment-ready model.

Feature Role
delay_days Delivery lateness in days
delayed Binary flag: was delivery late?
dissatisfaction_rate (category) Historical quality level of the product category
delay_rate (category) Historical logistics performance of the category
avg_review (category) Baseline sentiment for the category
log_vol (category) Business scale (log-transformed order volume)

9.1 Feature Matrix

Research question: Which delivery and category-level variables are useful for predicting dissatisfaction?

# Join category-level features onto order-level data
rf_data <- base_tbl %>%
  filter(!is.na(dissatisfied), !is.na(delay_days), category_en != "unknown") %>%
  left_join(
    category_core %>%
      mutate(log_vol = log1p(reviewed_orders)) %>%
      select(category_en, cat_dissat = dissatisfaction_rate,
             cat_delay = delay_rate, cat_avg_review = avg_review, log_vol),
    by = "category_en"
  ) %>%
  filter(!is.na(cat_dissat)) %>%
  transmute(
    dissatisfied   = factor(dissatisfied, levels = c(0, 1),
                            labels = c("Satisfied", "Dissatisfied")),
    delay_days     = as.numeric(delay_days),
    delayed        = as.integer(delayed),
    cat_dissat,
    cat_delay,
    cat_avg_review,
    log_vol
  ) %>%
  drop_na()

cat("Rows entering RF model:", nrow(rf_data), "\n")
## Rows entering RF model: 107735
cat("Class balance:\n")
## Class balance:
print(table(rf_data$dissatisfied))
## 
##    Satisfied Dissatisfied 
##        91490        16245

9.2 Train / Test Split (80 / 20)

Research question: How do we evaluate the model on unseen data?

set.seed(42)
train_idx  <- sample(seq_len(nrow(rf_data)), size = floor(0.8 * nrow(rf_data)))
train_rf   <- rf_data[ train_idx, ]
test_rf    <- rf_data[-train_idx, ]

cat("Train rows:", nrow(train_rf), "| Test rows:", nrow(test_rf), "\n")
## Train rows: 86188 | Test rows: 21547

An 80/20 split is sufficient given the dataset size. set.seed(42) ensures reproducibility.

9.3 Model Training

Research question: Can Random Forest learn patterns between delivery performance, category quality history, and dissatisfaction?

set.seed(42)
rf_model <- randomForest(
  dissatisfied ~ .,
  data       = train_rf,
  ntree      = 500,          # number of trees
  mtry       = 2,            # features considered at each split (≈ sqrt(6))
  importance = TRUE,         # needed for varImpPlot
  classwt    = c(1, 3)       # upweight minority class (dissatisfied)
)

print(rf_model)
## 
## Call:
##  randomForest(formula = dissatisfied ~ ., data = train_rf, ntree = 500,      mtry = 2, importance = TRUE, classwt = c(1, 3)) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 83.34%
## Confusion matrix:
##              Satisfied Dissatisfied class.error
## Satisfied         1425        71714 0.980516551
## Dissatisfied       119        12930 0.009119473

classwt: dissatisfied orders are a minority class (≈10–15%). Upweighting them (3×) prevents the model from trivially predicting “Satisfied” for every order.

Presenter note: In this run, the 3× weight is too aggressive. It produces high recall but very low accuracy and precision because the model predicts too many orders as Dissatisfied. Mention this as a limitation and improvement opportunity.

9.4 Variable Importance

Research question: Which variables are most important for predicting dissatisfied reviews?

imp_df <- importance(rf_model) %>%
  as.data.frame() %>%
  rownames_to_column("Feature") %>%
  rename(MeanDecreaseAccuracy = MeanDecreaseAccuracy,
         MeanDecreaseGini     = MeanDecreaseGini) %>%
  arrange(desc(MeanDecreaseAccuracy))

# ── tidy ggplot instead of base varImpPlot ──────────────────────────────────
p_acc <- ggplot(imp_df,
                aes(x = MeanDecreaseAccuracy,
                    y = reorder(Feature, MeanDecreaseAccuracy))) +
  geom_col(fill = CLR_BLUE, width = 0.6) +
  labs(title    = "Variable Importance — Mean Decrease in Accuracy",
       subtitle = "How much OOB accuracy drops when a feature is randomly shuffled",
       x = "Mean Decrease Accuracy", y = NULL) +
  theme_olist()

p_gini <- ggplot(imp_df,
                 aes(x = MeanDecreaseGini,
                     y = reorder(Feature, MeanDecreaseGini))) +
  geom_col(fill = CLR_RED, width = 0.6) +
  labs(title    = "Variable Importance — Mean Decrease in Gini",
       subtitle = "Total reduction in node impurity attributable to each feature",
       x = "Mean Decrease Gini", y = NULL) +
  theme_olist()

p_acc + p_gini

imp_df %>%
  transmute(
    Feature,
    `Mean Decrease Accuracy` = round(MeanDecreaseAccuracy, 3),
    `Mean Decrease Gini`     = round(MeanDecreaseGini, 3)
  ) %>%
  kable(align = "lrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(1:2, bold = TRUE, background = "#fff3cd")
Feature Mean Decrease Accuracy Mean Decrease Gini
delay_days 5.206 2753.908
log_vol -2.185 121.712
cat_dissat -11.396 282.910
cat_avg_review -13.033 247.389
cat_delay -15.821 128.790
delayed -29.280 918.226

9.5 Predictions & Confusion Matrix

Research question: How well does the model classify dissatisfied and satisfied orders?

pred_rf <- predict(rf_model, newdata = test_rf)

conf_mat <- table(Actual = test_rf$dissatisfied, Predicted = pred_rf)
print(conf_mat)
##               Predicted
## Actual         Satisfied Dissatisfied
##   Satisfied          361        17990
##   Dissatisfied        31         3165
TP  <- conf_mat["Dissatisfied", "Dissatisfied"]
TN  <- conf_mat["Satisfied",    "Satisfied"]
FP  <- conf_mat["Satisfied",    "Dissatisfied"]
FN  <- conf_mat["Dissatisfied", "Satisfied"]

accuracy  <- (TP + TN) / sum(conf_mat)
precision <- TP / (TP + FP)
recall    <- TP / (TP + FN)
f1        <- 2 * precision * recall / (precision + recall)

metrics_tbl <- tibble(
  Metric        = c("Accuracy", "Precision (Dissatisfied)", "Recall (Dissatisfied)", "F1 Score"),
  Value         = percent(c(accuracy, precision, recall, f1), 0.1),
  Interpretation = c(
    "Overall share of correctly classified orders",
    "Of predicted dissatisfied, how many truly were?",
    "Of truly dissatisfied, how many did the model catch?",
    "Harmonic mean of Precision & Recall — key metric for imbalanced classes"
  )
)

kable(metrics_tbl, align = "llr") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(4, bold = TRUE, background = "#e8f5e9")
Metric Value Interpretation
Accuracy 16.4% Overall share of correctly classified orders
Precision (Dissatisfied) 15.0% Of predicted dissatisfied, how many truly were?
Recall (Dissatisfied) 99.0% Of truly dissatisfied, how many did the model catch?
F1 Score 26.0% Harmonic mean of Precision & Recall — key metric for imbalanced classes
conf_df <- as.data.frame(conf_mat) %>%
  rename(Count = Freq)

ggplot(conf_df, aes(x = Predicted, y = Actual, fill = Count)) +
  geom_tile(colour = "white", linewidth = 1.2) +
  geom_text(aes(label = comma(Count)), size = 5, fontface = "bold", colour = "white") +
  scale_fill_gradient(low = CLR_BLUE, high = CLR_RED, labels = comma_format()) +
  labs(
    title    = "Confusion Matrix — Random Forest (Test Set)",
    subtitle = "Rows = true label | Columns = predicted label",
    x = "Predicted", y = "Actual", fill = "Count"
  ) +
  theme_olist()

Reading the matrix: the top-left cell (Satisfied → Satisfied) and bottom-right (Dissatisfied → Dissatisfied) are correct predictions. Off-diagonal cells are errors. A high recall on Dissatisfied means the model successfully flags most at-risk orders for early intervention.

Presenter note: The current metrics show the trade-off clearly: accuracy is low, precision is low, but recall is very high. This model should be tuned with a milder class weight or probability threshold before real deployment.

9.6 RF Findings

Research question: What is the final supervised-learning insight for business use?

Finding Detail
Top predictive feature cat_dissat — a category’s historical dissatisfaction rate is the single strongest signal
Second strongest cat_avg_review — category-level baseline sentiment reinforces the first signal
Delivery signal delay_days and delayed contribute, but less than product-quality proxies — consistent with Section 3
Business use Score each new order at purchase time using category features; flag high-probability dissatisfied orders for proactive seller outreach

Presenter note: End by connecting both learning methods: clustering gives category-level strategic segments, while Random Forest gives order-level risk prediction. Together, they support long-term quality improvement and short-term proactive intervention.


Report generated with R 4.5.3 · Dataset: Olist E-Commerce Public Dataset (Kaggle)