1. Introduction

Online marketplaces usually generate revenue from a wide range of product categories, but not all categories contribute in the same way. Some categories may drive revenue through very high order volume, while others may generate revenue through higher average prices or specialized demand. At the same time, operational factors such as freight cost, product size, seller availability, and customer satisfaction can affect how commercially attractive each category is.

The objective of this project is to understand the sales structure of product categories in the marketplace and identify which category characteristics are associated with high revenue.

The analysis has two main parts:

Unsupervised learning: group product categories into similar commercial profiles using clustering.
Supervised learning: predict whether a product category belongs to the high-revenue group using category-level features.

Business Question

The main business question is:

Which types of product categories drive marketplace revenue, and what makes a category commercially successful?

To answer this question, we will analyze each product category using indicators such as total item revenue, number of orders, average item price, freight ratio, number of sellers, review score, and product weight or size.

Research Questions

This project will focus on the following research questions:

Which product categories generate the highest total revenue?
Is category revenue mainly driven by order volume, average price, or seller availability?
Are there clear groups of product categories with different sales structures?
Which categories appear freight-heavy or operationally more complex?
Can we predict whether a category is high revenue based on its commercial and operational characteristics?
What business actions could marketplace managers take based on these category profiles?

Expected Business Value

The results can help marketplace managers understand where revenue is concentrated and which category types deserve strategic attention. For example, high-volume low-price categories may be important for customer traffic, while high-price niche categories may offer strong revenue potential with fewer orders. Freight-heavy categories may require logistics optimization, and low-performing categories may need better seller coverage, pricing adjustments, or customer experience improvements.

By combining clustering and prediction, the project will provide both a descriptive view of category structure and a practical model for identifying high-revenue category profiles.

2. Load Required Libraries

This analysis uses R packages for data cleaning, visualization, clustering, classification, and model evaluation. The same libraries will be used throughout the project so that the workflow remains reproducible.

library(tidyverse)
library(readr)
library(janitor)
library(cluster)
library(factoextra)
library(rpart)
library(rpart.plot)
library(randomForest)
library(caret)
library(R6)
library(xgboost)
library(pROC)

OlistCategoryAnalyzer <- R6Class(
  "OlistCategoryAnalyzer",
  public = list(
    data_path = NULL,
    sample_fraction = NULL,
    seed = NULL,
    order_items = NULL,
    products = NULL,
    orders = NULL,
    reviews = NULL,
    sellers = NULL,
    category_translation = NULL,
    reviews_by_order = NULL,
    sales_data = NULL,
    category_data = NULL,
    cluster_data = NULL,
    cluster_scaled = NULL,
    kmeans_model = NULL,
    supervised_data = NULL,
    rf_model = NULL,
    xgb_model = NULL,
    xgb_cv = NULL,
    xgb_feature_names = NULL,
    model_comparison = NULL,

    initialize = function(data_path, sample_fraction = 0.10, seed = 123) {
      self$data_path <- data_path
      self$sample_fraction <- sample_fraction
      self$seed <- seed
    },

    load_data = function() {
      self$order_items <- read_csv(
        file.path(self$data_path, "olist_order_items_dataset.csv")
      )
      self$products <- read_csv(
        file.path(self$data_path, "olist_products_dataset.csv")
      )
      self$orders <- read_csv(
        file.path(self$data_path, "olist_orders_dataset.csv")
      )
      self$reviews <- read_csv(
        file.path(self$data_path, "olist_order_reviews_dataset.csv")
      )
      self$sellers <- read_csv(
        file.path(self$data_path, "olist_sellers_dataset.csv")
      )
      self$category_translation <- read_csv(
        file.path(self$data_path, "product_category_name_translation.csv")
      )

      invisible(self)
    },

    get_data_overview = function() {
      tibble(
        dataset = c(
          "order_items",
          "products",
          "orders",
          "reviews",
          "sellers",
          "category_translation"
        ),
        rows = c(
          nrow(self$order_items),
          nrow(self$products),
          nrow(self$orders),
          nrow(self$reviews),
          nrow(self$sellers),
          nrow(self$category_translation)
        ),
        columns = c(
          ncol(self$order_items),
          ncol(self$products),
          ncol(self$orders),
          ncol(self$reviews),
          ncol(self$sellers),
          ncol(self$category_translation)
        )
      )
    },

    clean_data = function() {
      self$order_items <- self$order_items %>%
        clean_names() %>%
        private$convert_unknown_to_na()
      self$products <- self$products %>%
        clean_names() %>%
        private$convert_unknown_to_na()
      self$orders <- self$orders %>%
        clean_names() %>%
        private$convert_unknown_to_na()
      self$reviews <- self$reviews %>%
        clean_names() %>%
        private$convert_unknown_to_na()
      self$sellers <- self$sellers %>%
        clean_names() %>%
        private$convert_unknown_to_na()
      self$category_translation <- self$category_translation %>%
        clean_names() %>%
        private$convert_unknown_to_na()

      self$reviews_by_order <- self$reviews %>%
        group_by(order_id) %>%
        summarise(
          average_review_score_order = mean(review_score, na.rm = TRUE),
          number_of_reviews = n(),
          .groups = "drop"
        )

      invisible(self)
    },

    build_sales_data = function() {
      set.seed(self$seed)

      self$sales_data <- self$order_items %>%
        left_join(self$products, by = "product_id") %>%
        left_join(self$orders, by = "order_id") %>%
        left_join(self$reviews_by_order, by = "order_id") %>%
        left_join(self$sellers, by = "seller_id") %>%
        left_join(self$category_translation, by = "product_category_name") %>%
        filter(order_status == "delivered") %>%
        slice_sample(prop = self$sample_fraction) %>%
        mutate(
          product_category_name_english = if_else(
            is.na(product_category_name_english),
            product_category_name,
            product_category_name_english
          ),
          item_revenue = price,
          freight_ratio = freight_value / price,
          product_volume_cm3 = product_length_cm *
            product_height_cm *
            product_width_cm
        )

      invisible(self)
    },

    get_sales_data_overview = function() {
      tibble(
        metric = c(
          "Delivered order items",
          "Delivered orders",
          "Product categories",
          "Sellers"
        ),
        value = c(
          nrow(self$sales_data),
          n_distinct(self$sales_data$order_id),
          n_distinct(self$sales_data$product_category_name_english),
          n_distinct(self$sales_data$seller_id)
        )
      )
    },

    build_category_data = function() {
      self$category_data <- self$sales_data %>%
        filter(
          !is.na(product_category_name_english),
          !is.na(price),
          price > 0
        ) %>%
        mutate(
          freight_ratio = if_else(
            is.finite(freight_ratio),
            freight_ratio,
            NA_real_
          )
        ) %>%
        group_by(product_category_name_english) %>%
        summarise(
          total_item_revenue = sum(item_revenue, na.rm = TRUE),
          number_of_orders = n_distinct(order_id),
          number_of_items = n(),
          average_item_price = mean(price, na.rm = TRUE),
          total_freight = sum(freight_value, na.rm = TRUE),
          average_freight = mean(freight_value, na.rm = TRUE),
          average_freight_ratio = mean(freight_ratio, na.rm = TRUE),
          number_of_sellers = n_distinct(seller_id),
          average_review_score = mean(
            average_review_score_order,
            na.rm = TRUE
          ),
          average_product_weight_g = mean(product_weight_g, na.rm = TRUE),
          average_product_volume_cm3 = mean(product_volume_cm3, na.rm = TRUE),
          .groups = "drop"
        ) %>%
        filter(
          !is.na(average_freight_ratio),
          !is.na(average_review_score),
          !is.na(average_product_weight_g),
          !is.na(average_product_volume_cm3)
        ) %>%
        arrange(desc(total_item_revenue))

      invisible(self)
    },

    get_category_data_overview = function() {
      tibble(
        metric = c(
          "Product categories",
          "Total item revenue",
          "Median category revenue",
          "Average category revenue",
          "Average orders per category"
        ),
        value = c(
          nrow(self$category_data),
          sum(self$category_data$total_item_revenue),
          median(self$category_data$total_item_revenue),
          mean(self$category_data$total_item_revenue),
          mean(self$category_data$number_of_orders)
        )
      )
    },

    prepare_cluster_data = function() {
      self$cluster_data <- self$category_data %>%
        transmute(
          revenue_log = log(total_item_revenue),
          orders_log = log(number_of_orders),
          average_item_price_log = log(average_item_price),
          average_freight_ratio = average_freight_ratio,
          sellers_log = log(number_of_sellers),
          average_review_score = average_review_score,
          product_weight_log = log(average_product_weight_g),
          product_volume_log = log(average_product_volume_cm3)
        )

      self$cluster_scaled <- scale(self$cluster_data)

      invisible(self)
    },

    fit_kmeans = function(centers = 4, nstart = 25) {
      set.seed(self$seed)

      self$kmeans_model <- kmeans(
        self$cluster_scaled,
        centers = centers,
        nstart = nstart
      )

      self$category_data <- self$category_data %>%
        mutate(cluster = factor(self$kmeans_model$cluster))

      invisible(self)
    },

    get_cluster_summary = function() {
      self$category_data %>%
        group_by(cluster) %>%
        summarise(
          number_of_categories = n(),
          total_cluster_revenue = sum(total_item_revenue),
          average_category_revenue = mean(total_item_revenue),
          average_orders = mean(number_of_orders),
          average_item_price = mean(average_item_price),
          average_freight_ratio = mean(average_freight_ratio),
          average_sellers = mean(number_of_sellers),
          average_review_score = mean(average_review_score),
          average_product_weight_g = mean(average_product_weight_g),
          average_product_volume_cm3 = mean(average_product_volume_cm3),
          .groups = "drop"
        ) %>%
        arrange(desc(average_category_revenue))
    },

    create_high_revenue_target = function(method = "median") {
      if (method != "median") {
        stop("Only the median target method is currently implemented.")
      }

      revenue_cutoff <- median(
        self$category_data$total_item_revenue,
        na.rm = TRUE
      )

      self$category_data <- self$category_data %>%
        mutate(
          high_revenue_category = if_else(
            total_item_revenue > revenue_cutoff,
            "high",
            "low"
          ),
          high_revenue_category = factor(
            high_revenue_category,
            levels = c("high", "low")
          )
        )

      invisible(self)
    },

    prepare_supervised_data = function() {
      selected_columns <- c(
        "high_revenue_category",
        "number_of_orders",
        "average_item_price",
        "average_freight_ratio",
        "number_of_sellers",
        "average_review_score",
        "average_product_weight_g",
        "average_product_volume_cm3"
      )

      self$supervised_data <- self$category_data %>%
        select(all_of(selected_columns)) %>%
        drop_na() %>%
        mutate(
          high_revenue_category = factor(
            high_revenue_category,
            levels = c("high", "low")
          )
        )

      invisible(self)
    },

    train_random_forest = function() {
      set.seed(self$seed)

      self$rf_model <- train(
        high_revenue_category ~ .,
        data = self$supervised_data,
        method = "rf",
        metric = "ROC",
        trControl = private$get_train_control(),
        tuneLength = 3,
        importance = TRUE,
        ntree = 500
      )

      invisible(self)
    },

    train_xgboost = function() {
      set.seed(self$seed)

      x <- model.matrix(
        high_revenue_category ~ .,
        data = self$supervised_data
      )[, -1]
      y <- if_else(
        self$supervised_data$high_revenue_category == "high",
        1,
        0
      )

      self$xgb_feature_names <- colnames(x)

      dtrain <- xgb.DMatrix(data = x, label = y)

      xgb_params <- list(
        objective = "binary:logistic",
        eval_metric = "auc",
        max_depth = 2,
        eta = 0.10,
        subsample = 0.8,
        colsample_bytree = 0.8
      )

      self$xgb_cv <- xgb.cv(
        params = xgb_params,
        data = dtrain,
        nrounds = 100,
        nfold = 5,
        stratified = TRUE,
        early_stopping_rounds = 10,
        verbose = 0
      )

      best_iteration <- self$xgb_cv$best_iteration
      if (length(best_iteration) == 0 || is.null(best_iteration)) {
        best_iteration <- which.max(self$xgb_cv$evaluation_log$test_auc_mean)
      }

      self$xgb_model <- xgb.train(
        params = xgb_params,
        data = dtrain,
        nrounds = best_iteration,
        verbose = 0
      )

      invisible(self)
    },

    compare_models = function() {
      self$model_comparison <- bind_rows(
        private$get_best_model_metrics(self$rf_model, "Random Forest"),
        private$get_xgb_metrics()
      ) %>%
        arrange(desc(ROC))

      self$model_comparison
    },

    get_variable_importance = function(model) {
      if (inherits(model, "xgb.Booster")) {
        return(
          xgb.importance(model = model) %>%
            as_tibble() %>%
            transmute(feature = Feature, Overall = Gain) %>%
            arrange(desc(Overall))
        )
      }

      importance <- varImp(model)$importance %>%
        rownames_to_column("feature")

      if (!"Overall" %in% names(importance)) {
        numeric_columns <- importance %>%
          select(where(is.numeric)) %>%
          names()

        importance <- importance %>%
          mutate(Overall = rowMeans(across(all_of(numeric_columns))))
      }

      importance %>%
        arrange(desc(Overall))
    }
  ),

  private = list(
    convert_unknown_to_na = function(data) {
      data %>%
        mutate(
          across(
            where(is.character),
            ~ if_else(
              str_to_lower(str_trim(.x)) == "unknown",
              NA_character_,
              .x
            )
          )
        )
    },

    get_train_control = function() {
      trainControl(
        method = "repeatedcv",
        number = 5,
        repeats = 5,
        classProbs = TRUE,
        summaryFunction = twoClassSummary,
        savePredictions = "final"
      )
    },

    get_best_model_metrics = function(model, model_name) {
      best_tune <- model$bestTune

      model$results %>%
        semi_join(best_tune, by = names(best_tune)) %>%
        transmute(
          model = model_name,
          ROC = ROC,
          sensitivity = Sens,
          specificity = Spec
        )
    },

    get_xgb_metrics = function() {
      best_iteration <- self$xgb_cv$best_iteration
      if (length(best_iteration) == 0 || is.null(best_iteration)) {
        best_iteration <- which.max(self$xgb_cv$evaluation_log$test_auc_mean)
      }

      self$xgb_cv$evaluation_log %>%
        slice(best_iteration) %>%
        transmute(
          model = "XGBoost",
          ROC = test_auc_mean,
          sensitivity = NA_real_,
          specificity = NA_real_
        )
    }
  )
)

analyzer <- OlistCategoryAnalyzer$new(
  data_path = "olist dataset",
  sample_fraction = sample_fraction,
  seed = 123
)

The tidyverse, readr, and janitor packages support data import, cleaning, and transformation. The cluster and factoextra packages are used for the unsupervised learning part of the project. The rpart, rpart.plot, randomForest, xgboost, pROC, and caret packages are used for classification models and performance evaluation. The R6 package is used to organize the workflow in an object-oriented way.

From a business point of view, this setup allows us to move from raw marketplace transactions to category-level insights, category segments, and predictive models in one consistent analytical workflow.

3. Import the Olist Datasets

The analysis uses the Olist Brazilian e-commerce dataset. For this project, the most important files are the order items, products, orders, reviews, sellers, and product category translation tables. Together, these files allow us to connect product categories with revenue, freight cost, seller activity, order status, and customer satisfaction.

analyzer$load_data()

order_items <- analyzer$order_items
products <- analyzer$products
orders <- analyzer$orders
reviews <- analyzer$reviews
sellers <- analyzer$sellers
category_translation <- analyzer$category_translation

The order_items dataset is the core sales table because it contains item prices and freight values. The products table adds product category and physical product characteristics. The orders table allows us to focus on completed sales, especially delivered orders. The reviews table adds customer satisfaction, and the seller table allows us to measure how many sellers operate in each category.

The category names in the original dataset are in Portuguese, so the translation table will be used to make the final business interpretation clearer.

data_overview <- analyzer$get_data_overview()

data_overview

## # A tibble: 6 × 3
##   dataset                rows columns
##   <chr>                 <int>   <int>
## 1 order_items          112650       7
## 2 products              32951       9
## 3 orders                99441       8
## 4 reviews              100000       7
## 5 sellers                3095       4
## 6 category_translation     71       2

This overview confirms that the necessary data sources were imported correctly. From a business perspective, the imported tables give us the minimum information needed to connect category-level revenue with demand, price positioning, logistics, seller supply, and customer satisfaction.

4. Clean and Join the Data

The next step is to combine the imported datasets into one analysis table. The main unit of observation is an order item, because revenue and freight values are recorded at the item level. Product, order, review, seller, and category translation information are then added to each item.

Before joining, we clean the column names and aggregate reviews at the order level. This is important because some orders can have more than one review record, and joining reviews directly could duplicate item revenue.

analyzer$clean_data()

order_items <- analyzer$order_items
products <- analyzer$products
orders <- analyzer$orders
reviews <- analyzer$reviews
sellers <- analyzer$sellers
category_translation <- analyzer$category_translation
reviews_by_order <- analyzer$reviews_by_order

The joined dataset keeps delivered orders only. This makes the revenue analysis more reliable because delivered orders represent completed marketplace transactions. To keep the analysis lighter and faster, the project uses a reproducible 10% sample of delivered order items.

analyzer$build_sales_data()

sales_data <- analyzer$sales_data

The variables item_revenue, freight_ratio, and product_volume_cm3 are created because they will be useful for category-level analysis. The freight ratio measures the relative logistics burden of a category, while product volume helps identify categories that may be operationally more difficult to handle.

sales_data_overview <- analyzer$get_sales_data_overview()

sales_data_overview

## # A tibble: 4 × 2
##   metric                value
##   <chr>                 <int>
## 1 Delivered order items 11019
## 2 Delivered orders      10791
## 3 Product categories       74
## 4 Sellers                1656

From a business perspective, this joined table connects the key commercial dimensions of the marketplace: what was sold, how much revenue it generated, how expensive it was to ship, which sellers supplied it, and how customers reviewed the order. This table will be the foundation for building category-level features in the next step. Because the analysis uses a sample, the revenue figures should be interpreted as sample-based estimates of category structure rather than exact full-marketplace totals.

5. Category-Level Feature Engineering

The project focuses on product categories, so the joined item-level data must be aggregated into one row per category. Each row will describe the commercial and operational profile of a category.

The main category-level indicators are:

total item revenue
number of orders
number of sold items
average item price
average freight cost
average freight ratio
number of sellers
average review score
average product weight
average product volume

analyzer$build_category_data()

category_data <- analyzer$category_data

The resulting dataset is the main analytical table for the rest of the project. It transforms raw transactions into category-level business indicators.

category_data_overview <- analyzer$get_category_data_overview()

category_data_overview

## # A tibble: 5 × 2
##   metric                         value
##   <chr>                          <dbl>
## 1 Product categories               73 
## 2 Total item revenue          1304787.
## 3 Median category revenue        3680.
## 4 Average category revenue      17874.
## 5 Average orders per category     146.

category_data %>%
  select(
    product_category_name_english,
    total_item_revenue,
    number_of_orders,
    average_item_price,
    average_freight_ratio,
    number_of_sellers,
    average_review_score,
    average_product_weight_g,
    average_product_volume_cm3
  ) %>%
  slice_head(n = 10)

## # A tibble: 10 × 9
##    product_category_nam…¹ total_item_revenue number_of_orders average_item_price
##    <chr>                               <dbl>            <int>              <dbl>
##  1 watches_gifts                     119060.              583              201. 
##  2 health_beauty                     119060.              915              128. 
##  3 bed_bath_table                    109440.             1153               93.1
##  4 sports_leisure                    100199.              848              116. 
##  5 computers_accessories              82499.              701              115. 
##  6 furniture_decor                    69618.              777               85.6
##  7 housewares                         64569.              705               88.8
##  8 cool_stuff                         59775.              364              164. 
##  9 garden_tools                       55220.              394              134. 
## 10 auto                               52479.              410              127. 
## # ℹ abbreviated name: ¹product_category_name_english
## # ℹ 5 more variables: average_freight_ratio <dbl>, number_of_sellers <int>,
## #   average_review_score <dbl>, average_product_weight_g <dbl>,
## #   average_product_volume_cm3 <dbl>

From a business perspective, this table allows categories to be compared using the same set of indicators. A category with high revenue and many orders may be a broad demand driver, while a category with high average price but fewer orders may be a niche premium category. A high freight ratio or high product volume may indicate operational complexity, which is important when interpreting whether a category is commercially attractive.

6. Exploratory Data Analysis

Before applying clustering or prediction models, we first explore the category-level data. The goal of this section is to understand which categories generate the most revenue and how revenue relates to order volume, price, freight burden, seller availability, and customer satisfaction.

Top Categories by Revenue

top_revenue_categories <- category_data %>%
  slice_max(total_item_revenue, n = 10)

ggplot(
  top_revenue_categories,
  aes(
    x = reorder(product_category_name_english, total_item_revenue),
    y = total_item_revenue
  )
) +
  geom_col(fill = "#2f6f73") +
  coord_flip() +
  scale_y_continuous(labels = scales::label_number(big.mark = ",")) +
  labs(
    title = "Top 10 Product Categories by Total Item Revenue",
    x = "Product category",
    y = "Total item revenue"
  ) +
  theme_minimal()

This chart identifies the categories that contribute the most revenue to the marketplace. These categories are strategically important because changes in their demand, pricing, seller availability, or logistics can have a large effect on total marketplace performance.

From a marketing point of view, the leading categories should be treated as the main commercial pillars of the marketplace. Categories such as watches and gifts, health and beauty, bed and bath, sports and leisure, and computer accessories are not only selling categories; they are traffic generators. These categories are strong candidates for homepage placement, seasonal campaigns, cross-selling actions, loyalty promotions, and seller acquisition programs.

Revenue and Order Volume

ggplot(
  category_data,
  aes(x = number_of_orders, y = total_item_revenue)
) +
  geom_point(color = "#2f6f73", alpha = 0.75, size = 2.5) +
  geom_smooth(method = "lm", se = FALSE, color = "#b24c38") +
  scale_x_continuous(labels = scales::label_number(big.mark = ",")) +
  scale_y_continuous(labels = scales::label_number(big.mark = ",")) +
  labs(
    title = "Category Revenue Compared with Order Volume",
    x = "Number of orders",
    y = "Total item revenue"
  ) +
  theme_minimal()

This relationship shows whether revenue is mainly driven by volume. If categories with more orders almost always have higher revenue, marketplace growth depends strongly on broad demand categories. Categories above the trend line may be especially attractive because they generate more revenue than their order volume alone would suggest.

For business decisions, this means the marketplace should not only search for expensive products. Revenue is strongly connected to repeatable demand. The most commercially important categories are those that can attract many customers and many transactions. Marketing investment should therefore prioritize categories that combine demand frequency with enough price level to generate meaningful revenue.

Average Price and Order Volume

ggplot(
  category_data,
  aes(x = average_item_price, y = number_of_orders)
) +
  geom_point(color = "#5b6770", alpha = 0.75, size = 2.5) +
  scale_x_continuous(labels = scales::label_number(big.mark = ",")) +
  scale_y_continuous(labels = scales::label_number(big.mark = ",")) +
  labs(
    title = "Average Item Price Compared with Order Volume",
    x = "Average item price",
    y = "Number of orders"
  ) +
  theme_minimal()

This chart helps distinguish high-volume low-price categories from higher-price niche categories. From a business point of view, both can be useful: high-volume categories can drive traffic and repeat purchases, while higher-price categories can generate meaningful revenue with fewer transactions.

This creates two different marketing strategies. Low-price, high-volume categories are useful for acquisition campaigns, discounts, bundles, and retention because customers can buy them frequently. Higher-price niche categories require a different approach: stronger product information, trust signals, reviews, warranty communication, and targeted advertising to customers with specific purchase intent.

Freight Ratio by Category

freight_heavy_categories <- category_data %>%
  slice_max(average_freight_ratio, n = 10)

ggplot(
  freight_heavy_categories,
  aes(
    x = reorder(product_category_name_english, average_freight_ratio),
    y = average_freight_ratio
  )
) +
  geom_col(fill = "#b24c38") +
  coord_flip() +
  scale_y_continuous(labels = scales::label_percent(accuracy = 1)) +
  labs(
    title = "Top 10 Categories by Average Freight Ratio",
    x = "Product category",
    y = "Average freight as a share of item price"
  ) +
  theme_minimal()

Categories with a high freight ratio may be less attractive operationally, especially if shipping cost represents a large share of the item price. These categories may require logistics optimization, better shipping agreements, or pricing strategies that account for delivery cost.

From a business perspective, freight-heavy categories are risky because shipping can reduce the attractiveness of the offer. Even if demand exists, customers may abandon purchases when delivery cost feels too high compared with the product price. For these categories, the marketplace should consider minimum basket thresholds, bundle offers, negotiated freight conditions, regional delivery strategies, or clearer communication of total delivered price.

Sellers and Revenue

ggplot(
  category_data,
  aes(x = number_of_sellers, y = total_item_revenue)
) +
  geom_point(color = "#2f6f73", alpha = 0.75, size = 2.5) +
  geom_smooth(method = "lm", se = FALSE, color = "#b24c38") +
  scale_x_continuous(labels = scales::label_number(big.mark = ",")) +
  scale_y_continuous(labels = scales::label_number(big.mark = ",")) +
  labs(
    title = "Category Revenue Compared with Seller Availability",
    x = "Number of sellers",
    y = "Total item revenue"
  ) +
  theme_minimal()

This plot shows whether categories with more sellers also generate more revenue. A larger seller base can increase product variety, availability, and price competition. However, if revenue remains low despite many sellers, the category may have weak demand or poor positioning.

For marketplace management, seller availability is a commercial asset. A larger seller base usually means more assortment, better stock coverage, and more competitive offers. Categories with high demand but fewer sellers are good targets for seller recruitment. Categories with many sellers but weak revenue need marketing review: the problem may be poor visibility, weak customer demand, or an undifferentiated assortment.

Review Score and Revenue

ggplot(
  category_data,
  aes(x = average_review_score, y = total_item_revenue)
) +
  geom_point(color = "#5b6770", alpha = 0.75, size = 2.5) +
  scale_y_continuous(labels = scales::label_number(big.mark = ",")) +
  labs(
    title = "Category Revenue Compared with Average Review Score",
    x = "Average review score",
    y = "Total item revenue"
  ) +
  theme_minimal()

Customer satisfaction is important when interpreting category quality. A category with strong revenue but relatively low review scores may represent a business risk, because poor customer experience can reduce repeat purchases and damage marketplace trust.

In marketing terms, review score protects long-term revenue. A category can perform well in the short term, but if customers are not satisfied, promotion may amplify a bad experience. High-revenue categories with weaker reviews should be monitored carefully before receiving additional campaign investment. Improving seller quality, delivery reliability, and product descriptions may be more valuable than simply increasing advertising spend.

Overall, this exploratory analysis gives an initial view of the sales structure of product categories. The next step will use clustering to group categories with similar commercial and operational profiles.

Overall, the exploratory results suggest that the marketplace should manage categories as a portfolio. Some categories are revenue engines, some are traffic builders, some are premium niches, and some are operationally expensive. A good marketing strategy should not treat all categories equally; each category type needs a different mix of promotion, pricing, logistics support, and seller development.

7. Prepare Data for Clustering

The unsupervised part of the project groups product categories with similar commercial and operational profiles. Clustering uses only numeric category-level features, so the category name is excluded from the clustering matrix.

Several variables such as revenue, order count, seller count, weight, and volume are highly skewed. All variables selected for log transformation are strictly positive in the category-level dataset, so we use the natural logarithm log(x). This reduces the influence of extremely large categories and allows the clustering algorithm to consider multiple dimensions of category structure.

analyzer$prepare_cluster_data()

cluster_data <- analyzer$cluster_data
cluster_scaled <- analyzer$cluster_scaled

Scaling is necessary because the variables are measured in different units. Without scaling, variables with large numeric ranges, such as revenue or product volume, would dominate the clustering result.

summary(cluster_data)

##   revenue_log       orders_log    average_item_price_log average_freight_ratio
##  Min.   : 3.218   Min.   :0.000   Min.   :2.561          Min.   :0.04425      
##  1st Qu.: 6.750   1st Qu.:2.197   1st Qu.:4.304          1st Qu.:0.25714      
##  Median : 8.211   Median :3.219   Median :4.743          Median :0.32414      
##  Mean   : 8.163   Mean   :3.449   Mean   :4.701          Mean   :0.35104      
##  3rd Qu.: 9.736   3rd Qu.:5.112   3rd Qu.:4.992          3rd Qu.:0.37361      
##  Max.   :11.687   Max.   :7.050   Max.   :6.915          Max.   :1.16370      
##   sellers_log    average_review_score product_weight_log product_volume_log
##  Min.   :0.000   Min.   :1.000        Min.   :4.605      Min.   : 7.270    
##  1st Qu.:1.386   1st Qu.:3.946        1st Qu.:6.455      1st Qu.: 8.646    
##  Median :2.398   Median :4.064        Median :7.381      Median : 9.514    
##  Mean   :2.496   Mean   :4.016        Mean   :7.293      Mean   : 9.352    
##  3rd Qu.:3.689   3rd Qu.:4.290        3rd Qu.:7.925      3rd Qu.: 9.977    
##  Max.   :5.421   Max.   :5.000        Max.   :9.386      Max.   :11.175

Choosing the Number of Clusters

Two common methods are used to support the choice of the number of clusters:

the elbow method, which shows how much within-cluster variation decreases as more clusters are added
the silhouette method, which measures how well categories fit within their assigned cluster compared with other clusters

set.seed(123)

fviz_nbclust(
  cluster_scaled,
  kmeans,
  method = "wss",
  k.max = 10
) +
  labs(title = "Elbow Method for Choosing Number of Clusters") +
  theme_minimal()

set.seed(123)

fviz_nbclust(
  cluster_scaled,
  kmeans,
  method = "silhouette",
  k.max = 10
) +
  labs(title = "Silhouette Method for Choosing Number of Clusters") +
  theme_minimal()

The final number of clusters should balance statistical evidence and business interpretability. A very small number of clusters may hide important differences between category types, while too many clusters may create groups that are hard to explain or act on.

For this project, we will use four clusters as a practical starting point. This provides a simpler and more interpretable segmentation while still allowing us to compare broad marketplace categories, high-price niche categories, freight-heavy categories, and lower-performing or underdeveloped categories.

8. K-Means Clustering and Cluster Interpretation

After preparing and scaling the category-level variables, we apply K-means clustering. The model assigns each product category to one of four groups based on similarities in revenue, order volume, price, freight ratio, seller base, review score, and product size.

analyzer$fit_kmeans(centers = 4, nstart = 25)

kmeans_model <- analyzer$kmeans_model
category_data <- analyzer$category_data

The nstart = 25 argument runs K-means several times with different starting points and keeps the best result. This makes the clustering result more stable.

Visualizing the Clusters

fviz_cluster(
  kmeans_model,
  data = cluster_scaled,
  geom = "point",
  ellipse.type = "convex",
  show.clust.cent = TRUE
) +
  labs(title = "K-Means Clusters of Product Categories") +
  theme_minimal()

This visualization shows how categories separate into different groups after dimension reduction. The chart is useful for checking whether the clusters have some separation, but the final business meaning must come from the original category-level metrics.

Cluster Summary

cluster_summary <- analyzer$get_cluster_summary()

cluster_summary

## # A tibble: 4 × 11
##   cluster number_of_categories total_cluster_revenue average_category_revenue
##   <fct>                  <int>                 <dbl>                    <dbl>
## 1 4                         20              1080685.                   54034.
## 2 1                         24               192631.                    8026.
## 3 3                         25                30183.                    1207.
## 4 2                          4                 1287.                     322.
## # ℹ 7 more variables: average_orders <dbl>, average_item_price <dbl>,
## #   average_freight_ratio <dbl>, average_sellers <dbl>,
## #   average_review_score <dbl>, average_product_weight_g <dbl>,
## #   average_product_volume_cm3 <dbl>

The cluster summary translates the statistical output into business terms. It shows whether each cluster is characterized by high revenue, high volume, high prices, high freight burden, many sellers, stronger reviews, or larger product dimensions.

Cluster Segmentation Profile

The following plot compares clusters across the most important business characteristics. The values are standardized, so positive bars indicate that a cluster is above the average cluster profile for that characteristic, while negative bars indicate that it is below average.

cluster_profile_plot_data <- cluster_summary %>%
  select(
    cluster,
    average_category_revenue,
    average_orders,
    average_item_price,
    average_freight_ratio,
    average_sellers,
    average_review_score,
    average_product_weight_g,
    average_product_volume_cm3
  ) %>%
  pivot_longer(
    cols = -cluster,
    names_to = "characteristic",
    values_to = "value"
  ) %>%
  group_by(characteristic) %>%
  mutate(standardized_value = as.numeric(scale(value))) %>%
  ungroup() %>%
  mutate(
    characteristic = recode(
      characteristic,
      average_category_revenue = "Revenue",
      average_orders = "Orders",
      average_item_price = "Price",
      average_freight_ratio = "Freight ratio",
      average_sellers = "Sellers",
      average_review_score = "Review score",
      average_product_weight_g = "Weight",
      average_product_volume_cm3 = "Volume"
    ),
    characteristic = factor(
      characteristic,
      levels = c(
        "Revenue",
        "Orders",
        "Price",
        "Freight ratio",
        "Sellers",
        "Review score",
        "Weight",
        "Volume"
      )
    )
  )

ggplot(
  cluster_profile_plot_data,
  aes(
    x = characteristic,
    y = standardized_value,
    fill = cluster
  )
) +
  geom_col(position = position_dodge(width = 0.75), width = 0.65) +
  geom_hline(yintercept = 0, color = "#5b6770", linewidth = 0.4) +
  labs(
    title = "Product Category Segmentation by Cluster Characteristics",
    x = "Business characteristic",
    y = "Standardized value",
    fill = "Cluster"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 35, hjust = 1)
  )

This segmentation view makes the cluster profiles easier to compare. For example, a cluster with positive bars for revenue, orders, and sellers represents a broad marketplace group, while a cluster with a positive freight ratio bar and a negative price bar may represent categories with weaker logistics economics.

The segmentation plot should be read as a category management map. Clusters with strong revenue, orders, and sellers are the marketplace core and should receive consistent visibility and seller relationship management. Clusters with higher prices and larger products are more specialized; they need trust-building content, detailed product pages, and targeted campaigns rather than broad discounting. Clusters with high freight pressure require operational attention before aggressive marketing, because promotions can increase volume without improving profitability or customer experience.

Example Categories in Each Cluster

category_data %>%
  group_by(cluster) %>%
  arrange(desc(total_item_revenue), .by_group = TRUE) %>%
  slice_head(n = 5) %>%
  ungroup() %>%
  select(
    cluster,
    product_category_name_english,
    total_item_revenue,
    number_of_orders,
    average_item_price,
    average_freight_ratio,
    number_of_sellers,
    average_review_score
  )

## # A tibble: 19 × 8
##    cluster product_category_name_english   total_item_revenue number_of_orders
##    <fct>   <chr>                                        <dbl>            <int>
##  1 1       office_furniture                           26117.               166
##  2 1       computers                                  22161.                22
##  3 1       musical_instruments                        16669.                63
##  4 1       construction_tools_construction            14332.                86
##  5 1       luggage_accessories                        14187.               114
##  6 2       fashion_underwear_beach                     1176.                14
##  7 2       home_comfort_2                                60.8                3
##  8 2       cine_photo                                    25.9                2
##  9 2       fashion_sport                                 25.0                1
## 10 3       fixed_telephony                             6235.                29
## 11 3       audio                                       3239.                39
## 12 3       food                                        2867.                48
## 13 3       drinks                                      2409.                33
## 14 3       books_technical                             2200.                23
## 15 4       watches_gifts                             119060.               583
## 16 4       health_beauty                             119060.               915
## 17 4       bed_bath_table                            109440.              1153
## 18 4       sports_leisure                            100199.               848
## 19 4       computers_accessories                      82499.               701
## # ℹ 4 more variables: average_item_price <dbl>, average_freight_ratio <dbl>,
## #   number_of_sellers <int>, average_review_score <dbl>

Looking at example categories helps validate whether the clusters make practical sense. A cluster should not only be statistically different, but also meaningful for marketplace decision-making.

Business Interpretation of the Clusters

The four clusters can be interpreted by comparing their average metrics:

clusters with high average orders and many sellers can be understood as broad marketplace categories
clusters with high average prices and lower order volume can be interpreted as higher-price niche categories
clusters with high freight ratios, high weight, or high volume may represent freight-heavy or operationally complex categories
clusters with low revenue, low order volume, and fewer sellers may represent low-performing or underdeveloped categories

From a business point of view, the clustering result helps marketplace managers move beyond individual category names and think in terms of category types. This is useful because different category types require different strategies. For example, high-volume categories may need supply reliability and competitive pricing, while freight-heavy categories may need logistics optimization and careful margin management.

The four clusters suggest the following marketing actions:

Broad marketplace categories: protect visibility, maintain seller supply, and use these categories in traffic-driving campaigns.
High-price niche categories: focus on product trust, high-quality content, reviews, and targeted advertising rather than mass promotions.
Low-volume categories: test demand with small campaigns before investing heavily; improve assortment only if there is evidence of customer interest.
Freight-heavy low-price categories: avoid aggressive discount campaigns until shipping economics are improved, because freight can weaken conversion and margin.

9. Create the Supervised Learning Target

The supervised part of the project predicts whether a category is a high-revenue category. Because the analysis is performed at the category level and the number of categories is limited, we use the median category revenue as the cutoff. This creates a more balanced classification problem than using only the top 10% or top 25% of categories.

The target variable is:

high if category revenue is above the median
low if category revenue is at or below the median

analyzer$create_high_revenue_target(method = "median")

category_data <- analyzer$category_data

category_data %>%
  count(high_revenue_category)

## # A tibble: 2 × 2
##   high_revenue_category     n
##   <fct>                 <int>
## 1 high                     36
## 2 low                      37

The target is based on total_item_revenue, but total_item_revenue will not be used as a predictor. This avoids data leakage, because the model should not receive the same variable that defines the outcome.

10. Prepare the Supervised Learning Dataset

The supervised models use the category-level business indicators created earlier. The cluster label from the unsupervised analysis is not included as a predictor because the clustering step used total_item_revenue, and the supervised target is also defined from total_item_revenue. Including the cluster would create indirect data leakage and make the ROC unrealistically high.

analyzer$prepare_supervised_data()

supervised_data <- analyzer$supervised_data

supervised_data %>%
  slice_head(n = 10)

## # A tibble: 10 × 8
##    high_revenue_category number_of_orders average_item_price
##    <fct>                            <int>              <dbl>
##  1 high                               583              201. 
##  2 high                               915              128. 
##  3 high                              1153               93.1
##  4 high                               848              116. 
##  5 high                               701              115. 
##  6 high                               777               85.6
##  7 high                               705               88.8
##  8 high                               364              164. 
##  9 high                               394              134. 
## 10 high                               410              127. 
## # ℹ 5 more variables: average_freight_ratio <dbl>, number_of_sellers <int>,
## #   average_review_score <dbl>, average_product_weight_g <dbl>,
## #   average_product_volume_cm3 <dbl>

The supervised dataset includes demand, price, logistics, seller, review, and product-size indicators. total_item_revenue is excluded because it defines the target, and cluster is excluded because the current cluster definition already used revenue.

11. Random Forest Model

Random Forest is useful for this project because it can capture nonlinear relationships and interactions between category characteristics. It also provides variable importance, which helps explain which business factors are most associated with high revenue.

Because the number of product categories is small, the model is evaluated with repeated 5-fold cross-validation instead of a single train-test split.

analyzer$train_random_forest()

rf_model <- analyzer$rf_model
rf_model

## Random Forest 
## 
## 73 samples
##  7 predictor
##  2 classes: 'high', 'low' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times) 
## Summary of sample sizes: 59, 59, 58, 58, 58, 59, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##   2     0.9893176  0.9514286  0.9514286
##   4     0.9854401  0.9285714  0.9621429
##   7     0.9781505  0.9121429  0.9407143
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

rf_importance <- analyzer$get_variable_importance(rf_model)

ggplot(
  rf_importance,
  aes(x = reorder(feature, Overall), y = Overall)
) +
  geom_col(fill = "#2f6f73") +
  coord_flip() +
  labs(
    title = "Random Forest Variable Importance",
    x = "Feature",
    y = "Importance"
  ) +
  theme_minimal()

From a business perspective, the most important variables indicate which category characteristics are most useful for identifying high-revenue categories. If order volume or seller count appears as important, this suggests that category scale and marketplace supply depth are central to revenue structure.

In commercial terms, Random Forest is useful because it shows which category levers matter most for identifying strong categories. If the leading variables are order volume and seller count, the message is clear: winning categories are not only expensive categories, but categories with active demand and enough supply. Marketing teams should coordinate with seller acquisition teams, because campaigns work better when the category has enough sellers, product variety, and stock availability.

12. XGBoost Model

XGBoost is a gradient boosting model that can capture complex nonlinear patterns. It is included as a stronger predictive benchmark against Random Forest. Since the dataset is small, the model uses native 5-fold cross-validation with early stopping to avoid adding unnecessary complexity.

analyzer$train_xgboost()

xgb_model <- analyzer$xgb_model
xgb_model

## ##### xgb.Booster
## call:
##   xgb.train(params = xgb_params, data = dtrain, nrounds = best_iteration, 
##     verbose = 0)
## # of features: 7 
## # of rounds:  12

xgb_importance <- analyzer$get_variable_importance(xgb_model)

ggplot(
  xgb_importance,
  aes(x = reorder(feature, Overall), y = Overall)
) +
  geom_col(fill = "#b24c38") +
  coord_flip() +
  labs(
    title = "XGBoost Variable Importance",
    x = "Feature",
    y = "Importance"
  ) +
  theme_minimal()

XGBoost can be useful if the relationship between category characteristics and high revenue is not simple. However, its results should be interpreted with care because the supervised dataset contains one row per product category, not one row per order.

From a business perspective, XGBoost acts as a second opinion. If it highlights the same drivers as Random Forest, the conclusion is stronger: high-revenue categories are mainly connected to demand scale and seller ecosystem depth. If XGBoost gives more weight to freight or price, that suggests category economics also matter and should be considered before deciding where to invest marketing budget.

13. Model Comparison

The models are compared using cross-validation results. ROC is the main metric because it evaluates how well the models separate high-revenue categories from low-revenue categories across classification thresholds. Random Forest reports ROC, sensitivity, and specificity from repeated cross-validation; XGBoost reports cross-validated ROC from native XGBoost cross-validation.

model_comparison <- analyzer$compare_models()

model_comparison

##           model       ROC sensitivity specificity
## 1       XGBoost 0.9962963          NA          NA
## 2 Random Forest 0.9893176   0.9514286   0.9514286

The better model is the one with the higher cross-validated ROC value. However, business interpretability also matters. If Random Forest performs similarly to XGBoost, Random Forest may be easier to explain to non-technical stakeholders.

For decision-making, the exact model ranking is less important than the consistency of the business message. Both models should be used to identify which category characteristics are repeatedly associated with revenue strength. The models should not be used as automatic decision systems; they should support category prioritization, campaign planning, and seller development decisions.

14. Business Interpretation of Supervised Results

The supervised models connect category characteristics to revenue classification. The most important question is not only which model performs best, but also which variables explain the difference between high-revenue and low-revenue categories.

Important interpretation points:

if number_of_orders is highly important, high revenue is mainly volume-driven
if number_of_sellers is important, seller ecosystem depth matters
if average_item_price is important, price positioning helps distinguish category performance
if average_freight_ratio is important, logistics burden affects commercial attractiveness

From a business point of view, this means the marketplace can use category profiles to prioritize investment. Broad, high-volume categories may need seller coverage and stock reliability, high-price niche categories may need careful positioning, and freight-heavy categories may need logistics optimization before they can become more commercially attractive.

The main managerial conclusion is that revenue is driven by category momentum: customer demand, seller availability, and an offer structure that makes the category easy to buy. Marketing actions should therefore be connected to category readiness. A category with many orders and many sellers is ready for larger campaigns. A category with high freight burden may need operational fixes first. A category with low review quality may need seller or product quality improvements before more traffic is sent to it.

15. Business Recommendations

Based on the exploratory analysis, segmentation, and supervised models, the marketplace should manage product categories with differentiated strategies.

Protect and promote core revenue categories. Categories such as health and beauty, bed and bath, watches and gifts, sports and leisure, and computer accessories should receive regular marketing visibility because they combine demand volume with meaningful revenue contribution.
Use seller development as a growth lever. Seller count appears strongly connected with high-revenue categories, so growth is not only a customer-side marketing issue. The marketplace should recruit and retain sellers in categories where demand exists but assortment is still limited.
Treat high-price categories as trust-based purchases. Premium or niche categories need strong product information, reviews, guarantees, and targeted communication. Broad discounting is less important than reducing perceived purchase risk.
Fix freight-heavy categories before scaling promotion. Categories with high freight ratios may convert poorly or create margin pressure. Bundling, regional logistics, freight thresholds, and clearer total-price communication should be tested before large campaigns.
Monitor customer satisfaction in high-revenue categories. If a major category has weaker reviews, marketing may increase short-term sales while damaging customer trust. Quality improvement should come before additional traffic investment.

The final business message is that product category performance is not explained by one factor alone. Successful categories combine demand, supply, price positioning, operational feasibility, and customer trust. The marketplace should therefore prioritize categories where marketing investment is supported by strong seller coverage, manageable logistics, and a good customer experience.

Product Category Sales Structure and Category Revenue Prediction

Data Science Group

2026-06-09