Online marketplaces usually generate revenue from a wide range of product categories, but not all categories contribute in the same way. Some categories may drive revenue through very high order volume, while others may generate revenue through higher average prices or specialized demand. At the same time, operational factors such as freight cost, product size, seller availability, and customer satisfaction can affect how commercially attractive each category is.
The objective of this project is to understand the sales structure of product categories in the marketplace and identify which category characteristics are associated with high revenue.
The analysis has two main parts:
The main business question is:
Which types of product categories drive marketplace revenue, and what makes a category commercially successful?
To answer this question, we will analyze each product category using indicators such as total item revenue, number of orders, average item price, freight ratio, number of sellers, review score, and product weight or size.
This project will focus on the following research questions:
The results can help marketplace managers understand where revenue is concentrated and which category types deserve strategic attention. For example, high-volume low-price categories may be important for customer traffic, while high-price niche categories may offer strong revenue potential with fewer orders. Freight-heavy categories may require logistics optimization, and low-performing categories may need better seller coverage, pricing adjustments, or customer experience improvements.
By combining clustering and prediction, the project will provide both a descriptive view of category structure and a practical model for identifying high-revenue category profiles.
This analysis uses R packages for data cleaning, visualization, clustering, classification, and model evaluation. The same libraries will be used throughout the project so that the workflow remains reproducible.
library(tidyverse)
library(readr)
library(janitor)
library(cluster)
library(factoextra)
library(rpart)
library(rpart.plot)
library(randomForest)
library(caret)
library(R6)
library(xgboost)
library(pROC)
OlistCategoryAnalyzer <- R6Class(
"OlistCategoryAnalyzer",
public = list(
data_path = NULL,
sample_fraction = NULL,
seed = NULL,
order_items = NULL,
products = NULL,
orders = NULL,
reviews = NULL,
sellers = NULL,
category_translation = NULL,
reviews_by_order = NULL,
sales_data = NULL,
category_data = NULL,
cluster_data = NULL,
cluster_scaled = NULL,
kmeans_model = NULL,
supervised_data = NULL,
rf_model = NULL,
xgb_model = NULL,
xgb_cv = NULL,
xgb_feature_names = NULL,
model_comparison = NULL,
initialize = function(data_path, sample_fraction = 0.10, seed = 123) {
self$data_path <- data_path
self$sample_fraction <- sample_fraction
self$seed <- seed
},
load_data = function() {
self$order_items <- read_csv(
file.path(self$data_path, "olist_order_items_dataset.csv")
)
self$products <- read_csv(
file.path(self$data_path, "olist_products_dataset.csv")
)
self$orders <- read_csv(
file.path(self$data_path, "olist_orders_dataset.csv")
)
self$reviews <- read_csv(
file.path(self$data_path, "olist_order_reviews_dataset.csv")
)
self$sellers <- read_csv(
file.path(self$data_path, "olist_sellers_dataset.csv")
)
self$category_translation <- read_csv(
file.path(self$data_path, "product_category_name_translation.csv")
)
invisible(self)
},
get_data_overview = function() {
tibble(
dataset = c(
"order_items",
"products",
"orders",
"reviews",
"sellers",
"category_translation"
),
rows = c(
nrow(self$order_items),
nrow(self$products),
nrow(self$orders),
nrow(self$reviews),
nrow(self$sellers),
nrow(self$category_translation)
),
columns = c(
ncol(self$order_items),
ncol(self$products),
ncol(self$orders),
ncol(self$reviews),
ncol(self$sellers),
ncol(self$category_translation)
)
)
},
clean_data = function() {
self$order_items <- self$order_items %>%
clean_names() %>%
private$convert_unknown_to_na()
self$products <- self$products %>%
clean_names() %>%
private$convert_unknown_to_na()
self$orders <- self$orders %>%
clean_names() %>%
private$convert_unknown_to_na()
self$reviews <- self$reviews %>%
clean_names() %>%
private$convert_unknown_to_na()
self$sellers <- self$sellers %>%
clean_names() %>%
private$convert_unknown_to_na()
self$category_translation <- self$category_translation %>%
clean_names() %>%
private$convert_unknown_to_na()
self$reviews_by_order <- self$reviews %>%
group_by(order_id) %>%
summarise(
average_review_score_order = mean(review_score, na.rm = TRUE),
number_of_reviews = n(),
.groups = "drop"
)
invisible(self)
},
build_sales_data = function() {
set.seed(self$seed)
self$sales_data <- self$order_items %>%
left_join(self$products, by = "product_id") %>%
left_join(self$orders, by = "order_id") %>%
left_join(self$reviews_by_order, by = "order_id") %>%
left_join(self$sellers, by = "seller_id") %>%
left_join(self$category_translation, by = "product_category_name") %>%
filter(order_status == "delivered") %>%
slice_sample(prop = self$sample_fraction) %>%
mutate(
product_category_name_english = if_else(
is.na(product_category_name_english),
product_category_name,
product_category_name_english
),
item_revenue = price,
freight_ratio = freight_value / price,
product_volume_cm3 = product_length_cm *
product_height_cm *
product_width_cm
)
invisible(self)
},
get_sales_data_overview = function() {
tibble(
metric = c(
"Delivered order items",
"Delivered orders",
"Product categories",
"Sellers"
),
value = c(
nrow(self$sales_data),
n_distinct(self$sales_data$order_id),
n_distinct(self$sales_data$product_category_name_english),
n_distinct(self$sales_data$seller_id)
)
)
},
build_category_data = function() {
self$category_data <- self$sales_data %>%
filter(
!is.na(product_category_name_english),
!is.na(price),
price > 0
) %>%
mutate(
freight_ratio = if_else(
is.finite(freight_ratio),
freight_ratio,
NA_real_
)
) %>%
group_by(product_category_name_english) %>%
summarise(
total_item_revenue = sum(item_revenue, na.rm = TRUE),
number_of_orders = n_distinct(order_id),
number_of_items = n(),
average_item_price = mean(price, na.rm = TRUE),
total_freight = sum(freight_value, na.rm = TRUE),
average_freight = mean(freight_value, na.rm = TRUE),
average_freight_ratio = mean(freight_ratio, na.rm = TRUE),
number_of_sellers = n_distinct(seller_id),
average_review_score = mean(
average_review_score_order,
na.rm = TRUE
),
average_product_weight_g = mean(product_weight_g, na.rm = TRUE),
average_product_volume_cm3 = mean(product_volume_cm3, na.rm = TRUE),
.groups = "drop"
) %>%
filter(
!is.na(average_freight_ratio),
!is.na(average_review_score),
!is.na(average_product_weight_g),
!is.na(average_product_volume_cm3)
) %>%
arrange(desc(total_item_revenue))
invisible(self)
},
get_category_data_overview = function() {
tibble(
metric = c(
"Product categories",
"Total item revenue",
"Median category revenue",
"Average category revenue",
"Average orders per category"
),
value = c(
nrow(self$category_data),
sum(self$category_data$total_item_revenue),
median(self$category_data$total_item_revenue),
mean(self$category_data$total_item_revenue),
mean(self$category_data$number_of_orders)
)
)
},
prepare_cluster_data = function() {
self$cluster_data <- self$category_data %>%
transmute(
revenue_log = log(total_item_revenue),
orders_log = log(number_of_orders),
average_item_price_log = log(average_item_price),
average_freight_ratio = average_freight_ratio,
sellers_log = log(number_of_sellers),
average_review_score = average_review_score,
product_weight_log = log(average_product_weight_g),
product_volume_log = log(average_product_volume_cm3)
)
self$cluster_scaled <- scale(self$cluster_data)
invisible(self)
},
fit_kmeans = function(centers = 4, nstart = 25) {
set.seed(self$seed)
self$kmeans_model <- kmeans(
self$cluster_scaled,
centers = centers,
nstart = nstart
)
self$category_data <- self$category_data %>%
mutate(cluster = factor(self$kmeans_model$cluster))
invisible(self)
},
get_cluster_summary = function() {
self$category_data %>%
group_by(cluster) %>%
summarise(
number_of_categories = n(),
total_cluster_revenue = sum(total_item_revenue),
average_category_revenue = mean(total_item_revenue),
average_orders = mean(number_of_orders),
average_item_price = mean(average_item_price),
average_freight_ratio = mean(average_freight_ratio),
average_sellers = mean(number_of_sellers),
average_review_score = mean(average_review_score),
average_product_weight_g = mean(average_product_weight_g),
average_product_volume_cm3 = mean(average_product_volume_cm3),
.groups = "drop"
) %>%
arrange(desc(average_category_revenue))
},
create_high_revenue_target = function(method = "median") {
if (method != "median") {
stop("Only the median target method is currently implemented.")
}
revenue_cutoff <- median(
self$category_data$total_item_revenue,
na.rm = TRUE
)
self$category_data <- self$category_data %>%
mutate(
high_revenue_category = if_else(
total_item_revenue > revenue_cutoff,
"high",
"low"
),
high_revenue_category = factor(
high_revenue_category,
levels = c("high", "low")
)
)
invisible(self)
},
prepare_supervised_data = function() {
selected_columns <- c(
"high_revenue_category",
"number_of_orders",
"average_item_price",
"average_freight_ratio",
"number_of_sellers",
"average_review_score",
"average_product_weight_g",
"average_product_volume_cm3"
)
self$supervised_data <- self$category_data %>%
select(all_of(selected_columns)) %>%
drop_na() %>%
mutate(
high_revenue_category = factor(
high_revenue_category,
levels = c("high", "low")
)
)
invisible(self)
},
train_random_forest = function() {
set.seed(self$seed)
self$rf_model <- train(
high_revenue_category ~ .,
data = self$supervised_data,
method = "rf",
metric = "ROC",
trControl = private$get_train_control(),
tuneLength = 3,
importance = TRUE,
ntree = 500
)
invisible(self)
},
train_xgboost = function() {
set.seed(self$seed)
x <- model.matrix(
high_revenue_category ~ .,
data = self$supervised_data
)[, -1]
y <- if_else(
self$supervised_data$high_revenue_category == "high",
1,
0
)
self$xgb_feature_names <- colnames(x)
dtrain <- xgb.DMatrix(data = x, label = y)
xgb_params <- list(
objective = "binary:logistic",
eval_metric = "auc",
max_depth = 2,
eta = 0.10,
subsample = 0.8,
colsample_bytree = 0.8
)
self$xgb_cv <- xgb.cv(
params = xgb_params,
data = dtrain,
nrounds = 100,
nfold = 5,
stratified = TRUE,
early_stopping_rounds = 10,
verbose = 0
)
best_iteration <- self$xgb_cv$best_iteration
if (length(best_iteration) == 0 || is.null(best_iteration)) {
best_iteration <- which.max(self$xgb_cv$evaluation_log$test_auc_mean)
}
self$xgb_model <- xgb.train(
params = xgb_params,
data = dtrain,
nrounds = best_iteration,
verbose = 0
)
invisible(self)
},
compare_models = function() {
self$model_comparison <- bind_rows(
private$get_best_model_metrics(self$rf_model, "Random Forest"),
private$get_xgb_metrics()
) %>%
arrange(desc(ROC))
self$model_comparison
},
get_variable_importance = function(model) {
if (inherits(model, "xgb.Booster")) {
return(
xgb.importance(model = model) %>%
as_tibble() %>%
transmute(feature = Feature, Overall = Gain) %>%
arrange(desc(Overall))
)
}
importance <- varImp(model)$importance %>%
rownames_to_column("feature")
if (!"Overall" %in% names(importance)) {
numeric_columns <- importance %>%
select(where(is.numeric)) %>%
names()
importance <- importance %>%
mutate(Overall = rowMeans(across(all_of(numeric_columns))))
}
importance %>%
arrange(desc(Overall))
}
),
private = list(
convert_unknown_to_na = function(data) {
data %>%
mutate(
across(
where(is.character),
~ if_else(
str_to_lower(str_trim(.x)) == "unknown",
NA_character_,
.x
)
)
)
},
get_train_control = function() {
trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "final"
)
},
get_best_model_metrics = function(model, model_name) {
best_tune <- model$bestTune
model$results %>%
semi_join(best_tune, by = names(best_tune)) %>%
transmute(
model = model_name,
ROC = ROC,
sensitivity = Sens,
specificity = Spec
)
},
get_xgb_metrics = function() {
best_iteration <- self$xgb_cv$best_iteration
if (length(best_iteration) == 0 || is.null(best_iteration)) {
best_iteration <- which.max(self$xgb_cv$evaluation_log$test_auc_mean)
}
self$xgb_cv$evaluation_log %>%
slice(best_iteration) %>%
transmute(
model = "XGBoost",
ROC = test_auc_mean,
sensitivity = NA_real_,
specificity = NA_real_
)
}
)
)
analyzer <- OlistCategoryAnalyzer$new(
data_path = "olist dataset",
sample_fraction = sample_fraction,
seed = 123
)
The tidyverse, readr, and
janitor packages support data import, cleaning, and
transformation. The cluster and factoextra
packages are used for the unsupervised learning part of the project. The
rpart, rpart.plot, randomForest,
xgboost, pROC, and caret packages
are used for classification models and performance evaluation. The
R6 package is used to organize the workflow in an
object-oriented way.
From a business point of view, this setup allows us to move from raw marketplace transactions to category-level insights, category segments, and predictive models in one consistent analytical workflow.
The analysis uses the Olist Brazilian e-commerce dataset. For this project, the most important files are the order items, products, orders, reviews, sellers, and product category translation tables. Together, these files allow us to connect product categories with revenue, freight cost, seller activity, order status, and customer satisfaction.
analyzer$load_data()
order_items <- analyzer$order_items
products <- analyzer$products
orders <- analyzer$orders
reviews <- analyzer$reviews
sellers <- analyzer$sellers
category_translation <- analyzer$category_translation
The order_items dataset is the core sales table because
it contains item prices and freight values. The products
table adds product category and physical product characteristics. The
orders table allows us to focus on completed sales,
especially delivered orders. The reviews table adds
customer satisfaction, and the seller table allows us to measure how
many sellers operate in each category.
The category names in the original dataset are in Portuguese, so the translation table will be used to make the final business interpretation clearer.
data_overview <- analyzer$get_data_overview()
data_overview
## # A tibble: 6 × 3
## dataset rows columns
## <chr> <int> <int>
## 1 order_items 112650 7
## 2 products 32951 9
## 3 orders 99441 8
## 4 reviews 100000 7
## 5 sellers 3095 4
## 6 category_translation 71 2
This overview confirms that the necessary data sources were imported correctly. From a business perspective, the imported tables give us the minimum information needed to connect category-level revenue with demand, price positioning, logistics, seller supply, and customer satisfaction.
The next step is to combine the imported datasets into one analysis table. The main unit of observation is an order item, because revenue and freight values are recorded at the item level. Product, order, review, seller, and category translation information are then added to each item.
Before joining, we clean the column names and aggregate reviews at the order level. This is important because some orders can have more than one review record, and joining reviews directly could duplicate item revenue.
analyzer$clean_data()
order_items <- analyzer$order_items
products <- analyzer$products
orders <- analyzer$orders
reviews <- analyzer$reviews
sellers <- analyzer$sellers
category_translation <- analyzer$category_translation
reviews_by_order <- analyzer$reviews_by_order
The joined dataset keeps delivered orders only. This makes the revenue analysis more reliable because delivered orders represent completed marketplace transactions. To keep the analysis lighter and faster, the project uses a reproducible 10% sample of delivered order items.
analyzer$build_sales_data()
sales_data <- analyzer$sales_data
The variables item_revenue, freight_ratio,
and product_volume_cm3 are created because they will be
useful for category-level analysis. The freight ratio measures the
relative logistics burden of a category, while product volume helps
identify categories that may be operationally more difficult to
handle.
sales_data_overview <- analyzer$get_sales_data_overview()
sales_data_overview
## # A tibble: 4 × 2
## metric value
## <chr> <int>
## 1 Delivered order items 11019
## 2 Delivered orders 10791
## 3 Product categories 74
## 4 Sellers 1656
From a business perspective, this joined table connects the key commercial dimensions of the marketplace: what was sold, how much revenue it generated, how expensive it was to ship, which sellers supplied it, and how customers reviewed the order. This table will be the foundation for building category-level features in the next step. Because the analysis uses a sample, the revenue figures should be interpreted as sample-based estimates of category structure rather than exact full-marketplace totals.
The project focuses on product categories, so the joined item-level data must be aggregated into one row per category. Each row will describe the commercial and operational profile of a category.
The main category-level indicators are:
analyzer$build_category_data()
category_data <- analyzer$category_data
The resulting dataset is the main analytical table for the rest of the project. It transforms raw transactions into category-level business indicators.
category_data_overview <- analyzer$get_category_data_overview()
category_data_overview
## # A tibble: 5 × 2
## metric value
## <chr> <dbl>
## 1 Product categories 73
## 2 Total item revenue 1304787.
## 3 Median category revenue 3680.
## 4 Average category revenue 17874.
## 5 Average orders per category 146.
category_data %>%
select(
product_category_name_english,
total_item_revenue,
number_of_orders,
average_item_price,
average_freight_ratio,
number_of_sellers,
average_review_score,
average_product_weight_g,
average_product_volume_cm3
) %>%
slice_head(n = 10)
## # A tibble: 10 × 9
## product_category_nam…¹ total_item_revenue number_of_orders average_item_price
## <chr> <dbl> <int> <dbl>
## 1 watches_gifts 119060. 583 201.
## 2 health_beauty 119060. 915 128.
## 3 bed_bath_table 109440. 1153 93.1
## 4 sports_leisure 100199. 848 116.
## 5 computers_accessories 82499. 701 115.
## 6 furniture_decor 69618. 777 85.6
## 7 housewares 64569. 705 88.8
## 8 cool_stuff 59775. 364 164.
## 9 garden_tools 55220. 394 134.
## 10 auto 52479. 410 127.
## # ℹ abbreviated name: ¹​product_category_name_english
## # ℹ 5 more variables: average_freight_ratio <dbl>, number_of_sellers <int>,
## # average_review_score <dbl>, average_product_weight_g <dbl>,
## # average_product_volume_cm3 <dbl>
From a business perspective, this table allows categories to be compared using the same set of indicators. A category with high revenue and many orders may be a broad demand driver, while a category with high average price but fewer orders may be a niche premium category. A high freight ratio or high product volume may indicate operational complexity, which is important when interpreting whether a category is commercially attractive.
Before applying clustering or prediction models, we first explore the category-level data. The goal of this section is to understand which categories generate the most revenue and how revenue relates to order volume, price, freight burden, seller availability, and customer satisfaction.
top_revenue_categories <- category_data %>%
slice_max(total_item_revenue, n = 10)
ggplot(
top_revenue_categories,
aes(
x = reorder(product_category_name_english, total_item_revenue),
y = total_item_revenue
)
) +
geom_col(fill = "#2f6f73") +
coord_flip() +
scale_y_continuous(labels = scales::label_number(big.mark = ",")) +
labs(
title = "Top 10 Product Categories by Total Item Revenue",
x = "Product category",
y = "Total item revenue"
) +
theme_minimal()
This chart identifies the categories that contribute the most revenue to the marketplace. These categories are strategically important because changes in their demand, pricing, seller availability, or logistics can have a large effect on total marketplace performance.
From a marketing point of view, the leading categories should be treated as the main commercial pillars of the marketplace. Categories such as watches and gifts, health and beauty, bed and bath, sports and leisure, and computer accessories are not only selling categories; they are traffic generators. These categories are strong candidates for homepage placement, seasonal campaigns, cross-selling actions, loyalty promotions, and seller acquisition programs.
ggplot(
category_data,
aes(x = number_of_orders, y = total_item_revenue)
) +
geom_point(color = "#2f6f73", alpha = 0.75, size = 2.5) +
geom_smooth(method = "lm", se = FALSE, color = "#b24c38") +
scale_x_continuous(labels = scales::label_number(big.mark = ",")) +
scale_y_continuous(labels = scales::label_number(big.mark = ",")) +
labs(
title = "Category Revenue Compared with Order Volume",
x = "Number of orders",
y = "Total item revenue"
) +
theme_minimal()
This relationship shows whether revenue is mainly driven by volume. If categories with more orders almost always have higher revenue, marketplace growth depends strongly on broad demand categories. Categories above the trend line may be especially attractive because they generate more revenue than their order volume alone would suggest.
For business decisions, this means the marketplace should not only search for expensive products. Revenue is strongly connected to repeatable demand. The most commercially important categories are those that can attract many customers and many transactions. Marketing investment should therefore prioritize categories that combine demand frequency with enough price level to generate meaningful revenue.
ggplot(
category_data,
aes(x = average_item_price, y = number_of_orders)
) +
geom_point(color = "#5b6770", alpha = 0.75, size = 2.5) +
scale_x_continuous(labels = scales::label_number(big.mark = ",")) +
scale_y_continuous(labels = scales::label_number(big.mark = ",")) +
labs(
title = "Average Item Price Compared with Order Volume",
x = "Average item price",
y = "Number of orders"
) +
theme_minimal()
This chart helps distinguish high-volume low-price categories from higher-price niche categories. From a business point of view, both can be useful: high-volume categories can drive traffic and repeat purchases, while higher-price categories can generate meaningful revenue with fewer transactions.
This creates two different marketing strategies. Low-price, high-volume categories are useful for acquisition campaigns, discounts, bundles, and retention because customers can buy them frequently. Higher-price niche categories require a different approach: stronger product information, trust signals, reviews, warranty communication, and targeted advertising to customers with specific purchase intent.
freight_heavy_categories <- category_data %>%
slice_max(average_freight_ratio, n = 10)
ggplot(
freight_heavy_categories,
aes(
x = reorder(product_category_name_english, average_freight_ratio),
y = average_freight_ratio
)
) +
geom_col(fill = "#b24c38") +
coord_flip() +
scale_y_continuous(labels = scales::label_percent(accuracy = 1)) +
labs(
title = "Top 10 Categories by Average Freight Ratio",
x = "Product category",
y = "Average freight as a share of item price"
) +
theme_minimal()
Categories with a high freight ratio may be less attractive operationally, especially if shipping cost represents a large share of the item price. These categories may require logistics optimization, better shipping agreements, or pricing strategies that account for delivery cost.
From a business perspective, freight-heavy categories are risky because shipping can reduce the attractiveness of the offer. Even if demand exists, customers may abandon purchases when delivery cost feels too high compared with the product price. For these categories, the marketplace should consider minimum basket thresholds, bundle offers, negotiated freight conditions, regional delivery strategies, or clearer communication of total delivered price.
ggplot(
category_data,
aes(x = number_of_sellers, y = total_item_revenue)
) +
geom_point(color = "#2f6f73", alpha = 0.75, size = 2.5) +
geom_smooth(method = "lm", se = FALSE, color = "#b24c38") +
scale_x_continuous(labels = scales::label_number(big.mark = ",")) +
scale_y_continuous(labels = scales::label_number(big.mark = ",")) +
labs(
title = "Category Revenue Compared with Seller Availability",
x = "Number of sellers",
y = "Total item revenue"
) +
theme_minimal()
This plot shows whether categories with more sellers also generate more revenue. A larger seller base can increase product variety, availability, and price competition. However, if revenue remains low despite many sellers, the category may have weak demand or poor positioning.
For marketplace management, seller availability is a commercial asset. A larger seller base usually means more assortment, better stock coverage, and more competitive offers. Categories with high demand but fewer sellers are good targets for seller recruitment. Categories with many sellers but weak revenue need marketing review: the problem may be poor visibility, weak customer demand, or an undifferentiated assortment.
ggplot(
category_data,
aes(x = average_review_score, y = total_item_revenue)
) +
geom_point(color = "#5b6770", alpha = 0.75, size = 2.5) +
scale_y_continuous(labels = scales::label_number(big.mark = ",")) +
labs(
title = "Category Revenue Compared with Average Review Score",
x = "Average review score",
y = "Total item revenue"
) +
theme_minimal()
Customer satisfaction is important when interpreting category quality. A category with strong revenue but relatively low review scores may represent a business risk, because poor customer experience can reduce repeat purchases and damage marketplace trust.
In marketing terms, review score protects long-term revenue. A category can perform well in the short term, but if customers are not satisfied, promotion may amplify a bad experience. High-revenue categories with weaker reviews should be monitored carefully before receiving additional campaign investment. Improving seller quality, delivery reliability, and product descriptions may be more valuable than simply increasing advertising spend.
Overall, this exploratory analysis gives an initial view of the sales structure of product categories. The next step will use clustering to group categories with similar commercial and operational profiles.
Overall, the exploratory results suggest that the marketplace should manage categories as a portfolio. Some categories are revenue engines, some are traffic builders, some are premium niches, and some are operationally expensive. A good marketing strategy should not treat all categories equally; each category type needs a different mix of promotion, pricing, logistics support, and seller development.
The unsupervised part of the project groups product categories with similar commercial and operational profiles. Clustering uses only numeric category-level features, so the category name is excluded from the clustering matrix.
Several variables such as revenue, order count, seller count, weight,
and volume are highly skewed. All variables selected for log
transformation are strictly positive in the category-level dataset, so
we use the natural logarithm log(x). This reduces the
influence of extremely large categories and allows the clustering
algorithm to consider multiple dimensions of category structure.
analyzer$prepare_cluster_data()
cluster_data <- analyzer$cluster_data
cluster_scaled <- analyzer$cluster_scaled
Scaling is necessary because the variables are measured in different units. Without scaling, variables with large numeric ranges, such as revenue or product volume, would dominate the clustering result.
summary(cluster_data)
## revenue_log orders_log average_item_price_log average_freight_ratio
## Min. : 3.218 Min. :0.000 Min. :2.561 Min. :0.04425
## 1st Qu.: 6.750 1st Qu.:2.197 1st Qu.:4.304 1st Qu.:0.25714
## Median : 8.211 Median :3.219 Median :4.743 Median :0.32414
## Mean : 8.163 Mean :3.449 Mean :4.701 Mean :0.35104
## 3rd Qu.: 9.736 3rd Qu.:5.112 3rd Qu.:4.992 3rd Qu.:0.37361
## Max. :11.687 Max. :7.050 Max. :6.915 Max. :1.16370
## sellers_log average_review_score product_weight_log product_volume_log
## Min. :0.000 Min. :1.000 Min. :4.605 Min. : 7.270
## 1st Qu.:1.386 1st Qu.:3.946 1st Qu.:6.455 1st Qu.: 8.646
## Median :2.398 Median :4.064 Median :7.381 Median : 9.514
## Mean :2.496 Mean :4.016 Mean :7.293 Mean : 9.352
## 3rd Qu.:3.689 3rd Qu.:4.290 3rd Qu.:7.925 3rd Qu.: 9.977
## Max. :5.421 Max. :5.000 Max. :9.386 Max. :11.175
Two common methods are used to support the choice of the number of clusters:
set.seed(123)
fviz_nbclust(
cluster_scaled,
kmeans,
method = "wss",
k.max = 10
) +
labs(title = "Elbow Method for Choosing Number of Clusters") +
theme_minimal()
set.seed(123)
fviz_nbclust(
cluster_scaled,
kmeans,
method = "silhouette",
k.max = 10
) +
labs(title = "Silhouette Method for Choosing Number of Clusters") +
theme_minimal()
The final number of clusters should balance statistical evidence and business interpretability. A very small number of clusters may hide important differences between category types, while too many clusters may create groups that are hard to explain or act on.
For this project, we will use four clusters as a practical starting point. This provides a simpler and more interpretable segmentation while still allowing us to compare broad marketplace categories, high-price niche categories, freight-heavy categories, and lower-performing or underdeveloped categories.
After preparing and scaling the category-level variables, we apply K-means clustering. The model assigns each product category to one of four groups based on similarities in revenue, order volume, price, freight ratio, seller base, review score, and product size.
analyzer$fit_kmeans(centers = 4, nstart = 25)
kmeans_model <- analyzer$kmeans_model
category_data <- analyzer$category_data
The nstart = 25 argument runs K-means several times with
different starting points and keeps the best result. This makes the
clustering result more stable.
fviz_cluster(
kmeans_model,
data = cluster_scaled,
geom = "point",
ellipse.type = "convex",
show.clust.cent = TRUE
) +
labs(title = "K-Means Clusters of Product Categories") +
theme_minimal()
This visualization shows how categories separate into different groups after dimension reduction. The chart is useful for checking whether the clusters have some separation, but the final business meaning must come from the original category-level metrics.
cluster_summary <- analyzer$get_cluster_summary()
cluster_summary
## # A tibble: 4 × 11
## cluster number_of_categories total_cluster_revenue average_category_revenue
## <fct> <int> <dbl> <dbl>
## 1 4 20 1080685. 54034.
## 2 1 24 192631. 8026.
## 3 3 25 30183. 1207.
## 4 2 4 1287. 322.
## # ℹ 7 more variables: average_orders <dbl>, average_item_price <dbl>,
## # average_freight_ratio <dbl>, average_sellers <dbl>,
## # average_review_score <dbl>, average_product_weight_g <dbl>,
## # average_product_volume_cm3 <dbl>
The cluster summary translates the statistical output into business terms. It shows whether each cluster is characterized by high revenue, high volume, high prices, high freight burden, many sellers, stronger reviews, or larger product dimensions.
The following plot compares clusters across the most important business characteristics. The values are standardized, so positive bars indicate that a cluster is above the average cluster profile for that characteristic, while negative bars indicate that it is below average.
cluster_profile_plot_data <- cluster_summary %>%
select(
cluster,
average_category_revenue,
average_orders,
average_item_price,
average_freight_ratio,
average_sellers,
average_review_score,
average_product_weight_g,
average_product_volume_cm3
) %>%
pivot_longer(
cols = -cluster,
names_to = "characteristic",
values_to = "value"
) %>%
group_by(characteristic) %>%
mutate(standardized_value = as.numeric(scale(value))) %>%
ungroup() %>%
mutate(
characteristic = recode(
characteristic,
average_category_revenue = "Revenue",
average_orders = "Orders",
average_item_price = "Price",
average_freight_ratio = "Freight ratio",
average_sellers = "Sellers",
average_review_score = "Review score",
average_product_weight_g = "Weight",
average_product_volume_cm3 = "Volume"
),
characteristic = factor(
characteristic,
levels = c(
"Revenue",
"Orders",
"Price",
"Freight ratio",
"Sellers",
"Review score",
"Weight",
"Volume"
)
)
)
ggplot(
cluster_profile_plot_data,
aes(
x = characteristic,
y = standardized_value,
fill = cluster
)
) +
geom_col(position = position_dodge(width = 0.75), width = 0.65) +
geom_hline(yintercept = 0, color = "#5b6770", linewidth = 0.4) +
labs(
title = "Product Category Segmentation by Cluster Characteristics",
x = "Business characteristic",
y = "Standardized value",
fill = "Cluster"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 35, hjust = 1)
)
This segmentation view makes the cluster profiles easier to compare. For example, a cluster with positive bars for revenue, orders, and sellers represents a broad marketplace group, while a cluster with a positive freight ratio bar and a negative price bar may represent categories with weaker logistics economics.
The segmentation plot should be read as a category management map. Clusters with strong revenue, orders, and sellers are the marketplace core and should receive consistent visibility and seller relationship management. Clusters with higher prices and larger products are more specialized; they need trust-building content, detailed product pages, and targeted campaigns rather than broad discounting. Clusters with high freight pressure require operational attention before aggressive marketing, because promotions can increase volume without improving profitability or customer experience.
category_data %>%
group_by(cluster) %>%
arrange(desc(total_item_revenue), .by_group = TRUE) %>%
slice_head(n = 5) %>%
ungroup() %>%
select(
cluster,
product_category_name_english,
total_item_revenue,
number_of_orders,
average_item_price,
average_freight_ratio,
number_of_sellers,
average_review_score
)
## # A tibble: 19 × 8
## cluster product_category_name_english total_item_revenue number_of_orders
## <fct> <chr> <dbl> <int>
## 1 1 office_furniture 26117. 166
## 2 1 computers 22161. 22
## 3 1 musical_instruments 16669. 63
## 4 1 construction_tools_construction 14332. 86
## 5 1 luggage_accessories 14187. 114
## 6 2 fashion_underwear_beach 1176. 14
## 7 2 home_comfort_2 60.8 3
## 8 2 cine_photo 25.9 2
## 9 2 fashion_sport 25.0 1
## 10 3 fixed_telephony 6235. 29
## 11 3 audio 3239. 39
## 12 3 food 2867. 48
## 13 3 drinks 2409. 33
## 14 3 books_technical 2200. 23
## 15 4 watches_gifts 119060. 583
## 16 4 health_beauty 119060. 915
## 17 4 bed_bath_table 109440. 1153
## 18 4 sports_leisure 100199. 848
## 19 4 computers_accessories 82499. 701
## # ℹ 4 more variables: average_item_price <dbl>, average_freight_ratio <dbl>,
## # number_of_sellers <int>, average_review_score <dbl>
Looking at example categories helps validate whether the clusters make practical sense. A cluster should not only be statistically different, but also meaningful for marketplace decision-making.
The four clusters can be interpreted by comparing their average metrics:
From a business point of view, the clustering result helps marketplace managers move beyond individual category names and think in terms of category types. This is useful because different category types require different strategies. For example, high-volume categories may need supply reliability and competitive pricing, while freight-heavy categories may need logistics optimization and careful margin management.
The four clusters suggest the following marketing actions:
The supervised part of the project predicts whether a category is a high-revenue category. Because the analysis is performed at the category level and the number of categories is limited, we use the median category revenue as the cutoff. This creates a more balanced classification problem than using only the top 10% or top 25% of categories.
The target variable is:
high if category revenue is above the medianlow if category revenue is at or below the mediananalyzer$create_high_revenue_target(method = "median")
category_data <- analyzer$category_data
category_data %>%
count(high_revenue_category)
## # A tibble: 2 × 2
## high_revenue_category n
## <fct> <int>
## 1 high 36
## 2 low 37
The target is based on total_item_revenue, but
total_item_revenue will not be used as a predictor. This
avoids data leakage, because the model should not receive the same
variable that defines the outcome.
The supervised models use the category-level business indicators
created earlier. The cluster label from the unsupervised analysis is not
included as a predictor because the clustering step used
total_item_revenue, and the supervised target is also
defined from total_item_revenue. Including the cluster
would create indirect data leakage and make the ROC unrealistically
high.
analyzer$prepare_supervised_data()
supervised_data <- analyzer$supervised_data
supervised_data %>%
slice_head(n = 10)
## # A tibble: 10 × 8
## high_revenue_category number_of_orders average_item_price
## <fct> <int> <dbl>
## 1 high 583 201.
## 2 high 915 128.
## 3 high 1153 93.1
## 4 high 848 116.
## 5 high 701 115.
## 6 high 777 85.6
## 7 high 705 88.8
## 8 high 364 164.
## 9 high 394 134.
## 10 high 410 127.
## # ℹ 5 more variables: average_freight_ratio <dbl>, number_of_sellers <int>,
## # average_review_score <dbl>, average_product_weight_g <dbl>,
## # average_product_volume_cm3 <dbl>
The supervised dataset includes demand, price, logistics, seller,
review, and product-size indicators. total_item_revenue is
excluded because it defines the target, and cluster is
excluded because the current cluster definition already used
revenue.
Random Forest is useful for this project because it can capture nonlinear relationships and interactions between category characteristics. It also provides variable importance, which helps explain which business factors are most associated with high revenue.
Because the number of product categories is small, the model is evaluated with repeated 5-fold cross-validation instead of a single train-test split.
analyzer$train_random_forest()
rf_model <- analyzer$rf_model
rf_model
## Random Forest
##
## 73 samples
## 7 predictor
## 2 classes: 'high', 'low'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 5 times)
## Summary of sample sizes: 59, 59, 58, 58, 58, 59, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.9893176 0.9514286 0.9514286
## 4 0.9854401 0.9285714 0.9621429
## 7 0.9781505 0.9121429 0.9407143
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
rf_importance <- analyzer$get_variable_importance(rf_model)
ggplot(
rf_importance,
aes(x = reorder(feature, Overall), y = Overall)
) +
geom_col(fill = "#2f6f73") +
coord_flip() +
labs(
title = "Random Forest Variable Importance",
x = "Feature",
y = "Importance"
) +
theme_minimal()
From a business perspective, the most important variables indicate which category characteristics are most useful for identifying high-revenue categories. If order volume or seller count appears as important, this suggests that category scale and marketplace supply depth are central to revenue structure.
In commercial terms, Random Forest is useful because it shows which category levers matter most for identifying strong categories. If the leading variables are order volume and seller count, the message is clear: winning categories are not only expensive categories, but categories with active demand and enough supply. Marketing teams should coordinate with seller acquisition teams, because campaigns work better when the category has enough sellers, product variety, and stock availability.
XGBoost is a gradient boosting model that can capture complex nonlinear patterns. It is included as a stronger predictive benchmark against Random Forest. Since the dataset is small, the model uses native 5-fold cross-validation with early stopping to avoid adding unnecessary complexity.
analyzer$train_xgboost()
xgb_model <- analyzer$xgb_model
xgb_model
## ##### xgb.Booster
## call:
## xgb.train(params = xgb_params, data = dtrain, nrounds = best_iteration,
## verbose = 0)
## # of features: 7
## # of rounds: 12
xgb_importance <- analyzer$get_variable_importance(xgb_model)
ggplot(
xgb_importance,
aes(x = reorder(feature, Overall), y = Overall)
) +
geom_col(fill = "#b24c38") +
coord_flip() +
labs(
title = "XGBoost Variable Importance",
x = "Feature",
y = "Importance"
) +
theme_minimal()
XGBoost can be useful if the relationship between category characteristics and high revenue is not simple. However, its results should be interpreted with care because the supervised dataset contains one row per product category, not one row per order.
From a business perspective, XGBoost acts as a second opinion. If it highlights the same drivers as Random Forest, the conclusion is stronger: high-revenue categories are mainly connected to demand scale and seller ecosystem depth. If XGBoost gives more weight to freight or price, that suggests category economics also matter and should be considered before deciding where to invest marketing budget.
The models are compared using cross-validation results. ROC is the main metric because it evaluates how well the models separate high-revenue categories from low-revenue categories across classification thresholds. Random Forest reports ROC, sensitivity, and specificity from repeated cross-validation; XGBoost reports cross-validated ROC from native XGBoost cross-validation.
model_comparison <- analyzer$compare_models()
model_comparison
## model ROC sensitivity specificity
## 1 XGBoost 0.9962963 NA NA
## 2 Random Forest 0.9893176 0.9514286 0.9514286
The better model is the one with the higher cross-validated ROC value. However, business interpretability also matters. If Random Forest performs similarly to XGBoost, Random Forest may be easier to explain to non-technical stakeholders.
For decision-making, the exact model ranking is less important than the consistency of the business message. Both models should be used to identify which category characteristics are repeatedly associated with revenue strength. The models should not be used as automatic decision systems; they should support category prioritization, campaign planning, and seller development decisions.
The supervised models connect category characteristics to revenue classification. The most important question is not only which model performs best, but also which variables explain the difference between high-revenue and low-revenue categories.
Important interpretation points:
number_of_orders is highly important, high revenue
is mainly volume-drivennumber_of_sellers is important, seller ecosystem
depth mattersaverage_item_price is important, price positioning
helps distinguish category performanceaverage_freight_ratio is important, logistics burden
affects commercial attractivenessFrom a business point of view, this means the marketplace can use category profiles to prioritize investment. Broad, high-volume categories may need seller coverage and stock reliability, high-price niche categories may need careful positioning, and freight-heavy categories may need logistics optimization before they can become more commercially attractive.
The main managerial conclusion is that revenue is driven by category momentum: customer demand, seller availability, and an offer structure that makes the category easy to buy. Marketing actions should therefore be connected to category readiness. A category with many orders and many sellers is ready for larger campaigns. A category with high freight burden may need operational fixes first. A category with low review quality may need seller or product quality improvements before more traffic is sent to it.
Based on the exploratory analysis, segmentation, and supervised models, the marketplace should manage product categories with differentiated strategies.
The final business message is that product category performance is not explained by one factor alone. Successful categories combine demand, supply, price positioning, operational feasibility, and customer trust. The marketplace should therefore prioritize categories where marketing investment is supported by strong seller coverage, manageable logistics, and a good customer experience.