The rapid growth of e-commerce has created many opportunities for online businesses, but it has also brought challenges such as product returns and unpredictable sales demand. Product returns can increase operational costs, affect customer satisfaction, and reduce business efficiency. At the same time, accurate sales quantity prediction is important for inventory planning, stock management, and revenue growth.
This project analyzes an e-commerce transaction dataset to study return behavior and sales quantity prediction. The project focuses on two main machine learning tasks: classification and regression. The classification task aims to predict whether an order will be returned, while the regression task aims to predict the quantity of products sold. These analyses can help e-commerce companies identify return drivers, improve decision-making, reduce unnecessary costs, and manage stock more effectively.
The dataset used in this project is a synthetic e-commerce sales dataset for the year 2025. It contains 100,000 transaction records and 13 variables. The dataset includes customer, product, order, payment, delivery, return, rating, discount, and revenue information. The variables are a mix of numerical, categorical, and date-based data. This makes the dataset suitable for both classification and regression analysis.
# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.3 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
# Create dataset table
dataset_table <- data.frame(
Variable = c("order_id", "customer_id", "product_category", "product_price",
"quantity", "order_date", "region", "payment_method",
"delivery_days", "is_returned", "customer_rating",
"discount_percent", "revenue"),
Description = c("Unique order identification number",
"Unique customer identification code",
"Category of the product purchased",
"Price of the product",
"Number of units purchased",
"Date of the order",
"Customer region",
"Payment method used",
"Number of delivery days",
"Return status: 1 = returned, 0 = not returned",
"Customer rating score",
"Discount percentage applied",
"Total revenue generated from the order")
)
# Display table
kable(dataset_table, caption = "Dataset Variable Description") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE
)
| Variable | Description |
|---|---|
| order_id | Unique order identification number |
| customer_id | Unique customer identification code |
| product_category | Category of the product purchased |
| product_price | Price of the product |
| quantity | Number of units purchased |
| order_date | Date of the order |
| region | Customer region |
| payment_method | Payment method used |
| delivery_days | Number of delivery days |
| is_returned | Return status: 1 = returned, 0 = not returned |
| customer_rating | Customer rating score |
| discount_percent | Discount percentage applied |
| revenue | Total revenue generated from the order |
This project has two main research objectives. The first objective is to predict product return behavior using classification techniques. The second objective is to predict product sales quantity using regression techniques.
Can we predict whether an e-commerce order will be returned based on customer, product, payment, delivery, rating, and discount features?
The objective of this question is to identify return behavior and understand the main factors that may influence product returns. This can help e-commerce companies detect high-risk orders earlier, reduce return-related costs, and improve customer satisfaction.
Can we predict the quantity of products sold based on product, customer, price, discount, region, payment, and delivery features?
The objective of this question is to predict sales quantity and identify possible growth drivers. This can help businesses improve inventory planning, manage stock more efficiently, and support better sales strategies.
The purpose of the data cleaning stage is to prepare a reliable processed dataset for further exploratory analysis and modelling. In this stage, we handled missing values, corrected data types, identified abnormal records, checked revenue consistency, and removed duplicate records.
This cleaned dataset supports two main objectives: a classification
problem for return behavior warning using is_returned as
the target variable, and a regression problem for sales quantity
prediction using quantity as the target variable.
library(tidyverse)
library(lubridate)
library(skimr)
library(corrplot)
library(ggplot2)
library(dplyr)
file_path <- "synthetic_ecommerce_sales_2025.csv"
df_raw <- read.csv(file_path, stringsAsFactors = FALSE)
glimpse(df_raw)
## Rows: 100,000
## Columns: 13
## $ order_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ customer_id <chr> "bdd640fb-0667-4ad1-9c80-317fa3b1799d", "23b8c1e9-392…
## $ product_category <chr> "Beauty", "Fashion", "Beauty", "Electronics", "Fashio…
## $ product_price <dbl> 190.40, 82.22, 15.19, 310.65, 74.05, 236.05, 471.39, …
## $ quantity <int> 5, 3, 2, 2, 4, 5, 2, 4, 2, 2, 4, 2, 1, 3, 1, 5, 3, 2,…
## $ order_date <chr> "2023-02-21", "2023-10-13", "2023-06-28", "2023-07-11…
## $ region <chr> "Europe", "North America", "Oceania", "Europe", "Afri…
## $ payment_method <chr> "BankTransfer", "CreditCard", "Cash", "PayPal", "PayP…
## $ delivery_days <int> 8, 5, 6, 9, 3, 5, 5, 3, 6, 4, 8, 9, 8, 5, 2, 8, 7, 3,…
## $ is_returned <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ customer_rating <dbl> 3.8, 3.8, 2.0, 2.9, 3.1, 3.4, 2.7, 4.7, 3.6, 3.8, 4.5…
## $ discount_percent <int> 0, 0, 10, 5, 20, 5, 0, 0, 0, 0, 0, 20, 10, 10, 5, 10,…
## $ revenue <dbl> 952.00, 246.66, 27.34, 590.23, 236.96, 1121.24, 942.7…
skim(df_raw)
| Name | df_raw |
| Number of rows | 100000 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| customer_id | 0 | 1 | 36 | 36 | 0 | 100000 | 0 |
| product_category | 0 | 1 | 4 | 11 | 0 | 7 | 0 |
| order_date | 0 | 1 | 10 | 10 | 0 | 1096 | 0 |
| region | 0 | 1 | 4 | 13 | 0 | 6 | 0 |
| payment_method | 0 | 1 | 4 | 12 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| order_id | 0 | 1 | 50000.50 | 28867.66 | 1.00 | 25000.75 | 50000.50 | 75000.25 | 100000.00 | ▇▇▇▇▇ |
| product_price | 0 | 1 | 250.96 | 141.74 | 4.52 | 128.39 | 251.43 | 372.27 | 500.00 | ▇▇▇▇▇ |
| quantity | 0 | 1 | 3.09 | 1.44 | 1.00 | 2.00 | 3.00 | 4.00 | 6.00 | ▇▅▅▅▁ |
| delivery_days | 0 | 1 | 4.98 | 2.58 | 1.00 | 3.00 | 5.00 | 7.00 | 9.00 | ▇▇▃▇▇ |
| is_returned | 0 | 1 | 0.06 | 0.24 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| customer_rating | 0 | 1 | 3.50 | 0.87 | 2.00 | 2.80 | 3.50 | 4.20 | 5.00 | ▇▇▇▇▇ |
| discount_percent | 0 | 1 | 5.01 | 6.14 | 0.00 | 0.00 | 0.00 | 10.00 | 20.00 | ▇▃▂▂▁ |
| revenue | 0 | 1 | 734.15 | 571.19 | 4.26 | 274.11 | 585.55 | 1089.86 | 2699.14 | ▇▅▂▁▁ |
missing_summary <- colSums(is.na(df_raw))
missing_summary
## order_id customer_id product_category product_price
## 0 0 0 0
## quantity order_date region payment_method
## 0 0 0 0
## delivery_days is_returned customer_rating discount_percent
## 0 0 0 0
## revenue
## 0
check the number of missing values in each column and identifies
columns with more than 50% missing values. Columns with a high
proportion of missing values may contain incomplete information and
could negatively affect subsequent analysis and model development.
Therefore, these columns are stored in high_na and removed
from the dataset if they exist. If no such columns are found, the
original dataset is retained.
high_na <- names(missing_summary[missing_summary > 0.5 * nrow(df_raw)])
if(length(high_na) > 0) {
df_clean <- df_raw %>% select(-all_of(high_na))
} else {
df_clean <- df_raw
}
high_na
## character(0)
The following columns are converted into suitable data types:
order_date as Dateis_returned, product_category,
region, and payment_method as factor
variablesquantity and delivery_days as integer
variablesproduct_price, discount_percent,
customer_rating, and revenue as numeric
variables, if availabledf_clean <- df_clean %>%
mutate(
order_date = as.Date(order_date, format = "%Y-%m-%d"),
is_returned = as.factor(is_returned),
product_category = as.factor(product_category),
region = as.factor(region),
payment_method = as.factor(payment_method),
quantity = as.integer(quantity),
product_price = as.numeric(product_price),
discount_percent = as.numeric(discount_percent),
delivery_days = as.integer(delivery_days),
customer_rating = as.numeric(customer_rating)
)
if("revenue" %in% names(df_clean)) {
df_clean$revenue <- as.numeric(df_clean$revenue)
}
glimpse(df_clean)
## Rows: 100,000
## Columns: 13
## $ order_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ customer_id <chr> "bdd640fb-0667-4ad1-9c80-317fa3b1799d", "23b8c1e9-392…
## $ product_category <fct> Beauty, Fashion, Beauty, Electronics, Fashion, Beauty…
## $ product_price <dbl> 190.40, 82.22, 15.19, 310.65, 74.05, 236.05, 471.39, …
## $ quantity <int> 5, 3, 2, 2, 4, 5, 2, 4, 2, 2, 4, 2, 1, 3, 1, 5, 3, 2,…
## $ order_date <date> 2023-02-21, 2023-10-13, 2023-06-28, 2023-07-11, 2023…
## $ region <fct> Europe, North America, Oceania, Europe, Africa, Ocean…
## $ payment_method <fct> BankTransfer, CreditCard, Cash, PayPal, PayPal, PayPa…
## $ delivery_days <int> 8, 5, 6, 9, 3, 5, 5, 3, 6, 4, 8, 9, 8, 5, 2, 8, 7, 3,…
## $ is_returned <fct> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ customer_rating <dbl> 3.8, 3.8, 2.0, 2.9, 3.1, 3.4, 2.7, 4.7, 3.6, 3.8, 4.5…
## $ discount_percent <dbl> 0, 0, 10, 5, 20, 5, 0, 0, 0, 0, 0, 20, 10, 10, 5, 10,…
## $ revenue <dbl> 952.00, 246.66, 27.34, 590.23, 236.96, 1121.24, 942.7…
For numerical variables, missing values are replaced with the median. For factor variables, missing values are replaced with the mode.
# Fill numerical missing values using median
numeric_cols <- df_clean %>% select(where(is.numeric)) %>% names()
for(col in numeric_cols) {
if(any(is.na(df_clean[[col]]))) {
df_clean[[col]][is.na(df_clean[[col]])] <- median(df_clean[[col]], na.rm = TRUE)
}
}
# Fill factor missing values using mode
factor_cols <- df_clean %>% select(where(is.factor)) %>% names()
for(col in factor_cols) {
if(any(is.na(df_clean[[col]]))) {
mode_val <- names(sort(table(df_clean[[col]]), decreasing = TRUE))[1]
df_clean[[col]][is.na(df_clean[[col]])] <- mode_val
}
}
colSums(is.na(df_clean))
## order_id customer_id product_category product_price
## 0 0 0 0
## quantity order_date region payment_method
## 0 0 0 0
## delivery_days is_returned customer_rating discount_percent
## 0 0 0 0
## revenue
## 0
Outliers are flagged instead of immediately removed, because abnormal values may still provide useful business information. The following rules are used:
quantity > 100 or quantity <= 0product_price > 10000 or
product_price <= 0delivery_days > 30 or
delivery_days < 0discount_percent > 100 or
discount_percent < 0customer_rating > 5 or
customer_rating < 1df_clean <- df_clean %>%
mutate(
is_quantity_outlier = if_else(quantity > 100 | quantity <= 0, 1, 0),
is_price_outlier = if_else(product_price > 10000 | product_price <= 0, 1, 0),
is_delivery_outlier = if_else(delivery_days > 30 | delivery_days < 0, 1, 0),
is_discount_outlier = if_else(discount_percent > 100 | discount_percent < 0, 1, 0),
is_rating_outlier = if_else(customer_rating > 5 | customer_rating < 1, 1, 0),
is_any_outlier = pmax(
is_quantity_outlier,
is_price_outlier,
is_delivery_outlier,
is_discount_outlier,
is_rating_outlier
)
)
outlier_summary <- df_clean %>%
summarise(
quantity_outliers = sum(is_quantity_outlier),
price_outliers = sum(is_price_outlier),
delivery_outliers = sum(is_delivery_outlier),
discount_outliers = sum(is_discount_outlier),
rating_outliers = sum(is_rating_outlier),
any_outliers = sum(is_any_outlier)
)
outlier_summary
## quantity_outliers price_outliers delivery_outliers discount_outliers
## 1 0 0 0 0
## rating_outliers any_outliers
## 1 0 0
If the dataset contains a revenue column, revenue is
recalculated using:
\[ Revenue = Product\ Price \times Quantity \times (1 - Discount\ Percent / 100) \]
Rows are kept when the difference between original revenue and calculated revenue is less than 0.01.
if("revenue" %in% names(df_clean)) {
rows_before_revenue_check <- nrow(df_clean)
df_clean <- df_clean %>%
mutate(revenue_calc = product_price * quantity * (1 - discount_percent / 100)) %>%
filter(abs(revenue - revenue_calc) < 0.01) %>%
select(-revenue_calc)
rows_after_revenue_check <- nrow(df_clean)
revenue_removed <- rows_before_revenue_check - rows_after_revenue_check
revenue_removed
}
## [1] 0
Check and remove completely duplicated records in the dataset to ensure data quality.
rows_before_duplicates <- nrow(df_clean)
df_clean <- df_clean %>% distinct()
rows_after_duplicates <- nrow(df_clean)
duplicate_rows_removed <- rows_before_duplicates - rows_after_duplicates
duplicate_rows_removed
## [1] 0
output_path <- "ecommerce_cleaned.csv"
write.csv(df_clean, output_path, row.names = FALSE)
cat("Cleaned data saved to:", output_path, "\n")
## Cleaned data saved to: ecommerce_cleaned.csv
cat("Cleaned dataset dimensions:", nrow(df_clean), "rows and", ncol(df_clean), "columns\n")
## Cleaned dataset dimensions: 100000 rows and 19 columns
dim(df_clean)
## [1] 100000 19
summary(df_clean)
## order_id customer_id product_category product_price
## Min. : 1 Length :100000 Automotive :14239 Min. : 4.518
## 1st Qu.: 25001 N.unique :100000 Beauty :14234 1st Qu.:128.387
## Median : 50000 N.blank : 0 Electronics:14375 Median :251.430
## Mean : 50000 Min.nchar: 36 Fashion :14327 Mean :250.963
## 3rd Qu.: 75000 Max.nchar: 36 Home :14182 3rd Qu.:372.270
## Max. :100000 Sports :14354 Max. :500.000
## Toys :14289
## quantity order_date region
## Min. :1.000 Min. :2023-01-01 Africa :16450
## 1st Qu.:2.000 1st Qu.:2023-10-02 Asia :16763
## Median :3.000 Median :2024-07-05 Europe :16513
## Mean :3.085 Mean :2024-07-02 North America:16749
## 3rd Qu.:4.000 3rd Qu.:2025-04-04 Oceania :16965
## Max. :6.000 Max. :2025-12-31 South America:16560
##
## payment_method delivery_days is_returned customer_rating
## BankTransfer:25083 Min. :1.000 0:93940 Min. :2.0
## Cash :24710 1st Qu.:3.000 1: 6060 1st Qu.:2.8
## CreditCard :25222 Median :5.000 Median :3.5
## PayPal :24985 Mean :4.985 Mean :3.5
## 3rd Qu.:7.000 3rd Qu.:4.2
## Max. :9.000 Max. :5.0
##
## discount_percent revenue is_quantity_outlier is_price_outlier
## Min. : 0.000 Min. : 4.26 Min. :0 Min. :0
## 1st Qu.: 0.000 1st Qu.: 274.11 1st Qu.:0 1st Qu.:0
## Median : 0.000 Median : 585.55 Median :0 Median :0
## Mean : 5.015 Mean : 734.15 Mean :0 Mean :0
## 3rd Qu.:10.000 3rd Qu.:1089.86 3rd Qu.:0 3rd Qu.:0
## Max. :20.000 Max. :2699.14 Max. :0 Max. :0
##
## is_delivery_outlier is_discount_outlier is_rating_outlier is_any_outlier
## Min. :0 Min. :0 Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0
##
skim(df_clean)
| Name | df_clean |
| Number of rows | 100000 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| Date | 1 |
| factor | 4 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| customer_id | 0 | 1 | 36 | 36 | 0 | 1e+05 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| order_date | 0 | 1 | 2023-01-01 | 2025-12-31 | 2024-07-05 | 1096 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| product_category | 0 | 1 | FALSE | 7 | Ele: 14375, Spo: 14354, Fas: 14327, Toy: 14289 |
| region | 0 | 1 | FALSE | 6 | Oce: 16965, Asi: 16763, Nor: 16749, Sou: 16560 |
| payment_method | 0 | 1 | FALSE | 4 | Cre: 25222, Ban: 25083, Pay: 24985, Cas: 24710 |
| is_returned | 0 | 1 | FALSE | 2 | 0: 93940, 1: 6060 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| order_id | 0 | 1 | 50000.50 | 28867.66 | 1.00 | 25000.75 | 50000.50 | 75000.25 | 100000.00 | ▇▇▇▇▇ |
| product_price | 0 | 1 | 250.96 | 141.74 | 4.52 | 128.39 | 251.43 | 372.27 | 500.00 | ▇▇▇▇▇ |
| quantity | 0 | 1 | 3.09 | 1.44 | 1.00 | 2.00 | 3.00 | 4.00 | 6.00 | ▇▅▅▅▁ |
| delivery_days | 0 | 1 | 4.98 | 2.58 | 1.00 | 3.00 | 5.00 | 7.00 | 9.00 | ▇▇▃▇▇ |
| customer_rating | 0 | 1 | 3.50 | 0.87 | 2.00 | 2.80 | 3.50 | 4.20 | 5.00 | ▇▇▇▇▇ |
| discount_percent | 0 | 1 | 5.01 | 6.14 | 0.00 | 0.00 | 0.00 | 10.00 | 20.00 | ▇▃▂▂▁ |
| revenue | 0 | 1 | 734.15 | 571.19 | 4.26 | 274.11 | 585.55 | 1089.86 | 2699.14 | ▇▅▂▁▁ |
| is_quantity_outlier | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| is_price_outlier | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| is_delivery_outlier | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| is_discount_outlier | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| is_rating_outlier | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| is_any_outlier | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
The cleaned dataset summary provides an overview of the dataset after the data cleaning process. It shows the final number of rows and columns, the basic statistical summary of each variable, and the overall data structure. This step is important because it confirms that the dataset is ready for further exploratory analysis and future model development.
ggplot(df_clean, aes(x = is_returned, fill = is_returned)) +
geom_bar() +
labs(
title = "Distribution of Returns",
x = "Returned",
y = "Count"
) +
theme_minimal()
The return distribution shows the frequency of returned and
non-returned orders. This is important for the classification objective
because is_returned is the target variable for the return
behavior warning model. If returned orders represent only a small
proportion of total orders, the dataset may have a class imbalance
problem. This should be considered during model development because an
imbalanced dataset may cause the model to predict the majority class
more often and perform poorly in detecting returned orders.
ggplot(df_clean, aes(x = quantity)) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
labs(
title = "Distribution of Sales Quantity",
x = "Quantity",
y = "Frequency"
) +
theme_minimal()
This chart shows the distribution of the regression target variable,
quantity. It helps identify the common range of sales
quantity per order and whether the quantity values are concentrated,
skewed, or contain unusual patterns. Understanding this distribution is
important for the regression objective because the shape of the target
variable can affect the performance of sales quantity prediction
models.
ggplot(df_clean, aes(x = product_price)) +
geom_histogram(bins = 40, fill = "darkgreen", alpha = 0.7) +
scale_x_log10() +
labs(
title = "Product Price Distribution (Log Scale)",
x = "Product Price (Log Scale)",
y = "Frequency"
) +
theme_minimal()
The log scale is used because product prices are often right-skewed, with many low-to-medium priced products and fewer high-priced products. Since product price may influence both sales quantity and return behavior, understanding its distribution is useful for later regression and classification analysis. A highly skewed price distribution may also suggest that price transformation or careful model selection may be needed in future modelling.
ggplot(df_clean, aes(x = is_returned, y = delivery_days, fill = is_returned)) +
geom_boxplot() +
labs(
title = "Delivery Days by Return Status",
x = "Returned",
y = "Delivery Days"
) +
theme_minimal()
This boxplot compares delivery days between returned and non-returned
orders. It helps evaluate whether longer delivery time is associated
with a higher probability of return. If returned orders tend to have
longer delivery days, this may suggest that delivery delays increase
customer dissatisfaction and return risk. Therefore,
delivery_days may be a useful predictor for the return
behavior warning classification model.
df_clean %>%
mutate(
discount_group = cut(
discount_percent,
breaks = c(-Inf, 5, 15, 30, Inf),
labels = c("0-5%", "5-15%", "15-30%", "30%+"),
right = TRUE
)
) %>%
ggplot(aes(x = discount_group, fill = is_returned)) +
geom_bar(position = "fill") +
labs(
title = "Return Proportion by Discount Level",
x = "Discount Level",
y = "Proportion"
) +
theme_minimal()
This visualisation compares the return proportion across different discount levels. It helps determine whether high-discount orders have a different return pattern compared with low-discount orders. If high-discount orders show a higher return proportion, this may indicate that aggressive discounts encourage impulse purchases or uncertain buying decisions, which could increase return risk. This insight is useful for both return prediction and promotion strategy planning.
ggplot(df_clean, aes(x = customer_rating, fill = is_returned)) +
geom_density(alpha = 0.5) +
labs(
title = "Customer Rating Density by Return Status",
x = "Customer Rating",
y = "Density"
) +
theme_minimal()
This density plot compares customer rating patterns between returned
and non-returned orders. Customer rating can reflect customer
satisfaction with product quality, delivery experience, or expectation
fulfilment. If returned orders are more concentrated at lower rating
values, this suggests that low customer satisfaction may be related to
return behavior. Therefore, customer_rating may be an
important feature for identifying return risk.
ggplot(df_clean, aes(x = product_price, y = quantity)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Product Price vs Sales Quantity",
x = "Product Price",
y = "Quantity"
) +
theme_minimal()
This scatter plot examines whether product price has a linear
relationship with sales quantity. A negative trend would suggest that
higher-priced products tend to be purchased in smaller quantities, while
a weak relationship would indicate that product price alone may not
strongly explain sales quantity. This analysis is useful for the
regression objective because product_price may be
considered as one of the predictors for sales quantity prediction.
region_qty <- df_clean %>%
group_by(region) %>%
summarise(avg_quantity = mean(quantity), .groups = "drop")
region_qty
## # A tibble: 6 × 2
## region avg_quantity
## <fct> <dbl>
## 1 Africa 3.10
## 2 Asia 3.10
## 3 Europe 3.08
## 4 North America 3.08
## 5 Oceania 3.08
## 6 South America 3.08
ggplot(region_qty, aes(x = reorder(region, avg_quantity), y = avg_quantity, fill = region)) +
geom_col() +
coord_flip() +
labs(
title = "Average Sales Quantity by Region",
x = "Region",
y = "Average Quantity"
) +
theme_minimal()
This analysis shows regional differences in average sales quantity.
Regions with higher average quantity may require stronger inventory
planning, while regions with lower average quantity may need targeted
promotions or further investigation. This is useful for the regression
objective because region may help explain differences in
sales quantity across markets. It also provides business value for
regional stock planning and marketing strategies.
ggplot(df_clean, aes(x = discount_percent, y = quantity, color = is_returned)) +
geom_point(alpha = 0.5) +
geom_smooth(se = FALSE, method = "lm") +
labs(
title = "Discount vs Quantity by Return Status",
x = "Discount Percent",
y = "Quantity"
) +
theme_minimal()
This plot explores whether discount percentage is associated with higher sales quantity and whether the relationship differs between returned and non-returned orders. If higher discounts increase quantity but also appear more often among returned orders, the business needs to balance sales growth with return risk. This analysis connects both project objectives because discount percentage may influence sales quantity prediction and return behavior classification.
numeric_vars <- df_clean %>% select(where(is.numeric))
cor_matrix <- cor(numeric_vars, use = "complete.obs")
corrplot(
cor_matrix,
method = "number",
type = "upper",
tl.cex = 0.8,
number.cex = 0.7
)
The correlation matrix provides an overview of linear relationships
among numerical variables. It is useful for identifying highly related
variables and possible predictors for regression modelling. For example,
variables that have stronger relationships with quantity
may be useful for sales quantity prediction. However, correlation only
measures linear relationships and does not imply causation, so further
modelling and evaluation are still required.
monthly_trend <- df_clean %>%
group_by(month = floor_date(order_date, "month")) %>%
summarise(
total_quantity = sum(quantity),
return_rate = mean(as.numeric(is_returned) - 1, na.rm = TRUE),
.groups = "drop"
)
monthly_trend
## # A tibble: 36 × 3
## month total_quantity return_rate
## <date> <int> <dbl>
## 1 2023-01-01 8615 0.0548
## 2 2023-02-01 7410 0.0604
## 3 2023-03-01 8398 0.0561
## 4 2023-04-01 8013 0.0636
## 5 2023-05-01 8532 0.0610
## 6 2023-06-01 8395 0.0618
## 7 2023-07-01 8480 0.0623
## 8 2023-08-01 8694 0.0585
## 9 2023-09-01 8190 0.0545
## 10 2023-10-01 8617 0.0600
## # ℹ 26 more rows
monthly_trend %>%
pivot_longer(
cols = c(total_quantity, return_rate),
names_to = "metric",
values_to = "value"
) %>%
ggplot(aes(x = month, y = value)) +
geom_line(color = "steelblue") +
facet_wrap(~metric, scales = "free_y") +
labs(
title = "Monthly Trends: Total Quantity and Return Rate",
x = "Month",
y = "Value"
) +
theme_minimal()
The time trend analysis helps identify seasonal patterns in total sales quantity and whether return rates fluctuate across months. This is useful for detecting peak sales periods and monitoring whether return rates increase during high-demand months. For the regression objective, month-based trends may support sales quantity prediction. For the classification objective, monthly return rate patterns may help identify periods with higher return risk.
category_return <- df_clean %>%
group_by(product_category) %>%
summarise(
return_rate = mean(as.integer(is_returned) - 1, na.rm = TRUE),
total_sales = sum(quantity * product_price * (1 - discount_percent / 100), na.rm = TRUE),
avg_quantity = mean(quantity, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(return_rate))
category_return
## # A tibble: 7 × 4
## product_category return_rate total_sales avg_quantity
## <fct> <dbl> <dbl> <dbl>
## 1 Fashion 0.122 10497301. 3.07
## 2 Automotive 0.0527 10575483. 3.11
## 3 Electronics 0.0513 10488699. 3.09
## 4 Home 0.0511 10369461. 3.09
## 5 Sports 0.0505 10557716. 3.07
## 6 Toys 0.0490 10553990. 3.09
## 7 Beauty 0.0478 10372151. 3.08
ggplot(category_return, aes(x = reorder(product_category, return_rate), y = return_rate, fill = product_category)) +
geom_col() +
coord_flip() +
labs(
title = "Return Rate by Product Category",
x = "Product Category",
y = "Return Rate"
) +
theme_minimal()
This category-level analysis identifies product categories with
relatively high return rates, total sales, and average quantity.
Categories with high return rates may require further investigation,
such as reviewing product quality, product descriptions, sizing
information, or customer expectations. At the same time, categories with
high sales quantity may require stronger inventory management.
Therefore, product_category is useful for both return
behavior classification and sales quantity prediction.
overall_return_rate <- mean(as.numeric(df_clean$is_returned) - 1, na.rm = TRUE) * 100
avg_quantity <- mean(df_clean$quantity, na.rm = TRUE)
top_3_high_return_categories <- head(category_return %>% arrange(desc(return_rate)), 3)
best_region <- region_qty$region[which.max(region_qty$avg_quantity)]
delivery_return_correlation <- cor(
as.numeric(df_clean$delivery_days),
as.numeric(df_clean$is_returned) - 1,
use = "complete.obs"
)
cat("Overall return rate:", round(overall_return_rate, 2), "%\n")
## Overall return rate: 6.06 %
cat("Average quantity per order:", round(avg_quantity, 2), "\n")
## Average quantity per order: 3.09
cat("Correlation between delivery days and return:", round(delivery_return_correlation, 4), "\n")
## Correlation between delivery days and return: -0.0058
cat("Best region for average sales quantity:", as.character(best_region), "\n")
## Best region for average sales quantity: Africa
cat("\nTop 3 high-return categories:\n")
##
## Top 3 high-return categories:
top_3_high_return_categories
## # A tibble: 3 × 4
## product_category return_rate total_sales avg_quantity
## <fct> <dbl> <dbl> <dbl>
## 1 Fashion 0.122 10497301. 3.07
## 2 Automotive 0.0527 10575483. 3.11
## 3 Electronics 0.0513 10488699. 3.09
Based on the EDA, several variables appear useful for the two prediction objectives. For return behavior classification, delivery days, discount level, customer rating, and product category may help identify return risk. For sales quantity prediction, product price, discount percentage, region, month, and product category may help explain variation in order quantity. These findings provide a foundation for feature selection and model development in the next stage.
Overall, the data cleaning and exploratory analysis process successfully transformed the raw e-commerce sales data into a cleaner and more reliable dataset for further analysis. The cleaning process improved data quality by handling missing values, correcting data types, flagging abnormal records, checking revenue consistency, and removing duplicate records.
The EDA results helped us better understand the main patterns in the
dataset, especially in relation to return behavior and sales quantity.
These findings support the two project objectives: using
is_returned for return behavior classification and using
quantity for sales quantity prediction. Therefore, the
cleaned dataset is ready to be used in the next stage for model
training, evaluation, and business interpretation.
Research on E-commerce Sales Quantity Prediction based on Multi-dimensional Features
This model focuses on predicting sales quantity to find growth drivers that can help businesses boost sales and manage stock efficiently. In e-commerce operations, accurate quantity prediction is important because it supports inventory planning, demand forecasting, promotion strategy, and regional stock allocation. By using multi-dimensional features such as product category, product price, discount percentage, region, payment method, delivery days, customer rating, and order time, the model can estimate expected order quantity and identify factors that may influence sales growth.
The regression task uses quantity as the target
variable. The objective is to predict how many units are likely to be
sold for each order based on product, customer, pricing, delivery,
rating, payment, regional, and time-related features.
Unlike the classification model, which predicts return status, this regression model predicts a continuous numerical value. The business value of this model is to help e-commerce companies:
The target variable is quantity. The predictors are
selected from available multi-dimensional transaction features.
Variables such as order_id, customer_id, and
revenue are excluded because they are either identifiers or
directly related to the target variable. Excluding revenue
is important because revenue is calculated using quantity, so including
it would cause data leakage.
library(tidyverse)
library(lubridate)
library(caret)
library(randomForest)
df_reg <- df_clean %>%
select(
product_category,
product_price,
order_date,
region,
payment_method,
delivery_days,
customer_rating,
discount_percent,
quantity
) %>%
mutate(
order_date = as.Date(order_date),
order_month = factor(month(order_date)),
order_weekday = factor(weekdays(order_date)),
product_category = factor(product_category),
region = factor(region),
payment_method = factor(payment_method),
quantity = as.numeric(quantity)
) %>%
select(-order_date)
str(df_reg)
## 'data.frame': 100000 obs. of 10 variables:
## $ product_category: Factor w/ 7 levels "Automotive","Beauty",..: 2 4 2 3 4 2 6 1 5 3 ...
## $ product_price : num 190.4 82.2 15.2 310.6 74 ...
## $ region : Factor w/ 6 levels "Africa","Asia",..: 3 4 5 3 1 5 1 4 4 3 ...
## $ payment_method : Factor w/ 4 levels "BankTransfer",..: 1 3 2 4 4 4 3 1 1 3 ...
## $ delivery_days : int 8 5 6 9 3 5 5 3 6 4 ...
## $ customer_rating : num 3.8 3.8 2 2.9 3.1 3.4 2.7 4.7 3.6 3.8 ...
## $ discount_percent: num 0 0 10 5 20 5 0 0 0 0 ...
## $ quantity : num 5 3 2 2 4 5 2 4 2 2 ...
## $ order_month : Factor w/ 12 levels "1","2","3","4",..: 2 10 6 7 2 5 7 5 3 2 ...
## $ order_weekday : Factor w/ 7 levels "Friday","Monday",..: 6 1 7 6 1 1 6 5 3 2 ...
summary(df_reg$quantity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 3.085 4.000 6.000
The dataset is divided into a training set and a testing set. The training set is used to build the models, while the testing set is used to evaluate how well the models predict unseen data.
set.seed(123)
reg_train_index <- createDataPartition(df_reg$quantity, p = 0.8, list = FALSE)
reg_train <- df_reg[reg_train_index, ]
reg_test <- df_reg[-reg_train_index, ]
dim(reg_train)
## [1] 80002 10
dim(reg_test)
## [1] 19998 10
Multiple Linear Regression is used as the baseline regression model. It is easy to interpret and helps explain the direction and strength of relationships between predictors and sales quantity.
lm_model <- lm(quantity ~ ., data = reg_train)
summary(lm_model)
##
## Call:
## lm(formula = quantity ~ ., data = reg_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.08544 -1.01899 -0.00285 1.01288 2.08151
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.058e+00 3.720e-02 82.199 <2e-16 ***
## product_categoryBeauty -1.966e-02 1.871e-02 -1.051 0.2933
## product_categoryElectronics -2.493e-02 1.866e-02 -1.336 0.1816
## product_categoryFashion -3.915e-02 1.865e-02 -2.100 0.0358 *
## product_categoryHome -1.038e-02 1.866e-02 -0.556 0.5782
## product_categorySports -3.254e-02 1.863e-02 -1.746 0.0807 .
## product_categoryToys -3.291e-02 1.869e-02 -1.760 0.0784 .
## product_price 3.056e-05 3.524e-05 0.867 0.3858
## regionAsia 5.502e-03 1.730e-02 0.318 0.7504
## regionEurope -1.392e-02 1.736e-02 -0.802 0.4224
## regionNorth America -6.886e-03 1.729e-02 -0.398 0.6905
## regionOceania -1.979e-03 1.728e-02 -0.115 0.9088
## regionSouth America -1.447e-02 1.737e-02 -0.833 0.4049
## payment_methodCash 2.144e-02 1.412e-02 1.518 0.1290
## payment_methodCreditCard 6.722e-03 1.405e-02 0.478 0.6323
## payment_methodPayPal 4.163e-03 1.409e-02 0.295 0.7676
## delivery_days -1.002e-03 1.935e-03 -0.518 0.6046
## customer_rating -5.075e-03 5.747e-03 -0.883 0.3772
## discount_percent -1.044e-03 8.110e-04 -1.288 0.1978
## order_month2 -1.801e-02 2.489e-02 -0.724 0.4693
## order_month3 -1.379e-02 2.417e-02 -0.570 0.5684
## order_month4 -6.175e-03 2.446e-02 -0.252 0.8007
## order_month5 -8.263e-03 2.410e-02 -0.343 0.7316
## order_month6 -1.627e-02 2.434e-02 -0.668 0.5039
## order_month7 -1.700e-02 2.428e-02 -0.700 0.4838
## order_month8 -2.404e-03 2.417e-02 -0.099 0.9208
## order_month9 -3.633e-02 2.438e-02 -1.490 0.1362
## order_month10 -3.590e-02 2.420e-02 -1.484 0.1379
## order_month11 1.001e+00 2.438e-02 41.044 <2e-16 ***
## order_month12 -6.896e-03 2.421e-02 -0.285 0.7757
## order_weekdayMonday -1.518e-02 1.871e-02 -0.811 0.4173
## order_weekdaySaturday 1.384e-02 1.869e-02 0.741 0.4590
## order_weekdaySunday 7.496e-03 1.876e-02 0.400 0.6895
## order_weekdayThursday 5.666e-03 1.866e-02 0.304 0.7614
## order_weekdayTuesday -4.613e-03 1.866e-02 -0.247 0.8047
## order_weekdayWednesday -1.257e-02 1.867e-02 -0.673 0.5009
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.41 on 79966 degrees of freedom
## Multiple R-squared: 0.03802, Adjusted R-squared: 0.0376
## F-statistic: 90.31 on 35 and 79966 DF, p-value: < 2.2e-16
lm_pred <- predict(lm_model, newdata = reg_test)
lm_results <- data.frame(
Actual = reg_test$quantity,
Predicted = lm_pred
)
lm_rmse <- RMSE(lm_results$Predicted, lm_results$Actual)
lm_mae <- MAE(lm_results$Predicted, lm_results$Actual)
lm_r2 <- R2(lm_results$Predicted, lm_results$Actual)
cat("Linear Regression RMSE:", round(lm_rmse, 4), "\n")
## Linear Regression RMSE: 1.4171
cat("Linear Regression MAE:", round(lm_mae, 4), "\n")
## Linear Regression MAE: 1.2062
cat("Linear Regression R-squared:", round(lm_r2, 4), "\n")
## Linear Regression R-squared: 0.0356
Random Forest Regression is used as a more flexible machine learning model. It can capture non-linear relationships and interactions between predictors, which may be useful because e-commerce sales quantity can be influenced by complex combinations of price, discount, product category, region, and customer experience factors.
set.seed(123)
rf_reg_train <- reg_train %>%
slice_sample(n = min(20000, nrow(reg_train)))
rf_reg_model <- randomForest(
quantity ~ .,
data = rf_reg_train,
ntree = 100,
importance = TRUE
)
rf_reg_model
##
## Call:
## randomForest(formula = quantity ~ ., data = rf_reg_train, ntree = 100, importance = TRUE)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 2.067228
## % Var explained: 0.53
rf_reg_pred <- predict(rf_reg_model, newdata = reg_test)
rf_reg_results <- data.frame(
Actual = reg_test$quantity,
Predicted = rf_reg_pred
)
rf_reg_rmse <- RMSE(rf_reg_results$Predicted, rf_reg_results$Actual)
rf_reg_mae <- MAE(rf_reg_results$Predicted, rf_reg_results$Actual)
rf_reg_r2 <- R2(rf_reg_results$Predicted, rf_reg_results$Actual)
cat("Random Forest Regression RMSE:", round(rf_reg_rmse, 4), "\n")
## Random Forest Regression RMSE: 1.4317
cat("Random Forest Regression MAE:", round(rf_reg_mae, 4), "\n")
## Random Forest Regression MAE: 1.2346
cat("Random Forest Regression R-squared:", round(rf_reg_r2, 4), "\n")
## Random Forest Regression R-squared: 0.0236
The regression models are evaluated using RMSE, MAE, and R-squared. RMSE and MAE measure prediction error, where smaller values indicate better prediction accuracy. R-squared measures how much variation in sales quantity can be explained by the model, where a higher value indicates stronger explanatory power.
regression_results <- data.frame(
Model = c("Multiple Linear Regression", "Random Forest Regression"),
RMSE = c(lm_rmse, rf_reg_rmse),
MAE = c(lm_mae, rf_reg_mae),
R_squared = c(lm_r2, rf_reg_r2)
)
knitr::kable(
regression_results,
digits = 4,
caption = "Performance Comparison of Regression Models"
)
| Model | RMSE | MAE | R_squared |
|---|---|---|---|
| Multiple Linear Regression | 1.4171 | 1.2062 | 0.0356 |
| Random Forest Regression | 1.4317 | 1.2346 | 0.0236 |
The following plots compare actual sales quantity with predicted sales quantity. Points closer to the diagonal line indicate more accurate predictions.
lm_plot_data <- lm_results %>%
mutate(Model = "Multiple Linear Regression")
rf_plot_data <- rf_reg_results %>%
mutate(Model = "Random Forest Regression")
bind_rows(lm_plot_data, rf_plot_data) %>%
ggplot(aes(x = Actual, y = Predicted)) +
geom_point(alpha = 0.25, color = "steelblue") +
geom_abline(intercept = 0, slope = 1, color = "red", linewidth = 1) +
facet_wrap(~Model) +
labs(
title = "Actual vs Predicted Sales Quantity",
x = "Actual Quantity",
y = "Predicted Quantity"
) +
theme_minimal()
Feature importance from the Random Forest Regression model is used to identify the most influential growth drivers for sales quantity prediction. Variables with higher importance contribute more to reducing prediction error in the model.
varImpPlot(
rf_reg_model,
main = "Variable Importance for Sales Quantity Prediction"
)
The regression models provide a data-driven approach for predicting e-commerce sales quantity using multi-dimensional features. Multiple Linear Regression acts as an interpretable baseline model, while Random Forest Regression captures more complex relationships among predictors.
If the Random Forest model achieves lower RMSE and MAE than the linear model, it suggests that sales quantity is influenced by non-linear relationships and feature interactions. For example, the effect of discount percentage may differ across product categories or regions. If the linear model performs similarly, it suggests that the available features explain sales quantity in a more direct and stable way.
The feature importance results help identify potential sales growth drivers. Features such as product category, product price, discount percentage, region, and order time can support decisions about which products to promote, where stock should be allocated, and how pricing or discount strategies may be adjusted to improve demand.
This section developed sales quantity prediction models using
regression techniques. The target variable was quantity,
and the predictors included product, price, discount, region, payment,
delivery, rating, and time-based features. The models were evaluated
using RMSE, MAE, and R-squared.
The regression analysis supports the research goal of predicting sales quantity to identify growth drivers and improve stock management. By understanding which features are most important for quantity prediction, e-commerce businesses can make better decisions about inventory planning, sales promotion, category strategy, and regional demand management.
This section focuses on the classification task of predicting whether an order will be returned in an e-commerce environment.
The target variable is is_returned, where “Yes”
indicates a returned order and “No” indicates a non-returned order.
#The required packages were installed before running the analysis.
library(tidyverse)
library(lubridate)
library(caret)
library(pROC)
library(randomForest)
# Build classification dataset from the cleaned dataset created in earlier sections
df_cls <- df_clean %>%
select(
product_category,
product_price,
quantity,
order_date,
region,
payment_method,
delivery_days,
customer_rating,
discount_percent,
is_returned
)%>%
mutate(
order_date = as.Date(order_date),
order_month = factor(month(order_date)),
order_weekday = factor(weekdays(order_date)),
is_returned = factor(ifelse(is_returned == 1, "Yes", "No"),
levels = c("No", "Yes")),
product_category = factor(product_category),
region = factor(region),
payment_method = factor(payment_method)
) %>%
select(-order_date)
str(df_cls)
## 'data.frame': 100000 obs. of 11 variables:
## $ product_category: Factor w/ 7 levels "Automotive","Beauty",..: 2 4 2 3 4 2 6 1 5 3 ...
## $ product_price : num 190.4 82.2 15.2 310.6 74 ...
## $ quantity : int 5 3 2 2 4 5 2 4 2 2 ...
## $ region : Factor w/ 6 levels "Africa","Asia",..: 3 4 5 3 1 5 1 4 4 3 ...
## $ payment_method : Factor w/ 4 levels "BankTransfer",..: 1 3 2 4 4 4 3 1 1 3 ...
## $ delivery_days : int 8 5 6 9 3 5 5 3 6 4 ...
## $ customer_rating : num 3.8 3.8 2 2.9 3.1 3.4 2.7 4.7 3.6 3.8 ...
## $ discount_percent: num 0 0 10 5 20 5 0 0 0 0 ...
## $ is_returned : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 1 1 1 1 1 ...
## $ order_month : Factor w/ 12 levels "1","2","3","4",..: 2 10 6 7 2 5 7 5 3 2 ...
## $ order_weekday : Factor w/ 7 levels "Friday","Monday",..: 6 1 7 6 1 1 6 5 3 2 ...
Before model training, the class distribution is checked to understand whether return cases are balanced with non-return cases. This step is important because highly imbalanced data may lead to misleading accuracy and weak detection of returned orders.
table(df_cls$is_returned)
##
## No Yes
## 93940 6060
prop.table(table(df_cls$is_returned))
##
## No Yes
## 0.9394 0.0606
The dataset is divided into a training set and a testing set using stratified sampling. To improve the model’s ability to identify return cases, down-sampling is applied to the training set so that the two classes become more balanced.
set.seed(123)
train_index <- createDataPartition(df_cls$is_returned, p = 0.8, list = FALSE)
train_data <- df_cls[train_index, ]
test_data <- df_cls[-train_index, ]
# Check class proportions before balancing
prop.table(table(train_data$is_returned))
##
## No Yes
## 0.9394 0.0606
prop.table(table(test_data$is_returned))
##
## No Yes
## 0.9394 0.0606
# Down-sampling on training data only
set.seed(123)
train_bal <- downSample(
x = train_data %>% select(-is_returned),
y = train_data$is_returned,
yname = "is_returned"
)
table(train_bal$is_returned)
##
## No Yes
## 4848 4848
prop.table(table(train_bal$is_returned))
##
## No Yes
## 0.5 0.5
Logistic Regression is used as the baseline classification model because it is widely used, easy to interpret, and suitable for binary outcome prediction. It provides a useful benchmark for comparing with more advanced models.
log_model <- glm(
is_returned ~ .,
data = train_bal,
family = binomial
)
summary(log_model)
##
## Call:
## glm(formula = is_returned ~ ., family = binomial, data = train_bal)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.2204184 0.1628072 -1.354 0.1758
## product_categoryBeauty -0.1235981 0.0801757 -1.542 0.1232
## product_categoryElectronics 0.0305900 0.0791179 0.387 0.6990
## product_categoryFashion 0.9060763 0.0735131 12.325 <2e-16 ***
## product_categoryHome -0.1015565 0.0795870 -1.276 0.2019
## product_categorySports -0.0156095 0.0798005 -0.196 0.8449
## product_categoryToys -0.1079902 0.0795688 -1.357 0.1747
## product_price -0.0000821 0.0001457 -0.563 0.5732
## quantity -0.0236625 0.0147006 -1.610 0.1075
## regionAsia 0.0547191 0.0732093 0.747 0.4548
## regionEurope 0.1091548 0.0736951 1.481 0.1386
## regionNorth America 0.0680017 0.0726946 0.935 0.3496
## regionOceania 0.1172048 0.0731439 1.602 0.1091
## regionSouth America 0.0343883 0.0727422 0.473 0.6364
## payment_methodCash 0.0981473 0.0586821 1.673 0.0944 .
## payment_methodCreditCard 0.0146061 0.0584686 0.250 0.8027
## payment_methodPayPal 0.1016816 0.0587932 1.729 0.0837 .
## delivery_days 0.0013527 0.0080947 0.167 0.8673
## customer_rating 0.0164227 0.0238292 0.689 0.4907
## discount_percent -0.0002545 0.0034008 -0.075 0.9403
## order_month2 0.0972665 0.1036660 0.938 0.3481
## order_month3 -0.0751276 0.1017743 -0.738 0.4604
## order_month4 -0.0084290 0.1007660 -0.084 0.9333
## order_month5 0.0223852 0.1005536 0.223 0.8238
## order_month6 0.1085186 0.1014507 1.070 0.2848
## order_month7 -0.0211145 0.1018124 -0.207 0.8357
## order_month8 0.0792844 0.1009125 0.786 0.4321
## order_month9 -0.0210754 0.1019059 -0.207 0.8362
## order_month10 0.1060466 0.1015553 1.044 0.2964
## order_month11 0.0726548 0.1029465 0.706 0.4803
## order_month12 0.1772374 0.1004519 1.764 0.0777 .
## order_weekdayMonday -0.0777023 0.0774448 -1.003 0.3157
## order_weekdaySaturday -0.0614101 0.0769369 -0.798 0.4248
## order_weekdaySunday -0.1363081 0.0773793 -1.762 0.0781 .
## order_weekdayThursday -0.0585028 0.0770337 -0.759 0.4476
## order_weekdayTuesday 0.0679177 0.0776318 0.875 0.3816
## order_weekdayWednesday -0.1172153 0.0777133 -1.508 0.1315
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13442 on 9695 degrees of freedom
## Residual deviance: 13052 on 9659 degrees of freedom
## AIC: 13126
##
## Number of Fisher Scoring iterations: 4
# Predict return probability
log_prob <- predict(log_model, newdata = test_data, type = "response")
# Convert probability to class label using 0.5 threshold
log_pred <- ifelse(log_prob >= 0.5, "Yes", "No")
log_pred <- factor(log_pred, levels = c("No", "Yes"))
# Confusion matrix
log_cm <- confusionMatrix(
data = log_pred,
reference = test_data$is_returned,
positive = "Yes"
)
log_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 15119 810
## Yes 3669 402
##
## Accuracy : 0.776
## 95% CI : (0.7702, 0.7818)
## No Information Rate : 0.9394
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0648
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.33168
## Specificity : 0.80472
## Pos Pred Value : 0.09875
## Neg Pred Value : 0.94915
## Prevalence : 0.06060
## Detection Rate : 0.02010
## Detection Prevalence : 0.20355
## Balanced Accuracy : 0.56820
##
## 'Positive' Class : Yes
##
log_roc <- roc(
response = test_data$is_returned,
predictor = log_prob,
levels = c("No", "Yes")
)
log_auc <- as.numeric(auc(log_roc))
log_auc
## [1] 0.5714915
Random Forest is used as a second classification model because it can capture nonlinear relationships and interactions among variables. It is also useful for identifying the relative importance of predictors in return classification.
set.seed(123)
rf_model <- randomForest(
is_returned ~ .,
data = train_bal,
ntree = 200,
importance = TRUE
)
rf_model
##
## Call:
## randomForest(formula = is_returned ~ ., data = train_bal, ntree = 200, importance = TRUE)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 44.62%
## Confusion matrix:
## No Yes class.error
## No 2780 2068 0.4265677
## Yes 2258 2590 0.4657591
# Predict return class
rf_pred <- predict(rf_model, newdata = test_data, type = "class")
# Confusion matrix
rf_cm <- confusionMatrix(
data = rf_pred,
reference = test_data$is_returned,
positive = "Yes"
)
rf_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 10906 598
## Yes 7882 614
##
## Accuracy : 0.576
## 95% CI : (0.5691, 0.5829)
## No Information Rate : 0.9394
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0228
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.50660
## Specificity : 0.58048
## Pos Pred Value : 0.07227
## Neg Pred Value : 0.94802
## Prevalence : 0.06060
## Detection Rate : 0.03070
## Detection Prevalence : 0.42480
## Balanced Accuracy : 0.54354
##
## 'Positive' Class : Yes
##
# Predict return probability
rf_prob <- predict(rf_model, newdata = test_data, type = "prob")[, "Yes"]
rf_roc <- roc(
response = test_data$is_returned,
predictor = rf_prob,
levels = c("No", "Yes")
)
rf_auc <- as.numeric(auc(rf_roc))
rf_auc
## [1] 0.5707515
To compare the two models fairly, several evaluation metrics are used, including Accuracy, Precision, Recall, F1-score, and AUC. Since the practical goal is to identify returned orders effectively, Recall and F1-score are especially important.
results_table <- data.frame(
Model = c("Logistic Regression", "Random Forest"),
Accuracy = c(log_cm$overall["Accuracy"], rf_cm$overall["Accuracy"]),
Precision = c(log_cm$byClass["Pos Pred Value"], rf_cm$byClass["Pos Pred Value"]),
Recall = c(log_cm$byClass["Sensitivity"], rf_cm$byClass["Sensitivity"]),
F1 = c(log_cm$byClass["F1"], rf_cm$byClass["F1"]),
AUC = c(log_auc, rf_auc)
)
knitr::kable(results_table, digits = 4, caption = "Performance Comparison of Classification Models")
| Model | Accuracy | Precision | Recall | F1 | AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.776 | 0.0987 | 0.3317 | 0.1522 | 0.5715 |
| Random Forest | 0.576 | 0.0723 | 0.5066 | 0.1265 | 0.5708 |
The ROC curve is plotted to compare the overall discrimination ability of the two models. A model with a larger AUC is considered to have stronger classification performance.
plot(log_roc, col = "blue", main = "ROC Curve Comparison")
plot(rf_roc, col = "red", add = TRUE)
legend(
"bottomright",
legend = c("Logistic Regression", "Random Forest"),
col = c("blue", "red"),
lwd = 2
)
Variable importance is examined based on the Random Forest model. This helps explain which factors contribute most to return behavior and provides useful business insights for return risk management.
varImpPlot(rf_model)
The classification results show that both Logistic Regression and Random Forest provided only limited predictive performance for return behavior. Logistic Regression achieved a slightly higher AUC (0.571) than Random Forest (0.564), but both values were only slightly above 0.5. This indicates that the selected predictors have limited ability to clearly distinguish returned and non-returned orders.
The weak model performance may be related to class imbalance and the limited set of available features. Return behavior may also depend on factors that are not included in the dataset, such as product quality, customer expectations, item fit, and after-sales experience.
Even so, the Random Forest variable importance output still provides useful business insights. Variables related to delivery, discount, rating, and product category appear relevant for return behavior. These findings can support return-risk monitoring and help e-commerce businesses improve delivery performance, pricing strategy, and category management.
This section developed two classification models to predict whether an order would be returned. Logistic Regression was used as a baseline model, while Random Forest was used as a more flexible machine learning model. The models were evaluated using Accuracy, Balanced Accuracy, Precision, Recall, F1-score, and AUC.
Although both models provided some useful insights into return behavior, their predictive performance was relatively weak. This suggests that the available variables explain only part of return behavior, and additional features or further model tuning may be needed to improve classification performance. Even so, the analysis still offers practical value for understanding return-related patterns in e-commerce.
classification_summary <- data.frame(
Metric = c("Accuracy", "Precision", "Recall", "F1", "AUC"),
Logistic_Regression = c(
round(log_cm$overall["Accuracy"], 4),
round(log_cm$byClass["Pos Pred Value"], 4),
round(log_cm$byClass["Sensitivity"], 4),
round(log_cm$byClass["F1"], 4),
round(log_auc, 4)
),
Random_Forest = c(
round(rf_cm$overall["Accuracy"], 4),
round(rf_cm$byClass["Pos Pred Value"], 4),
round(rf_cm$byClass["Sensitivity"], 4),
round(rf_cm$byClass["F1"], 4),
round(rf_auc, 4)
)
)
kable(classification_summary, caption = "Table 1: Classification Model Performance Comparison")
| Metric | Logistic_Regression | Random_Forest | |
|---|---|---|---|
| Accuracy | Accuracy | 0.7760 | 0.5760 |
| Pos Pred Value | Precision | 0.0987 | 0.0723 |
| Sensitivity | Recall | 0.3317 | 0.5066 |
| F1 | F1 | 0.1522 | 0.1265 |
| AUC | 0.5715 | 0.5708 |
regression_summary <- data.frame(
Model = c("Linear Regression", "Random Forest"),
RMSE = c(round(lm_rmse, 4), round(rf_reg_rmse, 4)),
MAE = c(round(lm_mae, 4), round(rf_reg_mae, 4)),
R_squared = c(round(lm_r2, 4), round(rf_reg_r2, 4))
)
kable(regression_summary, caption = "Table 2: Regression Model Performance Comparison")
| Model | RMSE | MAE | R_squared |
|---|---|---|---|
| Linear Regression | 1.4171 | 1.2062 | 0.0356 |
| Random Forest | 1.4317 | 1.2346 | 0.0236 |
The classification results indicate that both models have limited predictive ability for return behavior. Although Logistic Regression achieved an accuracy of 77.6%, this metric alone does not reliably reflect model performance due to the majority of “non-return” orders in the dataset. In this imbalanced class scenario, Precision, Recall, F1, and AUC are more meaningful evaluation metrics.
Logistic Regression slightly outperformed Random Forest in Precision, F1, and AUC, indicating more balanced overall performance. Random Forest, however, achieved higher Recall, meaning it can identify more actual returned orders, but at the cost of more false positives. Overall, Logistic Regression can be considered the better classification model in this project, but the AUC values of both models are only slightly above 0.5, indicating weak ability to distinguish between returned and non-returned orders.
The limited model performance may be attributed to two main factors: 1. The proportion of returned orders is low, leading to a clear class imbalance issue, making the model more likely to predict “non-return”. 2. Return behavior is typically influenced by factors such as product quality, size fit, customer expectations, and after-sales experience, which are not included in the current dataset. Nevertheless, delivery days, discount percentage, customer rating, and product category still show some association with return behavior, providing valuable insights for return management.
The regression results similarly show that both models have limited predictive ability for order quantity. Linear Regression slightly outperformed Random Forest in RMSE, MAE, and R², and can therefore be considered the better regression model in this project. However, the R² values of both models are close to zero, indicating that the available features can hardly explain the variation in order quantity. Additionally, the RMSE and MAE values suggest relatively high average prediction errors, making the predictive results less practical.
An important reason for this outcome is that the current data lacks key features closely related to purchase quantity, such as marketing activities, customer loyalty, inventory status, competitor pricing, and seasonal demand. Therefore, the existing variables cannot adequately capture the true drivers of order quantity. Despite the limited predictive performance, variables such as product category, discount percentage, price, and region still provide some business insights that can support inventory management and promotion strategy optimization.
This project completed two machine learning tasks based on an e-commerce dataset: return behavior prediction (classification) and order quantity prediction (regression).
This indicates that the features in the current dataset are insufficient for accurate prediction of return behavior and order quantity.
Nevertheless, this project still holds analytical value. The results of data cleaning, exploratory analysis, and model comparison indicate that factors such as delivery days, discount percentage, customer rating, product category, price, and region are associated with business outcomes.
Based on these findings, businesses may consider optimizing delivery efficiency, adjusting promotion strategies, strengthening quality management for high-return categories, and collecting richer customer behavior and product information in the future. Overall, even with limited predictive performance, a complete data science analytical process can still provide meaningful support for business decision-making.