1 Introduction

The rapid growth of e-commerce has created many opportunities for online businesses, but it has also brought challenges such as product returns and unpredictable sales demand. Product returns can increase operational costs, affect customer satisfaction, and reduce business efficiency. At the same time, accurate sales quantity prediction is important for inventory planning, stock management, and revenue growth.

This project analyzes an e-commerce transaction dataset to study return behavior and sales quantity prediction. The project focuses on two main machine learning tasks: classification and regression. The classification task aims to predict whether an order will be returned, while the regression task aims to predict the quantity of products sold. These analyses can help e-commerce companies identify return drivers, improve decision-making, reduce unnecessary costs, and manage stock more effectively.

2 Dataset Description

The dataset used in this project is a synthetic e-commerce sales dataset for the year 2025. It contains 100,000 transaction records and 13 variables. The dataset includes customer, product, order, payment, delivery, return, rating, discount, and revenue information. The variables are a mix of numerical, categorical, and date-based data. This makes the dataset suitable for both classification and regression analysis.

# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.3     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
# Create dataset table
dataset_table <- data.frame(
  Variable = c("order_id", "customer_id", "product_category", "product_price",
               "quantity", "order_date", "region", "payment_method",
               "delivery_days", "is_returned", "customer_rating",
               "discount_percent", "revenue"),
  Description = c("Unique order identification number",
                  "Unique customer identification code",
                  "Category of the product purchased",
                  "Price of the product",
                  "Number of units purchased",
                  "Date of the order",
                  "Customer region",
                  "Payment method used",
                  "Number of delivery days",
                  "Return status: 1 = returned, 0 = not returned",
                  "Customer rating score",
                  "Discount percentage applied",
                  "Total revenue generated from the order")
)

# Display table
kable(dataset_table, caption = "Dataset Variable Description") %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE
  )
Dataset Variable Description
Variable Description
order_id Unique order identification number
customer_id Unique customer identification code
product_category Category of the product purchased
product_price Price of the product
quantity Number of units purchased
order_date Date of the order
region Customer region
payment_method Payment method used
delivery_days Number of delivery days
is_returned Return status: 1 = returned, 0 = not returned
customer_rating Customer rating score
discount_percent Discount percentage applied
revenue Total revenue generated from the order

3 Research Objectives and Questions

This project has two main research objectives. The first objective is to predict product return behavior using classification techniques. The second objective is to predict product sales quantity using regression techniques.

3.1 Research Question 1: Classification

Can we predict whether an e-commerce order will be returned based on customer, product, payment, delivery, rating, and discount features?

The objective of this question is to identify return behavior and understand the main factors that may influence product returns. This can help e-commerce companies detect high-risk orders earlier, reduce return-related costs, and improve customer satisfaction.

3.2 Research Question 2: Regression

Can we predict the quantity of products sold based on product, customer, price, discount, region, payment, and delivery features?

The objective of this question is to predict sales quantity and identify possible growth drivers. This can help businesses improve inventory planning, manage stock more efficiently, and support better sales strategies.

4 Data Cleaning

The purpose of the data cleaning stage is to prepare a reliable processed dataset for further exploratory analysis and modelling. In this stage, we handled missing values, corrected data types, identified abnormal records, checked revenue consistency, and removed duplicate records.

This cleaned dataset supports two main objectives: a classification problem for return behavior warning using is_returned as the target variable, and a regression problem for sales quantity prediction using quantity as the target variable.

4.1 Load Packages and Data

library(tidyverse)
library(lubridate)
library(skimr)
library(corrplot)
library(ggplot2)
library(dplyr)
file_path <- "synthetic_ecommerce_sales_2025.csv"
df_raw <- read.csv(file_path, stringsAsFactors = FALSE)

glimpse(df_raw)
## Rows: 100,000
## Columns: 13
## $ order_id         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ customer_id      <chr> "bdd640fb-0667-4ad1-9c80-317fa3b1799d", "23b8c1e9-392…
## $ product_category <chr> "Beauty", "Fashion", "Beauty", "Electronics", "Fashio…
## $ product_price    <dbl> 190.40, 82.22, 15.19, 310.65, 74.05, 236.05, 471.39, …
## $ quantity         <int> 5, 3, 2, 2, 4, 5, 2, 4, 2, 2, 4, 2, 1, 3, 1, 5, 3, 2,…
## $ order_date       <chr> "2023-02-21", "2023-10-13", "2023-06-28", "2023-07-11…
## $ region           <chr> "Europe", "North America", "Oceania", "Europe", "Afri…
## $ payment_method   <chr> "BankTransfer", "CreditCard", "Cash", "PayPal", "PayP…
## $ delivery_days    <int> 8, 5, 6, 9, 3, 5, 5, 3, 6, 4, 8, 9, 8, 5, 2, 8, 7, 3,…
## $ is_returned      <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ customer_rating  <dbl> 3.8, 3.8, 2.0, 2.9, 3.1, 3.4, 2.7, 4.7, 3.6, 3.8, 4.5…
## $ discount_percent <int> 0, 0, 10, 5, 20, 5, 0, 0, 0, 0, 0, 20, 10, 10, 5, 10,…
## $ revenue          <dbl> 952.00, 246.66, 27.34, 590.23, 236.96, 1121.24, 942.7…
skim(df_raw)
Data summary
Name df_raw
Number of rows 100000
Number of columns 13
_______________________
Column type frequency:
character 5
numeric 8
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
customer_id 0 1 36 36 0 100000 0
product_category 0 1 4 11 0 7 0
order_date 0 1 10 10 0 1096 0
region 0 1 4 13 0 6 0
payment_method 0 1 4 12 0 4 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
order_id 0 1 50000.50 28867.66 1.00 25000.75 50000.50 75000.25 100000.00 ▇▇▇▇▇
product_price 0 1 250.96 141.74 4.52 128.39 251.43 372.27 500.00 ▇▇▇▇▇
quantity 0 1 3.09 1.44 1.00 2.00 3.00 4.00 6.00 ▇▅▅▅▁
delivery_days 0 1 4.98 2.58 1.00 3.00 5.00 7.00 9.00 ▇▇▃▇▇
is_returned 0 1 0.06 0.24 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
customer_rating 0 1 3.50 0.87 2.00 2.80 3.50 4.20 5.00 ▇▇▇▇▇
discount_percent 0 1 5.01 6.14 0.00 0.00 0.00 10.00 20.00 ▇▃▂▂▁
revenue 0 1 734.15 571.19 4.26 274.11 585.55 1089.86 2699.14 ▇▅▂▁▁

4.2 Check Missing Values

missing_summary <- colSums(is.na(df_raw))
missing_summary
##         order_id      customer_id product_category    product_price 
##                0                0                0                0 
##         quantity       order_date           region   payment_method 
##                0                0                0                0 
##    delivery_days      is_returned  customer_rating discount_percent 
##                0                0                0                0 
##          revenue 
##                0

4.3 Remove Columns with More Than 50% Missing Values

check the number of missing values in each column and identifies columns with more than 50% missing values. Columns with a high proportion of missing values may contain incomplete information and could negatively affect subsequent analysis and model development. Therefore, these columns are stored in high_na and removed from the dataset if they exist. If no such columns are found, the original dataset is retained.

high_na <- names(missing_summary[missing_summary > 0.5 * nrow(df_raw)])

if(length(high_na) > 0) {
  df_clean <- df_raw %>% select(-all_of(high_na))
} else {
  df_clean <- df_raw
}

high_na
## character(0)

4.4 Convert Data Types

The following columns are converted into suitable data types:

  • order_date as Date
  • is_returned, product_category, region, and payment_method as factor variables
  • quantity and delivery_days as integer variables
  • product_price, discount_percent, customer_rating, and revenue as numeric variables, if available
df_clean <- df_clean %>%
  mutate(
    order_date = as.Date(order_date, format = "%Y-%m-%d"),
    is_returned = as.factor(is_returned),
    product_category = as.factor(product_category),
    region = as.factor(region),
    payment_method = as.factor(payment_method),
    quantity = as.integer(quantity),
    product_price = as.numeric(product_price),
    discount_percent = as.numeric(discount_percent),
    delivery_days = as.integer(delivery_days),
    customer_rating = as.numeric(customer_rating)
  )

if("revenue" %in% names(df_clean)) {
  df_clean$revenue <- as.numeric(df_clean$revenue)
}

glimpse(df_clean)
## Rows: 100,000
## Columns: 13
## $ order_id         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ customer_id      <chr> "bdd640fb-0667-4ad1-9c80-317fa3b1799d", "23b8c1e9-392…
## $ product_category <fct> Beauty, Fashion, Beauty, Electronics, Fashion, Beauty…
## $ product_price    <dbl> 190.40, 82.22, 15.19, 310.65, 74.05, 236.05, 471.39, …
## $ quantity         <int> 5, 3, 2, 2, 4, 5, 2, 4, 2, 2, 4, 2, 1, 3, 1, 5, 3, 2,…
## $ order_date       <date> 2023-02-21, 2023-10-13, 2023-06-28, 2023-07-11, 2023…
## $ region           <fct> Europe, North America, Oceania, Europe, Africa, Ocean…
## $ payment_method   <fct> BankTransfer, CreditCard, Cash, PayPal, PayPal, PayPa…
## $ delivery_days    <int> 8, 5, 6, 9, 3, 5, 5, 3, 6, 4, 8, 9, 8, 5, 2, 8, 7, 3,…
## $ is_returned      <fct> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ customer_rating  <dbl> 3.8, 3.8, 2.0, 2.9, 3.1, 3.4, 2.7, 4.7, 3.6, 3.8, 4.5…
## $ discount_percent <dbl> 0, 0, 10, 5, 20, 5, 0, 0, 0, 0, 0, 20, 10, 10, 5, 10,…
## $ revenue          <dbl> 952.00, 246.66, 27.34, 590.23, 236.96, 1121.24, 942.7…

4.5 Handle Missing Values

For numerical variables, missing values are replaced with the median. For factor variables, missing values are replaced with the mode.

# Fill numerical missing values using median
numeric_cols <- df_clean %>% select(where(is.numeric)) %>% names()

for(col in numeric_cols) {
  if(any(is.na(df_clean[[col]]))) {
    df_clean[[col]][is.na(df_clean[[col]])] <- median(df_clean[[col]], na.rm = TRUE)
  }
}

# Fill factor missing values using mode
factor_cols <- df_clean %>% select(where(is.factor)) %>% names()

for(col in factor_cols) {
  if(any(is.na(df_clean[[col]]))) {
    mode_val <- names(sort(table(df_clean[[col]]), decreasing = TRUE))[1]
    df_clean[[col]][is.na(df_clean[[col]])] <- mode_val
  }
}

colSums(is.na(df_clean))
##         order_id      customer_id product_category    product_price 
##                0                0                0                0 
##         quantity       order_date           region   payment_method 
##                0                0                0                0 
##    delivery_days      is_returned  customer_rating discount_percent 
##                0                0                0                0 
##          revenue 
##                0

4.6 Flag Outliers Based on Business Rules

Outliers are flagged instead of immediately removed, because abnormal values may still provide useful business information. The following rules are used:

  • quantity > 100 or quantity <= 0
  • product_price > 10000 or product_price <= 0
  • delivery_days > 30 or delivery_days < 0
  • discount_percent > 100 or discount_percent < 0
  • customer_rating > 5 or customer_rating < 1
df_clean <- df_clean %>%
  mutate(
    is_quantity_outlier = if_else(quantity > 100 | quantity <= 0, 1, 0),
    is_price_outlier = if_else(product_price > 10000 | product_price <= 0, 1, 0),
    is_delivery_outlier = if_else(delivery_days > 30 | delivery_days < 0, 1, 0),
    is_discount_outlier = if_else(discount_percent > 100 | discount_percent < 0, 1, 0),
    is_rating_outlier = if_else(customer_rating > 5 | customer_rating < 1, 1, 0),
    is_any_outlier = pmax(
      is_quantity_outlier,
      is_price_outlier,
      is_delivery_outlier,
      is_discount_outlier,
      is_rating_outlier
    )
  )

outlier_summary <- df_clean %>%
  summarise(
    quantity_outliers = sum(is_quantity_outlier),
    price_outliers = sum(is_price_outlier),
    delivery_outliers = sum(is_delivery_outlier),
    discount_outliers = sum(is_discount_outlier),
    rating_outliers = sum(is_rating_outlier),
    any_outliers = sum(is_any_outlier)
  )

outlier_summary
##   quantity_outliers price_outliers delivery_outliers discount_outliers
## 1                 0              0                 0                 0
##   rating_outliers any_outliers
## 1               0            0

4.7 Check Revenue Consistency

If the dataset contains a revenue column, revenue is recalculated using:

\[ Revenue = Product\ Price \times Quantity \times (1 - Discount\ Percent / 100) \]

Rows are kept when the difference between original revenue and calculated revenue is less than 0.01.

if("revenue" %in% names(df_clean)) {
  rows_before_revenue_check <- nrow(df_clean)
  
  df_clean <- df_clean %>%
    mutate(revenue_calc = product_price * quantity * (1 - discount_percent / 100)) %>%
    filter(abs(revenue - revenue_calc) < 0.01) %>%
    select(-revenue_calc)
  
  rows_after_revenue_check <- nrow(df_clean)
  revenue_removed <- rows_before_revenue_check - rows_after_revenue_check
  
  revenue_removed
}
## [1] 0

4.8 Remove Duplicate Rows

Check and remove completely duplicated records in the dataset to ensure data quality.

rows_before_duplicates <- nrow(df_clean)
df_clean <- df_clean %>% distinct()
rows_after_duplicates <- nrow(df_clean)
duplicate_rows_removed <- rows_before_duplicates - rows_after_duplicates

duplicate_rows_removed
## [1] 0

4.9 Save Cleaned Dataset

output_path <- "ecommerce_cleaned.csv"
write.csv(df_clean, output_path, row.names = FALSE)

cat("Cleaned data saved to:", output_path, "\n")
## Cleaned data saved to: ecommerce_cleaned.csv
cat("Cleaned dataset dimensions:", nrow(df_clean), "rows and", ncol(df_clean), "columns\n")
## Cleaned dataset dimensions: 100000 rows and 19 columns

5 Exploratory Data Analysis

5.1 Dataset Summary After Cleaning

dim(df_clean)
## [1] 100000     19
summary(df_clean)
##     order_id         customer_id        product_category product_price    
##  Min.   :     1   Length   :100000   Automotive :14239   Min.   :  4.518  
##  1st Qu.: 25001   N.unique :100000   Beauty     :14234   1st Qu.:128.387  
##  Median : 50000   N.blank  :     0   Electronics:14375   Median :251.430  
##  Mean   : 50000   Min.nchar:    36   Fashion    :14327   Mean   :250.963  
##  3rd Qu.: 75000   Max.nchar:    36   Home       :14182   3rd Qu.:372.270  
##  Max.   :100000                      Sports     :14354   Max.   :500.000  
##                                      Toys       :14289                    
##     quantity       order_date                   region     
##  Min.   :1.000   Min.   :2023-01-01   Africa       :16450  
##  1st Qu.:2.000   1st Qu.:2023-10-02   Asia         :16763  
##  Median :3.000   Median :2024-07-05   Europe       :16513  
##  Mean   :3.085   Mean   :2024-07-02   North America:16749  
##  3rd Qu.:4.000   3rd Qu.:2025-04-04   Oceania      :16965  
##  Max.   :6.000   Max.   :2025-12-31   South America:16560  
##                                                            
##       payment_method  delivery_days   is_returned customer_rating
##  BankTransfer:25083   Min.   :1.000   0:93940     Min.   :2.0    
##  Cash        :24710   1st Qu.:3.000   1: 6060     1st Qu.:2.8    
##  CreditCard  :25222   Median :5.000               Median :3.5    
##  PayPal      :24985   Mean   :4.985               Mean   :3.5    
##                       3rd Qu.:7.000               3rd Qu.:4.2    
##                       Max.   :9.000               Max.   :5.0    
##                                                                  
##  discount_percent    revenue        is_quantity_outlier is_price_outlier
##  Min.   : 0.000   Min.   :   4.26   Min.   :0           Min.   :0       
##  1st Qu.: 0.000   1st Qu.: 274.11   1st Qu.:0           1st Qu.:0       
##  Median : 0.000   Median : 585.55   Median :0           Median :0       
##  Mean   : 5.015   Mean   : 734.15   Mean   :0           Mean   :0       
##  3rd Qu.:10.000   3rd Qu.:1089.86   3rd Qu.:0           3rd Qu.:0       
##  Max.   :20.000   Max.   :2699.14   Max.   :0           Max.   :0       
##                                                                         
##  is_delivery_outlier is_discount_outlier is_rating_outlier is_any_outlier
##  Min.   :0           Min.   :0           Min.   :0         Min.   :0     
##  1st Qu.:0           1st Qu.:0           1st Qu.:0         1st Qu.:0     
##  Median :0           Median :0           Median :0         Median :0     
##  Mean   :0           Mean   :0           Mean   :0         Mean   :0     
##  3rd Qu.:0           3rd Qu.:0           3rd Qu.:0         3rd Qu.:0     
##  Max.   :0           Max.   :0           Max.   :0         Max.   :0     
## 
skim(df_clean)
Data summary
Name df_clean
Number of rows 100000
Number of columns 19
_______________________
Column type frequency:
character 1
Date 1
factor 4
numeric 13
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
customer_id 0 1 36 36 0 1e+05 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
order_date 0 1 2023-01-01 2025-12-31 2024-07-05 1096

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
product_category 0 1 FALSE 7 Ele: 14375, Spo: 14354, Fas: 14327, Toy: 14289
region 0 1 FALSE 6 Oce: 16965, Asi: 16763, Nor: 16749, Sou: 16560
payment_method 0 1 FALSE 4 Cre: 25222, Ban: 25083, Pay: 24985, Cas: 24710
is_returned 0 1 FALSE 2 0: 93940, 1: 6060

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
order_id 0 1 50000.50 28867.66 1.00 25000.75 50000.50 75000.25 100000.00 ▇▇▇▇▇
product_price 0 1 250.96 141.74 4.52 128.39 251.43 372.27 500.00 ▇▇▇▇▇
quantity 0 1 3.09 1.44 1.00 2.00 3.00 4.00 6.00 ▇▅▅▅▁
delivery_days 0 1 4.98 2.58 1.00 3.00 5.00 7.00 9.00 ▇▇▃▇▇
customer_rating 0 1 3.50 0.87 2.00 2.80 3.50 4.20 5.00 ▇▇▇▇▇
discount_percent 0 1 5.01 6.14 0.00 0.00 0.00 10.00 20.00 ▇▃▂▂▁
revenue 0 1 734.15 571.19 4.26 274.11 585.55 1089.86 2699.14 ▇▅▂▁▁
is_quantity_outlier 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
is_price_outlier 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
is_delivery_outlier 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
is_discount_outlier 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
is_rating_outlier 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁
is_any_outlier 0 1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ▁▁▇▁▁

The cleaned dataset summary provides an overview of the dataset after the data cleaning process. It shows the final number of rows and columns, the basic statistical summary of each variable, and the overall data structure. This step is important because it confirms that the dataset is ready for further exploratory analysis and future model development.

5.2 Univariate Analysis

5.2.1 Return Distribution

ggplot(df_clean, aes(x = is_returned, fill = is_returned)) +
  geom_bar() +
  labs(
    title = "Distribution of Returns",
    x = "Returned",
    y = "Count"
  ) +
  theme_minimal()

The return distribution shows the frequency of returned and non-returned orders. This is important for the classification objective because is_returned is the target variable for the return behavior warning model. If returned orders represent only a small proportion of total orders, the dataset may have a class imbalance problem. This should be considered during model development because an imbalanced dataset may cause the model to predict the majority class more often and perform poorly in detecting returned orders.

5.2.2 Quantity Distribution

ggplot(df_clean, aes(x = quantity)) +
  geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
  labs(
    title = "Distribution of Sales Quantity",
    x = "Quantity",
    y = "Frequency"
  ) +
  theme_minimal()

This chart shows the distribution of the regression target variable, quantity. It helps identify the common range of sales quantity per order and whether the quantity values are concentrated, skewed, or contain unusual patterns. Understanding this distribution is important for the regression objective because the shape of the target variable can affect the performance of sales quantity prediction models.

5.2.3 Product Price Distribution

ggplot(df_clean, aes(x = product_price)) +
  geom_histogram(bins = 40, fill = "darkgreen", alpha = 0.7) +
  scale_x_log10() +
  labs(
    title = "Product Price Distribution (Log Scale)",
    x = "Product Price (Log Scale)",
    y = "Frequency"
  ) +
  theme_minimal()

The log scale is used because product prices are often right-skewed, with many low-to-medium priced products and fewer high-priced products. Since product price may influence both sales quantity and return behavior, understanding its distribution is useful for later regression and classification analysis. A highly skewed price distribution may also suggest that price transformation or careful model selection may be needed in future modelling.

5.3 Bivariate Analysis for Classification Problem: Return Behavior

5.3.1 Delivery Days by Return Status

ggplot(df_clean, aes(x = is_returned, y = delivery_days, fill = is_returned)) +
  geom_boxplot() +
  labs(
    title = "Delivery Days by Return Status",
    x = "Returned",
    y = "Delivery Days"
  ) +
  theme_minimal()

This boxplot compares delivery days between returned and non-returned orders. It helps evaluate whether longer delivery time is associated with a higher probability of return. If returned orders tend to have longer delivery days, this may suggest that delivery delays increase customer dissatisfaction and return risk. Therefore, delivery_days may be a useful predictor for the return behavior warning classification model.

5.3.2 Return Proportion by Discount Level

df_clean %>%
  mutate(
    discount_group = cut(
      discount_percent,
      breaks = c(-Inf, 5, 15, 30, Inf),
      labels = c("0-5%", "5-15%", "15-30%", "30%+"),
      right = TRUE
    )
  ) %>%
  ggplot(aes(x = discount_group, fill = is_returned)) +
  geom_bar(position = "fill") +
  labs(
    title = "Return Proportion by Discount Level",
    x = "Discount Level",
    y = "Proportion"
  ) +
  theme_minimal()

This visualisation compares the return proportion across different discount levels. It helps determine whether high-discount orders have a different return pattern compared with low-discount orders. If high-discount orders show a higher return proportion, this may indicate that aggressive discounts encourage impulse purchases or uncertain buying decisions, which could increase return risk. This insight is useful for both return prediction and promotion strategy planning.

5.3.3 Customer Rating Density by Return Status

ggplot(df_clean, aes(x = customer_rating, fill = is_returned)) +
  geom_density(alpha = 0.5) +
  labs(
    title = "Customer Rating Density by Return Status",
    x = "Customer Rating",
    y = "Density"
  ) +
  theme_minimal()

This density plot compares customer rating patterns between returned and non-returned orders. Customer rating can reflect customer satisfaction with product quality, delivery experience, or expectation fulfilment. If returned orders are more concentrated at lower rating values, this suggests that low customer satisfaction may be related to return behavior. Therefore, customer_rating may be an important feature for identifying return risk.

5.4 Bivariate Analysis for Regression Problem: Sales Quantity

5.4.1 Product Price vs Sales Quantity

ggplot(df_clean, aes(x = product_price, y = quantity)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Product Price vs Sales Quantity",
    x = "Product Price",
    y = "Quantity"
  ) +
  theme_minimal()

This scatter plot examines whether product price has a linear relationship with sales quantity. A negative trend would suggest that higher-priced products tend to be purchased in smaller quantities, while a weak relationship would indicate that product price alone may not strongly explain sales quantity. This analysis is useful for the regression objective because product_price may be considered as one of the predictors for sales quantity prediction.

5.4.2 Average Quantity by Region

region_qty <- df_clean %>%
  group_by(region) %>%
  summarise(avg_quantity = mean(quantity), .groups = "drop")

region_qty
## # A tibble: 6 × 2
##   region        avg_quantity
##   <fct>                <dbl>
## 1 Africa                3.10
## 2 Asia                  3.10
## 3 Europe                3.08
## 4 North America         3.08
## 5 Oceania               3.08
## 6 South America         3.08
ggplot(region_qty, aes(x = reorder(region, avg_quantity), y = avg_quantity, fill = region)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Average Sales Quantity by Region",
    x = "Region",
    y = "Average Quantity"
  ) +
  theme_minimal()

This analysis shows regional differences in average sales quantity. Regions with higher average quantity may require stronger inventory planning, while regions with lower average quantity may need targeted promotions or further investigation. This is useful for the regression objective because region may help explain differences in sales quantity across markets. It also provides business value for regional stock planning and marketing strategies.

5.4.3 Discount vs Quantity by Return Status

ggplot(df_clean, aes(x = discount_percent, y = quantity, color = is_returned)) +
  geom_point(alpha = 0.5) +
  geom_smooth(se = FALSE, method = "lm") +
  labs(
    title = "Discount vs Quantity by Return Status",
    x = "Discount Percent",
    y = "Quantity"
  ) +
  theme_minimal()

This plot explores whether discount percentage is associated with higher sales quantity and whether the relationship differs between returned and non-returned orders. If higher discounts increase quantity but also appear more often among returned orders, the business needs to balance sales growth with return risk. This analysis connects both project objectives because discount percentage may influence sales quantity prediction and return behavior classification.

5.5 Correlation Matrix

numeric_vars <- df_clean %>% select(where(is.numeric))
cor_matrix <- cor(numeric_vars, use = "complete.obs")

corrplot(
  cor_matrix,
  method = "number",
  type = "upper",
  tl.cex = 0.8,
  number.cex = 0.7
)

The correlation matrix provides an overview of linear relationships among numerical variables. It is useful for identifying highly related variables and possible predictors for regression modelling. For example, variables that have stronger relationships with quantity may be useful for sales quantity prediction. However, correlation only measures linear relationships and does not imply causation, so further modelling and evaluation are still required.

5.6 Time Trend Analysis

monthly_trend <- df_clean %>%
  group_by(month = floor_date(order_date, "month")) %>%
  summarise(
    total_quantity = sum(quantity),
    return_rate = mean(as.numeric(is_returned) - 1, na.rm = TRUE),
    .groups = "drop"
  )

monthly_trend
## # A tibble: 36 × 3
##    month      total_quantity return_rate
##    <date>              <int>       <dbl>
##  1 2023-01-01           8615      0.0548
##  2 2023-02-01           7410      0.0604
##  3 2023-03-01           8398      0.0561
##  4 2023-04-01           8013      0.0636
##  5 2023-05-01           8532      0.0610
##  6 2023-06-01           8395      0.0618
##  7 2023-07-01           8480      0.0623
##  8 2023-08-01           8694      0.0585
##  9 2023-09-01           8190      0.0545
## 10 2023-10-01           8617      0.0600
## # ℹ 26 more rows
monthly_trend %>%
  pivot_longer(
    cols = c(total_quantity, return_rate),
    names_to = "metric",
    values_to = "value"
  ) %>%
  ggplot(aes(x = month, y = value)) +
  geom_line(color = "steelblue") +
  facet_wrap(~metric, scales = "free_y") +
  labs(
    title = "Monthly Trends: Total Quantity and Return Rate",
    x = "Month",
    y = "Value"
  ) +
  theme_minimal()

The time trend analysis helps identify seasonal patterns in total sales quantity and whether return rates fluctuate across months. This is useful for detecting peak sales periods and monitoring whether return rates increase during high-demand months. For the regression objective, month-based trends may support sales quantity prediction. For the classification objective, monthly return rate patterns may help identify periods with higher return risk.

5.7 Category-Level Insights

category_return <- df_clean %>%
  group_by(product_category) %>%
  summarise(
    return_rate = mean(as.integer(is_returned) - 1, na.rm = TRUE),
    total_sales = sum(quantity * product_price * (1 - discount_percent / 100), na.rm = TRUE),
    avg_quantity = mean(quantity, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(return_rate))

category_return
## # A tibble: 7 × 4
##   product_category return_rate total_sales avg_quantity
##   <fct>                  <dbl>       <dbl>        <dbl>
## 1 Fashion               0.122    10497301.         3.07
## 2 Automotive            0.0527   10575483.         3.11
## 3 Electronics           0.0513   10488699.         3.09
## 4 Home                  0.0511   10369461.         3.09
## 5 Sports                0.0505   10557716.         3.07
## 6 Toys                  0.0490   10553990.         3.09
## 7 Beauty                0.0478   10372151.         3.08
ggplot(category_return, aes(x = reorder(product_category, return_rate), y = return_rate, fill = product_category)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Return Rate by Product Category",
    x = "Product Category",
    y = "Return Rate"
  ) +
  theme_minimal()

This category-level analysis identifies product categories with relatively high return rates, total sales, and average quantity. Categories with high return rates may require further investigation, such as reviewing product quality, product descriptions, sizing information, or customer expectations. At the same time, categories with high sales quantity may require stronger inventory management. Therefore, product_category is useful for both return behavior classification and sales quantity prediction.

5.8 Key Findings

overall_return_rate <- mean(as.numeric(df_clean$is_returned) - 1, na.rm = TRUE) * 100
avg_quantity <- mean(df_clean$quantity, na.rm = TRUE)
top_3_high_return_categories <- head(category_return %>% arrange(desc(return_rate)), 3)
best_region <- region_qty$region[which.max(region_qty$avg_quantity)]
delivery_return_correlation <- cor(
  as.numeric(df_clean$delivery_days),
  as.numeric(df_clean$is_returned) - 1,
  use = "complete.obs"
)

cat("Overall return rate:", round(overall_return_rate, 2), "%\n")
## Overall return rate: 6.06 %
cat("Average quantity per order:", round(avg_quantity, 2), "\n")
## Average quantity per order: 3.09
cat("Correlation between delivery days and return:", round(delivery_return_correlation, 4), "\n")
## Correlation between delivery days and return: -0.0058
cat("Best region for average sales quantity:", as.character(best_region), "\n")
## Best region for average sales quantity: Africa
cat("\nTop 3 high-return categories:\n")
## 
## Top 3 high-return categories:
top_3_high_return_categories
## # A tibble: 3 × 4
##   product_category return_rate total_sales avg_quantity
##   <fct>                  <dbl>       <dbl>        <dbl>
## 1 Fashion               0.122    10497301.         3.07
## 2 Automotive            0.0527   10575483.         3.11
## 3 Electronics           0.0513   10488699.         3.09

Based on the EDA, several variables appear useful for the two prediction objectives. For return behavior classification, delivery days, discount level, customer rating, and product category may help identify return risk. For sales quantity prediction, product price, discount percentage, region, month, and product category may help explain variation in order quantity. These findings provide a foundation for feature selection and model development in the next stage.

5.9 Conclusion

Overall, the data cleaning and exploratory analysis process successfully transformed the raw e-commerce sales data into a cleaner and more reliable dataset for further analysis. The cleaning process improved data quality by handling missing values, correcting data types, flagging abnormal records, checking revenue consistency, and removing duplicate records.

The EDA results helped us better understand the main patterns in the dataset, especially in relation to return behavior and sales quantity. These findings support the two project objectives: using is_returned for return behavior classification and using quantity for sales quantity prediction. Therefore, the cleaned dataset is ready to be used in the next stage for model training, evaluation, and business interpretation.

6 Model 1: Regression / Prediction

6.1 Research Topic

Research on E-commerce Sales Quantity Prediction based on Multi-dimensional Features

This model focuses on predicting sales quantity to find growth drivers that can help businesses boost sales and manage stock efficiently. In e-commerce operations, accurate quantity prediction is important because it supports inventory planning, demand forecasting, promotion strategy, and regional stock allocation. By using multi-dimensional features such as product category, product price, discount percentage, region, payment method, delivery days, customer rating, and order time, the model can estimate expected order quantity and identify factors that may influence sales growth.

6.2 Problem Definition

The regression task uses quantity as the target variable. The objective is to predict how many units are likely to be sold for each order based on product, customer, pricing, delivery, rating, payment, regional, and time-related features.

Unlike the classification model, which predicts return status, this regression model predicts a continuous numerical value. The business value of this model is to help e-commerce companies:

  • identify features that drive higher sales quantity;
  • support better stock and inventory planning;
  • improve product category and regional sales strategies;
  • understand how price and discount decisions relate to demand.

6.3 Target Variable and Features

The target variable is quantity. The predictors are selected from available multi-dimensional transaction features. Variables such as order_id, customer_id, and revenue are excluded because they are either identifiers or directly related to the target variable. Excluding revenue is important because revenue is calculated using quantity, so including it would cause data leakage.

library(tidyverse)
library(lubridate)
library(caret)
library(randomForest)
df_reg <- df_clean %>%
  select(
    product_category,
    product_price,
    order_date,
    region,
    payment_method,
    delivery_days,
    customer_rating,
    discount_percent,
    quantity
  ) %>%
  mutate(
    order_date = as.Date(order_date),
    order_month = factor(month(order_date)),
    order_weekday = factor(weekdays(order_date)),
    product_category = factor(product_category),
    region = factor(region),
    payment_method = factor(payment_method),
    quantity = as.numeric(quantity)
  ) %>%
  select(-order_date)

str(df_reg)
## 'data.frame':    100000 obs. of  10 variables:
##  $ product_category: Factor w/ 7 levels "Automotive","Beauty",..: 2 4 2 3 4 2 6 1 5 3 ...
##  $ product_price   : num  190.4 82.2 15.2 310.6 74 ...
##  $ region          : Factor w/ 6 levels "Africa","Asia",..: 3 4 5 3 1 5 1 4 4 3 ...
##  $ payment_method  : Factor w/ 4 levels "BankTransfer",..: 1 3 2 4 4 4 3 1 1 3 ...
##  $ delivery_days   : int  8 5 6 9 3 5 5 3 6 4 ...
##  $ customer_rating : num  3.8 3.8 2 2.9 3.1 3.4 2.7 4.7 3.6 3.8 ...
##  $ discount_percent: num  0 0 10 5 20 5 0 0 0 0 ...
##  $ quantity        : num  5 3 2 2 4 5 2 4 2 2 ...
##  $ order_month     : Factor w/ 12 levels "1","2","3","4",..: 2 10 6 7 2 5 7 5 3 2 ...
##  $ order_weekday   : Factor w/ 7 levels "Friday","Monday",..: 6 1 7 6 1 1 6 5 3 2 ...
summary(df_reg$quantity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   3.085   4.000   6.000

6.4 Train-Test Split

The dataset is divided into a training set and a testing set. The training set is used to build the models, while the testing set is used to evaluate how well the models predict unseen data.

set.seed(123)

reg_train_index <- createDataPartition(df_reg$quantity, p = 0.8, list = FALSE)
reg_train <- df_reg[reg_train_index, ]
reg_test  <- df_reg[-reg_train_index, ]

dim(reg_train)
## [1] 80002    10
dim(reg_test)
## [1] 19998    10

6.5 Model 1A: Multiple Linear Regression

Multiple Linear Regression is used as the baseline regression model. It is easy to interpret and helps explain the direction and strength of relationships between predictors and sales quantity.

lm_model <- lm(quantity ~ ., data = reg_train)

summary(lm_model)
## 
## Call:
## lm(formula = quantity ~ ., data = reg_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.08544 -1.01899 -0.00285  1.01288  2.08151 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  3.058e+00  3.720e-02  82.199   <2e-16 ***
## product_categoryBeauty      -1.966e-02  1.871e-02  -1.051   0.2933    
## product_categoryElectronics -2.493e-02  1.866e-02  -1.336   0.1816    
## product_categoryFashion     -3.915e-02  1.865e-02  -2.100   0.0358 *  
## product_categoryHome        -1.038e-02  1.866e-02  -0.556   0.5782    
## product_categorySports      -3.254e-02  1.863e-02  -1.746   0.0807 .  
## product_categoryToys        -3.291e-02  1.869e-02  -1.760   0.0784 .  
## product_price                3.056e-05  3.524e-05   0.867   0.3858    
## regionAsia                   5.502e-03  1.730e-02   0.318   0.7504    
## regionEurope                -1.392e-02  1.736e-02  -0.802   0.4224    
## regionNorth America         -6.886e-03  1.729e-02  -0.398   0.6905    
## regionOceania               -1.979e-03  1.728e-02  -0.115   0.9088    
## regionSouth America         -1.447e-02  1.737e-02  -0.833   0.4049    
## payment_methodCash           2.144e-02  1.412e-02   1.518   0.1290    
## payment_methodCreditCard     6.722e-03  1.405e-02   0.478   0.6323    
## payment_methodPayPal         4.163e-03  1.409e-02   0.295   0.7676    
## delivery_days               -1.002e-03  1.935e-03  -0.518   0.6046    
## customer_rating             -5.075e-03  5.747e-03  -0.883   0.3772    
## discount_percent            -1.044e-03  8.110e-04  -1.288   0.1978    
## order_month2                -1.801e-02  2.489e-02  -0.724   0.4693    
## order_month3                -1.379e-02  2.417e-02  -0.570   0.5684    
## order_month4                -6.175e-03  2.446e-02  -0.252   0.8007    
## order_month5                -8.263e-03  2.410e-02  -0.343   0.7316    
## order_month6                -1.627e-02  2.434e-02  -0.668   0.5039    
## order_month7                -1.700e-02  2.428e-02  -0.700   0.4838    
## order_month8                -2.404e-03  2.417e-02  -0.099   0.9208    
## order_month9                -3.633e-02  2.438e-02  -1.490   0.1362    
## order_month10               -3.590e-02  2.420e-02  -1.484   0.1379    
## order_month11                1.001e+00  2.438e-02  41.044   <2e-16 ***
## order_month12               -6.896e-03  2.421e-02  -0.285   0.7757    
## order_weekdayMonday         -1.518e-02  1.871e-02  -0.811   0.4173    
## order_weekdaySaturday        1.384e-02  1.869e-02   0.741   0.4590    
## order_weekdaySunday          7.496e-03  1.876e-02   0.400   0.6895    
## order_weekdayThursday        5.666e-03  1.866e-02   0.304   0.7614    
## order_weekdayTuesday        -4.613e-03  1.866e-02  -0.247   0.8047    
## order_weekdayWednesday      -1.257e-02  1.867e-02  -0.673   0.5009    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.41 on 79966 degrees of freedom
## Multiple R-squared:  0.03802,    Adjusted R-squared:  0.0376 
## F-statistic: 90.31 on 35 and 79966 DF,  p-value: < 2.2e-16
lm_pred <- predict(lm_model, newdata = reg_test)

lm_results <- data.frame(
  Actual = reg_test$quantity,
  Predicted = lm_pred
)

lm_rmse <- RMSE(lm_results$Predicted, lm_results$Actual)
lm_mae <- MAE(lm_results$Predicted, lm_results$Actual)
lm_r2 <- R2(lm_results$Predicted, lm_results$Actual)

cat("Linear Regression RMSE:", round(lm_rmse, 4), "\n")
## Linear Regression RMSE: 1.4171
cat("Linear Regression MAE:", round(lm_mae, 4), "\n")
## Linear Regression MAE: 1.2062
cat("Linear Regression R-squared:", round(lm_r2, 4), "\n")
## Linear Regression R-squared: 0.0356

6.6 Model 1B: Random Forest Regression

Random Forest Regression is used as a more flexible machine learning model. It can capture non-linear relationships and interactions between predictors, which may be useful because e-commerce sales quantity can be influenced by complex combinations of price, discount, product category, region, and customer experience factors.

set.seed(123)

rf_reg_train <- reg_train %>%
  slice_sample(n = min(20000, nrow(reg_train)))

rf_reg_model <- randomForest(
  quantity ~ .,
  data = rf_reg_train,
  ntree = 100,
  importance = TRUE
)

rf_reg_model
## 
## Call:
##  randomForest(formula = quantity ~ ., data = rf_reg_train, ntree = 100,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 2.067228
##                     % Var explained: 0.53
rf_reg_pred <- predict(rf_reg_model, newdata = reg_test)

rf_reg_results <- data.frame(
  Actual = reg_test$quantity,
  Predicted = rf_reg_pred
)

rf_reg_rmse <- RMSE(rf_reg_results$Predicted, rf_reg_results$Actual)
rf_reg_mae <- MAE(rf_reg_results$Predicted, rf_reg_results$Actual)
rf_reg_r2 <- R2(rf_reg_results$Predicted, rf_reg_results$Actual)

cat("Random Forest Regression RMSE:", round(rf_reg_rmse, 4), "\n")
## Random Forest Regression RMSE: 1.4317
cat("Random Forest Regression MAE:", round(rf_reg_mae, 4), "\n")
## Random Forest Regression MAE: 1.2346
cat("Random Forest Regression R-squared:", round(rf_reg_r2, 4), "\n")
## Random Forest Regression R-squared: 0.0236

6.7 Regression Model Comparison

The regression models are evaluated using RMSE, MAE, and R-squared. RMSE and MAE measure prediction error, where smaller values indicate better prediction accuracy. R-squared measures how much variation in sales quantity can be explained by the model, where a higher value indicates stronger explanatory power.

regression_results <- data.frame(
  Model = c("Multiple Linear Regression", "Random Forest Regression"),
  RMSE = c(lm_rmse, rf_reg_rmse),
  MAE = c(lm_mae, rf_reg_mae),
  R_squared = c(lm_r2, rf_reg_r2)
)

knitr::kable(
  regression_results,
  digits = 4,
  caption = "Performance Comparison of Regression Models"
)
Performance Comparison of Regression Models
Model RMSE MAE R_squared
Multiple Linear Regression 1.4171 1.2062 0.0356
Random Forest Regression 1.4317 1.2346 0.0236

6.8 Actual vs Predicted Quantity

The following plots compare actual sales quantity with predicted sales quantity. Points closer to the diagonal line indicate more accurate predictions.

lm_plot_data <- lm_results %>%
  mutate(Model = "Multiple Linear Regression")

rf_plot_data <- rf_reg_results %>%
  mutate(Model = "Random Forest Regression")

bind_rows(lm_plot_data, rf_plot_data) %>%
  ggplot(aes(x = Actual, y = Predicted)) +
  geom_point(alpha = 0.25, color = "steelblue") +
  geom_abline(intercept = 0, slope = 1, color = "red", linewidth = 1) +
  facet_wrap(~Model) +
  labs(
    title = "Actual vs Predicted Sales Quantity",
    x = "Actual Quantity",
    y = "Predicted Quantity"
  ) +
  theme_minimal()

6.9 Feature Importance for Sales Quantity Prediction

Feature importance from the Random Forest Regression model is used to identify the most influential growth drivers for sales quantity prediction. Variables with higher importance contribute more to reducing prediction error in the model.

varImpPlot(
  rf_reg_model,
  main = "Variable Importance for Sales Quantity Prediction"
)

6.10 Discussion

The regression models provide a data-driven approach for predicting e-commerce sales quantity using multi-dimensional features. Multiple Linear Regression acts as an interpretable baseline model, while Random Forest Regression captures more complex relationships among predictors.

If the Random Forest model achieves lower RMSE and MAE than the linear model, it suggests that sales quantity is influenced by non-linear relationships and feature interactions. For example, the effect of discount percentage may differ across product categories or regions. If the linear model performs similarly, it suggests that the available features explain sales quantity in a more direct and stable way.

The feature importance results help identify potential sales growth drivers. Features such as product category, product price, discount percentage, region, and order time can support decisions about which products to promote, where stock should be allocated, and how pricing or discount strategies may be adjusted to improve demand.

6.11 Conclusion for Regression / Prediction

This section developed sales quantity prediction models using regression techniques. The target variable was quantity, and the predictors included product, price, discount, region, payment, delivery, rating, and time-based features. The models were evaluated using RMSE, MAE, and R-squared.

The regression analysis supports the research goal of predicting sales quantity to identify growth drivers and improve stock management. By understanding which features are most important for quantity prediction, e-commerce businesses can make better decisions about inventory planning, sales promotion, category strategy, and regional demand management.

7 Model 2: Classification

7.1 Problem Definition

This section focuses on the classification task of predicting whether an order will be returned in an e-commerce environment.

7.2 Target Variable and Features

The target variable is is_returned, where “Yes” indicates a returned order and “No” indicates a non-returned order.

#The required packages were installed before running the analysis.
library(tidyverse)
library(lubridate)
library(caret)
library(pROC)
library(randomForest)
# Build classification dataset from the cleaned dataset created in earlier sections

df_cls <- df_clean %>%
  select(
    product_category,
    product_price,
    quantity,
    order_date,
    region,
    payment_method,
    delivery_days,
    customer_rating,
    discount_percent,
    is_returned
  )%>%
  mutate(
    order_date = as.Date(order_date),
    order_month = factor(month(order_date)),
    order_weekday = factor(weekdays(order_date)),
    is_returned = factor(ifelse(is_returned == 1, "Yes", "No"),
                         levels = c("No", "Yes")),
    product_category = factor(product_category),
    region = factor(region),
    payment_method = factor(payment_method)
  ) %>%
  select(-order_date)

str(df_cls)
## 'data.frame':    100000 obs. of  11 variables:
##  $ product_category: Factor w/ 7 levels "Automotive","Beauty",..: 2 4 2 3 4 2 6 1 5 3 ...
##  $ product_price   : num  190.4 82.2 15.2 310.6 74 ...
##  $ quantity        : int  5 3 2 2 4 5 2 4 2 2 ...
##  $ region          : Factor w/ 6 levels "Africa","Asia",..: 3 4 5 3 1 5 1 4 4 3 ...
##  $ payment_method  : Factor w/ 4 levels "BankTransfer",..: 1 3 2 4 4 4 3 1 1 3 ...
##  $ delivery_days   : int  8 5 6 9 3 5 5 3 6 4 ...
##  $ customer_rating : num  3.8 3.8 2 2.9 3.1 3.4 2.7 4.7 3.6 3.8 ...
##  $ discount_percent: num  0 0 10 5 20 5 0 0 0 0 ...
##  $ is_returned     : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 1 1 1 1 1 ...
##  $ order_month     : Factor w/ 12 levels "1","2","3","4",..: 2 10 6 7 2 5 7 5 3 2 ...
##  $ order_weekday   : Factor w/ 7 levels "Friday","Monday",..: 6 1 7 6 1 1 6 5 3 2 ...

7.3 Class Distribution

Before model training, the class distribution is checked to understand whether return cases are balanced with non-return cases. This step is important because highly imbalanced data may lead to misleading accuracy and weak detection of returned orders.

table(df_cls$is_returned)
## 
##    No   Yes 
## 93940  6060
prop.table(table(df_cls$is_returned))
## 
##     No    Yes 
## 0.9394 0.0606

7.4 Train-Test Split and Class Balancing

The dataset is divided into a training set and a testing set using stratified sampling. To improve the model’s ability to identify return cases, down-sampling is applied to the training set so that the two classes become more balanced.

set.seed(123)

train_index <- createDataPartition(df_cls$is_returned, p = 0.8, list = FALSE)
train_data <- df_cls[train_index, ]
test_data  <- df_cls[-train_index, ]

# Check class proportions before balancing
prop.table(table(train_data$is_returned))
## 
##     No    Yes 
## 0.9394 0.0606
prop.table(table(test_data$is_returned))
## 
##     No    Yes 
## 0.9394 0.0606
# Down-sampling on training data only
set.seed(123)
train_bal <- downSample(
  x = train_data %>% select(-is_returned),
  y = train_data$is_returned,
  yname = "is_returned"
)

table(train_bal$is_returned)
## 
##   No  Yes 
## 4848 4848
prop.table(table(train_bal$is_returned))
## 
##  No Yes 
## 0.5 0.5

7.5 Model 2A: Logistic Regression

Logistic Regression is used as the baseline classification model because it is widely used, easy to interpret, and suitable for binary outcome prediction. It provides a useful benchmark for comparing with more advanced models.

log_model <- glm(
  is_returned ~ .,
  data = train_bal,
  family = binomial
)

summary(log_model)
## 
## Call:
## glm(formula = is_returned ~ ., family = binomial, data = train_bal)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -0.2204184  0.1628072  -1.354   0.1758    
## product_categoryBeauty      -0.1235981  0.0801757  -1.542   0.1232    
## product_categoryElectronics  0.0305900  0.0791179   0.387   0.6990    
## product_categoryFashion      0.9060763  0.0735131  12.325   <2e-16 ***
## product_categoryHome        -0.1015565  0.0795870  -1.276   0.2019    
## product_categorySports      -0.0156095  0.0798005  -0.196   0.8449    
## product_categoryToys        -0.1079902  0.0795688  -1.357   0.1747    
## product_price               -0.0000821  0.0001457  -0.563   0.5732    
## quantity                    -0.0236625  0.0147006  -1.610   0.1075    
## regionAsia                   0.0547191  0.0732093   0.747   0.4548    
## regionEurope                 0.1091548  0.0736951   1.481   0.1386    
## regionNorth America          0.0680017  0.0726946   0.935   0.3496    
## regionOceania                0.1172048  0.0731439   1.602   0.1091    
## regionSouth America          0.0343883  0.0727422   0.473   0.6364    
## payment_methodCash           0.0981473  0.0586821   1.673   0.0944 .  
## payment_methodCreditCard     0.0146061  0.0584686   0.250   0.8027    
## payment_methodPayPal         0.1016816  0.0587932   1.729   0.0837 .  
## delivery_days                0.0013527  0.0080947   0.167   0.8673    
## customer_rating              0.0164227  0.0238292   0.689   0.4907    
## discount_percent            -0.0002545  0.0034008  -0.075   0.9403    
## order_month2                 0.0972665  0.1036660   0.938   0.3481    
## order_month3                -0.0751276  0.1017743  -0.738   0.4604    
## order_month4                -0.0084290  0.1007660  -0.084   0.9333    
## order_month5                 0.0223852  0.1005536   0.223   0.8238    
## order_month6                 0.1085186  0.1014507   1.070   0.2848    
## order_month7                -0.0211145  0.1018124  -0.207   0.8357    
## order_month8                 0.0792844  0.1009125   0.786   0.4321    
## order_month9                -0.0210754  0.1019059  -0.207   0.8362    
## order_month10                0.1060466  0.1015553   1.044   0.2964    
## order_month11                0.0726548  0.1029465   0.706   0.4803    
## order_month12                0.1772374  0.1004519   1.764   0.0777 .  
## order_weekdayMonday         -0.0777023  0.0774448  -1.003   0.3157    
## order_weekdaySaturday       -0.0614101  0.0769369  -0.798   0.4248    
## order_weekdaySunday         -0.1363081  0.0773793  -1.762   0.0781 .  
## order_weekdayThursday       -0.0585028  0.0770337  -0.759   0.4476    
## order_weekdayTuesday         0.0679177  0.0776318   0.875   0.3816    
## order_weekdayWednesday      -0.1172153  0.0777133  -1.508   0.1315    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13442  on 9695  degrees of freedom
## Residual deviance: 13052  on 9659  degrees of freedom
## AIC: 13126
## 
## Number of Fisher Scoring iterations: 4
# Predict return probability
log_prob <- predict(log_model, newdata = test_data, type = "response")

# Convert probability to class label using 0.5 threshold
log_pred <- ifelse(log_prob >= 0.5, "Yes", "No")
log_pred <- factor(log_pred, levels = c("No", "Yes"))

# Confusion matrix
log_cm <- confusionMatrix(
  data = log_pred,
  reference = test_data$is_returned,
  positive = "Yes"
)

log_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  15119   810
##        Yes  3669   402
##                                           
##                Accuracy : 0.776           
##                  95% CI : (0.7702, 0.7818)
##     No Information Rate : 0.9394          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0648          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.33168         
##             Specificity : 0.80472         
##          Pos Pred Value : 0.09875         
##          Neg Pred Value : 0.94915         
##              Prevalence : 0.06060         
##          Detection Rate : 0.02010         
##    Detection Prevalence : 0.20355         
##       Balanced Accuracy : 0.56820         
##                                           
##        'Positive' Class : Yes             
## 
log_roc <- roc(
  response = test_data$is_returned,
  predictor = log_prob,
  levels = c("No", "Yes")
)

log_auc <- as.numeric(auc(log_roc))
log_auc
## [1] 0.5714915

7.6 Model 2B: Random Forest

Random Forest is used as a second classification model because it can capture nonlinear relationships and interactions among variables. It is also useful for identifying the relative importance of predictors in return classification.

set.seed(123)

rf_model <- randomForest(
  is_returned ~ .,
  data = train_bal,
  ntree = 200,
  importance = TRUE
)

rf_model
## 
## Call:
##  randomForest(formula = is_returned ~ ., data = train_bal, ntree = 200,      importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 200
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 44.62%
## Confusion matrix:
##       No  Yes class.error
## No  2780 2068   0.4265677
## Yes 2258 2590   0.4657591
# Predict return class
rf_pred <- predict(rf_model, newdata = test_data, type = "class")

# Confusion matrix
rf_cm <- confusionMatrix(
  data = rf_pred,
  reference = test_data$is_returned,
  positive = "Yes"
)

rf_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No  10906   598
##        Yes  7882   614
##                                           
##                Accuracy : 0.576           
##                  95% CI : (0.5691, 0.5829)
##     No Information Rate : 0.9394          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0228          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.50660         
##             Specificity : 0.58048         
##          Pos Pred Value : 0.07227         
##          Neg Pred Value : 0.94802         
##              Prevalence : 0.06060         
##          Detection Rate : 0.03070         
##    Detection Prevalence : 0.42480         
##       Balanced Accuracy : 0.54354         
##                                           
##        'Positive' Class : Yes             
## 
# Predict return probability
rf_prob <- predict(rf_model, newdata = test_data, type = "prob")[, "Yes"]

rf_roc <- roc(
  response = test_data$is_returned,
  predictor = rf_prob,
  levels = c("No", "Yes")
)

rf_auc <- as.numeric(auc(rf_roc))
rf_auc
## [1] 0.5707515

7.7 Model Comparison

To compare the two models fairly, several evaluation metrics are used, including Accuracy, Precision, Recall, F1-score, and AUC. Since the practical goal is to identify returned orders effectively, Recall and F1-score are especially important.

results_table <- data.frame(
  Model = c("Logistic Regression", "Random Forest"),
  Accuracy = c(log_cm$overall["Accuracy"], rf_cm$overall["Accuracy"]),
  Precision = c(log_cm$byClass["Pos Pred Value"], rf_cm$byClass["Pos Pred Value"]),
  Recall = c(log_cm$byClass["Sensitivity"], rf_cm$byClass["Sensitivity"]),
  F1 = c(log_cm$byClass["F1"], rf_cm$byClass["F1"]),
  AUC = c(log_auc, rf_auc)
)

knitr::kable(results_table, digits = 4, caption = "Performance Comparison of Classification Models")
Performance Comparison of Classification Models
Model Accuracy Precision Recall F1 AUC
Logistic Regression 0.776 0.0987 0.3317 0.1522 0.5715
Random Forest 0.576 0.0723 0.5066 0.1265 0.5708

7.8 ROC Curve

The ROC curve is plotted to compare the overall discrimination ability of the two models. A model with a larger AUC is considered to have stronger classification performance.

plot(log_roc, col = "blue", main = "ROC Curve Comparison")
plot(rf_roc, col = "red", add = TRUE)
legend(
  "bottomright",
  legend = c("Logistic Regression", "Random Forest"),
  col = c("blue", "red"),
  lwd = 2
)

7.9 Variable Importance

Variable importance is examined based on the Random Forest model. This helps explain which factors contribute most to return behavior and provides useful business insights for return risk management.

varImpPlot(rf_model)

7.10 Discussion

The classification results show that both Logistic Regression and Random Forest provided only limited predictive performance for return behavior. Logistic Regression achieved a slightly higher AUC (0.571) than Random Forest (0.564), but both values were only slightly above 0.5. This indicates that the selected predictors have limited ability to clearly distinguish returned and non-returned orders.

The weak model performance may be related to class imbalance and the limited set of available features. Return behavior may also depend on factors that are not included in the dataset, such as product quality, customer expectations, item fit, and after-sales experience.

Even so, the Random Forest variable importance output still provides useful business insights. Variables related to delivery, discount, rating, and product category appear relevant for return behavior. These findings can support return-risk monitoring and help e-commerce businesses improve delivery performance, pricing strategy, and category management.

7.11 Conclusion for Classification

This section developed two classification models to predict whether an order would be returned. Logistic Regression was used as a baseline model, while Random Forest was used as a more flexible machine learning model. The models were evaluated using Accuracy, Balanced Accuracy, Precision, Recall, F1-score, and AUC.

Although both models provided some useful insights into return behavior, their predictive performance was relatively weak. This suggests that the available variables explain only part of return behavior, and additional features or further model tuning may be needed to improve classification performance. Even so, the analysis still offers practical value for understanding return-related patterns in e-commerce.

8 Model Evaluation

classification_summary <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall", "F1", "AUC"),
  Logistic_Regression = c(
    round(log_cm$overall["Accuracy"], 4),
    round(log_cm$byClass["Pos Pred Value"], 4),
    round(log_cm$byClass["Sensitivity"], 4),
    round(log_cm$byClass["F1"], 4),
    round(log_auc, 4)
  ),
  Random_Forest = c(
    round(rf_cm$overall["Accuracy"], 4),
    round(rf_cm$byClass["Pos Pred Value"], 4),
    round(rf_cm$byClass["Sensitivity"], 4),
    round(rf_cm$byClass["F1"], 4),
    round(rf_auc, 4)
  )
)

kable(classification_summary, caption = "Table 1: Classification Model Performance Comparison")
Table 1: Classification Model Performance Comparison
Metric Logistic_Regression Random_Forest
Accuracy Accuracy 0.7760 0.5760
Pos Pred Value Precision 0.0987 0.0723
Sensitivity Recall 0.3317 0.5066
F1 F1 0.1522 0.1265
AUC 0.5715 0.5708
regression_summary <- data.frame(
  Model = c("Linear Regression", "Random Forest"),
  RMSE = c(round(lm_rmse, 4), round(rf_reg_rmse, 4)),
  MAE = c(round(lm_mae, 4), round(rf_reg_mae, 4)),
  R_squared = c(round(lm_r2, 4), round(rf_reg_r2, 4))
)

kable(regression_summary, caption = "Table 2: Regression Model Performance Comparison")
Table 2: Regression Model Performance Comparison
Model RMSE MAE R_squared
Linear Regression 1.4171 1.2062 0.0356
Random Forest 1.4317 1.2346 0.0236

9 Discussion and Conclusion

9.1 Classification Discussion

The classification results indicate that both models have limited predictive ability for return behavior. Although Logistic Regression achieved an accuracy of 77.6%, this metric alone does not reliably reflect model performance due to the majority of “non-return” orders in the dataset. In this imbalanced class scenario, Precision, Recall, F1, and AUC are more meaningful evaluation metrics.

Logistic Regression slightly outperformed Random Forest in Precision, F1, and AUC, indicating more balanced overall performance. Random Forest, however, achieved higher Recall, meaning it can identify more actual returned orders, but at the cost of more false positives. Overall, Logistic Regression can be considered the better classification model in this project, but the AUC values of both models are only slightly above 0.5, indicating weak ability to distinguish between returned and non-returned orders.

The limited model performance may be attributed to two main factors: 1. The proportion of returned orders is low, leading to a clear class imbalance issue, making the model more likely to predict “non-return”. 2. Return behavior is typically influenced by factors such as product quality, size fit, customer expectations, and after-sales experience, which are not included in the current dataset. Nevertheless, delivery days, discount percentage, customer rating, and product category still show some association with return behavior, providing valuable insights for return management.

9.2 Regression Discussion

The regression results similarly show that both models have limited predictive ability for order quantity. Linear Regression slightly outperformed Random Forest in RMSE, MAE, and , and can therefore be considered the better regression model in this project. However, the R² values of both models are close to zero, indicating that the available features can hardly explain the variation in order quantity. Additionally, the RMSE and MAE values suggest relatively high average prediction errors, making the predictive results less practical.

An important reason for this outcome is that the current data lacks key features closely related to purchase quantity, such as marketing activities, customer loyalty, inventory status, competitor pricing, and seasonal demand. Therefore, the existing variables cannot adequately capture the true drivers of order quantity. Despite the limited predictive performance, variables such as product category, discount percentage, price, and region still provide some business insights that can support inventory management and promotion strategy optimization.

9.3 Recommendations

9.3.1 Reducing Return Rate

  1. Shorten delivery time
  2. Provide accurate product descriptions and size guides
  3. Avoid excessive discounts (to prevent impulse purchases followed by returns)
  4. Monitor the quality of product categories with high return rates

9.3.2 Increasing Sales Quantity

  1. Pay attention to performance differences across product categories
  2. Allocate inventory based on regional demand patterns
  3. Leverage seasonal trends

9.3.3 Data Collection Recommendations

  1. Record return reasons
  2. Track customer browsing time on product pages
  3. Collect customer historical purchase behavior
  4. Monitor competitor pricing

9.4 Overall Conclusion and Final Statement

This project completed two machine learning tasks based on an e-commerce dataset: return behavior prediction (classification) and order quantity prediction (regression).

  • For the classification task: Logistic Regression performed slightly better than Random Forest, but both models showed weak discriminative ability.
  • For the regression task: Linear Regression performed slightly better than Random Forest, but both models had very limited explanatory power.

This indicates that the features in the current dataset are insufficient for accurate prediction of return behavior and order quantity.

Nevertheless, this project still holds analytical value. The results of data cleaning, exploratory analysis, and model comparison indicate that factors such as delivery days, discount percentage, customer rating, product category, price, and region are associated with business outcomes.

Based on these findings, businesses may consider optimizing delivery efficiency, adjusting promotion strategies, strengthening quality management for high-return categories, and collecting richer customer behavior and product information in the future. Overall, even with limited predictive performance, a complete data science analytical process can still provide meaningful support for business decision-making.