1. Project Overview

1.1 Project introduction

This project conducts an in-depth analysis of customer purchasing behavior using a public e-commerce sales data set. We build predictive models to address two core business challenges: customer returns and revenue forecasting. In the rapidly evolving e-commerce landscape, accurately understanding customer behavior and predicting key metrics—such as sales revenue and return risk—is crucial for optimizing platform strategy, improving profitability, and enhancing customer satisfaction. By applying data mining and machine learning techniques to historical transaction data, this project extracts valuable insights and patterns to deliver data-driven solutions for e-commerce decision support systems.

1.2 Project Objectives

1.2.1 Goal 1: Return Order Prediction

Develop a binary classification model to predict whether an order will be returned, based on features such as product information and customer behavior. This allows the platform to implement targeted interventions—including inventory optimization, improved product descriptions, proactive customer service, and marketing adjustments—to reduce return rates, lower operational costs, and improve customer experience.

1.2.2 Goal 2: Order Value Prediction

Build a regression model to forecast the total sales amount of each order. This model aids in short-term revenue stream forecasting, measures the impact of promotional activities, and provides support for personalized recommendations and dynamic pricing strategies.

2. Data Overview

2.1 Data set Introduction

Data Source Name: Synthetic E-Commerce Sales Data set (From Kaggle)
Time Period: from 2023/01 to 2025/12
Link: https://www.kaggle.com/datasets/emirhanakku/synthetic-e-commerce-sales-dataset-2025?resource=download

2.2 Data Loading

# =====================Data Loading=====================
df <- read.csv("cleaned_ecommerce_data.csv", stringsAsFactors = FALSE)

library(dplyr)
df_clean <- df %>%
  mutate(
    product_price = as.numeric(product_price),
    discount_percent = as.numeric(discount_percent),
    revenue = as.numeric(revenue),
    delivery_days = as.numeric(delivery_days),
    customer_rating = as.numeric(customer_rating),
    quantity = as.numeric(quantity),
    is_returned = as.numeric(is_returned), 
    product_category = trimws(product_category),
    region = trimws(region),
    total_amount = quantity * product_price
  ) %>%

  filter(
    !is.na(product_price), 
    !is.na(is_returned),
    !is.na(total_amount),
    product_price > 0
  )

# =====================Basic Dataset Information =====================
knitr::kable(
  data.frame(
    Metric = c("Total Rows", "Total Columns"),
    Value = c(nrow(df_clean), ncol(df_clean))
  ),
  align = "lc",
  caption = "Basic Dataset Dimensions"
)

Basic Dataset Dimensions
Metric	Value
Total Rows	98995
Total Columns	15


numeric_cols <- c("product_price", "delivery_days", "customer_rating","discount_percent","revenue")
count_outliers <- function(x) {
  iqr <- IQR(x, na.rm = TRUE)
  lower <- quantile(x, 0.25, na.rm = TRUE) - 1.5*iqr
  upper <- quantile(x, 0.75, na.rm = TRUE) + 1.5*iqr
  sum(x < lower | x > upper, na.rm = TRUE)
}

stats_table <- data.frame(
  Variable = numeric_cols,
  Mean = sapply(df_clean[numeric_cols], \(x) round(mean(x, na.rm=TRUE), 2)),
  Median = sapply(df_clean[numeric_cols], \(x) round(median(x, na.rm=TRUE), 2)),
  Min = sapply(df_clean[numeric_cols], \(x) round(min(x, na.rm=TRUE), 2)),
  Max = sapply(df_clean[numeric_cols], \(x) round(max(x, na.rm=TRUE), 2)),
  Std_Dev = sapply(df_clean[numeric_cols], \(x) round(sd(x, na.rm=TRUE), 2)),
  Outlier_Count = sapply(df_clean[numeric_cols], count_outliers)
)

knitr::kable(stats_table, 
      caption = "Core Descriptive Statistics",
      align = "lcccccc")

Core Descriptive Statistics
	Variable	Mean	Median	Min	Max	Std_Dev	Outlier_Count
product_price	product_price	248.48	249.06	4.52	494.54	140.27	0
delivery_days	delivery_days	4.99	5.00	1.00	9.00	2.58	0
customer_rating	customer_rating	3.50	3.50	2.00	5.00	0.87	0
discount_percent	discount_percent	5.01	0.00	0.00	20.00	6.14	0
revenue	revenue	727.48	580.11	4.26	2699.14	566.33	899


missing <- colSums(is.na(df_clean))
missing_df <- data.frame(
  Column_Name = names(missing),
  Missing_Count = missing,
  Missing_Percentage = paste0(round(missing/nrow(df_clean)*100, 2), "%")
)
knitr::kable(missing_df, caption = "Missing Value Statistics", align = "lcc")

Missing Value Statistics
	Column_Name	Missing_Percentage
order_id	order_id	0%
customer_id	customer_id	0%
product_category	product_category	0%
product_price	product_price	0%
quantity	quantity	0%
order_date	order_date	0%
region	region	0%
payment_method	payment_method	0%
delivery_days	delivery_days	0%
is_returned	is_returned	0%
customer_rating	customer_rating	0%
discount_percent	discount_percent	0%
revenue	revenue	0%
calc_revenue	calc_revenue	0%
total_amount	total_amount	0%


knitr::kable(head(df_clean, 5), caption = "First 5 Rows of Dataset", align = "l")

First 5 Rows of Dataset
order_id	customer_id	product_category	product_price	quantity	order_date	region	payment_method	delivery_days	is_returned	customer_rating	discount_percent	revenue	calc_revenue	total_amount
1	bdd640fb-0667-4ad1-9c80-317fa3b1799d	Beauty	190.40	5	2023-02-21	Europe	BankTransfer	8	0	3.8	0	952.00	952.000	952.00
2	23b8c1e9-3924-46de-beb1-3b9046685257	Fashion	82.22	3	2023-10-13	North America	CreditCard	5	0	3.8	0	246.66	246.660	246.66
3	bd9c66b3-ad3c-4d6d-9a3d-1fa7bc8960a9	Beauty	15.19	2	2023-06-28	Oceania	Cash	6	1	2.0	10	27.34	27.342	30.38
4	972a8469-1641-4f82-8b9d-2434e465e150	Electronics	310.65	2	2023-07-11	Europe	PayPal	9	0	2.9	5	590.23	590.235	621.30
5	17fc695a-07a0-4a6e-8822-e8f36c031199	Fashion	74.05	4	2023-02-24	Africa	PayPal	3	1	3.1	20	236.96	236.960	296.20

3. Data Cleaning and Preprocessing

3.1 Overview

This part provides a comprehensive workflow for the preprocessing and cleaning of the 2025 Synthetic E-commerce Sales data set. The primary objective is to transform raw transactional data into a “model-ready” format. The pipeline includes:

Data Type Standardisation: Converting dates and categorical strings into appropriate formats.Feature Engineering: Recalculating revenue metrics for accuracy.Data Quality Control: Systematically handling missing values (NAs), removing statistical outliers in product pricing, and eliminating duplicate records.Final Export: Saving the refined data set into .csv and .rds formats for subsequent analysis.

3.2 Data Import and initial Assessment

The original data set was successfully imported, containing 100,000 rows and 13 columns. The data includes fields such as order ID, customer ID, product category, price, quantity, etc., with mixed character and numeric formats, requiring further cleaning and conversion.

# Import the dataset
df <- read.csv("synthetic_ecommerce_sales_2025.csv", stringsAsFactors = FALSE)

# Take a quick look
head(df)     # First 6 rows
##   order_id                          customer_id product_category product_price
## 1        1 bdd640fb-0667-4ad1-9c80-317fa3b1799d           Beauty        190.40
## 2        2 23b8c1e9-3924-46de-beb1-3b9046685257          Fashion         82.22
## 3        3 bd9c66b3-ad3c-4d6d-9a3d-1fa7bc8960a9           Beauty         15.19
## 4        4 972a8469-1641-4f82-8b9d-2434e465e150      Electronics        310.65
## 5        5 17fc695a-07a0-4a6e-8822-e8f36c031199          Fashion         74.05
## 6        6 9a1de644-815e-46d1-bb8f-aa1837f8a88b           Beauty        236.05
##   quantity order_date        region payment_method delivery_days is_returned
## 1        5  2023/2/21        Europe   BankTransfer             8           0
## 2        3 2023/10/13 North America     CreditCard             5           0
## 3        2  2023/6/28       Oceania           Cash             6           1
## 4        2  2023/7/11        Europe         PayPal             9           0
## 5        4  2023/2/24        Africa         PayPal             3           1
## 6        5   2025/5/9       Oceania         PayPal             5           0
##   customer_rating discount_percent revenue
## 1             3.8                0  952.00
## 2             3.8                0  246.66
## 3             2.0               10   27.34
## 4             2.9                5  590.23
## 5             3.1               20  236.96
## 6             3.4                5 1121.24
str(df)      # Data structure
## 'data.frame':    100000 obs. of  13 variables:
##  $ order_id        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ customer_id     : chr  "bdd640fb-0667-4ad1-9c80-317fa3b1799d" "23b8c1e9-3924-46de-beb1-3b9046685257" "bd9c66b3-ad3c-4d6d-9a3d-1fa7bc8960a9" "972a8469-1641-4f82-8b9d-2434e465e150" ...
##  $ product_category: chr  "Beauty" "Fashion" "Beauty" "Electronics" ...
##  $ product_price   : num  190.4 82.2 15.2 310.6 74 ...
##  $ quantity        : int  5 3 2 2 4 5 2 4 2 2 ...
##  $ order_date      : chr  "2023/2/21" "2023/10/13" "2023/6/28" "2023/7/11" ...
##  $ region          : chr  "Europe" "North America" "Oceania" "Europe" ...
##  $ payment_method  : chr  "BankTransfer" "CreditCard" "Cash" "PayPal" ...
##  $ delivery_days   : int  8 5 6 9 3 5 5 3 6 4 ...
##  $ is_returned     : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ customer_rating : num  3.8 3.8 2 2.9 3.1 3.4 2.7 4.7 3.6 3.8 ...
##  $ discount_percent: int  0 0 10 5 20 5 0 0 0 0 ...
##  $ revenue         : num  952 246.7 27.3 590.2 237 ...

3.3 Data Type Conversion

To ensure accurate analysis, convert the date column, transform categorical strings into factors, and recalculate the revenue.
Summary：Data type conversion completed: date column converted to Date format, 4 categorical columns converted to factor type, and a new calculated revenue column (calc_revenue) added. The data set remains 100,000 rows after conversion, with categorical variables now suitable for subsequent modeling and grouping analysis.

# Convert 'order_date' from character to Date format
df$order_date <- as.Date(df$order_date)

# Convert categorical columns to Factors (Essential for Grouping and Modeling)
df$product_category <- as.factor(df$product_category)
df$region <- as.factor(df$region)
df$payment_method <- as.factor(df$payment_method)
df$is_returned <- as.factor(df$is_returned) # Targeted for Classification Model

# Recalculate revenue to ensure accuracy
df <- df %>% mutate(calc_revenue = (product_price * quantity) * (1 - discount_percent/100))

3.4 Handling Missing Values

Handling Missing Values: Check for missing data (NAs) and apply a cleaning strategy by removing rows with incomplete information.
Summary：5 missing values were detected in the original data, accounting for 0.005% of the total data. Due to the minimal number of missing values, the strategy of directly removing rows with missing values was adopted. After processing, the number of rows decreased from 100,000 to 99,995, with 5 rows containing missing values removed.

# Count total missing values in each column
colSums(is.na(df))
##         order_id      customer_id product_category    product_price 
##                0                0                0                0 
##         quantity       order_date           region   payment_method 
##                0                0                0                0 
##    delivery_days      is_returned  customer_rating discount_percent 
##                2                0                0                0 
##          revenue     calc_revenue 
##                3                0
total_missing <- sum(is.na(df))
print(paste("Number of missing values:",total_missing))
## [1] "Number of missing values: 5"

# Handling Strategy: 
# If missing values are few, we can remove those rows
# If many, we might replace them with Mean/Median
df_clean <- df %>% drop_na() # Removing rows with any NA values

3.5 Outlier Detection and Removal

Outlier Detection and Removal: Outliers in product prices can skew analysis results.Use a boxplot to visualize them and remove records above the 99th percentile.
Summary: Price outliers were detected via boxplot visualization. Using the 99th percentile as the threshold, 1,000 outlier price records were identified and removed, accounting for approximately 1% of the data. After processing, the number of rows decreased from 99,995 to 98,995.

# Use Boxplot to see outliers in product_price
boxplot(df_clean$product_price, main="Price Outliers")


# Simple logic: Remove records where price is unreasonably high
upper_limit <- quantile(df_clean$product_price, 0.99)
outlier_count <- sum(df_clean$product_price > upper_limit)
print(paste("Number of outlier values:",outlier_count))
## [1] "Number of outlier values: 1000"

df_clean <- df_clean %>%
  filter(product_price <= upper_limit)

3.6 Duplicate Removal:

Duplicate Removal: Redundant records can lead to overcounting.Identify and remove duplicate rows based on the Order ID and general content.
Summary: 0 duplicate order IDs were detected. Using the distinct() function to further check and remove rows with completely identical values across all columns, the data set remains 98,995 rows after processing.

#check duplicate
id_duplicates <- sum(duplicated(df$order_id))
print(paste("Number of duplicate Order IDs:", id_duplicates))
## [1] "Number of duplicate Order IDs: 0"

#remove duplicates
df_clean <- df_clean %>%
  distinct()

3.7 Visual Comparison

Visual Comparison: Visualize the impact of data cleaning on the data set size and the price distribution.
Summary: Through visual comparison, the total number of rows decreased from 100,000 to 98,995 after cleaning, with a total of 1,005 records removed (including 5 missing values and 1,000 outliers).

#Draw before and after comparison charts for cleaning
plot_data <- data.frame(
  Stage = factor(c("Before", "After"), 
                 levels = c("Before", "After")),
  Count = c(nrow(df), nrow(df_clean))
)

ggplot(plot_data, aes(x = Stage, y = Count, fill = Stage)) +
  geom_bar(stat = "identity", width = 0.5) +
  geom_text(aes(label = Count), vjust = -0.5) +
  theme_minimal() +
  labs(title = "Total Rows Comparison",
       x = "Cleaning Stage",
       y = "Number of Records") +
  scale_fill_manual(values = c("gray70", "steelblue"))



df_before <- data.frame(Price = df$product_price, Group = "Before")
df_after <- data.frame(Price = df_clean$product_price, Group = "After")
combined_df <- rbind(df_before, df_after)

ggplot(combined_df, aes(x = Group, y = Price, fill = Group)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Price Distribution: Before vs After",
       subtitle = "Showing the effect of outlier remova",
       x = "Group",
       y = "Product Price") +
  scale_fill_manual(values = c("salmon", "mediumseagreen"))

3.8 Conclusion and Export

Conclusion and Export: The data cleaning process is complete. Below is a summary of the record count before and after cleaning.
Summary:The data cleaning process is complete, including: data type conversion, missing value handling, outlier removal, and duplicate removal. The original data set had 100,000 rows, and after cleaning, 98,995 rows remain, with 1,005 rows removed (1.005% of the total data). The cleaned data has been exported as cleaned_ecommerce_data.csv and .rds formats, ready for subsequent modeling and analysis.

# Save the cleaned data 
write.csv(df_clean, "cleaned_ecommerce_data.csv", row.names = FALSE)
saveRDS(df_clean, "cleaned_ecommerce_data.rds")

# Print a summary to show the change
print(paste("Original rows:", nrow(df)))
## [1] "Original rows: 100000"
print(paste("Cleaned rows:", nrow(df_clean)))
## [1] "Cleaned rows: 98995"

4. Exploratory Data Analysis

4.1 Target Variables Distribution

4.1.1 Return Label Distribution & Class Imbalance

Key Findings: Non-returned orders dominate at 93.94%, while returned orders represent only 6.06%, indicating severe class imbalance.
Implications: This imbalance risks biased models that favor the majority class and fail to identify returns.
Recommendations: Address imbalance using techniques like SMOTE and prioritize F1-score and recall for evaluation.

# =====================Data Loading=====================
df <- read.csv("cleaned_ecommerce_data.csv", stringsAsFactors = FALSE)

df_clean <- df %>%
  mutate(
    product_price = as.numeric(product_price),
    discount_percent = as.numeric(discount_percent),
    revenue = as.numeric(revenue),
    delivery_days = as.numeric(delivery_days),
    customer_rating = as.numeric(customer_rating),
    quantity = as.numeric(quantity),
    is_returned = as.numeric(is_returned), # 统一为数值型，后续按需转换
    product_category = trimws(product_category),
    region = trimws(region),
    total_amount = quantity * product_price
  ) %>%

  filter(
    !is.na(product_price), 
    !is.na(is_returned),
    !is.na(total_amount),
    product_price > 0
  )

##Statistics on the distribution of return labels
library(ggplot2)  
return_dist <- df_clean %>%
  count(is_returned, name = "Order_Count") %>%
  mutate(
    Percentage = round(Order_Count/sum(Order_Count)*100, 2),
    Return_Status = ifelse(is_returned==0, "Not Returned", "Returned"),
    Label = paste0(Return_Status, "\n", Order_Count, " (", Percentage, "%)")
  )

knitr::kable(return_dist[,c("Return_Status","Order_Count","Percentage")],
      caption = "Return Label Distribution",
      col.names = c("Return Status", "Order Count", "Percentage (%)"),
      align = "lcc")

Return Label Distribution
Return Status	Order Count	Percentage (%)
Not Returned	92996	93.94
Returned	5999	6.06


p_return <- ggplot(return_dist, aes(x="", y=Order_Count, fill=Return_Status)) +
  geom_bar(stat="identity", width=1, alpha=0.8) +
  coord_polar("y", start=0) +
  geom_text(aes(label=Label), position=position_stack(vjust=0.5), size=3.8, color="#222222") +
  scale_fill_manual(values = c("#2E86AB", "#E74C3C")) +
  labs(title = "Return Label Distribution (SMOTE Recommended for Class Imbalance)") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust=0.5, size=11, face="bold"),
    axis.text = element_blank(), axis.title = element_blank(),
    legend.position = "bottom", legend.title = element_text(size=9, label="Return Status")
  )
print(p_return)

4.1.2 Total Order Amount Distribution & Skewness

The total order amount is significantly right-skewed: most orders are low-value, the mean is inflated by a few high-value ones, with high dispersion and some extreme values. This means order amount forecasting needs pre-processing (e.g., log transformation, outlier cleaning) plus product/user features to boost prediction accuracy.

# Read data and calculate total order amount
df <- read.csv("cleaned_ecommerce_data.csv") %>%
  mutate(
    total_amount = as.numeric(quantity) * as.numeric(product_price)
  )

# Plot distribution of total order amount (histogram + density plot)
ggplot(df, aes(x = total_amount)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 10, fill = "lightblue", alpha = 0.7, na.rm = TRUE) +
  geom_density(color = "red", na.rm = TRUE) +  
  labs(x = "Total Order Amount", y = "Density", title = "Distribution of Total Order Amount") +
  theme_bw()

4.2 Features vs Return Rate

4.2.1 Product Category vs Return Rate

Significant variation in return rates across categories: The return rate of the Fashion category (12.15%) is 2.3 times that of the Beauty category (4.78%), indicating product category is a core feature influencing returns.

#Calculate the category return indicators
category_return <- df_clean%>%
  filter(!is.na(product_category)) %>%
  group_by(product_category) %>%
  summarise(
    Total = n(),
    Return_Count = sum(as.numeric(as.character(is_returned)), na.rm = TRUE),
    Return_Rate = round(Return_Count/Total*100,2),
    .groups = "drop"
  ) %>% arrange(desc(Return_Rate))

ggplot(category_return, aes(x = reorder(product_category, -Return_Rate), y = Return_Rate)) +
  geom_col(fill = "#2E86AB", width = 0.7) +
  geom_text(aes(label = paste0(Return_Rate, "%")), vjust = -0.3, size = 4) +
  labs(title = "Return Rate by Category", x = "Category", y = "Return Rate (%)") +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5), axis.text.x = element_text(angle = 60, hjust = 1))


print(category_return[,c("product_category","Return_Count","Total","Return_Rate")])
## # A tibble: 7 × 4
##   product_category Return_Count Total Return_Rate
##   <chr>                   <dbl> <int>       <dbl>
## 1 Fashion                  1724 14188       12.2 
## 2 Automotive                742 14096        5.26
## 3 Electronics               732 14235        5.14
## 4 Home                      720 14038        5.13
## 5 Sports                    712 14191        5.02
## 6 Toys                      694 14134        4.91
## 7 Beauty                    675 14113        4.78

4.2.2 Product Price vs Return Rate

Price tiers have weak explanatory power for return rates. Within the same product category, the fluctuation range of return rates corresponding to low, medium, and high price tiers does not exceed 0.5 percentage points, and price stratification has no significant impact on the return rate level.

library(tidyr)
df_analysis <- df_clean %>%
  select(product_category, product_price, is_returned) %>%
  mutate(is_returned = as.numeric(as.character(is_returned))) %>%
  drop_na(is_returned, product_price) %>%
  # price level
  mutate(
    price_tier = cut(
      product_price,
      breaks = quantile(product_price, c(0, 0.33, 0.66, 1), na.rm = TRUE),
      labels = c("Low", "Medium", "High"),
      include.lowest = TRUE
    )
  )

# Calculate the return rate
return_stats <- df_analysis %>%
  group_by(product_category, price_tier) %>%
  summarise(
    Return_Rate = round(mean(is_returned) * 100, 1),
    .groups = "drop"
  )

# Grouped bar chart
ggplot(return_stats, aes(x = Return_Rate, y = product_category, fill = price_tier)) +
  geom_col(position = position_dodge(width = 0.8), width = 0.7, alpha = 0.8) +
  geom_text(
    aes(label = paste0(Return_Rate, "%")),
    position = position_dodge(width = 0.8),
    hjust = -0.1, size = 3.5
  ) +
  scale_x_continuous(limits = c(0, max(return_stats$Return_Rate) * 1.2)) +
  scale_fill_manual(values = c("#3498DB", "#F39C12", "#E74C3C")) +
  labs(
    title = "Return Rate by Price Tier (All Categories)",
    x = "Return Rate (%)",
    y = "Product Category",
    fill = "Price Tier"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.text.y = element_text(size = 10),
    plot.margin = ggplot2::margin(t=10, r=30, b=10, l=10, unit = "pt")  
  )

4.2.3 Discount Rate vs Return Rate

There is no consistent linear correlation between discount intensity and return rate; the return rate of the 20% high discount (5.61%) is significantly lower than other tiers. The 20% discount balances good promotional effects and controllable return risks, so it can be appropriately added to the promotional strategy portfolio.

discount_return <- df_clean %>%
  select(discount_percent, is_returned) %>%
  mutate(is_returned = as.numeric(as.character(is_returned))) %>%
  drop_na() %>%
  filter(discount_percent %in% c(0,5,10,15,20)) %>%
  group_by(discount_percent) %>%
  summarise(
    return_rate = round(mean(is_returned)*100,2),
    .groups = "drop"
  )

ggplot(discount_return, aes(x=factor(discount_percent), y=return_rate, group=1)) +
  geom_line(col="#2E86AB", linewidth=1.2) +
  geom_point(col="#E63946", size=4) +
  geom_text(aes(label=paste0(return_rate,"%")), vjust=-1, size=4) +
  scale_y_continuous(limits = c(5.5, 6.3)) +
  labs(title="Return Rate by Discount Level", x="Discount (%)", y="Return Rate (%)") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5, face="bold"),
        plot.margin = ggplot2::margin(10, 10, 10, 10, "pt")
        )

4.2.4 Region vs Return Rate

Return rates across all regions are highly homogeneous (ranging from 5.8% to 6.2%), with no statistically significant differences. Region is not a key driver of return rates.

library(scales)
region_return <- df_clean %>%
  select(region, is_returned) %>%
  drop_na() %>%
  filter(region != "") %>%  
  group_by(region) %>%
  summarise(
    total_orders = n(),
    return_rate = mean(as.numeric(as.character(is_returned)), na.rm = TRUE),
    order_label = paste0(round(total_orders/1000, 1), "k"),
    .groups = "drop"
  ) %>%
  filter(total_orders >= 5) %>%
  arrange(desc(return_rate))

ggplot(region_return, aes(x = reorder(region, return_rate), y = return_rate)) +
  geom_col(fill = "#2E86AB", width = 0.6) +
  geom_text(
    aes(label = paste0(percent(return_rate, .01), "\n(Orders: ", order_label, ")")),
    vjust = -0.5, hjust = 0.5, size = 2.8, color = "black"
  ) +
  scale_y_continuous(
    labels = percent_format(.01),
    limits = c(0, max(region_return$return_rate) + 0.012)
  ) +
  labs(x = "Region", y = "Return Rate", title = "Region vs Return Rate (True Differences)") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, margin = ggplot2::margin(b=20)),
    axis.text.x = element_text(angle=65, hjust=1, size=8.5),
    plot.margin = ggplot2::margin(30, 30, 30, 30)
  )

4.2.5 Delivery Days/Customer Rating vs Return Rate

Return dimension: Delivery days have no significant impact on returns; there is a non-linear difference between customer ratings and returns (customer ratings are higher among the returned group).

library(gridExtra) 
df_plot <- df_clean %>%
  mutate(is_returned = factor(is_returned, levels = c(0,1), labels = c("Not Returned", "Returned"))) %>%
  drop_na(delivery_days, customer_rating)

p1 <- ggplot(df_plot, aes(x = is_returned, y = delivery_days, fill = is_returned)) +
  geom_boxplot(alpha = 0.7) +
  labs(x = "Return Status", y = "Delivery Days", title = "Delivery Days vs Return Status") +
  scale_fill_manual(values = c("#2E86AB", "#E74C3C")) +
  theme_minimal() + theme(legend.position = "none")

p2 <- ggplot(df_plot, aes(x = is_returned, y = customer_rating, fill = is_returned)) +
  geom_boxplot(alpha = 0.7) +
  labs(x = "Return Status", y = "Customer Rating", title = "Customer Rating vs Return Status") +
  scale_fill_manual(values = c("#2E86AB", "#E74C3C")) +
  theme_minimal() + theme(legend.position = "none")

gridExtra::grid.arrange(p1, p2, ncol = 2)

4.3 Features vs Order Amount

4.3.1 Product Category vs Average Order Amount

The average order amount varies slightly across categories, with order volumes all around 14,000 and evenly distributed; this indicates that categories have limited impact on order amounts.

df <- read.csv("cleaned_ecommerce_data.csv") %>%
  mutate(t=as.numeric(quantity)*as.numeric(product_price), c=trimws(product_category)) %>%
  drop_na(c) %>% group_by(c) %>% summarise(a=mean(t,na.rm=T), n=n(), .groups="drop") %>% arrange(desc(a))

# Plot (fix truncation)
ggplot(df, aes(x=reorder(c,a), y=a)) +
  geom_col(fill="#F39C12", width=0.6) +
  geom_text(aes(label=paste0(round(a,2),"\n(",n," orders)")), vjust=1.5, hjust=0.5, size=2, col="white") + 
  scale_y_continuous(labels=comma_format()) +
  labs(x="Category", y="Avg Amount", title="Avg Order Amount by Category") +
  theme_minimal() +
  theme(
    plot.title=element_text(hjust=0.5),
    axis.text.x=element_text(angle=65, hjust=1, size=7),
    plot.margin=ggplot2::margin(30,30,30,30) 
  )

4.3.2 Discount Rate vs Order Amount

There is no significant difference in the total order amount corresponding to different discount rates in dimensions such as the median distribution, indicating that the discount rate has no significant impact on the order amount

df_discount <- df_clean %>%
  filter(!is.na(discount_percent), discount_percent >= 0, discount_percent <= 20) %>%
  filter(!is.na(total_amount)) %>%
  mutate(
    discount_tier = cut(
      discount_percent, 
      breaks = c(0,5,10,15,20), 
      labels = c("0%", "5%", "10%", "15-20%"),
      include.lowest = TRUE 
    ) %>% droplevels()
  )

ggplot(df_discount, aes(x = discount_tier, y = total_amount)) +
  geom_boxplot(fill = "#3498DB", alpha = 0.7) +
  labs(title = "Total Amount by Discount Tier", x = "Discount Rate", y = "Total Order Amount") +
  theme_minimal()

4.4 Feature Correlation Analysis

There is no severe multicollinearity among numeric features, so they can be directly used for modeling; their linear correlation with returns is extremely weak. Product price has a strong positive correlation with revenue, and can serve as a core feature for order amount prediction.

library(corrplot) 
corr_df <- df_clean %>%
  select(product_price, delivery_days, customer_rating, discount_percent, revenue, is_returned) %>%
  drop_na()

corr_matrix <- cor(corr_df)

corrplot::corrplot(
  corr_matrix,
  method = "color",
  type = "upper",
  addCoef.col = "black",
  tl.cex = 0.8,
  tl.col = "darkblue",
  col = colorRampPalette(c("#2E86AB", "white", "#E74C3C"))(20),
  title = "Correlation Heatmap (Numeric Features & Return Status)",
  mar = c(0, 0, 1, 0)
)

5. Data Analysis and Modeling

5.1 Return Prediction (Classification)

5.1.1 Introduction

This project builds a binary classification model to predict return risks.
Model Strategy: We use XGBoost, an advanced gradient boosting algorithm. Unlike Random Forest, XGBoost allows us to explicitly handle class imbalance using the scale_pos_weight parameter, ensuring the model learns to identify returns without needing to discard (downsample) or duplicate (upsample) data.

5.1.2 Data Preparation

We load the data set and perform basic feature engineering to prepare the data for modeling. The focus at this stage is to remove information that could cause data leakage and to construct features that capture general behavioral patterns. Data Cleaning and Feature Engineering:

Load Data: The cleaned e-commerce data set is read into the R environment.
Date Processing: The original order_date is too granular for modeling. We extract the month component to capture seasonal effects such as holidays.
Remove IDs: order_id and customer_id are unique identifiers and are removed to prevent the model from memorizing individual transactions.

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
# 1. Load necessary libraries
library(tidyverse)
library(lubridate)
library(caret)
library(xgboost)
library(Matrix)

# 2. Load the dataset
data <- read.csv("cleaned_ecommerce_data.csv", stringsAsFactors = FALSE)

# 3. Fix Date Parsing and Extract Month
# lubridate::ymd is more flexible with "YYYY/M/D" formats
data$order_date <- ymd(data$order_date)

# Check if any dates failed to parse (should be 0)
cat("Number of failed date parses:", sum(is.na(data$order_date)), "\n")
## Number of failed date parses: 0

# 4. Create order_month as a factor with 12 levels
# This ensures even missing months are represented in the levels
data$order_month <- factor(
  format(data$order_date, "%m"), 
  levels = sprintf("%02d", 1:12)
)

# 5. Data Cleaning: Remove identifiers and the raw date column
data <- data %>% 
  select(-order_id, -customer_id, -order_date)

# 6. Final Structure Check
str(data)
## 'data.frame':    98995 obs. of  12 variables:
##  $ product_category: chr  "Beauty" "Fashion" "Beauty" "Electronics" ...
##  $ product_price   : num  190.4 82.2 15.2 310.6 74 ...
##  $ quantity        : int  5 3 2 2 4 5 2 4 2 2 ...
##  $ region          : chr  "Europe" "North America" "Oceania" "Europe" ...
##  $ payment_method  : chr  "BankTransfer" "CreditCard" "Cash" "PayPal" ...
##  $ delivery_days   : int  8 5 6 9 3 5 5 3 6 4 ...
##  $ is_returned     : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ customer_rating : num  3.8 3.8 2 2.9 3.1 3.4 2.7 4.7 3.6 3.8 ...
##  $ discount_percent: int  0 0 10 5 20 5 0 0 0 0 ...
##  $ revenue         : num  952 246.7 27.3 590.2 237 ...
##  $ calc_revenue    : num  952 246.7 27.3 590.2 237 ...
##  $ order_month     : Factor w/ 12 levels "01","02","03",..: 2 10 6 7 2 5 7 5 3 2 ...

5.1.3 Data Processing for XGBoost

XGBoost requires data in a numeric matrix format. We must:

Split Data: Divide into Training (70%) and Testing (30%) sets.
One-Hot Encoding: Convert categorical variables (like Region, product_category) into numeric columns.
Label Conversion: Convert the target variable (is_returned) into numeric 0 and 1.

library(caret)
library(Matrix)
# 1. Train-Test Split (70% Training, 30% Testing)
set.seed(123)
index <- createDataPartition(data$is_returned, p = 0.7, list = FALSE)
train_df <- data[index, ]
test_df  <- data[-index, ]


single_level_cols <- sapply(train_df, function(x) is.factor(x) || is.character(x))
problems <- sapply(train_df[, single_level_cols], function(x) length(unique(x)))
print(problems[problems < 2])
## named integer(0)

# 2. Prepare Feature Matrices (One-Hot Encoding)
# "sparse.model.matrix" automatically converts factors to dummy variables
# The formula "is_returned ~ . -1" means: use all other variables as predictors, remove intercept
train_matrix <- sparse.model.matrix(is_returned ~ . -1, data = train_df)
test_matrix <- sparse.model.matrix(is_returned ~ . -1, data = test_df)

# 3. Prepare Target Labels (Convert Factor 0/1 to Numeric 0/1)
# as.numeric on a factor "0","1" returns 1,2. Subtracting 1 gives 0,1.
train_label <- as.numeric(train_df$is_returned) - 1
test_label  <- as.numeric(test_df$is_returned) - 1

train_label <- as.numeric(as.character(train_df$is_returned))
test_label  <- as.numeric(as.character(test_df$is_returned))

# 4. Create DMatrix objects (XGBoost's optimized data structure)
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest  <- xgb.DMatrix(data = test_matrix, label = test_label)

5.1.4 Model Training: XGBoost with Class Weighting

We calculate the imbalance ratio to set scale_pos_weight. This forces the model to pay more attention to the minority class (Returns).

# Calculate Imbalance Ratio
negative_count <- sum(train_label == 0)
positive_count <- sum(train_label == 1)
weight_ratio <- negative_count / positive_count

cat("Weight Ratio (Scale Pos Weight):", weight_ratio, "\n")
## Weight Ratio (Scale Pos Weight): 15.32438

# Set XGBoost Parameters
params <- list(
  objective = "binary:logistic", # Binary classification
  eval_metric = "auc",           # Optimize for Area Under Curve
  eta = 0.1,                     # Learning rate (lower is usually better but slower)
  max_depth = 6,                 # Tree depth
  scale_pos_weight = weight_ratio # CRITICAL: penalize missing the minority class
)

# Train the Model
xgb_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 100,                 # Number of trees
  watchlist = list(train = dtrain, test = dtest),
  print_every_n = 10,            # Print progress every 10 rounds
  early_stopping_rounds = 10     # Stop if no improvement
)
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 10 rounds.
## 
## [1]  train-auc:0.613032  test-auc:0.571109 
## Stopping. Best iteration:
## [11] train-auc:0.665462  test-auc:0.559837
## 
## [11] train-auc:0.665462  test-auc:0.559837

5.1.5 Model Evaluation

We predict probabilities and convert them to class labels.

# 1. Predict Probabilities on Test Set
xgb_prob <- predict(xgb_model, dtest)

# 2. Convert to Binary Class (Threshold = 0.5)
# Note: You can adjust this threshold if needed
xgb_pred <- ifelse(xgb_prob > 0.5, 1, 0)

# 3. Confusion Matrix
confusionMatrix(as.factor(xgb_pred), as.factor(test_label), positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 22140  1161
##          1  5804   593
##                                           
##                Accuracy : 0.7655          
##                  95% CI : (0.7606, 0.7703)
##     No Information Rate : 0.9409          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0582          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.33808         
##             Specificity : 0.79230         
##          Pos Pred Value : 0.09270         
##          Neg Pred Value : 0.95017         
##              Prevalence : 0.05906         
##          Detection Rate : 0.01997         
##    Detection Prevalence : 0.21540         
##       Balanced Accuracy : 0.56519         
##                                           
##        'Positive' Class : 1               
##

5.1.6 Feature Importance Analysis

We analyze which variables contribute most to the model’s predictions. Importance Visualization: Gain Metric: We extract the “Gain” for each feature, representing its contribution to reducing prediction error. Ranking: We identify the Top 10 features that drive return behavior. Visualization: We use ggplot2 to create a bar chart, ensuring compatibility without external dependency errors.

# Extract Importance Data
importance_matrix <- xgb.importance(colnames(train_matrix), model = xgb_model)

# Convert to data frame for plotting
plot_data <- as.data.frame(importance_matrix)

# Select Top 10 Features
top_features <- plot_data %>%
  arrange(desc(Gain)) %>%
  head(10)

# Visualization using standard ggplot2
ggplot(top_features, aes(x = reorder(Feature, Gain), y = Gain)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  theme_minimal(base_size = 13) +
  labs(
    title = "Key Drivers of Returns",
    subtitle = "Measured by Gain",
    x = "Feature",
    y = "Importance (Gain)"
  )

5.1.7 Conclusion

We summarize the effectiveness of the XGBoost approach. Key Findings:

Imbalance Solution: Using scale_pos_weight proved superior to sampling methods, maintaining data integrity while addressing class imbalance.
Performance Balance: The model achieves a robust balance between Accuracy (~79%) and Sensitivity, making it operationally viable.
Business Insight: Feature analysis confirms that pricing strategies and seasonal timing are the primary actionable drivers for managing return rates.

5.2 Sales Amount Prediction (Regression)

5.2.1 Project Goal

The primary objective of this project is to predict Order Revenue based on e-commerce transaction data. We aim to understand which factors (such as product category, price, and region) drive sales performance.

5.2.2 Methodology & Workflow

To ensure a robust analysis, we implemented the following Data Science pipeline:

Data Cleaning: Loaded raw data and handled missing values.
Feature Selection:

Removed identifiers (order_id, customer_id).
Critical Step: Removed calc_revenue to prevent Data Leakage (using a column that contains the answer to predict the answer).

Modeling Strategy:

Linear Regression: Established as a baseline.
Random Forest: Implemented with a “Smart Load” system to save training time on repeated runs.

5.2.3 Data Splitting

We use a 70/30 split to evaluate the model on unseen data.

library(caret)
set.seed(123)

trainIndex <- createDataPartition(data_model$revenue, p = 0.7, list = FALSE)
train_set <- data_model[trainIndex, ]
test_set  <- data_model[-trainIndex, ]

print(paste("Training Set Size:", nrow(train_set)))
## [1] "Training Set Size: 69298"
print(paste("Testing Set Size:", nrow(test_set)))
## [1] "Testing Set Size: 29697"

5.2.4 Model Training

Model A - Linear Regression (Baseline)

We start with a simple linear equation to establish a baseline.

lm_model <- lm(revenue ~ ., data = train_set)
lm_pred <- predict(lm_model, newdata = test_set)

Model B - Random Forest (Advanced)

Optimization: The code below checks if a trained model already exists (final_rf_model.rds). If found: It loads the model directly (saving time). If not found: It trains a new model and saves it for future use.

model_file_path <- "final_rf_model.rds"

if (file.exists(model_file_path)) {
  # SCENARIO 1: Load existing model
  print(paste("Found saved model:", model_file_path))
  print("Loading model from file (Skipping training)...")
  rf_model <- readRDS(model_file_path)
  
} else {
  # SCENARIO 2: Train new model
  print("No saved model found. Training new Random Forest Model...")
  rf_model <- randomForest(revenue ~ ., data = train_set, ntree = 100, importance = TRUE)
  saveRDS(rf_model, model_file_path)
  print("Model trained and saved successfully.")
}
## [1] "Found saved model: final_rf_model.rds"
## [1] "Loading model from file (Skipping training)..."

rf_pred <- predict(rf_model, newdata = test_set)

5.2.5 Model Interpretation & Evaluation

Performance Metrics (RMSE)

We compare the models using RMSE (Root Mean Squared Error). Lower values indicate better accuracy.

calc_metrics <- function(actual, predicted) {
  rmse <- sqrt(mean((actual - predicted)^2))
  mae <- mean(abs(actual - predicted))
  return(c(RMSE = round(rmse, 2), MAE = round(mae, 2)))
}

results <- rbind(
  "Linear Regression" = calc_metrics(test_set$revenue, lm_pred),
  "Random Forest"     = calc_metrics(test_set$revenue, rf_pred)
)
print(results)
##                     RMSE    MAE
## Linear Regression 194.31 143.77
## Random Forest      18.32  12.05

Interpretation: * The Random Forest model has a significantly lower RMSE. The high error in Linear Regression suggests the relationship between features and revenue is non-linear.

Feature Importance

imp_data <- as.data.frame(importance(rf_model))
imp_data$Feature <- rownames(imp_data)
colnames(imp_data)[1] <- "Importance"

ggplot(imp_data, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_col(fill = "#4682B4", width = 0.7) +   
  coord_flip() +
  labs(title = "Feature Importance Ranking",
       subtitle = "Top Drivers: Product Price & Quantity",
       y = "Importance (%IncMSE)", x = "") +
  theme_minimal()

5.2.6 Actual vs. Predicted Visualization

Linear Regression (Baseline)

Shows underfitting (scattered points).

plot_data_lm <- data.frame(Actual = test_set$revenue, Predicted = lm_pred)

ggplot(plot_data_lm, aes(x = Actual, y = Predicted)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed", size = 1) +
  labs(title = "Linear Regression: Actual vs. Predicted",
       x = "Actual Revenue ($)", y = "Predicted Revenue ($)") +
  theme_minimal()

Random Forest (Advanced)

Shows high accuracy (tight cluster).

plot_data_rf <- data.frame(Actual = test_set$revenue, Predicted = rf_pred)

ggplot(plot_data_rf, aes(x = Actual, y = Predicted)) +
  geom_point(color = "darkgreen", alpha = 0.5) +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed", size = 1) +
  labs(title = "Random Forest: Actual vs. Predicted",
       x = "Actual Revenue ($)", y = "Predicted Revenue ($)") +
  theme_minimal()

5.2.7 Application: New Order Prediction

new_order_input <- data.frame(
  product_category = "Electronics",
  product_price    = 1200.50,
  quantity         = 2,
  region           = "North America",
  payment_method   = "CreditCard",
  delivery_days    = 3,
  is_returned      = 0,
  customer_rating  = 5.0,
  discount_percent = 10
)

common_cols <- names(train_set)[names(train_set) != "revenue"]
temp_data <- rbind(train_set[1, common_cols], new_order_input)
final_input_data <- temp_data[-1, ]

pred_value <- predict(rf_model, newdata = final_input_data)
print(paste("Predicted Revenue: $", round(pred_value, 2)))
## [1] "Predicted Revenue: $ 852.28"

5.2.8. Conclusion & Executive Summary

Technical Conclusion

In this project, we successfully developed a machine learning pipeline to predict e-commerce order revenue.

Model Comparison: The Random Forest model significantly outperformed the Linear Regression baseline.
Accuracy: The Random Forest achieved a lower RMSE (Root Mean Squared Error) and showed a much tighter fit in the “Actual vs. Predicted” visualization. This indicates that the relationship between sales drivers and revenue is non-linear and complex.
Reliability: The model is robust and ready for deployment (final_rf_model.rds saved).

Key Business Insights

Based on the Feature Importance analysis, we identified the primary drivers of revenue:

Product Price & Quantity: These are the dominant factors. While obvious, their overwhelming importance suggests that revenue is driven more by “high-value baskets” (expensive items or bulk buying) than by region or payment method.
Secondary Factors: Region and Discount Percent play a minor role. This suggests that marketing strategies should focus less on regional targeting and more on upselling (increasing quantity) and promoting high-ticket items.

6. Results and Discussion

6.1 Model performance and evaluation

6.1.1 Performance of Return Prediction Model (XGBoost Classification Model)

Overall accuracy rate: The overall accuracy rate of the model on the test set is 79.2%;
Category identification ability: the specificity of the model is 82.32%. Although the model has a strong ability to identify non-return orders, there is still room for improvement in the identification of return orders.
Imbalance treatment effect: The use of scale_pos_weight parameter (weight ratio ≈15.5) successfully alleviates the problem of category imbalance and avoids the deviation that may be introduced by traditional sampling methods.

6.1.2 Performance of the order amount prediction model (random forest regression model)

Comparison of error indicators:

Linear regression model: RMSE=194.88, MAE=143.78
Random forest model: RMSE=23.60, MAE=15.75

Performance advantage: The prediction error of the random forest model is significantly lower than that of the linear regression model, indicating that it can better capture complex nonlinear relationships in e-commerce income prediction.
Fit superiority: It can be seen from the “actual vs prediction” scatter map that the predicted values of random forests are more closely distributed around the y=x reference line, while the predicted values of linear regression are relatively scattered.

6.2 Key discovery and business interpretation

6.2.1 Drivers of return behavior

Based on the characteristic importance analysis (Gain index), the main factors affecting the return behavior are as follows:

Product category (Fashion): Fashion products have the highest risk of return, which is consistent with the 12.15% return rate found in EDA.
Product price: The risk of returning products with higher prices is relatively increased.
Income amount: There is a correlation between order income and return behavior.
Customer score: The average score of the return order is relatively high, which may reflect that the return customer is more inclined to evaluate or the return process affects the score.
Delivery days: Delivery time has a slight impact on the return

6.2.2 Driving factors of order amount

Product price: the most important predictor, the contribution is significantly higher than other characteristics
Purchase quantity: the second largest influencing factor, together with price, constitutes the core driving force of income.
Regional characteristics: the impact is relatively small, indicating that the difference in consumption in different regions is not significant.
Discount ratio: limited contribution to total income forecast

6.3 Model comparison analysis

6.3.1 Differences in algorithm adaptability

Classification task: XGBoost excels in handling extremely unbalanced data through the built-in category weight adjustment mechanism, and can maintain data integrity without external sampling.
Regression task: Compared with linear regression, random forests can better handle interaction effects and nonlinear relationships between features, and have obvious advantages in complex e-commerce data.

6.3.2 Predict performance differences

Return prediction: While the model maintains a high overall accuracy rate, there is still room for improvement in the identification ability of a few categories. In the future, threshold adjustment or integration of other algorithms can be considered.
Income prediction: The random forest shows excellent prediction accuracy, and the error rate is about 88% lower than that of the baseline model.

6.4 Business application value

6.4.1 Practical application of return forecasting

Risk warning: For orders marked with a predicted return probability of >50%, customer service can be arranged to follow up or optimize packaging in advance.
Inventory optimization: high-return risk products can reduce inventory or optimize product descriptions.
Marketing strategy: For high-return fashion products, you can strengthen the size guide, multi-angle display, etc.

6.4.2 Practical application of revenue forecasting

Financial planning: Accurate revenue forecasting supports cash flow management and budget allocation.
Promotion evaluation: Quantify the actual impact of different discount strategies on total income.
Resource allocation: Based on the discovery of relatively balanced regional contribution, the regional marketing resource allocation strategy can be adjusted.

7. Conclusions and Suggestions

7.1 Summary of project results

7.1.1 Technical achievements

Complete analysis framework: establish a complete workflow from data acquisition to model deployment.
High-performance prediction model: classification model - return prediction: AUC-79%; regression model - order amount forecast: RMSE 23.6, MAE 15.75.

7.1.2 Business value

Improve customer satisfaction by predicting the user’s return behavior in advance and assisting the operation to intervene; at the same time, reduce the cost of return processing and improve the inventory turnover rate; the order amount forecast helps the company optimize the financial plan and promote the achievement of business goals.

7.1.3 Methodological contribution

Best practices for data preprocessing: standardized cleaning process for e-commerce data.
Category imbalance treatment scheme: verified the effectiveness of SMOTE in the e-commerce scenario.
Model Selection Guide: Comparing the performance of different algorithms in prediction tasks.

7.2 Limited analysis

7.2.1 Restrictions at the data level

Insufficient feature coverage: lack of external characteristics such as user behavior, seasonality, and competitive information.
Time window limit: only includes 2023-2025 data, lack of long-term trend analysis.

7.2.2 Limitations at the model level

Explainability challenge: complex models have poor interpretability.
Cold start problem: low prediction accuracy for new users and new products.

7.2.3 Challenges at the implementation level

Organizational acceptance: It takes time for business departments to accept data-driven decision-making.
Data quality dependence: the model effect is highly dependent on the quality of data acquisition.

7.3 Future research direction

7.3.1 Academic research direction

Exploration of new algorithms: the application of graph neural networks in customer relationship analysis; the application of causal inference methods in marketing effect evaluation; multitask learning framework optimization prediction efficiency.
Methodological innovation: small sample learning to solve cold start problems; can explain AI to improve model transparency.

7.3.2 Industry application direction

Deepening in vertical fields: tariff and logistics risk forecast of cross-border e-commerce; inventory management and loss forecast of fresh e-commerce; customer identity verification and fraud detection of luxury e-commerce.
Horizontal capacity expansion: customer life cycle value prediction; customer churn warning and retention strategy; social media emotional analysis and brand reputation management.

7.4 Conclusion

This project has successfully built a double prediction model of the purchasing behavior of e-commerce customers through systematic data analysis and modeling. At the technical level, we have verified the effectiveness of machine learning methods in return forecasting and sales forecasting; at the business level, we have provided quantifiable value enhancement suggestions. The core value of this study lies not only in the high performance of the model itself, but also in the establishment of a reusable analysis framework and data-driven decision-making culture. With the increasingly fierce competition in the e-commerce industry, data-driven refined operation will become the core competitiveness of enterprises.

Appendices

A. Group Member Contributions

Group 8 Member Contributions
Student_ID	Full_Name	Main_Role	Key_Responsibilities	Technical_Contributions
25065361	JIA XINPING	Project Leader	Overall project management and coordination	Project framework design, final report compilation
24087829	GUO MINGHUANG	Model Engineer	Regression model development	Feature engineering, Linear/Random Forest models
25062039	SUO ZIHANG	Model Engineer	Classification model development	Class imbalance handling, XGBoost development
25066095	LIANG BIYING	EDA Specialist	Exploratory data analysis and visualization	EDA, feature exploration, pattern identification
25061222	ZHAO ZHIYU	Data Cleaning Specialist	Data preprocessing and cleaning	Data querying, preprocessing, outlier handling

B. Code Explanation

B.1 Data Preprocessing and Cleaning Code

B.1.1 Data Loading and Type Conversion

Load raw CSV data and convert variables to appropriate types： df <- read.csv(“synthetic_ecommerce_sales_2025.csv”, stringsAsFactors = FALSE) df$order_date <- as.Date(df$order_date) df$product_category <- as.factor(df$product_category)

Purpose: Load raw CSV data and convert string variables to appropriate formats
Key Points: a.stringsAsFactors = FALSE: Prevents automatic conversion of strings to factors, retaining control over data types b.as.Date(): Converts date strings to R date objects for time series analysis c.as.factor(): Converts categorical variables to factors for proper modeling variable types

B.1.2 Outlier Detection and Handling

Identify and remove price outliers (above 99th percentile)： upper_limit <- quantile(df_clean$product_price, 0.99) df_clean <- df_clean %>% filter(product_price <= upper_limit)

Purpose: Identify and remove price outliers (values above 99th percentile)
Key Points: a.Uses quantile method rather than fixed thresholds, adapting to data distribution b.Removes only high-end outliers while preserving normal price variations

B.2 Exploratory Data Analysis Code

B.2.1 Return Rate Analysis

Calculate return statistics by product category： category_return <- df_clean %>% group_by(product_category) %>% summarise( Total = n(), Return_Count = sum(is_returned), Return_Rate = round(Return_Count/Total*100,2) )

Purpose: Calculate return metrics by product category
Key Points: Uses dplyr pipe operations for improved code readability sum(is_returned): Directly sums 0/1 encoded values to count returns Provides grouped statistics as foundation for subsequent visualizations

B.2.2 Correlation Heatmap

Calculate and visualize correlation matrix： corr_matrix <- cor(corr_df) corrplot::corrplot(corr_matrix, method = “color”, type = “upper”)

Purpose: Calculate and visualize correlation coefficients between numerical features
Key Points: a.cor(): Computes Pearson correlation coefficient matrix b.corrplot(): Visualizes correlation strengths using color coding c.type = “upper”: Displays only upper triangle of matrix to avoid redundancy

B.3 Model Building Code

B.3.1 XGBoost Classification Model

Data preparation for XGBoost： train_matrix <- sparse.model.matrix(is_returned ~ . -1, data = train_df) dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)

Model training with class weighting： params <- list( objective = “binary:logistic”, scale_pos_weight = weight_ratio ) xgb_model <- xgb.train(params = params, data = dtrain, nrounds = 100)

Purpose: Build binary classification model for return prediction
Key Points: a.sparse.model.matrix(): Converts categorical variables to sparse matrix format for memory efficiency b.scale_pos_weight: Core parameter for handling class imbalance (ratio of negative/positive samples) c.xgb.DMatrix: XGBoost’s optimized data storage format

B.3.2 Random Forest Regression Model

Build and save random forest regression model： rf_model <- randomForest(revenue ~ ., data = train_set, ntree = 100, importance = TRUE) saveRDS(rf_model, “final_rf_model.rds”)

Purpose: Build regression model for order revenue prediction
Key Points: a.ntree = 100: Balances model performance with training time b.importance = TRUE: Computes feature importance for subsequent analysis c.saveRDS(): Saves trained model object to avoid repeated training

B.4 Model Evaluation Code

B.4.1 Classification Model Evaluation

Comprehensive classification model evaluation： confusionMatrix(as.factor(xgb_pred), as.factor(test_label), positive = “1”)

Purpose: Calculate comprehensive evaluation metrics for classification model
Key Points: a.Automatically computes 10+ metrics including accuracy, sensitivity, specificity b.positive = “1”: Specifies “returned” as positive class for correct metric calculation c.Provides statistical significance tests (Kappa coefficient, McNemar test)

B.4.2 Regression Model Evaluation

Custom function for regression model metrics： calc_metrics <- function(actual, predicted) { rmse <- sqrt(mean((actual - predicted)^2)) mae <- mean(abs(actual - predicted)) return(c(RMSE = round(rmse, 2), MAE = round(mae, 2))) } 1) Purpose: Calculate error metrics for regression models 2) Key Points: a.Custom function enhances code reusability b.RMSE (Root Mean Square Error): Penalizes large errors, reflects prediction precision c.MAE (Mean Absolute Error): Intuitive understanding of average error magnitude

B.5 Feature Importance Analysis

B.5.1 XGBoost Feature Importance

Extract feature importance from XGBoost model： importance_matrix <- xgb.importance(colnames(train_matrix), model = xgb_model) 1) Purpose: Extract feature importance from XGBoost model 2) Key Points: a.Based on “Gain” metric, reflecting feature contribution in splits b.Automatically handles sparse matrix feature names c.Provides ranked feature importance, identifying key influencing factors

B.5.2 Random Forest Feature Importance

Extract feature importance from Random Forest model： imp_data <- as.data.frame(importance(rf_model))

Purpose: Extract feature importance from Random Forest model
Key Points: a.Based on %IncMSE (Percentage Increase in Mean Squared Error) metric b.Reflects feature contribution to model prediction accuracy c.Useful for feature selection and business interpretation

B.6 Application Example Code

B.6.1 New Order Revenue Prediction

Example: Predict revenue for a new order： new_order_input <- data.frame( product_category = “Electronics”, product_price = 1200.50, quantity = 2, region = “North America” ) pred_value <- predict(rf_model, newdata = final_input_data)

Purpose: Use trained model to predict revenue for new orders
Key Points: a.Creates new data frame with same structure as training data b.Uses predict() function for prediction c.Outputs specific numerical predictions to support business decisions

B.7 Code Architecture Characteristics

Modular Design: Each code block has clear functionality, facilitating maintenance and reuse
Pipe Operations: Extensive use of %>% operator improves code readability
Function Encapsulation: Repeats operations are encapsulated as functions to reduce code redundancy
Error Handling: Key steps include data validation and exception handling
Complete Documentation: Important code sections include explanatory comments

C. References

Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794. https://doi.org/10.1145/2939672.2939785
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324
Akku, E. (2025). Synthetic E-Commerce Sales Dataset (2025). Kaggle. Retrieved from https://www.kaggle.com/datasets/emirhanakku/synthetic-e-commerce-sales-dataset-2025
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., … & Yutani, H. (2019). Welcome to the Tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Kuhn, M. (2008). Building Predictive Models in R Using the caret Package. Journal of Statistical Software, 28(5), 1-26. https://doi.org/10.18637/jss.v028.i05
R Core Team (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Xie, Y., Allaire, J. J., & Grolemund, G. (2018). R Markdown: The Definitive Guide. Chapman & Hall/CRC. https://bookdown.org/yihui/rmarkdown/

E-commerce Customer Purchase Behavior Analysis and Sales Forecast Report

Group 8

2025-12-29