1 1. Introduction

This report presents the data cleaning and exploratory data analysis (EDA) process for the synthetic e-commerce sales dataset. The analysis focuses on two main business problems:

Classification problem: return behavior warning, using is_returned as the target variable.
Regression problem: sales quantity prediction, using quantity as the target variable.

The purpose of the data cleaning stage is to prepare a reliable processed dataset by handling missing values, correcting data types, identifying abnormal records, checking revenue consistency, and removing duplicate records. The EDA stage is used to understand return patterns, sales quantity distribution, category-level performance, regional differences, and possible relationships among numerical variables.

2 2. Load Packages and Data

library(tidyverse)
library(lubridate)
library(skimr)
library(corrplot)
library(ggplot2)
library(dplyr)

# Important for RPubs:
# Put this Rmd file and synthetic_ecommerce_sales_2025.csv in the same folder before knitting.
# Avoid using local absolute paths such as C:/Users/... because RPubs cannot access them.

file_path <- "synthetic_ecommerce_sales_2025.csv"
df_raw <- read.csv(file_path, stringsAsFactors = FALSE)

glimpse(df_raw)

## Rows: 100,000
## Columns: 13
## $ order_id         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ customer_id      <chr> "bdd640fb-0667-4ad1-9c80-317fa3b1799d", "23b8c1e9-392…
## $ product_category <chr> "Beauty", "Fashion", "Beauty", "Electronics", "Fashio…
## $ product_price    <dbl> 190.40, 82.22, 15.19, 310.65, 74.05, 236.05, 471.39, …
## $ quantity         <int> 5, 3, 2, 2, 4, 5, 2, 4, 2, 2, 4, 2, 1, 3, 1, 5, 3, 2,…
## $ order_date       <chr> "2023-02-21", "2023-10-13", "2023-06-28", "2023-07-11…
## $ region           <chr> "Europe", "North America", "Oceania", "Europe", "Afri…
## $ payment_method   <chr> "BankTransfer", "CreditCard", "Cash", "PayPal", "PayP…
## $ delivery_days    <int> 8, 5, 6, 9, 3, 5, 5, 3, 6, 4, 8, 9, 8, 5, 2, 8, 7, 3,…
## $ is_returned      <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ customer_rating  <dbl> 3.8, 3.8, 2.0, 2.9, 3.1, 3.4, 2.7, 4.7, 3.6, 3.8, 4.5…
## $ discount_percent <int> 0, 0, 10, 5, 20, 5, 0, 0, 0, 0, 0, 20, 10, 10, 5, 10,…
## $ revenue          <dbl> 952.00, 246.66, 27.34, 590.23, 236.96, 1121.24, 942.7…

skim(df_raw)

Data summary
Name	df_raw
Number of rows	100000
Number of columns	13
_______________________
Column type frequency:
character	5
numeric	8
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
customer_id	1	36	36	100000
product_category	1	4	11	7
order_date	1	10	10	1096
region	1	4	13	6
payment_method	1	4	12	4

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
order_id	1	50000.50	28867.66	1.00	25000.75	50000.50	75000.25	100000.00	▇▇▇▇▇
product_price	1	250.96	141.74	4.52	128.39	251.43	372.27	500.00	▇▇▇▇▇
quantity	1	3.09	1.44	1.00	2.00	3.00	4.00	6.00	▇▅▅▅▁
delivery_days	1	4.98	2.58	1.00	3.00	5.00	7.00	9.00	▇▇▃▇▇
is_returned	1	0.06	0.24	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
customer_rating	1	3.50	0.87	2.00	2.80	3.50	4.20	5.00	▇▇▇▇▇
discount_percent	1	5.01	6.14	0.00	0.00	0.00	10.00	20.00	▇▃▂▂▁
revenue	1	734.15	571.19	4.26	274.11	585.55	1089.86	2699.14	▇▅▂▁▁

3 3. Data Cleaning

3.1 3.1 Check Missing Values

missing_summary <- colSums(is.na(df_raw))
missing_summary

##         order_id      customer_id product_category    product_price 
##                0                0                0                0 
##         quantity       order_date           region   payment_method 
##                0                0                0                0 
##    delivery_days      is_returned  customer_rating discount_percent 
##                0                0                0                0 
##          revenue 
##                0

3.2 3.2 Remove Columns with More Than 50% Missing Values

high_na <- names(missing_summary[missing_summary > 0.5 * nrow(df_raw)])

if(length(high_na) > 0) {
  df_clean <- df_raw %>% select(-all_of(high_na))
} else {
  df_clean <- df_raw
}

high_na

## character(0)

3.3 3.3 Convert Data Types

The following columns are converted into suitable data types:

order_date as Date
is_returned, product_category, region, and payment_method as factor variables
quantity and delivery_days as integer variables
product_price, discount_percent, customer_rating, and revenue as numeric variables, if available

df_clean <- df_clean %>%
  mutate(
    order_date = as.Date(order_date, format = "%Y-%m-%d"),
    is_returned = as.factor(is_returned),
    product_category = as.factor(product_category),
    region = as.factor(region),
    payment_method = as.factor(payment_method),
    quantity = as.integer(quantity),
    product_price = as.numeric(product_price),
    discount_percent = as.numeric(discount_percent),
    delivery_days = as.integer(delivery_days),
    customer_rating = as.numeric(customer_rating)
  )

if("revenue" %in% names(df_clean)) {
  df_clean$revenue <- as.numeric(df_clean$revenue)
}

glimpse(df_clean)

## Rows: 100,000
## Columns: 13
## $ order_id         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ customer_id      <chr> "bdd640fb-0667-4ad1-9c80-317fa3b1799d", "23b8c1e9-392…
## $ product_category <fct> Beauty, Fashion, Beauty, Electronics, Fashion, Beauty…
## $ product_price    <dbl> 190.40, 82.22, 15.19, 310.65, 74.05, 236.05, 471.39, …
## $ quantity         <int> 5, 3, 2, 2, 4, 5, 2, 4, 2, 2, 4, 2, 1, 3, 1, 5, 3, 2,…
## $ order_date       <date> 2023-02-21, 2023-10-13, 2023-06-28, 2023-07-11, 2023…
## $ region           <fct> Europe, North America, Oceania, Europe, Africa, Ocean…
## $ payment_method   <fct> BankTransfer, CreditCard, Cash, PayPal, PayPal, PayPa…
## $ delivery_days    <int> 8, 5, 6, 9, 3, 5, 5, 3, 6, 4, 8, 9, 8, 5, 2, 8, 7, 3,…
## $ is_returned      <fct> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ customer_rating  <dbl> 3.8, 3.8, 2.0, 2.9, 3.1, 3.4, 2.7, 4.7, 3.6, 3.8, 4.5…
## $ discount_percent <dbl> 0, 0, 10, 5, 20, 5, 0, 0, 0, 0, 0, 20, 10, 10, 5, 10,…
## $ revenue          <dbl> 952.00, 246.66, 27.34, 590.23, 236.96, 1121.24, 942.7…

3.4 3.4 Handle Missing Values

For numerical variables, missing values are replaced with the median. For factor variables, missing values are replaced with the mode.

# Fill numerical missing values using median
numeric_cols <- df_clean %>% select(where(is.numeric)) %>% names()

for(col in numeric_cols) {
  if(any(is.na(df_clean[[col]]))) {
    df_clean[[col]][is.na(df_clean[[col]])] <- median(df_clean[[col]], na.rm = TRUE)
  }
}

# Fill factor missing values using mode
factor_cols <- df_clean %>% select(where(is.factor)) %>% names()

for(col in factor_cols) {
  if(any(is.na(df_clean[[col]]))) {
    mode_val <- names(sort(table(df_clean[[col]]), decreasing = TRUE))[1]
    df_clean[[col]][is.na(df_clean[[col]])] <- mode_val
  }
}

colSums(is.na(df_clean))

##         order_id      customer_id product_category    product_price 
##                0                0                0                0 
##         quantity       order_date           region   payment_method 
##                0                0                0                0 
##    delivery_days      is_returned  customer_rating discount_percent 
##                0                0                0                0 
##          revenue 
##                0

3.5 3.5 Flag Outliers Based on Business Rules

Outliers are flagged instead of immediately removed, because abnormal values may still provide useful business information. The following rules are used:

quantity > 100 or quantity <= 0
product_price > 10000 or product_price <= 0
delivery_days > 30 or delivery_days < 0
discount_percent > 100 or discount_percent < 0
customer_rating > 5 or customer_rating < 1

df_clean <- df_clean %>%
  mutate(
    is_quantity_outlier = if_else(quantity > 100 | quantity <= 0, 1, 0),
    is_price_outlier = if_else(product_price > 10000 | product_price <= 0, 1, 0),
    is_delivery_outlier = if_else(delivery_days > 30 | delivery_days < 0, 1, 0),
    is_discount_outlier = if_else(discount_percent > 100 | discount_percent < 0, 1, 0),
    is_rating_outlier = if_else(customer_rating > 5 | customer_rating < 1, 1, 0),
    is_any_outlier = pmax(
      is_quantity_outlier,
      is_price_outlier,
      is_delivery_outlier,
      is_discount_outlier,
      is_rating_outlier
    )
  )

outlier_summary <- df_clean %>%
  summarise(
    quantity_outliers = sum(is_quantity_outlier),
    price_outliers = sum(is_price_outlier),
    delivery_outliers = sum(is_delivery_outlier),
    discount_outliers = sum(is_discount_outlier),
    rating_outliers = sum(is_rating_outlier),
    any_outliers = sum(is_any_outlier)
  )

outlier_summary

##   quantity_outliers price_outliers delivery_outliers discount_outliers
## 1                 0              0                 0                 0
##   rating_outliers any_outliers
## 1               0            0

3.6 3.6 Check Revenue Consistency

If the dataset contains a revenue column, revenue is recalculated using:

\[ Revenue = Product\ Price \times Quantity \times (1 - Discount\ Percent / 100) \]

Rows are kept when the difference between original revenue and calculated revenue is less than 0.01.

if("revenue" %in% names(df_clean)) {
  rows_before_revenue_check <- nrow(df_clean)
  
  df_clean <- df_clean %>%
    mutate(revenue_calc = product_price * quantity * (1 - discount_percent / 100)) %>%
    filter(abs(revenue - revenue_calc) < 0.01) %>%
    select(-revenue_calc)
  
  rows_after_revenue_check <- nrow(df_clean)
  revenue_removed <- rows_before_revenue_check - rows_after_revenue_check
  
  revenue_removed
}

## [1] 0

3.7 3.7 Remove Duplicate Rows

rows_before_duplicates <- nrow(df_clean)
df_clean <- df_clean %>% distinct()
rows_after_duplicates <- nrow(df_clean)
duplicate_rows_removed <- rows_before_duplicates - rows_after_duplicates

duplicate_rows_removed

## [1] 0

3.8 3.8 Save Cleaned Dataset

output_path <- "ecommerce_cleaned.csv"
write.csv(df_clean, output_path, row.names = FALSE)

cat("Cleaned data saved to:", output_path, "\n")

## Cleaned data saved to: ecommerce_cleaned.csv

cat("Cleaned dataset dimensions:", nrow(df_clean), "rows and", ncol(df_clean), "columns\n")

## Cleaned dataset dimensions: 100000 rows and 19 columns

4 4. Exploratory Data Analysis

4.1 4.1 Dataset Summary After Cleaning

dim(df_clean)

## [1] 100000     19

summary(df_clean)

##     order_id      customer_id           product_category product_price    
##  Min.   :     1   Length:100000      Automotive :14239   Min.   :  4.518  
##  1st Qu.: 25001   Class :character   Beauty     :14234   1st Qu.:128.387  
##  Median : 50001   Mode  :character   Electronics:14375   Median :251.430  
##  Mean   : 50001                      Fashion    :14327   Mean   :250.963  
##  3rd Qu.: 75000                      Home       :14182   3rd Qu.:372.270  
##  Max.   :100000                      Sports     :14354   Max.   :500.000  
##                                      Toys       :14289                    
##     quantity       order_date                   region     
##  Min.   :1.000   Min.   :2023-01-01   Africa       :16450  
##  1st Qu.:2.000   1st Qu.:2023-10-02   Asia         :16763  
##  Median :3.000   Median :2024-07-05   Europe       :16513  
##  Mean   :3.085   Mean   :2024-07-02   North America:16749  
##  3rd Qu.:4.000   3rd Qu.:2025-04-04   Oceania      :16965  
##  Max.   :6.000   Max.   :2025-12-31   South America:16560  
##                                                            
##       payment_method  delivery_days   is_returned customer_rating
##  BankTransfer:25083   Min.   :1.000   0:93940     Min.   :2.0    
##  Cash        :24710   1st Qu.:3.000   1: 6060     1st Qu.:2.8    
##  CreditCard  :25222   Median :5.000               Median :3.5    
##  PayPal      :24985   Mean   :4.985               Mean   :3.5    
##                       3rd Qu.:7.000               3rd Qu.:4.2    
##                       Max.   :9.000               Max.   :5.0    
##                                                                  
##  discount_percent    revenue        is_quantity_outlier is_price_outlier
##  Min.   : 0.000   Min.   :   4.26   Min.   :0           Min.   :0       
##  1st Qu.: 0.000   1st Qu.: 274.11   1st Qu.:0           1st Qu.:0       
##  Median : 0.000   Median : 585.55   Median :0           Median :0       
##  Mean   : 5.015   Mean   : 734.15   Mean   :0           Mean   :0       
##  3rd Qu.:10.000   3rd Qu.:1089.86   3rd Qu.:0           3rd Qu.:0       
##  Max.   :20.000   Max.   :2699.14   Max.   :0           Max.   :0       
##                                                                         
##  is_delivery_outlier is_discount_outlier is_rating_outlier is_any_outlier
##  Min.   :0           Min.   :0           Min.   :0         Min.   :0     
##  1st Qu.:0           1st Qu.:0           1st Qu.:0         1st Qu.:0     
##  Median :0           Median :0           Median :0         Median :0     
##  Mean   :0           Mean   :0           Mean   :0         Mean   :0     
##  3rd Qu.:0           3rd Qu.:0           3rd Qu.:0         3rd Qu.:0     
##  Max.   :0           Max.   :0           Max.   :0         Max.   :0     
##

skim(df_clean)

Data summary
Name	df_clean
Number of rows	100000
Number of columns	19
_______________________
Column type frequency:
character	1
Date	1
factor	4
numeric	13
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
customer_id	0	1	36	36	0	1e+05	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
order_date	0	1	2023-01-01	2025-12-31	2024-07-05	1096

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
product_category	1	FALSE	7	Ele: 14375, Spo: 14354, Fas: 14327, Toy: 14289
region	1	FALSE	6	Oce: 16965, Asi: 16763, Nor: 16749, Sou: 16560
payment_method	1	FALSE	4	Cre: 25222, Ban: 25083, Pay: 24985, Cas: 24710
is_returned	1	FALSE	2	0: 93940, 1: 6060

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
order_id	1	50000.50	28867.66	1.00	25000.75	50000.50	75000.25	100000.00	▇▇▇▇▇
product_price	1	250.96	141.74	4.52	128.39	251.43	372.27	500.00	▇▇▇▇▇
quantity	1	3.09	1.44	1.00	2.00	3.00	4.00	6.00	▇▅▅▅▁
delivery_days	1	4.98	2.58	1.00	3.00	5.00	7.00	9.00	▇▇▃▇▇
customer_rating	1	3.50	0.87	2.00	2.80	3.50	4.20	5.00	▇▇▇▇▇
discount_percent	1	5.01	6.14	0.00	0.00	0.00	10.00	20.00	▇▃▂▂▁
revenue	1	734.15	571.19	4.26	274.11	585.55	1089.86	2699.14	▇▅▂▁▁
is_quantity_outlier	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
is_price_outlier	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
is_delivery_outlier	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
is_discount_outlier	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
is_rating_outlier	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁
is_any_outlier	1	0.00	0.00	0.00	0.00	0.00	0.00	0.00	▁▁▇▁▁

4.2 4.2 Univariate Analysis

4.2.1 Return Distribution

ggplot(df_clean, aes(x = is_returned, fill = is_returned)) +
  geom_bar() +
  labs(
    title = "Distribution of Returns",
    x = "Returned",
    y = "Count"
  ) +
  theme_minimal()

The return distribution shows whether the dataset is balanced or imbalanced for the classification task. If returned orders represent a small proportion of total orders, this indicates class imbalance and should be considered during model development.

4.2.2 Quantity Distribution

ggplot(df_clean, aes(x = quantity)) +
  geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
  labs(
    title = "Distribution of Sales Quantity",
    x = "Quantity",
    y = "Frequency"
  ) +
  theme_minimal()

This chart shows the typical order quantity pattern and helps identify whether most transactions are concentrated within a narrow quantity range.

4.2.3 Product Price Distribution

ggplot(df_clean, aes(x = product_price)) +
  geom_histogram(bins = 40, fill = "darkgreen", alpha = 0.7) +
  scale_x_log10() +
  labs(
    title = "Product Price Distribution (Log Scale)",
    x = "Product Price (Log Scale)",
    y = "Frequency"
  ) +
  theme_minimal()

The log scale is used because product prices are often right-skewed, with many low-to-medium priced products and fewer high-priced products.

5 5. Bivariate Analysis for Classification Problem: Return Behavior

5.1 5.1 Delivery Days by Return Status

ggplot(df_clean, aes(x = is_returned, y = delivery_days, fill = is_returned)) +
  geom_boxplot() +
  labs(
    title = "Delivery Days by Return Status",
    x = "Returned",
    y = "Delivery Days"
  ) +
  theme_minimal()

This chart helps evaluate whether longer delivery time is associated with a higher probability of return.

5.2 5.2 Return Proportion by Discount Level

df_clean %>%
  mutate(
    discount_group = cut(
      discount_percent,
      breaks = c(-Inf, 5, 15, 30, Inf),
      labels = c("0-5%", "5-15%", "15-30%", "30%+"),
      right = TRUE
    )
  ) %>%
  ggplot(aes(x = discount_group, fill = is_returned)) +
  geom_bar(position = "fill") +
  labs(
    title = "Return Proportion by Discount Level",
    x = "Discount Level",
    y = "Proportion"
  ) +
  theme_minimal()

This visualisation is used to observe whether high-discount orders have a different return pattern compared with low-discount orders.

5.3 5.3 Customer Rating Density by Return Status

ggplot(df_clean, aes(x = customer_rating, fill = is_returned)) +
  geom_density(alpha = 0.5) +
  labs(
    title = "Customer Rating Density by Return Status",
    x = "Customer Rating",
    y = "Density"
  ) +
  theme_minimal()

This density plot helps compare customer satisfaction levels between returned and non-returned orders.

6 6. Bivariate Analysis for Regression Problem: Sales Quantity

6.1 6.1 Product Price vs Sales Quantity

ggplot(df_clean, aes(x = product_price, y = quantity)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Product Price vs Sales Quantity",
    x = "Product Price",
    y = "Quantity"
  ) +
  theme_minimal()

This scatter plot is used to examine whether product price has a linear relationship with order quantity.

6.2 6.2 Average Quantity by Region

region_qty <- df_clean %>%
  group_by(region) %>%
  summarise(avg_quantity = mean(quantity), .groups = "drop")

region_qty

## # A tibble: 6 × 2
##   region        avg_quantity
##   <fct>                <dbl>
## 1 Africa                3.10
## 2 Asia                  3.10
## 3 Europe                3.08
## 4 North America         3.08
## 5 Oceania               3.08
## 6 South America         3.08

ggplot(region_qty, aes(x = reorder(region, avg_quantity), y = avg_quantity, fill = region)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Average Sales Quantity by Region",
    x = "Region",
    y = "Average Quantity"
  ) +
  theme_minimal()

This analysis shows regional differences in sales quantity and can support region-specific stock planning and marketing strategies.

6.3 6.3 Discount vs Quantity by Return Status

ggplot(df_clean, aes(x = discount_percent, y = quantity, color = is_returned)) +
  geom_point(alpha = 0.5) +
  geom_smooth(se = FALSE, method = "lm") +
  labs(
    title = "Discount vs Quantity by Return Status",
    x = "Discount Percent",
    y = "Quantity"
  ) +
  theme_minimal()

This plot explores whether discount percentage is associated with higher order quantity and whether the relationship differs by return status.

7 7. Correlation Matrix

numeric_vars <- df_clean %>% select(where(is.numeric))
cor_matrix <- cor(numeric_vars, use = "complete.obs")

corrplot(
  cor_matrix,
  method = "number",
  type = "upper",
  tl.cex = 0.8,
  number.cex = 0.7
)

The correlation matrix provides an overview of linear relationships among numerical variables. It is useful for identifying highly related variables and possible predictors for regression modelling.

8 8. Time Trend Analysis

monthly_trend <- df_clean %>%
  group_by(month = floor_date(order_date, "month")) %>%
  summarise(
    total_quantity = sum(quantity),
    return_rate = mean(as.numeric(is_returned) - 1, na.rm = TRUE),
    .groups = "drop"
  )

monthly_trend

## # A tibble: 36 × 3
##    month      total_quantity return_rate
##    <date>              <int>       <dbl>
##  1 2023-01-01           8615      0.0548
##  2 2023-02-01           7410      0.0604
##  3 2023-03-01           8398      0.0561
##  4 2023-04-01           8013      0.0636
##  5 2023-05-01           8532      0.0610
##  6 2023-06-01           8395      0.0618
##  7 2023-07-01           8480      0.0623
##  8 2023-08-01           8694      0.0585
##  9 2023-09-01           8190      0.0545
## 10 2023-10-01           8617      0.0600
## # ℹ 26 more rows

monthly_trend %>%
  pivot_longer(
    cols = c(total_quantity, return_rate),
    names_to = "metric",
    values_to = "value"
  ) %>%
  ggplot(aes(x = month, y = value)) +
  geom_line(color = "steelblue") +
  facet_wrap(~metric, scales = "free_y") +
  labs(
    title = "Monthly Trends: Total Quantity and Return Rate",
    x = "Month",
    y = "Value"
  ) +
  theme_minimal()

The time trend analysis helps identify seasonal patterns in sales quantity and whether return rates fluctuate across months.

9 9. Category-Level Insights

category_return <- df_clean %>%
  group_by(product_category) %>%
  summarise(
    return_rate = mean(as.integer(is_returned) - 1, na.rm = TRUE),
    total_sales = sum(quantity * product_price * (1 - discount_percent / 100), na.rm = TRUE),
    avg_quantity = mean(quantity, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(desc(return_rate))

category_return

## # A tibble: 7 × 4
##   product_category return_rate total_sales avg_quantity
##   <fct>                  <dbl>       <dbl>        <dbl>
## 1 Fashion               0.122    10497301.         3.07
## 2 Automotive            0.0527   10575483.         3.11
## 3 Electronics           0.0513   10488699.         3.09
## 4 Home                  0.0511   10369461.         3.09
## 5 Sports                0.0505   10557716.         3.07
## 6 Toys                  0.0490   10553990.         3.09
## 7 Beauty                0.0478   10372151.         3.08

ggplot(category_return, aes(x = reorder(product_category, return_rate), y = return_rate, fill = product_category)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Return Rate by Product Category",
    x = "Product Category",
    y = "Return Rate"
  ) +
  theme_minimal()

This category-level analysis identifies product categories with relatively high return rates. These categories may require further investigation, such as reviewing product quality, product descriptions, sizing information, or customer expectations.

10 10. Key Findings

overall_return_rate <- mean(as.numeric(df_clean$is_returned) - 1, na.rm = TRUE) * 100
avg_quantity <- mean(df_clean$quantity, na.rm = TRUE)
top_3_high_return_categories <- head(category_return %>% arrange(desc(return_rate)), 3)
best_region <- region_qty$region[which.max(region_qty$avg_quantity)]
delivery_return_correlation <- cor(
  as.numeric(df_clean$delivery_days),
  as.numeric(df_clean$is_returned) - 1,
  use = "complete.obs"
)

cat("Overall return rate:", round(overall_return_rate, 2), "%\n")

## Overall return rate: 6.06 %

cat("Average quantity per order:", round(avg_quantity, 2), "\n")

## Average quantity per order: 3.09

cat("Correlation between delivery days and return:", round(delivery_return_correlation, 4), "\n")

## Correlation between delivery days and return: -0.0058

cat("Best region for average sales quantity:", as.character(best_region), "\n")

## Best region for average sales quantity: Africa

cat("\nTop 3 high-return categories:\n")

## 
## Top 3 high-return categories:

top_3_high_return_categories

## # A tibble: 3 × 4
##   product_category return_rate total_sales avg_quantity
##   <fct>                  <dbl>       <dbl>        <dbl>
## 1 Fashion               0.122    10497301.         3.07
## 2 Automotive            0.0527   10575483.         3.11
## 3 Electronics           0.0513   10488699.         3.09

Based on the EDA, the dataset provides useful information for both return behavior classification and sales quantity prediction. Return-related analysis suggests that variables such as delivery days, discount level, customer rating, and product category may be useful predictors for identifying return risk. For sales quantity prediction, product price, discount percentage, region, month, and product category may provide useful explanatory power.

11 11. Conclusion

The data cleaning process successfully prepared the raw e-commerce sales data for further analysis by checking missing values, converting data types, imputing missing values, flagging outliers, validating revenue consistency, and removing duplicate records. The EDA results provide initial insights into return behavior and sales quantity patterns.

For the classification objective, is_returned can be used as the target variable to develop a return behavior warning model. For the regression objective, quantity can be used as the target variable to predict sales quantity. The cleaned dataset can be used in the next stage for model training, evaluation, and business interpretation.

E-commerce Sales Data Cleaning and Exploratory Analysis

Group4

2026-06-06