This report presents the data cleaning and exploratory data analysis (EDA) process for the synthetic e-commerce sales dataset. The analysis focuses on two main business problems:
is_returned as the target variable.quantity as the target variable.The purpose of the data cleaning stage is to prepare a reliable processed dataset by handling missing values, correcting data types, identifying abnormal records, checking revenue consistency, and removing duplicate records. The EDA stage is used to understand return patterns, sales quantity distribution, category-level performance, regional differences, and possible relationships among numerical variables.
library(tidyverse)
library(lubridate)
library(skimr)
library(corrplot)
library(ggplot2)
library(dplyr)
# Important for RPubs:
# Put this Rmd file and synthetic_ecommerce_sales_2025.csv in the same folder before knitting.
# Avoid using local absolute paths such as C:/Users/... because RPubs cannot access them.
file_path <- "synthetic_ecommerce_sales_2025.csv"
df_raw <- read.csv(file_path, stringsAsFactors = FALSE)
glimpse(df_raw)
## Rows: 100,000
## Columns: 13
## $ order_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ customer_id <chr> "bdd640fb-0667-4ad1-9c80-317fa3b1799d", "23b8c1e9-392…
## $ product_category <chr> "Beauty", "Fashion", "Beauty", "Electronics", "Fashio…
## $ product_price <dbl> 190.40, 82.22, 15.19, 310.65, 74.05, 236.05, 471.39, …
## $ quantity <int> 5, 3, 2, 2, 4, 5, 2, 4, 2, 2, 4, 2, 1, 3, 1, 5, 3, 2,…
## $ order_date <chr> "2023-02-21", "2023-10-13", "2023-06-28", "2023-07-11…
## $ region <chr> "Europe", "North America", "Oceania", "Europe", "Afri…
## $ payment_method <chr> "BankTransfer", "CreditCard", "Cash", "PayPal", "PayP…
## $ delivery_days <int> 8, 5, 6, 9, 3, 5, 5, 3, 6, 4, 8, 9, 8, 5, 2, 8, 7, 3,…
## $ is_returned <int> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ customer_rating <dbl> 3.8, 3.8, 2.0, 2.9, 3.1, 3.4, 2.7, 4.7, 3.6, 3.8, 4.5…
## $ discount_percent <int> 0, 0, 10, 5, 20, 5, 0, 0, 0, 0, 0, 20, 10, 10, 5, 10,…
## $ revenue <dbl> 952.00, 246.66, 27.34, 590.23, 236.96, 1121.24, 942.7…
skim(df_raw)
| Name | df_raw |
| Number of rows | 100000 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| customer_id | 0 | 1 | 36 | 36 | 0 | 100000 | 0 |
| product_category | 0 | 1 | 4 | 11 | 0 | 7 | 0 |
| order_date | 0 | 1 | 10 | 10 | 0 | 1096 | 0 |
| region | 0 | 1 | 4 | 13 | 0 | 6 | 0 |
| payment_method | 0 | 1 | 4 | 12 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| order_id | 0 | 1 | 50000.50 | 28867.66 | 1.00 | 25000.75 | 50000.50 | 75000.25 | 100000.00 | ▇▇▇▇▇ |
| product_price | 0 | 1 | 250.96 | 141.74 | 4.52 | 128.39 | 251.43 | 372.27 | 500.00 | ▇▇▇▇▇ |
| quantity | 0 | 1 | 3.09 | 1.44 | 1.00 | 2.00 | 3.00 | 4.00 | 6.00 | ▇▅▅▅▁ |
| delivery_days | 0 | 1 | 4.98 | 2.58 | 1.00 | 3.00 | 5.00 | 7.00 | 9.00 | ▇▇▃▇▇ |
| is_returned | 0 | 1 | 0.06 | 0.24 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| customer_rating | 0 | 1 | 3.50 | 0.87 | 2.00 | 2.80 | 3.50 | 4.20 | 5.00 | ▇▇▇▇▇ |
| discount_percent | 0 | 1 | 5.01 | 6.14 | 0.00 | 0.00 | 0.00 | 10.00 | 20.00 | ▇▃▂▂▁ |
| revenue | 0 | 1 | 734.15 | 571.19 | 4.26 | 274.11 | 585.55 | 1089.86 | 2699.14 | ▇▅▂▁▁ |
missing_summary <- colSums(is.na(df_raw))
missing_summary
## order_id customer_id product_category product_price
## 0 0 0 0
## quantity order_date region payment_method
## 0 0 0 0
## delivery_days is_returned customer_rating discount_percent
## 0 0 0 0
## revenue
## 0
high_na <- names(missing_summary[missing_summary > 0.5 * nrow(df_raw)])
if(length(high_na) > 0) {
df_clean <- df_raw %>% select(-all_of(high_na))
} else {
df_clean <- df_raw
}
high_na
## character(0)
The following columns are converted into suitable data types:
order_date as Dateis_returned, product_category,
region, and payment_method as factor
variablesquantity and delivery_days as integer
variablesproduct_price, discount_percent,
customer_rating, and revenue as numeric
variables, if availabledf_clean <- df_clean %>%
mutate(
order_date = as.Date(order_date, format = "%Y-%m-%d"),
is_returned = as.factor(is_returned),
product_category = as.factor(product_category),
region = as.factor(region),
payment_method = as.factor(payment_method),
quantity = as.integer(quantity),
product_price = as.numeric(product_price),
discount_percent = as.numeric(discount_percent),
delivery_days = as.integer(delivery_days),
customer_rating = as.numeric(customer_rating)
)
if("revenue" %in% names(df_clean)) {
df_clean$revenue <- as.numeric(df_clean$revenue)
}
glimpse(df_clean)
## Rows: 100,000
## Columns: 13
## $ order_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ customer_id <chr> "bdd640fb-0667-4ad1-9c80-317fa3b1799d", "23b8c1e9-392…
## $ product_category <fct> Beauty, Fashion, Beauty, Electronics, Fashion, Beauty…
## $ product_price <dbl> 190.40, 82.22, 15.19, 310.65, 74.05, 236.05, 471.39, …
## $ quantity <int> 5, 3, 2, 2, 4, 5, 2, 4, 2, 2, 4, 2, 1, 3, 1, 5, 3, 2,…
## $ order_date <date> 2023-02-21, 2023-10-13, 2023-06-28, 2023-07-11, 2023…
## $ region <fct> Europe, North America, Oceania, Europe, Africa, Ocean…
## $ payment_method <fct> BankTransfer, CreditCard, Cash, PayPal, PayPal, PayPa…
## $ delivery_days <int> 8, 5, 6, 9, 3, 5, 5, 3, 6, 4, 8, 9, 8, 5, 2, 8, 7, 3,…
## $ is_returned <fct> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ customer_rating <dbl> 3.8, 3.8, 2.0, 2.9, 3.1, 3.4, 2.7, 4.7, 3.6, 3.8, 4.5…
## $ discount_percent <dbl> 0, 0, 10, 5, 20, 5, 0, 0, 0, 0, 0, 20, 10, 10, 5, 10,…
## $ revenue <dbl> 952.00, 246.66, 27.34, 590.23, 236.96, 1121.24, 942.7…
For numerical variables, missing values are replaced with the median. For factor variables, missing values are replaced with the mode.
# Fill numerical missing values using median
numeric_cols <- df_clean %>% select(where(is.numeric)) %>% names()
for(col in numeric_cols) {
if(any(is.na(df_clean[[col]]))) {
df_clean[[col]][is.na(df_clean[[col]])] <- median(df_clean[[col]], na.rm = TRUE)
}
}
# Fill factor missing values using mode
factor_cols <- df_clean %>% select(where(is.factor)) %>% names()
for(col in factor_cols) {
if(any(is.na(df_clean[[col]]))) {
mode_val <- names(sort(table(df_clean[[col]]), decreasing = TRUE))[1]
df_clean[[col]][is.na(df_clean[[col]])] <- mode_val
}
}
colSums(is.na(df_clean))
## order_id customer_id product_category product_price
## 0 0 0 0
## quantity order_date region payment_method
## 0 0 0 0
## delivery_days is_returned customer_rating discount_percent
## 0 0 0 0
## revenue
## 0
Outliers are flagged instead of immediately removed, because abnormal values may still provide useful business information. The following rules are used:
quantity > 100 or quantity <= 0product_price > 10000 or
product_price <= 0delivery_days > 30 or
delivery_days < 0discount_percent > 100 or
discount_percent < 0customer_rating > 5 or
customer_rating < 1df_clean <- df_clean %>%
mutate(
is_quantity_outlier = if_else(quantity > 100 | quantity <= 0, 1, 0),
is_price_outlier = if_else(product_price > 10000 | product_price <= 0, 1, 0),
is_delivery_outlier = if_else(delivery_days > 30 | delivery_days < 0, 1, 0),
is_discount_outlier = if_else(discount_percent > 100 | discount_percent < 0, 1, 0),
is_rating_outlier = if_else(customer_rating > 5 | customer_rating < 1, 1, 0),
is_any_outlier = pmax(
is_quantity_outlier,
is_price_outlier,
is_delivery_outlier,
is_discount_outlier,
is_rating_outlier
)
)
outlier_summary <- df_clean %>%
summarise(
quantity_outliers = sum(is_quantity_outlier),
price_outliers = sum(is_price_outlier),
delivery_outliers = sum(is_delivery_outlier),
discount_outliers = sum(is_discount_outlier),
rating_outliers = sum(is_rating_outlier),
any_outliers = sum(is_any_outlier)
)
outlier_summary
## quantity_outliers price_outliers delivery_outliers discount_outliers
## 1 0 0 0 0
## rating_outliers any_outliers
## 1 0 0
If the dataset contains a revenue column, revenue is
recalculated using:
\[ Revenue = Product\ Price \times Quantity \times (1 - Discount\ Percent / 100) \]
Rows are kept when the difference between original revenue and calculated revenue is less than 0.01.
if("revenue" %in% names(df_clean)) {
rows_before_revenue_check <- nrow(df_clean)
df_clean <- df_clean %>%
mutate(revenue_calc = product_price * quantity * (1 - discount_percent / 100)) %>%
filter(abs(revenue - revenue_calc) < 0.01) %>%
select(-revenue_calc)
rows_after_revenue_check <- nrow(df_clean)
revenue_removed <- rows_before_revenue_check - rows_after_revenue_check
revenue_removed
}
## [1] 0
rows_before_duplicates <- nrow(df_clean)
df_clean <- df_clean %>% distinct()
rows_after_duplicates <- nrow(df_clean)
duplicate_rows_removed <- rows_before_duplicates - rows_after_duplicates
duplicate_rows_removed
## [1] 0
output_path <- "ecommerce_cleaned.csv"
write.csv(df_clean, output_path, row.names = FALSE)
cat("Cleaned data saved to:", output_path, "\n")
## Cleaned data saved to: ecommerce_cleaned.csv
cat("Cleaned dataset dimensions:", nrow(df_clean), "rows and", ncol(df_clean), "columns\n")
## Cleaned dataset dimensions: 100000 rows and 19 columns
dim(df_clean)
## [1] 100000 19
summary(df_clean)
## order_id customer_id product_category product_price
## Min. : 1 Length:100000 Automotive :14239 Min. : 4.518
## 1st Qu.: 25001 Class :character Beauty :14234 1st Qu.:128.387
## Median : 50001 Mode :character Electronics:14375 Median :251.430
## Mean : 50001 Fashion :14327 Mean :250.963
## 3rd Qu.: 75000 Home :14182 3rd Qu.:372.270
## Max. :100000 Sports :14354 Max. :500.000
## Toys :14289
## quantity order_date region
## Min. :1.000 Min. :2023-01-01 Africa :16450
## 1st Qu.:2.000 1st Qu.:2023-10-02 Asia :16763
## Median :3.000 Median :2024-07-05 Europe :16513
## Mean :3.085 Mean :2024-07-02 North America:16749
## 3rd Qu.:4.000 3rd Qu.:2025-04-04 Oceania :16965
## Max. :6.000 Max. :2025-12-31 South America:16560
##
## payment_method delivery_days is_returned customer_rating
## BankTransfer:25083 Min. :1.000 0:93940 Min. :2.0
## Cash :24710 1st Qu.:3.000 1: 6060 1st Qu.:2.8
## CreditCard :25222 Median :5.000 Median :3.5
## PayPal :24985 Mean :4.985 Mean :3.5
## 3rd Qu.:7.000 3rd Qu.:4.2
## Max. :9.000 Max. :5.0
##
## discount_percent revenue is_quantity_outlier is_price_outlier
## Min. : 0.000 Min. : 4.26 Min. :0 Min. :0
## 1st Qu.: 0.000 1st Qu.: 274.11 1st Qu.:0 1st Qu.:0
## Median : 0.000 Median : 585.55 Median :0 Median :0
## Mean : 5.015 Mean : 734.15 Mean :0 Mean :0
## 3rd Qu.:10.000 3rd Qu.:1089.86 3rd Qu.:0 3rd Qu.:0
## Max. :20.000 Max. :2699.14 Max. :0 Max. :0
##
## is_delivery_outlier is_discount_outlier is_rating_outlier is_any_outlier
## Min. :0 Min. :0 Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0
##
skim(df_clean)
| Name | df_clean |
| Number of rows | 100000 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| Date | 1 |
| factor | 4 |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| customer_id | 0 | 1 | 36 | 36 | 0 | 1e+05 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| order_date | 0 | 1 | 2023-01-01 | 2025-12-31 | 2024-07-05 | 1096 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| product_category | 0 | 1 | FALSE | 7 | Ele: 14375, Spo: 14354, Fas: 14327, Toy: 14289 |
| region | 0 | 1 | FALSE | 6 | Oce: 16965, Asi: 16763, Nor: 16749, Sou: 16560 |
| payment_method | 0 | 1 | FALSE | 4 | Cre: 25222, Ban: 25083, Pay: 24985, Cas: 24710 |
| is_returned | 0 | 1 | FALSE | 2 | 0: 93940, 1: 6060 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| order_id | 0 | 1 | 50000.50 | 28867.66 | 1.00 | 25000.75 | 50000.50 | 75000.25 | 100000.00 | ▇▇▇▇▇ |
| product_price | 0 | 1 | 250.96 | 141.74 | 4.52 | 128.39 | 251.43 | 372.27 | 500.00 | ▇▇▇▇▇ |
| quantity | 0 | 1 | 3.09 | 1.44 | 1.00 | 2.00 | 3.00 | 4.00 | 6.00 | ▇▅▅▅▁ |
| delivery_days | 0 | 1 | 4.98 | 2.58 | 1.00 | 3.00 | 5.00 | 7.00 | 9.00 | ▇▇▃▇▇ |
| customer_rating | 0 | 1 | 3.50 | 0.87 | 2.00 | 2.80 | 3.50 | 4.20 | 5.00 | ▇▇▇▇▇ |
| discount_percent | 0 | 1 | 5.01 | 6.14 | 0.00 | 0.00 | 0.00 | 10.00 | 20.00 | ▇▃▂▂▁ |
| revenue | 0 | 1 | 734.15 | 571.19 | 4.26 | 274.11 | 585.55 | 1089.86 | 2699.14 | ▇▅▂▁▁ |
| is_quantity_outlier | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| is_price_outlier | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| is_delivery_outlier | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| is_discount_outlier | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| is_rating_outlier | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
| is_any_outlier | 0 | 1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ▁▁▇▁▁ |
ggplot(df_clean, aes(x = is_returned, fill = is_returned)) +
geom_bar() +
labs(
title = "Distribution of Returns",
x = "Returned",
y = "Count"
) +
theme_minimal()
The return distribution shows whether the dataset is balanced or imbalanced for the classification task. If returned orders represent a small proportion of total orders, this indicates class imbalance and should be considered during model development.
ggplot(df_clean, aes(x = quantity)) +
geom_histogram(bins = 30, fill = "steelblue", alpha = 0.7) +
labs(
title = "Distribution of Sales Quantity",
x = "Quantity",
y = "Frequency"
) +
theme_minimal()
This chart shows the typical order quantity pattern and helps identify whether most transactions are concentrated within a narrow quantity range.
ggplot(df_clean, aes(x = product_price)) +
geom_histogram(bins = 40, fill = "darkgreen", alpha = 0.7) +
scale_x_log10() +
labs(
title = "Product Price Distribution (Log Scale)",
x = "Product Price (Log Scale)",
y = "Frequency"
) +
theme_minimal()
The log scale is used because product prices are often right-skewed, with many low-to-medium priced products and fewer high-priced products.
ggplot(df_clean, aes(x = is_returned, y = delivery_days, fill = is_returned)) +
geom_boxplot() +
labs(
title = "Delivery Days by Return Status",
x = "Returned",
y = "Delivery Days"
) +
theme_minimal()
This chart helps evaluate whether longer delivery time is associated with a higher probability of return.
df_clean %>%
mutate(
discount_group = cut(
discount_percent,
breaks = c(-Inf, 5, 15, 30, Inf),
labels = c("0-5%", "5-15%", "15-30%", "30%+"),
right = TRUE
)
) %>%
ggplot(aes(x = discount_group, fill = is_returned)) +
geom_bar(position = "fill") +
labs(
title = "Return Proportion by Discount Level",
x = "Discount Level",
y = "Proportion"
) +
theme_minimal()
This visualisation is used to observe whether high-discount orders have a different return pattern compared with low-discount orders.
ggplot(df_clean, aes(x = customer_rating, fill = is_returned)) +
geom_density(alpha = 0.5) +
labs(
title = "Customer Rating Density by Return Status",
x = "Customer Rating",
y = "Density"
) +
theme_minimal()
This density plot helps compare customer satisfaction levels between returned and non-returned orders.
ggplot(df_clean, aes(x = product_price, y = quantity)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Product Price vs Sales Quantity",
x = "Product Price",
y = "Quantity"
) +
theme_minimal()
This scatter plot is used to examine whether product price has a linear relationship with order quantity.
region_qty <- df_clean %>%
group_by(region) %>%
summarise(avg_quantity = mean(quantity), .groups = "drop")
region_qty
## # A tibble: 6 × 2
## region avg_quantity
## <fct> <dbl>
## 1 Africa 3.10
## 2 Asia 3.10
## 3 Europe 3.08
## 4 North America 3.08
## 5 Oceania 3.08
## 6 South America 3.08
ggplot(region_qty, aes(x = reorder(region, avg_quantity), y = avg_quantity, fill = region)) +
geom_col() +
coord_flip() +
labs(
title = "Average Sales Quantity by Region",
x = "Region",
y = "Average Quantity"
) +
theme_minimal()
This analysis shows regional differences in sales quantity and can support region-specific stock planning and marketing strategies.
ggplot(df_clean, aes(x = discount_percent, y = quantity, color = is_returned)) +
geom_point(alpha = 0.5) +
geom_smooth(se = FALSE, method = "lm") +
labs(
title = "Discount vs Quantity by Return Status",
x = "Discount Percent",
y = "Quantity"
) +
theme_minimal()
This plot explores whether discount percentage is associated with higher order quantity and whether the relationship differs by return status.
numeric_vars <- df_clean %>% select(where(is.numeric))
cor_matrix <- cor(numeric_vars, use = "complete.obs")
corrplot(
cor_matrix,
method = "number",
type = "upper",
tl.cex = 0.8,
number.cex = 0.7
)
The correlation matrix provides an overview of linear relationships among numerical variables. It is useful for identifying highly related variables and possible predictors for regression modelling.
monthly_trend <- df_clean %>%
group_by(month = floor_date(order_date, "month")) %>%
summarise(
total_quantity = sum(quantity),
return_rate = mean(as.numeric(is_returned) - 1, na.rm = TRUE),
.groups = "drop"
)
monthly_trend
## # A tibble: 36 × 3
## month total_quantity return_rate
## <date> <int> <dbl>
## 1 2023-01-01 8615 0.0548
## 2 2023-02-01 7410 0.0604
## 3 2023-03-01 8398 0.0561
## 4 2023-04-01 8013 0.0636
## 5 2023-05-01 8532 0.0610
## 6 2023-06-01 8395 0.0618
## 7 2023-07-01 8480 0.0623
## 8 2023-08-01 8694 0.0585
## 9 2023-09-01 8190 0.0545
## 10 2023-10-01 8617 0.0600
## # ℹ 26 more rows
monthly_trend %>%
pivot_longer(
cols = c(total_quantity, return_rate),
names_to = "metric",
values_to = "value"
) %>%
ggplot(aes(x = month, y = value)) +
geom_line(color = "steelblue") +
facet_wrap(~metric, scales = "free_y") +
labs(
title = "Monthly Trends: Total Quantity and Return Rate",
x = "Month",
y = "Value"
) +
theme_minimal()
The time trend analysis helps identify seasonal patterns in sales quantity and whether return rates fluctuate across months.
category_return <- df_clean %>%
group_by(product_category) %>%
summarise(
return_rate = mean(as.integer(is_returned) - 1, na.rm = TRUE),
total_sales = sum(quantity * product_price * (1 - discount_percent / 100), na.rm = TRUE),
avg_quantity = mean(quantity, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(desc(return_rate))
category_return
## # A tibble: 7 × 4
## product_category return_rate total_sales avg_quantity
## <fct> <dbl> <dbl> <dbl>
## 1 Fashion 0.122 10497301. 3.07
## 2 Automotive 0.0527 10575483. 3.11
## 3 Electronics 0.0513 10488699. 3.09
## 4 Home 0.0511 10369461. 3.09
## 5 Sports 0.0505 10557716. 3.07
## 6 Toys 0.0490 10553990. 3.09
## 7 Beauty 0.0478 10372151. 3.08
ggplot(category_return, aes(x = reorder(product_category, return_rate), y = return_rate, fill = product_category)) +
geom_col() +
coord_flip() +
labs(
title = "Return Rate by Product Category",
x = "Product Category",
y = "Return Rate"
) +
theme_minimal()
This category-level analysis identifies product categories with relatively high return rates. These categories may require further investigation, such as reviewing product quality, product descriptions, sizing information, or customer expectations.
overall_return_rate <- mean(as.numeric(df_clean$is_returned) - 1, na.rm = TRUE) * 100
avg_quantity <- mean(df_clean$quantity, na.rm = TRUE)
top_3_high_return_categories <- head(category_return %>% arrange(desc(return_rate)), 3)
best_region <- region_qty$region[which.max(region_qty$avg_quantity)]
delivery_return_correlation <- cor(
as.numeric(df_clean$delivery_days),
as.numeric(df_clean$is_returned) - 1,
use = "complete.obs"
)
cat("Overall return rate:", round(overall_return_rate, 2), "%\n")
## Overall return rate: 6.06 %
cat("Average quantity per order:", round(avg_quantity, 2), "\n")
## Average quantity per order: 3.09
cat("Correlation between delivery days and return:", round(delivery_return_correlation, 4), "\n")
## Correlation between delivery days and return: -0.0058
cat("Best region for average sales quantity:", as.character(best_region), "\n")
## Best region for average sales quantity: Africa
cat("\nTop 3 high-return categories:\n")
##
## Top 3 high-return categories:
top_3_high_return_categories
## # A tibble: 3 × 4
## product_category return_rate total_sales avg_quantity
## <fct> <dbl> <dbl> <dbl>
## 1 Fashion 0.122 10497301. 3.07
## 2 Automotive 0.0527 10575483. 3.11
## 3 Electronics 0.0513 10488699. 3.09
Based on the EDA, the dataset provides useful information for both return behavior classification and sales quantity prediction. Return-related analysis suggests that variables such as delivery days, discount level, customer rating, and product category may be useful predictors for identifying return risk. For sales quantity prediction, product price, discount percentage, region, month, and product category may provide useful explanatory power.
The data cleaning process successfully prepared the raw e-commerce sales data for further analysis by checking missing values, converting data types, imputing missing values, flagging outliers, validating revenue consistency, and removing duplicate records. The EDA results provide initial insights into return behavior and sales quantity patterns.
For the classification objective, is_returned can be
used as the target variable to develop a return behavior warning model.
For the regression objective, quantity can be used as the
target variable to predict sales quantity. The cleaned dataset can be
used in the next stage for model training, evaluation, and business
interpretation.