| NAME | MATRIC NUMBER | ROLE |
|---|---|---|
| ONG CHEN YAN | 25093281 | Data Analysis |
| QIN ZHONG YI | 25084810 | Data Cleaning |
| CEN LINZE | 25089511 | Modelling |
| YANG PEIPEI | 24229716 | EDA |
| ZHANG WENQIAN | 25076646 | Evaluation |
In the modern e-commerce and logistics landscape, customer retention
is directly tied to operational efficiency. When customers have a
fantastic experience and leave positive feedback, they become brand
advocates, which is often the main reason a company rises above its
competitors. To ensure a smooth logistic processing, predicting package
weight is essential to accurately calculate shipping costs, forecast
fulfilment space, and prevent unexpected carrier fees while classifying
shipment on time rate is the promising to the customer for receiving
their parcel on time. It is vital for a company to highlight package
weight forecasting and shipment on time monitoring to maintain a quality
shipment management and prevent customer churn. This project focused on
leveraging machine learning to predict package weight as well as to
classify the shipment on time rate based on historical shipping data,
courier performance, and customer feedback features and is targeted to
improve company management meanwhile reduce the rate of customer churn.
The questions that going to be answered upon the completion of the
project are:
1. How to predict package weight based on E-Commerce
shipping data?
2. How to classify the shipment on time rate using
machine learning model?
3. What is performance of the regression and
classification model?
The answers that going to be discovered from the dataset
included:
1. The most critical factor that determine package
weight.
2. The important features used to classify the shipment on
time rate.
3. The performance of regression model in predicting
package weight.
4. The performance of classification model in
predicting shipment on time rate.
The objectives of this project included:
1. To predict package
weight based on E-Commerce shipping data.
2. To classify shipment
on time rate using machine learning model.
3. To evaluate the
performance of the regression and classification model.
Author: ONG CHEN YAN
The dataset used in this project is obtained from online resources
Kaggle. The title of the dataset is “E-Commerce Shipping Data” which is
basically a customer database from electronics products company. This
dataset was published in the year of 2021 with the goal to discover key
insights from customer database. The dataset consists of 11499
observations and 12 variables. The variables in the dataset included:
1. ID: ID Number of Customers.
2.
Warehouse block: The Company have big Warehouse which
is divided in to block such as A,B,C,D,E.
3. Mode of
shipment:The Company Ships the products in multiple way such as
Ship, Flight and Road.
4. Customer care calls: The
number of calls made from enquiry for enquiry of the shipment.
5.
Customer rating: The company has rated from every
customer. 1 is the lowest (Worst), 5 is the highest (Best).
6.
Cost of the product: Cost of the Product in US
Dollars.
7. Prior purchases: The Number of Prior
Purchase.
8. Product importance: The company has
categorized the product in the various parameter such as low, medium,
high.
9. Gender: Male and Female.
10.
Discount offered: Discount offered on that specific
product.
11. Weight in gms: It is the weight in
grams.
12. Reached on time: It is the target
variable, where 1 Indicates that the product has NOT reached on time and
0 indicates it has reached on time.
Author: QIN ZHONGYI
This section performs an initial inspection of the raw dataset to check dimensions and identify data anomalies such as missing values, special characters, and formatting errors.
This step imports the raw CSV dataset into the R environment and calculates the total dimensions (rows × columns) to verify if it meets the minimum requirement of 100,000 data points.
# Load the raw polluted dataset
raw_data <- read.csv("Dataset/Highly_Polluted_Dataset.csv")
# Calculate dimensions
n_rows <- nrow(raw_data)
n_cols <- ncol(raw_data)
total_dim <- n_rows * n_cols
# Output results
print(paste("Total Rows:", n_rows))[1] "Total Rows: 11499"
[1] "Total Columns: 12"
[1] "Total Dimension: 137988"
# Automated check
if(total_dim > 100000) {
print("Status: Requirement Met (>100,000)")
} else {
print("Status: Warning - Requirement Not Met")
}[1] "Status: Requirement Met (>100,000)"
This step outputs the first six rows of the dataset to observe the raw data structure and identify obvious formatting issues, such as symbols embedded in numerical columns.
ID Warehouse_block Mode_of_Shipment Customer_care_calls Customer_rating
1 1 D Flight 4 NA
2 2 4 5
3 3 A Flight 2 2
4 4 B Flight 3 3
5 5 C Flight 2 2
6 6 F flyght 3 1
Cost_of_the_Product Prior_purchases Product_importance Gender
1 177 3 low FEMALE
2 216 2 low M
3 $183 4 low M
4 176 4 medium M
5 184 3 medium F
6 3 medium FEMALE
Discount_offered Weight_in_gms Reached.on.Time_Y.N
1 44 1233.0 1
2 59 3088.0 1
3 48 3374.0 1
4 10 1177.0 1
5 46 2484.0 1
6 12 1
The glimpse() function is used to perform a detailed
inspection of the data types assigned to each variable. This step is
critical for identifying which numerical columns are incorrectly stored
as ‘character’ strings due to the presence of non-numeric “pollution”
(e.g., currency symbols and units). Identifying these discrepancies now
informs our cleaning strategy in Step 2B.
Rows: 11,499
Columns: 12
$ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ Warehouse_block <chr> "D", "", "A", "B", "C", "F", "D", "f ", "A", "b", …
$ Mode_of_Shipment <chr> "Flight", "", "Flight", "Flight", "Flight", "flygh…
$ Customer_care_calls <chr> "4", "4", "2", "3", "2", "3", "3", "4", "3", "3", …
$ Customer_rating <dbl> NA, 5, 2, 3, 2, 1, 4, 1, 4, 2, NA, 5, 5, 4, 3, 3, …
$ Cost_of_the_Product <chr> "177", "216", "$183", "176", "184", "", "250", "23…
$ Prior_purchases <dbl> 3, 2, 4, 4, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 2,…
$ Product_importance <chr> "low", "low", "low", "medium", "medium", "medium",…
$ Gender <chr> "FEMALE", "M", "M", "M", "F", "FEMALE", "F", "F", …
$ Discount_offered <dbl> 44, 59, 48, 10, 46, 12, 3, 48, 11, 29, 12, 32, 1, …
$ Weight_in_gms <chr> "1233.0", "3088.0", "3374.0", "1177.0", "2484.0", …
$ Reached.on.Time_Y.N <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", …
This step temporarily converts implicit missing values (e.g., empty
strings, “?”, “N/A”) to standard NA format and uses an
UpSet plot to visualize the distribution and intersection of missing
data across variables.
# 1. Creating a temporary standardized view for accurate detection ONLY
missing_viz_data <- raw_data %>%
mutate(across(everything(), as.character)) %>%
mutate(across(everything(), trimws)) %>%
mutate(across(everything(), ~case_when(
. == "" ~ NA_character_,
grepl("^(?i)(none|null|n/a|\\?|nan)$", .) ~ NA_character_,
TRUE ~ .
)))
# 2. Comprehensive Statistical Summary based on the true missingness
cat("Actual Missingness Profile:\n")Actual Missingness Profile:
# A tibble: 10 × 3
variable n_miss pct_miss
<chr> <int> <num>
1 Warehouse_block 575 5.00
2 Mode_of_Shipment 575 5.00
3 Customer_care_calls 575 5.00
4 Customer_rating 575 5.00
5 Cost_of_the_Product 575 5.00
6 Prior_purchases 575 5.00
7 Product_importance 575 5.00
8 Gender 575 5.00
9 Discount_offered 575 5.00
10 Reached.on.Time_Y.N 575 5.00
# 3. Advanced Visualization (UpSet Plot)
gg_miss_upset(missing_viz_data,
nsets = 10,
main.bar.color = "#2c3e50",
sets.bar.color = "#e74c3c")This subsection identifies and quantifies specific data anomalies to guide the subsequent cleaning steps.
We first investigate the features that should be strictly numerical,
such as Cost_of_the_Product and Weight_in_gms.
Preliminary inspection suggests that these fields contain currency
symbols ($), units (g/grams), and other non-numeric suffixes. These
“pollutants” force R to treat these variables as character strings,
preventing direct statistical computation.
# Function to extract unique non-numeric characters from a column
detect_non_numeric <- function(column_data) {
symbols <- str_extract_all(as.character(column_data), "[^0-9.]+") %>%
unlist() %>% unique() %>% na.omit()
return(paste(symbols, collapse = ", "))
}
# Counting affected rows for cost and weight
cost_polluted <- sum(str_detect(as.character(raw_data$Cost_of_the_Product), "[^0-9.]+"), na.rm = TRUE)
weight_polluted <- sum(str_detect(as.character(raw_data$Weight_in_gms), "[^0-9.]+"), na.rm = TRUE)
# Summary table for symbols
symbol_summary <- data.frame(
Feature = c("Cost of the Product", "Weight in gms"),
Detected_Symbols = c(detect_non_numeric(raw_data$Cost_of_the_Product),
detect_non_numeric(raw_data$Weight_in_gms)),
Number = c(cost_polluted, weight_polluted)
)
knitr::kable(symbol_summary, col.names = c("Feature", "Detected Symbols", "Number"))| Feature | Detected Symbols | Number |
|---|---|---|
| Cost of the Product | $ | 1120 |
| Weight in gms | nang, g, - | 685 |
This step calculates the frequencies of categorical variables to identify misspelled labels, inconsistent casing, and unstandardized values.
# Identifying Shipment anomalies (Excluding standard: Flight, Ship, Road)
shipment_anomalies <- raw_data %>%
count(Mode_of_Shipment, name = "Number") %>%
filter(!Mode_of_Shipment %in% c("Flight", "Ship", "Road"))
# Identifying Gender anomalies (Excluding standard: F, M)
gender_anomalies <- raw_data %>%
count(Gender, name = "Number") %>%
filter(!Gender %in% c("F", "M"))
# Combining results for a unified presentation
label_errors <- bind_rows(
mutate(shipment_anomalies, Feature = "Mode of Shipment") %>% rename(Value = Mode_of_Shipment),
mutate(gender_anomalies, Feature = "Gender") %>% rename(Value = Gender)
) %>% select(Feature, Value, Number)
# Displaying the summary table
knitr::kable(label_errors, col.names = c("Feature ", "Error value name ", "Number"))| Feature | Error value name | Number |
|---|---|---|
| Mode of Shipment | 575 | |
| Mode of Shipment | Flight | 155 |
| Mode of Shipment | FLIGHT | 156 |
| Mode of Shipment | ROAD | 182 |
| Mode of Shipment | S.h.i.p | 759 |
| Mode of Shipment | flyght | 135 |
| Mode of Shipment | road | 171 |
| Mode of Shipment | ship | 711 |
| Gender | 575 | |
| Gender | FEMALE | 561 |
| Gender | Female | 534 |
| Gender | MALE | 572 |
| Gender | Male | 537 |
| Gender | f | 548 |
| Gender | m | 530 |
This step scans the ID column to identify and count
duplicate primary keys, excluding empty strings and missing values.
# 1. Removing true NAs AND empty/whitespace strings
duplicate_report <- raw_data %>%
filter(!is.na(ID) & trimws(as.character(ID)) != "") %>%
filter(duplicated(ID) | duplicated(ID, fromLast = TRUE)) %>%
count(ID, name = "Number") %>%
arrange(desc(Number))
# 2. Calculating the total sum BEFORE subsetting
total_duplicate_rows <- sum(duplicate_report$Number)
# 3. Generating the summary table for the top 10 only
knitr::kable(head(duplicate_report, 10), col.names = c("Error value name (ID)", "Number"))| Error value name (ID) | Number |
|---|
# 4. Displaying the final full-dataset count summary at the bottom
cat("\n--------------------------------------------------\n")
--------------------------------------------------
Total unique numerical IDs with duplicates: 0
Total rows affected by duplicates: 0
This step filters the ordinal variables (e.g.,
Customer_rating) to identify values that fall outside the
defined standard range (1 to 5).
# Detecting values in Customer_rating that are not within the standard 1-5 range
rating_anomalies <- raw_data %>%
count(Customer_rating, name = "Number") %>%
filter(!Customer_rating %in% c("1", "2", "3", "4", "5")) %>%
arrange(desc(Number))
# Generating the summary table
knitr::kable(rating_anomalies, col.names = c("Error values", "Number"))| Error values | Number |
|---|---|
| NA | 575 |
This step counts the occurrences of specific special characters
(e.g., *, #, @, !!)
across the entire dataset to profile the distribution of data noise.
# Function to count the occurrences of specific noise characters across all columns
count_noise_pattern <- function(char_pattern) {
raw_data %>%
filter(if_any(everything(), ~str_detect(as.character(.), fixed(char_pattern)))) %>%
nrow()
}
# Generating the distribution profile of identified noise characters
noise_profile_data <- data.frame(
Error_Type = c("Asterisk (*)", "Hashtag (#)", "At Symbol (@)", "Double Bang (!!)"),
Number = c(count_noise_pattern("*"), count_noise_pattern("#"),
count_noise_pattern("@"), count_noise_pattern("!!"))
)
# Displaying the noise distribution table
knitr::kable(noise_profile_data, col.names = c("Noise", "Number"))| Noise | Number |
|---|---|
| Asterisk (*) | 0 |
| Hashtag (#) | 0 |
| At Symbol (@) | 0 |
| Double Bang (!!) | 0 |
Author: QIN ZHONG YI
This section sequentially resolves the anomalies identified in Part 2A. The cleaning process includes deduplication, noise character removal, data type casting, categorical standardization, and missing value imputation.
This step removes non-numeric characters from the ID
column, converts it to a numeric format, and filters out duplicate
records based on the unique ID to ensure row-level data
integrity.
library(dplyr)
library(stringr)
# 1. Record initial state
rows_before <- nrow(raw_data)
# 2. Clean ID column and deduplicate
cleaned_data <- raw_data %>%
# Strip everything except numbers and decimals from ID
mutate(ID = as.numeric(str_remove_all(as.character(ID), "[^0-9.]"))) %>%
# Remove completely blank/NA IDs, then deduplicate
filter(!is.na(ID)) %>%
distinct(ID, .keep_all = TRUE)
cat("Rows before:", rows_before, "\nRows after Step 1:", nrow(cleaned_data), "\n")Rows before: 11499
Rows after Step 1: 11499
This step utilizes regular expressions to search and remove specific non-numeric characters (e.g., currency symbols, metric units, and special symbols) across all columns, followed by trimming whitespace.
# 1. Defining the universal noise pattern (covering all identified pollutants)
noise_pattern <- "\\$|USD|g|grams|\\*|#|@|!!"
# 2. Executing global noise stripping and whitespace trimming
cleaned_data <- cleaned_data %>%
mutate(across(everything(), ~str_remove_all(as.character(.), noise_pattern))) %>%
mutate(across(everything(), trimws))
cat("Universal noise stripping completed successfully.\n")Universal noise stripping completed successfully.
This step converts various textual representations of missing data
(e.g., “?”, “None”, “N/A”, empty strings) into standard NA
values. Subsequently, variables representing numeric features are
explicitly cast to the numeric data type.
# 1. Standardize hidden textual NAs into system NAs
cleaned_data <- cleaned_data %>%
mutate(across(everything(), ~na_if(., ""))) %>%
mutate(across(everything(), ~case_when(
grepl("^(?i)(none|null|nan|n/a|na|\\?)$", .) ~ NA_character_,
TRUE ~ .
)))
# 2. Force numeric type casting for mathematical columns
numeric_columns <- c("Cost_of_the_Product", "Weight_in_gms", "Customer_rating",
"Customer_care_calls", "Prior_purchases", "Discount_offered",
"Reached.on.Time_Y.N")
cleaned_data <- cleaned_data %>%
mutate(
across(all_of(numeric_columns), as.numeric),
Weight_in_gms = abs(Weight_in_gms)
)
cat("Type casting completed. Hidden strings converted to true NAs.\n")Type casting completed. Hidden strings converted to true NAs.
This step corrects spelling errors and standardizes labels for
categorical variables (e.g., Mode_of_Shipment,
Gender, Product_importance) using pattern
matching, and unifies the text casing to ensure categorical
consistency.
# 1. Execute deterministic recoding with fuzzy logic
cleaned_data <- cleaned_data %>%
mutate(
# Robust alignment for Mode_of_Shipment
Mode_of_Shipment = case_when(
grepl("(?i)fly|flight|fliht", Mode_of_Shipment) ~ "Flight",
grepl("(?i)s\\.?h\\.?i\\.?p", Mode_of_Shipment) ~ "Ship",
grepl("(?i)road", Mode_of_Shipment) ~ "Road",
TRUE ~ Mode_of_Shipment
),
# Robust alignment for Gender
Gender = case_when(
grepl("(?i)^f", Gender) ~ "F",
grepl("(?i)^m", Gender) ~ "M",
TRUE ~ Gender
),
# Robust alignment & casing for Product_importance (Fixing "hih")
Product_importance = case_when(
grepl("(?i)hih|high", Product_importance) ~ "high",
TRUE ~ tolower(Product_importance)
),
# Standardize textual casing and remove noise strings for Warehouse_block
Warehouse_block = case_when(
grepl("(?i)warehouse_a", Warehouse_block) ~ "A",
grepl("(?i)d_block", Warehouse_block) ~ "D",
TRUE ~ toupper(Warehouse_block)
)
)
cat("Categorical recoding successful. Distinct Product Importance Levels:",
paste(unique(na.omit(cleaned_data$Product_importance)), collapse=", "), "\n")Categorical recoding successful. Distinct Product Importance Levels: low, medium, high
This final step imputes remaining NA values using the
median for continuous numerical variables and the mode for categorical
variables. The cleaned dataset is then audited for zero missingness and
exported to a CSV file for downstream analysis.
# 1. Define a robust mode calculation function
calc_mode <- function(x) {
ux <- unique(na.omit(x))
ux[which.max(tabulate(match(x, ux)))]
}
# 2. Execute Universal Imputation Strategy
cleaned_data <- cleaned_data %>%
mutate(
# Micro-Adjustment: Catch any literal "NA" or "nan" strings disguised as text
Warehouse_block = ifelse(toupper(as.character(Warehouse_block)) %in% c("NA", "NAN", ""), NA_character_, Warehouse_block),
# Median Imputation for numerical features
Cost_of_the_Product = coalesce(Cost_of_the_Product, median(Cost_of_the_Product, na.rm = TRUE)),
Weight_in_gms = coalesce(Weight_in_gms, median(Weight_in_gms, na.rm = TRUE)),
Customer_rating = coalesce(Customer_rating, median(Customer_rating, na.rm = TRUE)),
Customer_care_calls = coalesce(Customer_care_calls, median(Customer_care_calls, na.rm = TRUE)),
Prior_purchases = coalesce(Prior_purchases, median(Prior_purchases, na.rm = TRUE)),
Discount_offered = coalesce(Discount_offered, median(Discount_offered, na.rm = TRUE)),
# Mode Imputation for categorical/binary features
Reached.on.Time_Y.N = coalesce(Reached.on.Time_Y.N, calc_mode(Reached.on.Time_Y.N)),
Warehouse_block = coalesce(Warehouse_block, calc_mode(Warehouse_block)),
Mode_of_Shipment = coalesce(Mode_of_Shipment, calc_mode(Mode_of_Shipment)),
Product_importance = coalesce(Product_importance, calc_mode(Product_importance)),
Gender = coalesce(Gender, calc_mode(Gender))
)
# 3. Final Integrity Audit
cat("\n--- Final Integrity Audit ---\n")
--- Final Integrity Audit ---
Total Missing Values: 0
Final Dataset Dimensions: 11499 rows x 12 columns
--- Final Cleaned Data Preview ---
| ID | Warehouse_block | Mode_of_Shipment | Customer_care_calls | Customer_rating | Cost_of_the_Product | Prior_purchases | Product_importance | Gender | Discount_offered | Weight_in_gms | Reached.on.Time_Y.N |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | D | Flight | 4 | 3 | 177 | 3 | low | F | 44 | 1233 | 1 |
| 2 | F | Ship | 4 | 5 | 216 | 2 | low | M | 59 | 3088 | 1 |
| 3 | A | Flight | 2 | 2 | 183 | 4 | low | M | 48 | 3374 | 1 |
| 4 | B | Flight | 3 | 3 | 176 | 4 | medium | M | 10 | 1177 | 1 |
| 5 | C | Flight | 2 | 2 | 184 | 3 | medium | F | 46 | 2484 | 1 |
| 6 | F | Flight | 3 | 1 | 216 | 3 | medium | F | 12 | 4145 | 1 |
| 7 | D | Flight | 3 | 4 | 250 | 3 | low | F | 3 | 2371 | 1 |
| 8 | F | Flight | 4 | 1 | 233 | 2 | low | F | 48 | 2804 | 1 |
| 9 | A | Flight | 3 | 4 | 150 | 3 | low | F | 11 | 1861 | 1 |
| 10 | B | Flight | 3 | 2 | 164 | 3 | medium | F | 29 | 1187 | 1 |
# 5. Exporting the final "Gold Standard" Dataset
export_path <- "Dataset/Cleaned_Dataset.csv"
write.csv(cleaned_data, file = export_path, row.names = FALSE)
cat("\nSUCCESS: Dataset exported to:\n", export_path, "\n")
SUCCESS: Dataset exported to:
Dataset/Cleaned_Dataset.csv
Author: YANG PEIPEI
In this section, a comprehensive Exploratory Data Analysis (EDA) is
conducted on the cleaned dataset. The primary goal of this phase is not
just to summarize the data, but to uncover underlying patterns,
distributions, and relationships. These insights will be applied to
logically motivate our two upcoming machine learning objectives: 1.
Regression Task: Predicting package weight
(Weight_in_gms). 2. Classification Task:
Predicting shipment on-time rate (Reached.on.Time_Y.N).
First, individual variables is analyzed to understand their fundamental distributions. This step mainly helps to identify skewness, distinct clusters, and the baseline distribution of the target variables.
The distribution of package weight, cost, and discount offered is observed.
p1 <- ggplot(eda_data, aes(x = as.numeric(Customer_rating))) +
geom_histogram(bins = 5, fill = "skyblue", color = "black", alpha = 0.8) +
theme_minimal() + labs(title = "Distribution of Customer Rating", x = "Customer Rating", y = "Count")
p2 <- ggplot(eda_data, aes(x = Cost_of_the_Product)) +
geom_histogram(bins = 30, fill = "lightgreen", color = "black", alpha = 0.8) +
theme_minimal() + labs(title = "Distribution of Product Cost", x = "Cost (USD)", y = "Count")
p3 <- ggplot(eda_data, aes(x = as.numeric(Product_importance))) +
geom_histogram(bins = 5, fill = "salmon", color = "black", alpha = 0.8) +
theme_minimal() + labs(title = "Distribution of Product Importance", x = "Product Importance 1(Low), 2(Medium), 3(High)", y = "Count")
grid.arrange(p1, p2, p3, ncol = 2)
Conclusion:
- Customer_rating
distribution showed that there is no bias in the dataset to certain
group of customer rating
- Cost_of_the_Product is
right-skewed due to several extreme high-cost observations.
-
Product_importance distribution showed that low importance
products are the most common, medium importance products are the second
largest group while high importance products are relatively
rare.
Understanding the distribution of the two modeling targets is essential before building predictive models.
t1 <- ggplot(eda_data, aes(x = Reached.on.Time_Y.N, fill = Reached.on.Time_Y.N)) +
geom_bar(color = "black", alpha = 0.8) +
scale_fill_manual(values = c("#2ecc71", "#e74c3c")) +
theme_minimal() + labs(title = "Shipment On-Time Status", x = "0 = On Time, 1 = Delayed", y = "Count") +
theme(legend.position = "none")
t2 <- ggplot(eda_data, aes(x =Weight_in_gms)) +
geom_histogram(bins = 30, fill = "salmon", color = "black", alpha = 0.8) +
theme_minimal() + labs(title = "Package Weight Distribution", x = "Package Weight", y = "Count")
grid.arrange(t1, t2, ncol = 2)
Conclusion:
- For On-Time Status
(Reached.on.Time_Y.N), class 1
(Delayed) has a noticeably higher count than class 0
(On-Time). This slight class imbalance is crucial context for our
classification model evaluation later.
- For Package
Weight(Weight_in_gms), it displays a clear bimodal
distribution, suggesting two distinct types of weight range (lightweight
vs. heavy).
In this section, the interaction of predictor variables with target variables is analyze. This step is critical because observing distinct patterns here justifies the necessity of building robust machine learning models.
The factors influencing that influenced Weight_in_gmsis
explored.
# Package Weight vs Product Importance
r1 <- ggplot(eda_data, aes(x = Weight_in_gms, y = Product_importance)) +
geom_boxplot(alpha = 0.6) +
theme_minimal() + labs(title = "Package Weight vs Product Importance", y = "Product Importance") + theme(legend.position = "none")
# Package Weight vs Product Cost
r2 <- ggplot(eda_data, aes(x = Weight_in_gms, y = Cost_of_the_Product)) +
geom_point(alpha = 0.4, color = "purple") +
theme_minimal() + labs(title = "Package Weight vs Product Cost", y = "Product Cost")
# Package Weight vs Mode of Shipment
max_package_weight_mode <- eda_data %>%
group_by(Mode_of_Shipment) %>%
summarise(
Max_Weight = max(Weight_in_gms, na.rm = TRUE)
)
r3 <- ggplot(max_package_weight_mode,
aes(x = Mode_of_Shipment,
y = Max_Weight,
fill = Mode_of_Shipment)) +
geom_col() +
theme_minimal() +
labs(
title = "Maximum Package Weight by Mode of Shipment",
x = "Mode of Shipment",
y = "Maximum Package Weight (g)"
)
grid.arrange(r1, r2, r3, ncol = 2)Conclusion for Regression Motivation:
1.
Multiple Interactions: The factors influencing package
weight appear to be multifaceted. It closely related to product costs,
mode_of_Shipment and product importance.
2. Subtle
Trends: The product which is high important have larger package
weight. Besides, ship can delivered package with the highest maximum
weight followed by road and flight.
Actionable
Takeaway: Product weight is driven by complex interactions
among the available predictors. This necessitates a robust
Multivariate Regression Model to synthesize these features and
accurately predict the expected package weight.
The variables which strongly distinguish between on-time deliveries and delayed deliveries is explored.
# Weight vs. On-Time
b1 <- ggplot(eda_data, aes(x = Reached.on.Time_Y.N, y = Weight_in_gms, fill = Reached.on.Time_Y.N)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("#2ecc71", "#e74c3c")) +
theme_minimal() + labs(title = "Weight vs Delivery Status", x = "Status (0: On-time, 1: Delayed)") +
theme(legend.position = "none")
# Discount vs. On-Time
b2 <- ggplot(eda_data, aes(x = Reached.on.Time_Y.N, y = Discount_offered, fill = Reached.on.Time_Y.N)) +
geom_boxplot(alpha = 0.7) +
scale_fill_manual(values = c("#2ecc71", "#e74c3c")) +
theme_minimal() + labs(title = "Discount vs Delivery Status", x = "Status") +
theme(legend.position = "none")
# Product Importance vs. On-Time
b3 <- ggplot(eda_data, aes(x = Product_importance, fill = Reached.on.Time_Y.N)) +
geom_bar(position = "fill", color = "black") +
scale_fill_manual(values = c("#2ecc71", "#e74c3c"), name="Status") +
theme_minimal() + labs(title = "Importance vs Delivery Status", y = "Proportion")
grid.arrange(b1, b2, b3, ncol = 2)Conclusion for Classification Motivation:
1.
Weight and discount show visible differences across delivery
status: The boxplots reveal useful patterns. Packages that
weigh between 2000g and 4000g are more prone to delays. Furthermore,
packages with high discount offers are almost exclusively delayed
(Target=1).
2. Categorical features also matter:
The proportion chart shows that high importance products
actually suffer from a slightly higher proportion of delays.
Actionable Takeaway: Because multiple variables
exhibit non-linear and overlapping effects on delivery delays, simple
heuristics won’t suffice.
This motivates the use of Classification Models such
as Logistic Regression, Linear Discriminant Analysis, and Classification
Tree to map these feature interactions and predict delays.
Finally, the correlation matrix for all continuous numerical features is computed. This ensures any multicollinearity is awared before passing these features into the models.
# Select only numeric continuous columns for correlation
numeric_data <- eda_data %>%
dplyr::select(Customer_care_calls, Cost_of_the_Product, Prior_purchases, Discount_offered, Weight_in_gms)
# Calculate correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")
# Plot correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black",
tl.srt = 45, addCoef.col = "black", number.cex = 0.8,
col = colorRampPalette(c("#e74c3c", "white", "#3498db"))(200),
title = "Correlation Matrix of Numerical Features",
mar = c(0,0,1,0))Conclusion:
- Cost_of_the_Product
has very weak correlations with the other numerical variables,
suggesting that product cost might not be a strong standalone predictor.
- Overall, there are no extremely high correlations (e.g., > 0.8)
between the independent variables. This confirms the absence of
severe multicollinearity, meaning we can safely use all these
numerical features simultaneously in both our Classification and
Regression models without causing data leakage.
Through rigorous EDA, we have established the following: 1. The data distributions have distinct characteristics, notably the bimodal weight distribution and right-skewed discount distribution. 2. Variables like weight and discount are strong indicators of shipment delays, providing an excellent foundation for the Classification task. 3. Package weight maintains complex variance influenced by various shipment features, justifying the use of a Regression model to predict the package weight. 4. The numerical features are mostly independent of one another, clearing the way for smooth model training.
With the data thoroughly explored and modeling motivations clearly justified, the next part which is Part 4: Predictive Modeling can be proceeded.
Author: CEN LINZE
This section builds two supervised machine learning tasks based on the EDA findings:
Weight_in_gms.Reached.on.Time_Y.N.# Load cleaned dataset again for modelling to ensure correct data types
model_data <- read.csv("Dataset/Cleaned_Dataset.csv")
model_data <- model_data %>%
mutate(
Warehouse_block = as.factor(Warehouse_block),
Mode_of_Shipment = as.factor(Mode_of_Shipment),
Product_importance = as.factor(Product_importance),
Gender = as.factor(Gender),
Reached.on.Time_Y.N = factor(Reached.on.Time_Y.N, levels = c(0, 1)),
Customer_rating = as.numeric(Customer_rating),
Customer_care_calls = as.numeric(Customer_care_calls),
Cost_of_the_Product = as.numeric(Cost_of_the_Product),
Prior_purchases = as.numeric(Prior_purchases),
Discount_offered = as.numeric(Discount_offered),
Weight_in_gms = as.numeric(Weight_in_gms)
)
set.seed(7004)
train_index <- sample(seq_len(nrow(model_data)), size = 0.7 * nrow(model_data))
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]
cat("Training set:", nrow(train_data), "rows\n")## Training set: 8049 rows
## Testing set: 3450 rows
Three regression models are trained and compared:
reg_formula <- Weight_in_gms ~ Warehouse_block + Mode_of_Shipment +
Customer_care_calls + Cost_of_the_Product + Prior_purchases +
Product_importance + Gender + Discount_offered + Customer_rating +
Reached.on.Time_Y.N
# Model 1: Multiple Linear Regression
lm_model <- lm(reg_formula, data = train_data)
# Model 2: Polynomial Linear Regression
poly_lm_model <- lm(
Weight_in_gms ~ Warehouse_block + Mode_of_Shipment +
Customer_care_calls + poly(Cost_of_the_Product, 2) +
Prior_purchases + Product_importance + Gender +
Discount_offered + Customer_rating + Reached.on.Time_Y.N,
data = train_data
)
# Model 3: Regression Tree
reg_tree_model <- rpart(
reg_formula,
data = train_data,
method = "anova",
control = rpart.control(cp = 0.001)
)
pred_lm <- predict(lm_model, newdata = test_data)
pred_poly_lm <- predict(poly_lm_model, newdata = test_data)
pred_reg_tree <- predict(reg_tree_model, newdata = test_data)
# Weight cannot be negative
clip_weight <- function(x) {
pmax(x, 0)
}
pred_lm <- clip_weight(pred_lm)
pred_poly_lm <- clip_weight(pred_poly_lm)
pred_reg_tree <- clip_weight(pred_reg_tree)Three classification models are trained and compared:
class_formula <- Reached.on.Time_Y.N ~ Warehouse_block + Mode_of_Shipment +
Customer_care_calls + Customer_rating + Cost_of_the_Product + Prior_purchases +
Product_importance + Gender + Discount_offered + Weight_in_gms
# Model 1: Logistic Regression
logistic_model <- glm(
class_formula,
data = train_data,
family = binomial
)
# Model 2: Linear Discriminant Analysis
lda_model <- MASS::lda(class_formula, data = train_data)
# Model 3: Classification Tree
class_tree_model <- rpart(
class_formula,
data = train_data,
method = "class",
control = rpart.control(cp = 0.001)
)
# Logistic Regression Predictions
logistic_prob <- predict(logistic_model, newdata = test_data, type = "response")
pred_logistic <- factor(ifelse(logistic_prob >= 0.5, "1", "0"), levels = c("0", "1"))
# LDA Predictions
lda_output <- predict(lda_model, newdata = test_data)
pred_lda <- factor(lda_output$class, levels = c("0", "1"))
lda_prob <- lda_output$posterior[, "1"]
# Classification Tree Predictions
tree_prob_matrix <- predict(class_tree_model, newdata = test_data, type = "prob")
class_tree_prob <- tree_prob_matrix[, "1"]
pred_class_tree <- factor(ifelse(class_tree_prob >= 0.5, "1", "0"), levels = c("0", "1"))Author: ZHANG WENQIAN
The regression models are evaluated using R-Squared, Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).
regression_metrics <- function(actual, predicted) {
sse <- sum((actual - predicted)^2)
sst <- sum((actual - mean(actual))^2)
data.frame(
R_Squared = 1 - sse / sst,
MAE = mean(abs(actual - predicted)),
MSE = mean((actual - predicted)^2),
RMSE = sqrt(mean((actual - predicted)^2))
)
}
regression_results <- bind_rows(
cbind(Model = "Multiple Linear Regression",
regression_metrics(test_data$Weight_in_gms, pred_lm)),
cbind(Model = "Polynomial Linear Regression",
regression_metrics(test_data$Weight_in_gms, pred_poly_lm)),
cbind(Model = "Regression Tree",
regression_metrics(test_data$Weight_in_gms, pred_reg_tree))
)
regression_results <- regression_results %>%
mutate(across(where(is.numeric), ~round(.x, 4)))
knitr::kable(
regression_results,
caption = "Regression Model Performance for Package Weight Prediction"
)| Model | R_Squared | MAE | MSE | RMSE |
|---|---|---|---|---|
| Multiple Linear Regression | 0.2576 | 1147.5766 | 1900353 | 1378.533 |
| Polynomial Linear Regression | 0.2576 | 1147.3718 | 1900252 | 1378.496 |
| Regression Tree | 0.5787 | 732.8797 | 1078478 | 1038.498 |
best_regression_model <- regression_results$Model[which.min(regression_results$RMSE)]
cat("Best regression model based on RMSE:", best_regression_model)## Best regression model based on RMSE: Regression Tree
Based on RMSE, the best regression model is selected as the model with the lowest prediction error. This is important because inaccurate prediction of package weights may lead to poor warehouse resource allocation and logistical planning errors.
The classification models are evaluated using Accuracy, Precision, Recall (Sensitivity), F1-Score, and AUC. Since delayed shipments are the main concern, recall and F1-score are more useful when choosing the final classification model.
calculate_auc <- function(actual, probability, positive_class = "1") {
actual_binary <- ifelse(actual == positive_class, 1, 0)
pos <- probability[actual_binary == 1]
neg <- probability[actual_binary == 0]
if (length(pos) == 0 | length(neg) == 0) return(NA)
auc_value <- (sum(rank(c(pos, neg))[1:length(pos)]) -
length(pos) * (length(pos) + 1) / 2) / (length(pos) * length(neg))
return(auc_value)
}
classification_metrics <- function(actual, predicted, probability, model_name, positive_class = "1") {
actual <- factor(actual, levels = c("0", "1"))
predicted <- factor(predicted, levels = c("0", "1"))
tp <- sum(actual == positive_class & predicted == positive_class)
tn <- sum(actual != positive_class & predicted != positive_class)
fp <- sum(actual != positive_class & predicted == positive_class)
fn <- sum(actual == positive_class & predicted != positive_class)
accuracy <- (tp + tn) / (tp + tn + fp + fn)
precision <- ifelse((tp + fp) == 0, 0, tp / (tp + fp))
recall <- ifelse((tp + fn) == 0, 0, tp / (tp + fn))
f1_score <- ifelse((precision + recall) == 0, 0,
2 * precision * recall / (precision + recall))
auc <- calculate_auc(actual, probability, positive_class)
data.frame(
Model = model_name,
Accuracy = round(accuracy, 4),
Precision = round(precision, 4),
Recall = round(recall, 4),
F1_Score = round(f1_score, 4),
AUC = round(auc, 4)
)
}
classification_results <- bind_rows(
classification_metrics(test_data$Reached.on.Time_Y.N, pred_logistic, logistic_prob, "Logistic Regression"),
classification_metrics(test_data$Reached.on.Time_Y.N, pred_lda, lda_prob, "Linear Discriminant Analysis"),
classification_metrics(test_data$Reached.on.Time_Y.N, pred_class_tree, class_tree_prob, "Classification Tree")
)
knitr::kable(
classification_results,
caption = "Classification Model Performance for Shipment Delay Prediction"
)| Model | Accuracy | Precision | Recall | F1_Score | AUC |
|---|---|---|---|---|---|
| Logistic Regression | 0.6362 | 0.6899 | 0.8135 | 0.7466 | 0.6832 |
| Linear Discriminant Analysis | 0.6499 | 0.6837 | 0.8720 | 0.7664 | 0.6951 |
| Classification Tree | 0.6339 | 0.7153 | 0.7382 | 0.7266 | 0.7047 |
best_class_model <- classification_results$Model[which.max(classification_results$F1_Score)]
cat("Best classification model based on F1-score:", best_class_model, "\n\n")## Best classification model based on F1-score: Linear Discriminant Analysis
if (best_class_model == "Logistic Regression") {
print(table(Actual = test_data$Reached.on.Time_Y.N, Predicted = pred_logistic))
} else if (best_class_model == "Linear Discriminant Analysis") {
print(table(Actual = test_data$Reached.on.Time_Y.N, Predicted = pred_lda))
} else {
print(table(Actual = test_data$Reached.on.Time_Y.N, Predicted = pred_class_tree))
}## Predicted
## Actual 0 1
## 0 260 917
## 1 291 1982
classification_results %>%
pivot_longer(cols = c(Accuracy, Precision, Recall, F1_Score, AUC),
names_to = "Metric", values_to = "Value") %>%
ggplot(aes(x = Model, y = Value, fill = Model)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ Metric) +
coord_cartesian(ylim = c(0, 1)) +
theme_minimal() +
labs(title = "Classification Model Comparison",
x = "Model", y = "Score") +
theme(axis.text.x = element_text(angle = 15, hjust = 1))The best classification model is chosen based on its F1-Score to effectively balance the trade-off between precision and recall, ensuring that the company accurately identifies as many delayed packages as possible without an excessive false alarm rate.
Based on results obtained, Linear Discriminant Analysis has the best performance for shipment on time classification while Regression Tree has the best performance for package weight prediction. For package weight, it influences by multiple factors. One of the factors is product importance where product with high importance are typically larger weight. For shipment on time rate, it showed that product with weight 2000g to 4000g as well as product with high discount rate usually having delay shipment. Future works are recommended such as to obtain more features like dimension of package, quantity of parcel per shipment, shipment distance to increase the data accuracy.