Group 6

Group Members:

NAME	MATRIC NUMBER	ROLE
ONG CHEN YAN	25093281	Data Analysis
QIN ZHONG YI	25084810	Data Cleaning
CEN LINZE	25089511	Modelling
YANG PEIPEI	24229716	EDA
ZHANG WENQIAN	25076646	Evaluation

Introduction

In the modern e-commerce and logistics landscape, customer retention is directly tied to operational efficiency. When customers have a fantastic experience and leave positive feedback, they become brand advocates, which is often the main reason a company rises above its competitors. To ensure a smooth logistic processing, predicting package weight is essential to accurately calculate shipping costs, forecast fulfilment space, and prevent unexpected carrier fees while classifying shipment on time rate is the promising to the customer for receiving their parcel on time. It is vital for a company to highlight package weight forecasting and shipment on time monitoring to maintain a quality shipment management and prevent customer churn. This project focused on leveraging machine learning to predict package weight as well as to classify the shipment on time rate based on historical shipping data, courier performance, and customer feedback features and is targeted to improve company management meanwhile reduce the rate of customer churn.

Questions

The questions that going to be answered upon the completion of the project are:
1. How to predict package weight based on E-Commerce shipping data?
2. How to classify the shipment on time rate using machine learning model?
3. What is performance of the regression and classification model?

Answers

The answers that going to be discovered from the dataset included:
1. The most critical factor that determine package weight.
2. The important features used to classify the shipment on time rate.
3. The performance of regression model in predicting package weight.
4. The performance of classification model in predicting shipment on time rate.

Objectives

The objectives of this project included:
1. To predict package weight based on E-Commerce shipping data.
2. To classify shipment on time rate using machine learning model.
3. To evaluate the performance of the regression and classification model.

Methodology

Part 1: Data Collection

Author: ONG CHEN YAN

The dataset used in this project is obtained from online resources Kaggle. The title of the dataset is “E-Commerce Shipping Data” which is basically a customer database from electronics products company. This dataset was published in the year of 2021 with the goal to discover key insights from customer database. The dataset consists of 11499 observations and 12 variables. The variables in the dataset included:

1. ID: ID Number of Customers.
2. Warehouse block: The Company have big Warehouse which is divided in to block such as A,B,C,D,E.
3. Mode of shipment:The Company Ships the products in multiple way such as Ship, Flight and Road.
4. Customer care calls: The number of calls made from enquiry for enquiry of the shipment.
5. Customer rating: The company has rated from every customer. 1 is the lowest (Worst), 5 is the highest (Best).
6. Cost of the product: Cost of the Product in US Dollars.
7. Prior purchases: The Number of Prior Purchase.
8. Product importance: The company has categorized the product in the various parameter such as low, medium, high.
9. Gender: Male and Female.
10. Discount offered: Discount offered on that specific product.
11. Weight in gms: It is the weight in grams.
12. Reached on time: It is the target variable, where 1 Indicates that the product has NOT reached on time and 0 indicates it has reached on time.

Part 2: Data Preprocessing

2.1 Exploratory Data Analysis (EDA) before Data Cleaning

Author: QIN ZHONGYI

This section performs an initial inspection of the raw dataset to check dimensions and identify data anomalies such as missing values, special characters, and formatting errors.

2.1.1 Data Import and Dimensionality Verification

This step imports the raw CSV dataset into the R environment and calculates the total dimensions (rows × columns) to verify if it meets the minimum requirement of 100,000 data points.

# Load the raw polluted dataset
raw_data <- read.csv("Dataset/Highly_Polluted_Dataset.csv")

# Calculate dimensions
n_rows <- nrow(raw_data)
n_cols <- ncol(raw_data)
total_dim <- n_rows * n_cols

# Output results
print(paste("Total Rows:", n_rows))

[1] "Total Rows: 11499"

print(paste("Total Columns:", n_cols))

[1] "Total Columns: 12"

print(paste("Total Dimension:", total_dim))

[1] "Total Dimension: 137988"

# Automated check
if(total_dim > 100000) {
  print("Status: Requirement Met (>100,000)")
} else {
  print("Status: Warning - Requirement Not Met")
}

[1] "Status: Requirement Met (>100,000)"

2.1.2 Initial Data Format Inspection

This step outputs the first six rows of the dataset to observe the raw data structure and identify obvious formatting issues, such as symbols embedded in numerical columns.

# Display the top 6 rows of the raw dataset
head(raw_data)

  ID Warehouse_block Mode_of_Shipment Customer_care_calls Customer_rating
1  1               D           Flight                   4              NA
2  2                                                    4               5
3  3               A           Flight                   2               2
4  4               B           Flight                   3               3
5  5               C           Flight                   2               2
6  6               F           flyght                   3               1
  Cost_of_the_Product Prior_purchases Product_importance Gender
1                 177               3                low FEMALE
2                 216               2                low      M
3                $183               4                low      M
4                 176               4             medium      M
5                 184               3             medium      F
6                                   3             medium FEMALE
  Discount_offered Weight_in_gms Reached.on.Time_Y.N
1               44        1233.0                   1
2               59        3088.0                   1
3               48        3374.0                   1
4               10        1177.0                   1
5               46        2484.0                   1
6               12                                 1

2.1.3 Data Structure and Type Analysis

The glimpse() function is used to perform a detailed inspection of the data types assigned to each variable. This step is critical for identifying which numerical columns are incorrectly stored as ‘character’ strings due to the presence of non-numeric “pollution” (e.g., currency symbols and units). Identifying these discrepancies now informs our cleaning strategy in Step 2B.

# Check the detailed structure and data types
glimpse(raw_data)

Rows: 11,499
Columns: 12
$ ID                  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ Warehouse_block     <chr> "D", "", "A", "B", "C", "F", "D", "f ", "A", "b", …
$ Mode_of_Shipment    <chr> "Flight", "", "Flight", "Flight", "Flight", "flygh…
$ Customer_care_calls <chr> "4", "4", "2", "3", "2", "3", "3", "4", "3", "3", …
$ Customer_rating     <dbl> NA, 5, 2, 3, 2, 1, 4, 1, 4, 2, NA, 5, 5, 4, 3, 3, …
$ Cost_of_the_Product <chr> "177", "216", "$183", "176", "184", "", "250", "23…
$ Prior_purchases     <dbl> 3, 2, 4, 4, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 2,…
$ Product_importance  <chr> "low", "low", "low", "medium", "medium", "medium",…
$ Gender              <chr> "FEMALE", "M", "M", "M", "F", "FEMALE", "F", "F", …
$ Discount_offered    <dbl> 44, 59, 48, 10, 46, 12, 3, 48, 11, 29, 12, 32, 1, …
$ Weight_in_gms       <chr> "1233.0", "3088.0", "3374.0", "1177.0", "2484.0", …
$ Reached.on.Time_Y.N <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", …

2.1.4 Missing Value Detection and Visualization

This step temporarily converts implicit missing values (e.g., empty strings, “?”, “N/A”) to standard NA format and uses an UpSet plot to visualize the distribution and intersection of missing data across variables.

# 1. Creating a temporary standardized view for accurate detection ONLY
missing_viz_data <- raw_data %>%
  mutate(across(everything(), as.character)) %>%
  mutate(across(everything(), trimws)) %>%
  mutate(across(everything(), ~case_when(
    . == "" ~ NA_character_,
    grepl("^(?i)(none|null|n/a|\\?|nan)$", .) ~ NA_character_,
    TRUE ~ .
  )))

# 2. Comprehensive Statistical Summary based on the true missingness
cat("Actual Missingness Profile:\n")

Actual Missingness Profile:

miss_var_summary(missing_viz_data) %>% head(10)

# A tibble: 10 × 3
   variable            n_miss pct_miss
   <chr>                <int>    <num>
 1 Warehouse_block        575     5.00
 2 Mode_of_Shipment       575     5.00
 3 Customer_care_calls    575     5.00
 4 Customer_rating        575     5.00
 5 Cost_of_the_Product    575     5.00
 6 Prior_purchases        575     5.00
 7 Product_importance     575     5.00
 8 Gender                 575     5.00
 9 Discount_offered       575     5.00
10 Reached.on.Time_Y.N    575     5.00

# 3. Advanced Visualization (UpSet Plot)
gg_miss_upset(missing_viz_data, 
              nsets = 10, 
              main.bar.color = "#2c3e50", 
              sets.bar.color = "#e74c3c")

2.1.5 Data Quality Issues Detection

This subsection identifies and quantifies specific data anomalies to guide the subsequent cleaning steps.

2.1.5.1 Symbol Pollution in Numerical Columns

We first investigate the features that should be strictly numerical, such as Cost_of_the_Product and Weight_in_gms. Preliminary inspection suggests that these fields contain currency symbols ($), units (g/grams), and other non-numeric suffixes. These “pollutants” force R to treat these variables as character strings, preventing direct statistical computation.

# Function to extract unique non-numeric characters from a column
detect_non_numeric <- function(column_data) {
  symbols <- str_extract_all(as.character(column_data), "[^0-9.]+") %>% 
    unlist() %>% unique() %>% na.omit()
  return(paste(symbols, collapse = ", "))
}

# Counting affected rows for cost and weight
cost_polluted <- sum(str_detect(as.character(raw_data$Cost_of_the_Product), "[^0-9.]+"), na.rm = TRUE)
weight_polluted <- sum(str_detect(as.character(raw_data$Weight_in_gms), "[^0-9.]+"), na.rm = TRUE)

# Summary table for symbols
symbol_summary <- data.frame(
  Feature = c("Cost of the Product", "Weight in gms"),
  Detected_Symbols = c(detect_non_numeric(raw_data$Cost_of_the_Product), 
                      detect_non_numeric(raw_data$Weight_in_gms)),
  Number = c(cost_polluted, weight_polluted)
)

knitr::kable(symbol_summary, col.names = c("Feature", "Detected Symbols", "Number"))

Feature	Detected Symbols	Number
Cost of the Product	$	1120
Weight in gms	nang, g, -	685

2.1.5.2 Categorical Label Anomalies

This step calculates the frequencies of categorical variables to identify misspelled labels, inconsistent casing, and unstandardized values.

Anomalous Labels in Shipment & Gender

# Identifying Shipment anomalies (Excluding standard: Flight, Ship, Road)
shipment_anomalies <- raw_data %>%
  count(Mode_of_Shipment, name = "Number") %>%
  filter(!Mode_of_Shipment %in% c("Flight", "Ship", "Road"))

# Identifying Gender anomalies (Excluding standard: F, M)
gender_anomalies <- raw_data %>%
  count(Gender, name = "Number") %>%
  filter(!Gender %in% c("F", "M"))

# Combining results for a unified presentation
label_errors <- bind_rows(
  mutate(shipment_anomalies, Feature = "Mode of Shipment") %>% rename(Value = Mode_of_Shipment),
  mutate(gender_anomalies, Feature = "Gender") %>% rename(Value = Gender)
) %>% select(Feature, Value, Number)

# Displaying the summary table
knitr::kable(label_errors, col.names = c("Feature   ", "Error value name   ", "Number"))

Feature	Error value name	Number
Mode of Shipment		575
Mode of Shipment	Flight	155
Mode of Shipment	FLIGHT	156
Mode of Shipment	ROAD	182
Mode of Shipment	S.h.i.p	759
Mode of Shipment	flyght	135
Mode of Shipment	road	171
Mode of Shipment	ship	711
Gender		575
Gender	FEMALE	561
Gender	Female	534
Gender	MALE	572
Gender	Male	537
Gender	f	548
Gender	m	530

2.1.5.3 Duplicate Record Detection

This step scans the ID column to identify and count duplicate primary keys, excluding empty strings and missing values.

Sample of Detected Duplicate IDs

# 1. Removing true NAs AND empty/whitespace strings
duplicate_report <- raw_data %>%
  filter(!is.na(ID) & trimws(as.character(ID)) != "") %>% 
  filter(duplicated(ID) | duplicated(ID, fromLast = TRUE)) %>%
  count(ID, name = "Number") %>%
  arrange(desc(Number))

# 2. Calculating the total sum BEFORE subsetting
total_duplicate_rows <- sum(duplicate_report$Number)

# 3. Generating the summary table for the top 10 only
knitr::kable(head(duplicate_report, 10), col.names = c("Error value name (ID)", "Number"))

Error value name (ID)	Number

# 4. Displaying the final full-dataset count summary at the bottom
cat("\n--------------------------------------------------\n")


--------------------------------------------------

cat("Total unique numerical IDs with duplicates:", nrow(duplicate_report), "\n")

Total unique numerical IDs with duplicates: 0

cat("Total rows affected by duplicates:", total_duplicate_rows, "\n")

Total rows affected by duplicates: 0

2.1.5.4 Value Range Validation

This step filters the ordinal variables (e.g., Customer_rating) to identify values that fall outside the defined standard range (1 to 5).

Package Weight Value Anomalies

# Detecting values in Customer_rating that are not within the standard 1-5 range
rating_anomalies <- raw_data %>%
  count(Customer_rating, name = "Number") %>%
  filter(!Customer_rating %in% c("1", "2", "3", "4", "5")) %>%
  arrange(desc(Number))

# Generating the summary table
knitr::kable(rating_anomalies, col.names = c("Error values", "Number"))

Error values	Number
NA	575

2.1.5.5 Noise Character Profiling

This step counts the occurrences of specific special characters (e.g., *, #, @, !!) across the entire dataset to profile the distribution of data noise.

Distribution of Random Noise Characters

# Function to count the occurrences of specific noise characters across all columns
count_noise_pattern <- function(char_pattern) {
  raw_data %>%
    filter(if_any(everything(), ~str_detect(as.character(.), fixed(char_pattern)))) %>%
    nrow()
}

# Generating the distribution profile of identified noise characters
noise_profile_data <- data.frame(
  Error_Type = c("Asterisk (*)", "Hashtag (#)", "At Symbol (@)", "Double Bang (!!)"),
  Number = c(count_noise_pattern("*"), count_noise_pattern("#"), 
            count_noise_pattern("@"), count_noise_pattern("!!"))
)

# Displaying the noise distribution table
knitr::kable(noise_profile_data, col.names = c("Noise", "Number"))

Noise	Number
Asterisk (*)	0
Hashtag (#)	0
At Symbol (@)	0
Double Bang (!!)	0

2.2 Data Cleaning

Author: QIN ZHONG YI

This section sequentially resolves the anomalies identified in Part 2A. The cleaning process includes deduplication, noise character removal, data type casting, categorical standardization, and missing value imputation.

2.2.1 Identifier Cleaning and Deduplication

This step removes non-numeric characters from the ID column, converts it to a numeric format, and filters out duplicate records based on the unique ID to ensure row-level data integrity.

library(dplyr)
library(stringr)

# 1. Record initial state
rows_before <- nrow(raw_data)

# 2. Clean ID column and deduplicate
cleaned_data <- raw_data %>%
  # Strip everything except numbers and decimals from ID
  mutate(ID = as.numeric(str_remove_all(as.character(ID), "[^0-9.]"))) %>%
  # Remove completely blank/NA IDs, then deduplicate
  filter(!is.na(ID)) %>%
  distinct(ID, .keep_all = TRUE)

cat("Rows before:", rows_before, "\nRows after Step 1:", nrow(cleaned_data), "\n")

Rows before: 11499 
Rows after Step 1: 11499

2.2.2 Noise Character Removal

This step utilizes regular expressions to search and remove specific non-numeric characters (e.g., currency symbols, metric units, and special symbols) across all columns, followed by trimming whitespace.

# 1. Defining the universal noise pattern (covering all identified pollutants)
noise_pattern <- "\\$|USD|g|grams|\\*|#|@|!!"

# 2. Executing global noise stripping and whitespace trimming
cleaned_data <- cleaned_data %>%
  mutate(across(everything(), ~str_remove_all(as.character(.), noise_pattern))) %>%
  mutate(across(everything(), trimws))

cat("Universal noise stripping completed successfully.\n")

Universal noise stripping completed successfully.

2.2.3 Missing Value Standardization and Type Casting

This step converts various textual representations of missing data (e.g., “?”, “None”, “N/A”, empty strings) into standard NA values. Subsequently, variables representing numeric features are explicitly cast to the numeric data type.

# 1. Standardize hidden textual NAs into system NAs
cleaned_data <- cleaned_data %>%
  mutate(across(everything(), ~na_if(., ""))) %>%
  mutate(across(everything(), ~case_when(
    grepl("^(?i)(none|null|nan|n/a|na|\\?)$", .) ~ NA_character_,
    TRUE ~ .
  )))

# 2. Force numeric type casting for mathematical columns
numeric_columns <- c("Cost_of_the_Product", "Weight_in_gms", "Customer_rating", 
                     "Customer_care_calls", "Prior_purchases", "Discount_offered", 
                     "Reached.on.Time_Y.N")

cleaned_data <- cleaned_data %>%
  mutate(
    across(all_of(numeric_columns), as.numeric),
    Weight_in_gms = abs(Weight_in_gms)
  )

cat("Type casting completed. Hidden strings converted to true NAs.\n")

Type casting completed. Hidden strings converted to true NAs.

2.2.4 Categorical Variable Standardization

This step corrects spelling errors and standardizes labels for categorical variables (e.g., Mode_of_Shipment, Gender, Product_importance) using pattern matching, and unifies the text casing to ensure categorical consistency.

# 1. Execute deterministic recoding with fuzzy logic
cleaned_data <- cleaned_data %>%
  mutate(
    # Robust alignment for Mode_of_Shipment
    Mode_of_Shipment = case_when(
      grepl("(?i)fly|flight|fliht", Mode_of_Shipment) ~ "Flight",
      grepl("(?i)s\\.?h\\.?i\\.?p", Mode_of_Shipment) ~ "Ship",
      grepl("(?i)road", Mode_of_Shipment)      ~ "Road",
      TRUE ~ Mode_of_Shipment
    ),
    
    # Robust alignment for Gender
    Gender = case_when(
      grepl("(?i)^f", Gender) ~ "F",
      grepl("(?i)^m", Gender) ~ "M",
      TRUE ~ Gender
    ),
    
    # Robust alignment & casing for Product_importance (Fixing "hih")
    Product_importance = case_when(
      grepl("(?i)hih|high", Product_importance) ~ "high",
      TRUE ~ tolower(Product_importance)
    ),
    
    # Standardize textual casing and remove noise strings for Warehouse_block
    Warehouse_block = case_when(
      grepl("(?i)warehouse_a", Warehouse_block) ~ "A",
      grepl("(?i)d_block", Warehouse_block) ~ "D",
      TRUE ~ toupper(Warehouse_block)
    )
  )

cat("Categorical recoding successful. Distinct Product Importance Levels:", 
    paste(unique(na.omit(cleaned_data$Product_importance)), collapse=", "), "\n")

Categorical recoding successful. Distinct Product Importance Levels: low, medium, high

2.2.5 Missing Value Imputation and Data Export

This final step imputes remaining NA values using the median for continuous numerical variables and the mode for categorical variables. The cleaned dataset is then audited for zero missingness and exported to a CSV file for downstream analysis.

# 1. Define a robust mode calculation function
calc_mode <- function(x) {
  ux <- unique(na.omit(x))
  ux[which.max(tabulate(match(x, ux)))]
}

# 2. Execute Universal Imputation Strategy
cleaned_data <- cleaned_data %>%
  mutate(
    # Micro-Adjustment: Catch any literal "NA" or "nan" strings disguised as text
    Warehouse_block = ifelse(toupper(as.character(Warehouse_block)) %in% c("NA", "NAN", ""), NA_character_, Warehouse_block),
    
    # Median Imputation for numerical features
    Cost_of_the_Product = coalesce(Cost_of_the_Product, median(Cost_of_the_Product, na.rm = TRUE)),
    Weight_in_gms       = coalesce(Weight_in_gms, median(Weight_in_gms, na.rm = TRUE)),
    Customer_rating     = coalesce(Customer_rating, median(Customer_rating, na.rm = TRUE)),
    Customer_care_calls = coalesce(Customer_care_calls, median(Customer_care_calls, na.rm = TRUE)),
    Prior_purchases     = coalesce(Prior_purchases, median(Prior_purchases, na.rm = TRUE)),
    Discount_offered    = coalesce(Discount_offered, median(Discount_offered, na.rm = TRUE)),
    
    # Mode Imputation for categorical/binary features
    Reached.on.Time_Y.N = coalesce(Reached.on.Time_Y.N, calc_mode(Reached.on.Time_Y.N)),
    Warehouse_block     = coalesce(Warehouse_block, calc_mode(Warehouse_block)),
    Mode_of_Shipment    = coalesce(Mode_of_Shipment, calc_mode(Mode_of_Shipment)),
    Product_importance  = coalesce(Product_importance, calc_mode(Product_importance)),
    Gender              = coalesce(Gender, calc_mode(Gender))
  )

# 3. Final Integrity Audit
cat("\n--- Final Integrity Audit ---\n")


--- Final Integrity Audit ---

cat("Total Missing Values:", sum(is.na(cleaned_data)), "\n")

Total Missing Values: 0

cat("Final Dataset Dimensions:", nrow(cleaned_data), "rows x", ncol(cleaned_data), "columns\n\n")

Final Dataset Dimensions: 11499 rows x 12 columns

# 4. Final Cleaned Data Preview
cat("--- Final Cleaned Data Preview ---\n")

--- Final Cleaned Data Preview ---

knitr::kable(head(cleaned_data, 10))

ID	Warehouse_block	Mode_of_Shipment	Customer_care_calls	Customer_rating	Cost_of_the_Product	Prior_purchases	Product_importance	Gender	Discount_offered	Weight_in_gms	Reached.on.Time_Y.N
1	D	Flight	4	3	177	3	low	F	44	1233	1
2	F	Ship	4	5	216	2	low	M	59	3088	1
3	A	Flight	2	2	183	4	low	M	48	3374	1
4	B	Flight	3	3	176	4	medium	M	10	1177	1
5	C	Flight	2	2	184	3	medium	F	46	2484	1
6	F	Flight	3	1	216	3	medium	F	12	4145	1
7	D	Flight	3	4	250	3	low	F	3	2371	1
8	F	Flight	4	1	233	2	low	F	48	2804	1
9	A	Flight	3	4	150	3	low	F	11	1861	1
10	B	Flight	3	2	164	3	medium	F	29	1187	1

# 5. Exporting the final "Gold Standard" Dataset
export_path <- "Dataset/Cleaned_Dataset.csv"
write.csv(cleaned_data, file = export_path, row.names = FALSE)
cat("\nSUCCESS: Dataset exported to:\n", export_path, "\n")


SUCCESS: Dataset exported to:
 Dataset/Cleaned_Dataset.csv

3. Exploratory Data Analysis (EDA)

Author: YANG PEIPEI

In this section, a comprehensive Exploratory Data Analysis (EDA) is conducted on the cleaned dataset. The primary goal of this phase is not just to summarize the data, but to uncover underlying patterns, distributions, and relationships. These insights will be applied to logically motivate our two upcoming machine learning objectives: 1. Regression Task: Predicting package weight (Weight_in_gms). 2. Classification Task: Predicting shipment on-time rate (Reached.on.Time_Y.N).

3.1 Univariate Analysis

First, individual variables is analyzed to understand their fundamental distributions. This step mainly helps to identify skewness, distinct clusters, and the baseline distribution of the target variables.

3.1.1 Distribution of Numerical Features

The distribution of package weight, cost, and discount offered is observed.

p1 <- ggplot(eda_data, aes(x = as.numeric(Customer_rating))) +
  geom_histogram(bins = 5, fill = "skyblue", color = "black", alpha = 0.8) +
  theme_minimal() + labs(title = "Distribution of Customer Rating", x = "Customer Rating", y = "Count")

p2 <- ggplot(eda_data, aes(x = Cost_of_the_Product)) +
  geom_histogram(bins = 30, fill = "lightgreen", color = "black", alpha = 0.8) +
  theme_minimal() + labs(title = "Distribution of Product Cost", x = "Cost (USD)", y = "Count")

p3 <- ggplot(eda_data, aes(x = as.numeric(Product_importance))) +
  geom_histogram(bins = 5, fill = "salmon", color = "black", alpha = 0.8) +
  theme_minimal() + labs(title = "Distribution of Product Importance", x = "Product Importance 1(Low), 2(Medium), 3(High)", y = "Count")

grid.arrange(p1, p2, p3, ncol = 2)

Conclusion:
- Customer_rating distribution showed that there is no bias in the dataset to certain group of customer rating
- Cost_of_the_Product is right-skewed due to several extreme high-cost observations.
- Product_importance distribution showed that low importance products are the most common, medium importance products are the second largest group while high importance products are relatively rare.

3.1.2 Distribution of Target Variables

Understanding the distribution of the two modeling targets is essential before building predictive models.

t1 <- ggplot(eda_data, aes(x = Reached.on.Time_Y.N, fill = Reached.on.Time_Y.N)) +
  geom_bar(color = "black", alpha = 0.8) +
  scale_fill_manual(values = c("#2ecc71", "#e74c3c")) +
  theme_minimal() + labs(title = "Shipment On-Time Status", x = "0 = On Time, 1 = Delayed", y = "Count") +
  theme(legend.position = "none")

t2 <- ggplot(eda_data, aes(x =Weight_in_gms)) +
  geom_histogram(bins = 30, fill = "salmon", color = "black", alpha = 0.8) +
  theme_minimal() + labs(title = "Package Weight Distribution", x = "Package Weight", y = "Count")

grid.arrange(t1, t2, ncol = 2)

Conclusion:
- For On-Time Status (Reached.on.Time_Y.N), class 1 (Delayed) has a noticeably higher count than class 0 (On-Time). This slight class imbalance is crucial context for our classification model evaluation later.
- For Package Weight(Weight_in_gms), it displays a clear bimodal distribution, suggesting two distinct types of weight range (lightweight vs. heavy).

3.2 Bivariate Analysis & Modeling Motivation

In this section, the interaction of predictor variables with target variables is analyze. This step is critical because observing distinct patterns here justifies the necessity of building robust machine learning models.

3.2.1 Motivating the Regression Task: Predicting Package Weight

The factors influencing that influenced Weight_in_gmsis explored.

# Package Weight vs Product Importance
r1 <- ggplot(eda_data, aes(x = Weight_in_gms, y = Product_importance)) +
  geom_boxplot(alpha = 0.6) +
  theme_minimal() + labs(title = "Package Weight vs Product Importance", y = "Product Importance") + theme(legend.position = "none")

# Package Weight vs Product Cost
r2 <- ggplot(eda_data, aes(x = Weight_in_gms, y = Cost_of_the_Product)) +
  geom_point(alpha = 0.4, color = "purple") +
  theme_minimal() + labs(title = "Package Weight vs Product Cost", y = "Product Cost")

# Package Weight vs Mode of Shipment
max_package_weight_mode <- eda_data %>%
  group_by(Mode_of_Shipment) %>%
  summarise(
    Max_Weight = max(Weight_in_gms, na.rm = TRUE)
  )

r3 <- ggplot(max_package_weight_mode,
             aes(x = Mode_of_Shipment,
                 y = Max_Weight,
                 fill = Mode_of_Shipment)) +
  geom_col() +
  theme_minimal() +
  labs(
    title = "Maximum Package Weight by Mode of Shipment",
    x = "Mode of Shipment",
    y = "Maximum Package Weight (g)"
  )


grid.arrange(r1, r2, r3, ncol = 2)

Conclusion for Regression Motivation:
1. Multiple Interactions: The factors influencing package weight appear to be multifaceted. It closely related to product costs, mode_of_Shipment and product importance.
2. Subtle Trends: The product which is high important have larger package weight. Besides, ship can delivered package with the highest maximum weight followed by road and flight.
Actionable Takeaway: Product weight is driven by complex interactions among the available predictors. This necessitates a robust Multivariate Regression Model to synthesize these features and accurately predict the expected package weight.

3.2.2 Motivating the Classification Task: Predicting On-Time Delivery

The variables which strongly distinguish between on-time deliveries and delayed deliveries is explored.

# Weight vs. On-Time
b1 <- ggplot(eda_data, aes(x = Reached.on.Time_Y.N, y = Weight_in_gms, fill = Reached.on.Time_Y.N)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("#2ecc71", "#e74c3c")) +
  theme_minimal() + labs(title = "Weight vs Delivery Status", x = "Status (0: On-time, 1: Delayed)") +
  theme(legend.position = "none")

# Discount vs. On-Time
b2 <- ggplot(eda_data, aes(x = Reached.on.Time_Y.N, y = Discount_offered, fill = Reached.on.Time_Y.N)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_manual(values = c("#2ecc71", "#e74c3c")) +
  theme_minimal() + labs(title = "Discount vs Delivery Status", x = "Status") +
  theme(legend.position = "none")

# Product Importance vs. On-Time
b3 <- ggplot(eda_data, aes(x = Product_importance, fill = Reached.on.Time_Y.N)) +
  geom_bar(position = "fill", color = "black") +
  scale_fill_manual(values = c("#2ecc71", "#e74c3c"), name="Status") +
  theme_minimal() + labs(title = "Importance vs Delivery Status", y = "Proportion")

grid.arrange(b1, b2, b3, ncol = 2)

Conclusion for Classification Motivation:
1. Weight and discount show visible differences across delivery status: The boxplots reveal useful patterns. Packages that weigh between 2000g and 4000g are more prone to delays. Furthermore, packages with high discount offers are almost exclusively delayed (Target=1).
2. Categorical features also matter: The proportion chart shows that high importance products actually suffer from a slightly higher proportion of delays.

Actionable Takeaway: Because multiple variables exhibit non-linear and overlapping effects on delivery delays, simple heuristics won’t suffice.

This motivates the use of Classification Models such as Logistic Regression, Linear Discriminant Analysis, and Classification Tree to map these feature interactions and predict delays.

3.3 Multivariate Correlation Analysis

Finally, the correlation matrix for all continuous numerical features is computed. This ensures any multicollinearity is awared before passing these features into the models.

# Select only numeric continuous columns for correlation
numeric_data <- eda_data %>%
  dplyr::select(Customer_care_calls, Cost_of_the_Product, Prior_purchases, Discount_offered, Weight_in_gms)

# Calculate correlation matrix
cor_matrix <- cor(numeric_data, use = "complete.obs")

# Plot correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", 
         tl.srt = 45, addCoef.col = "black", number.cex = 0.8, 
         col = colorRampPalette(c("#e74c3c", "white", "#3498db"))(200),
         title = "Correlation Matrix of Numerical Features",
         mar = c(0,0,1,0))

Conclusion:
- Cost_of_the_Product has very weak correlations with the other numerical variables, suggesting that product cost might not be a strong standalone predictor. - Overall, there are no extremely high correlations (e.g., > 0.8) between the independent variables. This confirms the absence of severe multicollinearity, meaning we can safely use all these numerical features simultaneously in both our Classification and Regression models without causing data leakage.

3.4 Summary

Through rigorous EDA, we have established the following: 1. The data distributions have distinct characteristics, notably the bimodal weight distribution and right-skewed discount distribution. 2. Variables like weight and discount are strong indicators of shipment delays, providing an excellent foundation for the Classification task. 3. Package weight maintains complex variance influenced by various shipment features, justifying the use of a Regression model to predict the package weight. 4. The numerical features are mostly independent of one another, clearing the way for smooth model training.

With the data thoroughly explored and modeling motivations clearly justified, the next part which is Part 4: Predictive Modeling can be proceeded.

4. Predictive Modeling

Author: CEN LINZE

This section builds two supervised machine learning tasks based on the EDA findings:

Regression Task: Predicting Weight_in_gms.
Classification Task: Predicting shipment delay status using Reached.on.Time_Y.N.

4.1 Data Preparation and Train-Test Split

# Load cleaned dataset again for modelling to ensure correct data types
model_data <- read.csv("Dataset/Cleaned_Dataset.csv")

model_data <- model_data %>%
  mutate(
    Warehouse_block = as.factor(Warehouse_block),
    Mode_of_Shipment = as.factor(Mode_of_Shipment),
    Product_importance = as.factor(Product_importance),
    Gender = as.factor(Gender),
    Reached.on.Time_Y.N = factor(Reached.on.Time_Y.N, levels = c(0, 1)),
    Customer_rating = as.numeric(Customer_rating),
    Customer_care_calls = as.numeric(Customer_care_calls),
    Cost_of_the_Product = as.numeric(Cost_of_the_Product),
    Prior_purchases = as.numeric(Prior_purchases),
    Discount_offered = as.numeric(Discount_offered),
    Weight_in_gms = as.numeric(Weight_in_gms)
  )

set.seed(7004)

train_index <- sample(seq_len(nrow(model_data)), size = 0.7 * nrow(model_data))
train_data <- model_data[train_index, ]
test_data <- model_data[-train_index, ]

cat("Training set:", nrow(train_data), "rows\n")

## Training set: 8049 rows

cat("Testing set:", nrow(test_data), "rows\n")

## Testing set: 3450 rows

4.2 Regression Models: Predicting Package Weight

Three regression models are trained and compared:

Multiple Linear Regression
Polynomial Linear Regression
Regression Tree

reg_formula <- Weight_in_gms ~ Warehouse_block + Mode_of_Shipment +
  Customer_care_calls + Cost_of_the_Product + Prior_purchases +
  Product_importance + Gender + Discount_offered + Customer_rating +
  Reached.on.Time_Y.N

# Model 1: Multiple Linear Regression
lm_model <- lm(reg_formula, data = train_data)

# Model 2: Polynomial Linear Regression
poly_lm_model <- lm(
  Weight_in_gms ~ Warehouse_block + Mode_of_Shipment +
    Customer_care_calls + poly(Cost_of_the_Product, 2) +
    Prior_purchases + Product_importance + Gender +
    Discount_offered + Customer_rating + Reached.on.Time_Y.N,
  data = train_data
)

# Model 3: Regression Tree
reg_tree_model <- rpart(
  reg_formula,
  data = train_data,
  method = "anova",
  control = rpart.control(cp = 0.001)
)

pred_lm <- predict(lm_model, newdata = test_data)
pred_poly_lm <- predict(poly_lm_model, newdata = test_data)
pred_reg_tree <- predict(reg_tree_model, newdata = test_data)

# Weight cannot be negative
clip_weight <- function(x) {
  pmax(x, 0)
}

pred_lm <- clip_weight(pred_lm)
pred_poly_lm <- clip_weight(pred_poly_lm)
pred_reg_tree <- clip_weight(pred_reg_tree)

4.3 Classification Models: Predicting Shipment Delay

Three classification models are trained and compared:

Logistic Regression
Linear Discriminant Analysis (LDA)
Classification Tree

class_formula <- Reached.on.Time_Y.N ~ Warehouse_block + Mode_of_Shipment +
  Customer_care_calls + Customer_rating + Cost_of_the_Product + Prior_purchases +
  Product_importance + Gender + Discount_offered + Weight_in_gms

# Model 1: Logistic Regression
logistic_model <- glm(
  class_formula,
  data = train_data,
  family = binomial
)

# Model 2: Linear Discriminant Analysis
lda_model <- MASS::lda(class_formula, data = train_data)

# Model 3: Classification Tree
class_tree_model <- rpart(
  class_formula,
  data = train_data,
  method = "class",
  control = rpart.control(cp = 0.001)
)

# Logistic Regression Predictions
logistic_prob <- predict(logistic_model, newdata = test_data, type = "response")
pred_logistic <- factor(ifelse(logistic_prob >= 0.5, "1", "0"), levels = c("0", "1"))

# LDA Predictions
lda_output <- predict(lda_model, newdata = test_data)
pred_lda <- factor(lda_output$class, levels = c("0", "1"))
lda_prob <- lda_output$posterior[, "1"]

# Classification Tree Predictions
tree_prob_matrix <- predict(class_tree_model, newdata = test_data, type = "prob")
class_tree_prob <- tree_prob_matrix[, "1"]
pred_class_tree <- factor(ifelse(class_tree_prob >= 0.5, "1", "0"), levels = c("0", "1"))

5. Model Evaluation

Author: ZHANG WENQIAN

5.1 Regression Model Evaluation

The regression models are evaluated using R-Squared, Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).

regression_metrics <- function(actual, predicted) {
  sse <- sum((actual - predicted)^2)
  sst <- sum((actual - mean(actual))^2)
  
  data.frame(
    R_Squared = 1 - sse / sst,
    MAE = mean(abs(actual - predicted)),
    MSE = mean((actual - predicted)^2),
    RMSE = sqrt(mean((actual - predicted)^2))
  )
}

regression_results <- bind_rows(
  cbind(Model = "Multiple Linear Regression",
        regression_metrics(test_data$Weight_in_gms, pred_lm)),
  cbind(Model = "Polynomial Linear Regression",
        regression_metrics(test_data$Weight_in_gms, pred_poly_lm)),
  cbind(Model = "Regression Tree",
        regression_metrics(test_data$Weight_in_gms, pred_reg_tree))
)

regression_results <- regression_results %>%
  mutate(across(where(is.numeric), ~round(.x, 4)))

knitr::kable(
  regression_results,
  caption = "Regression Model Performance for Package Weight Prediction"
)

Regression Model Performance for Package Weight Prediction
Model	R_Squared	MAE	MSE	RMSE
Multiple Linear Regression	0.2576	1147.5766	1900353	1378.533
Polynomial Linear Regression	0.2576	1147.3718	1900252	1378.496
Regression Tree	0.5787	732.8797	1078478	1038.498

best_regression_model <- regression_results$Model[which.min(regression_results$RMSE)]
cat("Best regression model based on RMSE:", best_regression_model)

## Best regression model based on RMSE: Regression Tree

Based on RMSE, the best regression model is selected as the model with the lowest prediction error. This is important because inaccurate prediction of package weights may lead to poor warehouse resource allocation and logistical planning errors.

5.2 Classification Model Evaluation

The classification models are evaluated using Accuracy, Precision, Recall (Sensitivity), F1-Score, and AUC. Since delayed shipments are the main concern, recall and F1-score are more useful when choosing the final classification model.

calculate_auc <- function(actual, probability, positive_class = "1") {
  actual_binary <- ifelse(actual == positive_class, 1, 0)
  pos <- probability[actual_binary == 1]
  neg <- probability[actual_binary == 0]
  if (length(pos) == 0 | length(neg) == 0) return(NA)
  auc_value <- (sum(rank(c(pos, neg))[1:length(pos)]) -
                  length(pos) * (length(pos) + 1) / 2) / (length(pos) * length(neg))
  return(auc_value)
}

classification_metrics <- function(actual, predicted, probability, model_name, positive_class = "1") {
  actual <- factor(actual, levels = c("0", "1"))
  predicted <- factor(predicted, levels = c("0", "1"))
  
  tp <- sum(actual == positive_class & predicted == positive_class)
  tn <- sum(actual != positive_class & predicted != positive_class)
  fp <- sum(actual != positive_class & predicted == positive_class)
  fn <- sum(actual == positive_class & predicted != positive_class)
  
  accuracy <- (tp + tn) / (tp + tn + fp + fn)
  precision <- ifelse((tp + fp) == 0, 0, tp / (tp + fp))
  recall <- ifelse((tp + fn) == 0, 0, tp / (tp + fn))
  f1_score <- ifelse((precision + recall) == 0, 0,
                     2 * precision * recall / (precision + recall))
  auc <- calculate_auc(actual, probability, positive_class)
  
  data.frame(
    Model = model_name,
    Accuracy = round(accuracy, 4),
    Precision = round(precision, 4),
    Recall = round(recall, 4),
    F1_Score = round(f1_score, 4),
    AUC = round(auc, 4)
  )
}

classification_results <- bind_rows(
  classification_metrics(test_data$Reached.on.Time_Y.N, pred_logistic, logistic_prob, "Logistic Regression"),
  classification_metrics(test_data$Reached.on.Time_Y.N, pred_lda, lda_prob, "Linear Discriminant Analysis"),
  classification_metrics(test_data$Reached.on.Time_Y.N, pred_class_tree, class_tree_prob, "Classification Tree")
)

knitr::kable(
  classification_results,
  caption = "Classification Model Performance for Shipment Delay Prediction"
)

Classification Model Performance for Shipment Delay Prediction
Model	Accuracy	Precision	Recall	F1_Score	AUC
Logistic Regression	0.6362	0.6899	0.8135	0.7466	0.6832
Linear Discriminant Analysis	0.6499	0.6837	0.8720	0.7664	0.6951
Classification Tree	0.6339	0.7153	0.7382	0.7266	0.7047

Best Classification Model Details

best_class_model <- classification_results$Model[which.max(classification_results$F1_Score)]
cat("Best classification model based on F1-score:", best_class_model, "\n\n")

## Best classification model based on F1-score: Linear Discriminant Analysis

if (best_class_model == "Logistic Regression") {
  print(table(Actual = test_data$Reached.on.Time_Y.N, Predicted = pred_logistic))
} else if (best_class_model == "Linear Discriminant Analysis") {
  print(table(Actual = test_data$Reached.on.Time_Y.N, Predicted = pred_lda))
} else {
  print(table(Actual = test_data$Reached.on.Time_Y.N, Predicted = pred_class_tree))
}

##       Predicted
## Actual    0    1
##      0  260  917
##      1  291 1982

Classification Metrics Visualization

classification_results %>%
  pivot_longer(cols = c(Accuracy, Precision, Recall, F1_Score, AUC),
               names_to = "Metric", values_to = "Value") %>%
  ggplot(aes(x = Model, y = Value, fill = Model)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ Metric) +
  coord_cartesian(ylim = c(0, 1)) +
  theme_minimal() +
  labs(title = "Classification Model Comparison",
       x = "Model", y = "Score") +
  theme(axis.text.x = element_text(angle = 15, hjust = 1))

The best classification model is chosen based on its F1-Score to effectively balance the trade-off between precision and recall, ensuring that the company accurately identifies as many delayed packages as possible without an excessive false alarm rate.

Conclusion

Based on results obtained, Linear Discriminant Analysis has the best performance for shipment on time classification while Regression Tree has the best performance for package weight prediction. For package weight, it influences by multiple factors. One of the factors is product importance where product with high importance are typically larger weight. For shipment on time rate, it showed that product with weight 2000g to 4000g as well as product with high discount rate usually having delay shipment. Future works are recommended such as to obtain more features like dimension of package, quantity of parcel per shipment, shipment distance to increase the data accuracy.

Predicting Package Weight and Shipment On-Time Rate with Machine Learning Model

ZHANG WENQIAN, YANG PEIPEI, CEN LINZE, ONG CHEN YAN, QIN ZHONG YI

2026-06-11