E-Commerce User Behavior Modeling through Classification, Regression and Predictive Analytics

R Markdown

Project Title “E-Commerce User Behavior Modeling through Classification, Regression and Predictive Analytics”

Introduction

The “E-commerce Customer Behaviour and Purchase” dataset serves as a rich and structured representation of online-shopping-customer interactions within an E-commerce platform. The dataset from Kaggle is described as a synthetically generated dataset using the Python Faker library. Although it is not real-world data, the dataset is modelled to reflect or imitate realistic e-commerce behaviour patterns.

The dataset is generated to facilitate various types of data science and machine learning tasks related to online retail, such as customer churn prediction, market basket analysis, recommendation systems, and purchasing trend analysis etc. The dataset is well-structured in tabular form, where each row represents a unique transaction made by a customer, and each column captures a specific attribute related to the transaction, the customer, or their purchase behaviour.

The dataset spans multiple relevant aspects of customer purchasing behaviour. Key features include customer demographic information such as Customer ID, Customer Name, Age, and Gender, along with transactional data like Purchase Date, Product Category, Product Price, Quantity, and Total Purchase Amount. It also includes behavioural indicators such as Payment Method, returns and customer churn. These features collectively allow for in-depth analyses of consumer habits and provide a strong foundation for predictive modelling.

Two sets of datasets comprising 250000 records and 13 columns are provided. Given its use case in modelling customer behaviour over time, it can be assumed that it spans a large number of records to accurately simulate e-commerce activities across different customers and timeframes.

Project Objectives

To identify factors influencing customer churn, helping businesses improve retention strategies.
To model user behavior using classification and regression techniques to predict important customer outcomes.

# Reading the dataset
df <- read.csv("ecommerce_customer_data_large.csv", stringsAsFactors = FALSE)

# Load necessary libraries
library(tidyverse)   # For data manipulation and visualization

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(caret)       # For machine learning functions

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(e1071)       # For SVM and other ML models
library(rpart)       # For decision trees
library(nnet)        # For neural networks
library(MASS)        # For logistic regression

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(ggplot2)     # For visualization
library(dplyr)       # For data wrangling```
library(lubridate)

# View structure and summary (df)

# Structure (data types)
str(df)

## 'data.frame':    250000 obs. of  13 variables:
##  $ Customer.ID          : int  44605 44605 44605 44605 44605 13738 13738 13738 13738 13738 ...
##  $ Purchase.Date        : chr  "2023-05-03 21:30:02" "2021-05-16 13:57:44" "2020-07-13 06:16:57" "2023-01-17 13:14:36" ...
##  $ Product.Category     : chr  "Home" "Electronics" "Books" "Electronics" ...
##  $ Product.Price        : int  177 174 413 396 259 191 205 370 12 40 ...
##  $ Quantity             : int  1 3 1 3 4 3 1 5 2 4 ...
##  $ Total.Purchase.Amount: int  2427 2448 2345 937 2598 3722 2773 1486 2175 4327 ...
##  $ Payment.Method       : chr  "PayPal" "PayPal" "Credit Card" "Cash" ...
##  $ Customer.Age         : int  31 31 31 31 31 27 27 27 27 27 ...
##  $ Returns              : num  1 1 1 0 1 1 NA 1 NA 0 ...
##  $ Customer.Name        : chr  "John Rivera" "John Rivera" "John Rivera" "John Rivera" ...
##  $ Age                  : int  31 31 31 31 31 27 27 27 27 27 ...
##  $ Gender               : chr  "Female" "Female" "Female" "Female" ...
##  $ Churn                : int  0 0 0 0 0 0 0 0 0 0 ...

# Quick overview
glimpse(df)

## Rows: 250,000
## Columns: 13
## $ Customer.ID           <int> 44605, 44605, 44605, 44605, 44605, 13738, 13738,…
## $ Purchase.Date         <chr> "2023-05-03 21:30:02", "2021-05-16 13:57:44", "2…
## $ Product.Category      <chr> "Home", "Electronics", "Books", "Electronics", "…
## $ Product.Price         <int> 177, 174, 413, 396, 259, 191, 205, 370, 12, 40, …
## $ Quantity              <int> 1, 3, 1, 3, 4, 3, 1, 5, 2, 4, 3, 1, 2, 4, 1, 2, …
## $ Total.Purchase.Amount <int> 2427, 2448, 2345, 937, 2598, 3722, 2773, 1486, 2…
## $ Payment.Method        <chr> "PayPal", "PayPal", "Credit Card", "Cash", "PayP…
## $ Customer.Age          <int> 31, 31, 31, 31, 31, 27, 27, 27, 27, 27, 27, 27, …
## $ Returns               <dbl> 1, 1, 1, 0, 1, 1, NA, 1, NA, 0, NA, 1, 0, 0, 0, …
## $ Customer.Name         <chr> "John Rivera", "John Rivera", "John Rivera", "Jo…
## $ Age                   <int> 31, 31, 31, 31, 31, 27, 27, 27, 27, 27, 27, 27, …
## $ Gender                <chr> "Female", "Female", "Female", "Female", "Female"…
## $ Churn                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …

# Descriptive statistics
summary(df)

##   Customer.ID    Purchase.Date      Product.Category   Product.Price  
##  Min.   :    1   Length:250000      Length:250000      Min.   : 10.0  
##  1st Qu.:12590   Class :character   Class :character   1st Qu.:132.0  
##  Median :25011   Mode  :character   Mode  :character   Median :255.0  
##  Mean   :25018                                         Mean   :254.7  
##  3rd Qu.:37441                                         3rd Qu.:377.0  
##  Max.   :50000                                         Max.   :500.0  
##                                                                       
##     Quantity     Total.Purchase.Amount Payment.Method      Customer.Age 
##  Min.   :1.000   Min.   : 100          Length:250000      Min.   :18.0  
##  1st Qu.:2.000   1st Qu.:1476          Class :character   1st Qu.:30.0  
##  Median :3.000   Median :2725          Mode  :character   Median :44.0  
##  Mean   :3.005   Mean   :2725                             Mean   :43.8  
##  3rd Qu.:4.000   3rd Qu.:3975                             3rd Qu.:57.0  
##  Max.   :5.000   Max.   :5350                             Max.   :70.0  
##                                                                         
##     Returns      Customer.Name           Age          Gender         
##  Min.   :0.0     Length:250000      Min.   :18.0   Length:250000     
##  1st Qu.:0.0     Class :character   1st Qu.:30.0   Class :character  
##  Median :1.0     Mode  :character   Median :44.0   Mode  :character  
##  Mean   :0.5                        Mean   :43.8                     
##  3rd Qu.:1.0                        3rd Qu.:57.0                     
##  Max.   :1.0                        Max.   :70.0                     
##  NA's   :47382                                                       
##      Churn       
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2005  
##  3rd Qu.:0.0000  
##  Max.   :1.0000  
##

# Dimensions: Rows and Columns
dim(df)

## [1] 250000     13

# Column Names
colnames(df)

##  [1] "Customer.ID"           "Purchase.Date"         "Product.Category"     
##  [4] "Product.Price"         "Quantity"              "Total.Purchase.Amount"
##  [7] "Payment.Method"        "Customer.Age"          "Returns"              
## [10] "Customer.Name"         "Age"                   "Gender"               
## [13] "Churn"

# Check for first few columns
head(df)

##   Customer.ID       Purchase.Date Product.Category Product.Price Quantity
## 1       44605 2023-05-03 21:30:02             Home           177        1
## 2       44605 2021-05-16 13:57:44      Electronics           174        3
## 3       44605 2020-07-13 06:16:57            Books           413        1
## 4       44605 2023-01-17 13:14:36      Electronics           396        3
## 5       44605 2021-05-01 11:29:27            Books           259        4
## 6       13738 2022-08-25 06:48:33             Home           191        3
##   Total.Purchase.Amount Payment.Method Customer.Age Returns  Customer.Name Age
## 1                  2427         PayPal           31       1    John Rivera  31
## 2                  2448         PayPal           31       1    John Rivera  31
## 3                  2345    Credit Card           31       1    John Rivera  31
## 4                   937           Cash           31       0    John Rivera  31
## 5                  2598         PayPal           31       1    John Rivera  31
## 6                  3722    Credit Card           27       1 Lauren Johnson  27
##   Gender Churn
## 1 Female     0
## 2 Female     0
## 3 Female     0
## 4 Female     0
## 5 Female     0
## 6 Female     0

Data Cleaning & Preprocessing

Data Cleaning

Removing duplicates

Handling missing values

Fixing column names or formats

Converting data types (e.g., character to date)

Drop similar column (e.g., Customer_Age)

Data Preprocessing

Encoding categorical variables into numeric formats

Feature scaling (normalization/standardization)

Creating new features

Binning or discretizing variables

# Check how many duplicate rows exist
sum(duplicated(df))

## [1] 0

# Check for missing values
colSums(is.na(df))

##           Customer.ID         Purchase.Date      Product.Category 
##                     0                     0                     0 
##         Product.Price              Quantity Total.Purchase.Amount 
##                     0                     0                     0 
##        Payment.Method          Customer.Age               Returns 
##                     0                     0                 47382 
##         Customer.Name                   Age                Gender 
##                     0                     0                     0 
##                 Churn 
##                     0

# Check rows with missing values in 'Returns'
missing_rows <- df[is.na(df$Returns), ]

 # View the first few rows with missing values
head(missing_rows)

##    Customer.ID       Purchase.Date Product.Category Product.Price Quantity
## 7        13738 2023-07-25 05:17:24      Electronics           205        1
## 9        13738 2021-12-21 03:29:05             Home            12        2
## 11       33969 2023-02-28 19:58:23         Clothing           410        3
## 22       42650 2022-01-26 12:50:30      Electronics           105        2
## 27       42650 2020-05-24 06:45:48      Electronics           307        3
## 40       19676 2021-03-23 17:36:30      Electronics           218        4
##    Total.Purchase.Amount Payment.Method Customer.Age Returns  Customer.Name Age
## 7                   2773    Credit Card           27      NA Lauren Johnson  27
## 9                   2175           Cash           27      NA Lauren Johnson  27
## 11                  5018    Credit Card           27      NA    Carol Allen  27
## 22                  3721    Credit Card           20      NA   Curtis Smith  20
## 27                   973         PayPal           20      NA   Curtis Smith  20
## 40                  2866         PayPal           57      NA      Linda Lee  57
##    Gender Churn
## 7  Female     0
## 9  Female     0
## 11   Male     0
## 22 Female     0
## 27 Female     0
## 40   Male     0

# Replace all dots with underscores in column names
colnames(df) <- gsub("\\.", "_", colnames(df))

# Check the column name after change
colnames(df)

##  [1] "Customer_ID"           "Purchase_Date"         "Product_Category"     
##  [4] "Product_Price"         "Quantity"              "Total_Purchase_Amount"
##  [7] "Payment_Method"        "Customer_Age"          "Returns"              
## [10] "Customer_Name"         "Age"                   "Gender"               
## [13] "Churn"

# Convert data types
df$Purchase_Date <- as.Date(df$Purchase_Date, format = "%Y-%m-%d")
df$Returns <- as.integer(df$Returns)

# Create new features
# Extract components
df$Purchase_Day_Name <- weekdays(df$Purchase_Date)
df$Purchase_Month <- month(df$Purchase_Date)
df$Purchase_Month_Name <- month(df$Purchase_Date, label = TRUE, abbr = FALSE)
df$Purchase_Year <- format(df$Purchase_Date, "%Y")
head(df)

##   Customer_ID Purchase_Date Product_Category Product_Price Quantity
## 1       44605    2023-05-03             Home           177        1
## 2       44605    2021-05-16      Electronics           174        3
## 3       44605    2020-07-13            Books           413        1
## 4       44605    2023-01-17      Electronics           396        3
## 5       44605    2021-05-01            Books           259        4
## 6       13738    2022-08-25             Home           191        3
##   Total_Purchase_Amount Payment_Method Customer_Age Returns  Customer_Name Age
## 1                  2427         PayPal           31       1    John Rivera  31
## 2                  2448         PayPal           31       1    John Rivera  31
## 3                  2345    Credit Card           31       1    John Rivera  31
## 4                   937           Cash           31       0    John Rivera  31
## 5                  2598         PayPal           31       1    John Rivera  31
## 6                  3722    Credit Card           27       1 Lauren Johnson  27
##   Gender Churn Purchase_Day_Name Purchase_Month Purchase_Month_Name
## 1 Female     0         Wednesday              5                 May
## 2 Female     0            Sunday              5                 May
## 3 Female     0            Monday              7                July
## 4 Female     0           Tuesday              1             January
## 5 Female     0          Saturday              5                 May
## 6 Female     0          Thursday              8              August
##   Purchase_Year
## 1          2023
## 2          2021
## 3          2020
## 4          2023
## 5          2021
## 6          2022

# Check for total number of numerical features and categorical features in the dataset
num_features <- sum(sapply(df, is.numeric))
cat_features <- sum(sapply(df, function(x) is.factor(x) || is.character(x)))

cat("Numerical features:", num_features, "\n")

## Numerical features: 9

cat("Categorical features:", cat_features, "\n")

## Categorical features: 7

# Show the names of the numerical and categorical features
numerical_features <- names(df)[sapply(df, is.numeric)]
categorical_features <- names(df)[sapply(df, function(x) is.factor(x) || is.character(x))]

cat("Numerical Features:\n")

## Numerical Features:

print(numerical_features)

## [1] "Customer_ID"           "Product_Price"         "Quantity"             
## [4] "Total_Purchase_Amount" "Customer_Age"          "Returns"              
## [7] "Age"                   "Churn"                 "Purchase_Month"

cat("\nCategorical Features:\n")

## 
## Categorical Features:

print(categorical_features)

## [1] "Product_Category"    "Payment_Method"      "Customer_Name"      
## [4] "Gender"              "Purchase_Day_Name"   "Purchase_Month_Name"
## [7] "Purchase_Year"

# Count the total number of churn and return
# Total Churned
sum(df$Churn == 1)

## [1] 50130

# Total Returned
sum(df$Returns == 1)

## [1] NA

Notice that for the sum of returned, the result showed , which means there is NA in the dataset for ‘Returns’

# Count the total number of return (only with value, set na.rm to TRUE)
total_returns <- sum(df$Returns == 1, na.rm = TRUE)
print(total_returns)

## [1] 101476

# Assuming the NA (missing value) is no return, we fill up with '0'
df$Returns[is.na(df$Returns)] <- 0
total_returns <- sum(df$Returns == 1)
print(total_returns)

## [1] 101476

# Check whether there is still NA in the dataset
sum(is.na(df$returns))

## [1] 0

# Count the total number of churn and return
# Total Churned
sum(df$Churn == 1)

## [1] 50130

# Total Returned
sum(df$Returns == 1)

## [1] 101476

There is no NA value in the dataset anymore, which means we successfully fill up missing value with ‘0’ There is a total of 101476 of returns.

DATA ANALYSIS - EDA

Total Revenue and Average Revenue

# Calculate for total revenue and average revenue

# Total Revenue
total_revenue <- sum(df$Total_Purchase_Amount, na.rm = TRUE)
print(total_revenue)

## [1] 681346299

# Average Revenue
average_revenue <- mean(df$Total_Purchase_Amount, na.rm = TRUE)
print(average_revenue)

## [1] 2725.385

Top 5 Spending Customers

# Show the top 5 spending customers

top_customers <- df %>%
  group_by(Customer_ID, Customer_Name) %>%
  summarise(Total_Spending = sum(Total_Purchase_Amount, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(Total_Spending)) %>%
  slice_head(n = 5)

top_customers

## # A tibble: 5 × 3
##   Customer_ID Customer_Name     Total_Spending
##         <int> <chr>                      <int>
## 1       39895 Reginald Gonzales          50659
## 2       39717 Joseph Kaiser              50496
## 3       48382 Katelyn Clark              50179
## 4        6633 Andre Spencer              48499
## 5       49743 Bryan Gonzalez             47015

# Plot bar graph
ggplot(top_customers, aes(x = reorder(Customer_Name, -Total_Spending), y = Total_Spending, fill = Customer_Name)) +
  geom_bar(stat = "identity") +
  labs(title = "Top 5 Highest Spending Customers",
       x = "Customer Name",
       y = "Total Spending") +
  theme_minimal() +
  theme(legend.position = "none") +
  geom_text(aes(label = round(Total_Spending, 2)), vjust = -0.3)

Busiest Sales Day of the Week

# Determine the Busiest Sales Day of the week
sales_by_day <- df %>%
  group_by(Purchase_Day_Name) %>%
  summarise(Total_Sales = sum(Total_Purchase_Amount, na.rm = TRUE)) %>%
  arrange(desc(Total_Sales))

# Show result
print(sales_by_day)

## # A tibble: 7 × 2
##   Purchase_Day_Name Total_Sales
##   <chr>                   <int>
## 1 Wednesday            98336024
## 2 Thursday             97990590
## 3 Saturday             97466717
## 4 Friday               97178917
## 5 Tuesday              96894118
## 6 Monday               96842007
## 7 Sunday               96637926

# Reorder factor based on Total_Sales (in descending order)
sales_by_day$Purchase_Day_Name <- factor(sales_by_day$Purchase_Day_Name, levels = sales_by_day$Purchase_Day_Name)

# Plot bar graph
ggplot(sales_by_day, aes(x = Purchase_Day_Name, y = Total_Sales, fill = Purchase_Day_Name)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Sales by Day of the Week", x = "Day", y = "Total Sales") +
  theme_minimal() +
  theme(legend.position = "none")

Product Category

# List out all the product category
df %>%
  distinct(Product_Category)

##   Product_Category
## 1             Home
## 2      Electronics
## 3            Books
## 4         Clothing

# Count products per category
category_counts <- df %>%
  group_by(Product_Category) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100,
         Label = paste0(Product_Category, " (", round(Percentage, 1), "%)"))

# Plot pie chart
ggplot(category_counts, aes(x = "", y = Count, fill = Product_Category)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  geom_text(aes(label = Label), position = position_stack(vjust = 0.5), size = 4) +
  labs(title = "Product Category Distribution") +
  theme_void()

Payment Type

# List out all the payment type
df %>%
  distinct(Payment_Method)

##   Payment_Method
## 1         PayPal
## 2    Credit Card
## 3           Cash

# Count number of transactions per payment method
payment_counts <- df %>%
  group_by(Payment_Method) %>%
  summarise(Count = n()) %>%
  arrange(desc(Count))

# Plot bar graph
ggplot(payment_counts, aes(x = reorder(Payment_Method, -Count), y = Count, fill = Payment_Method)) +
  geom_bar(stat = "identity") +
  labs(title = "Transaction Count by Payment Method",
       x = "Payment Method",
       y = "Number of Transactions") +
  theme_minimal() +
  theme(legend.position = "none") +
  geom_text(aes(label = Count), vjust = -0.3)

Univariate Analysis (Numerical Variables) Univariate Analysis

Numerical Variables (Customer_Age, Product_Price, Total_Purchase_Amount) • Understand distribution, outliers, and spread

hist(): Used to visualize the distribution of a numerical feature.

boxplot(): Used to detect outliers and visualize spread.

summary(): Provides a quick overview of key statistics like mean, median, and quartiles.

# Customer Age Histogram
hist(df$Age,
     main = "Histogram of Customer Age",
     xlab = "Customer Age",
     col = "lightblue",
     border = "black",
     breaks = 15)  # Adjust number of bins with breaks

# Product Price Histogram
hist(df$Product_Price,
     main = "Histogram of Product Price",
     xlab = "Product Price",
     col = "lightgreen",
     border = "black",
     breaks = 20)

# Total Purchase Amount Histogram
hist(df$Total_Purchase_Amount,
     main = "Histogram of Total Purchase Amount",
     xlab = "Total Purchase Amount",
     col = "lightcoral",
     border = "black",
     breaks = 25)

# Customer Age Boxplot
boxplot(df$Age,
        main = "Boxplot of Customer Age",
        ylab = "Customer Age",
        col = "lightblue",
        border = "blue")

# Product Price Boxplot
boxplot(df$Product_Price,
        main = "Boxplot of Product Price",
        ylab = "Product Price",
        col = "lightgreen",
        border = "green")

# Total Purchase Amount Boxplot
boxplot(df$Total_Purchase_Amount,
        main = "Boxplot of Total Purchase Amount",
        ylab = "Total Purchase Amount",
        col = "lightcoral",
        border = "red")

# Summary for Customer Age
summary(df$Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0    30.0    44.0    43.8    57.0    70.0

# Summary for Product Price
summary(df$Product_Price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0   132.0   255.0   254.7   377.0   500.0

# Summary for Total Purchase Amount
summary(df$Total_Purchase_Amount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     100    1476    2725    2725    3975    5350

Univariate Analysis (Categorical Variables) Categorical Variables (Gender, Product_Category, Payment_Method, Returns, Churn) • See frequency of categories

table(): Generates a frequency table for categorical data, showing how many times each category appears.

barplot(): Visualizes categorical data using bar charts.

prop.table(): Converts the frequency table to proportions (useful for relative frequency analysis).

# Frequency Table for Gender
table(df$Gender)

## 
## Female   Male 
## 124324 125676

# Frequency Table for Product Category
table(df$Product_Category)

## 
##       Books    Clothing Electronics        Home 
##       62247       62581       62630       62542

# Frequency Table for Payment Method
table(df$Payment_Method)

## 
##        Cash Credit Card      PayPal 
##       83012       83547       83441

# Frequency Table for Returns
table(df$Returns)

## 
##      0      1 
## 148524 101476

# Frequency Table for Churn
table(df$Churn)

## 
##      0      1 
## 199870  50130

# Bar Plot for Gender
gender_counts <- table(df$Gender)
barplot(gender_counts,
        main = "Gender Distribution",
        col = c("lightblue", "lightgreen"),
        xlab = "Gender",
        ylab = "Count",
        names.arg = names(gender_counts))

# Bar Plot for Product Category
product_category_counts <- table(df$Product_Category)
barplot(product_category_counts,
        main = "Product Category Distribution",
        col = rainbow(length(product_category_counts)),
        xlab = "Product Category",
        ylab = "Count",
        las = 2)  # Rotate axis labels for better readability

# Bar Plot for Payment Method
payment_method_counts <- table(df$Payment_Method)
barplot(payment_method_counts,
        main = "Payment Method Distribution",
        col = c("lightblue", "lightcoral", "lightgreen"),
        xlab = "Payment Method",
        ylab = "Count",
        names.arg = names(payment_method_counts))

# Bar Plot for Returns
returns_counts <- table(df$Returns)
barplot(returns_counts,
        main = "Returns Distribution",
        col = c("lightblue", "lightpink"),
        xlab = "Returns",
        ylab = "Count",
        names.arg = names(returns_counts))

# Bar Plot for Churn
churn_counts <- table(df$Churn)
barplot(churn_counts,
        main = "Churn Distribution",
        col = c("lightgreen", "lightcoral"),
        xlab = "Churn",
        ylab = "Count",
        names.arg = names(churn_counts))

# Bar Plot with Proportions for Gender
gender_proportions <- prop.table(table(df$Gender))
barplot(gender_proportions,
        main = "Gender Distribution (Proportions)",
        col = c("lightblue", "lightgreen"),
        xlab = "Gender",
        ylab = "Proportion",
        names.arg = names(gender_proportions))

# Bar Plot with Proportions for Product Category
product_category_proportions <- prop.table(table(df$Product_Category))
barplot(product_category_proportions,
        main = "Product Category Distribution (Proportions)",
        col = c("lightblue", "lightgreen"),
        xlab = "Product Category",
        ylab = "Proportion",
        names.arg = names(product_category_proportions))

# Bar Plot with Proportions for Payment Method
payment_method_proportions <- prop.table(table(df$Payment_Method))
barplot(payment_method_proportions,
        main = "Payment Method Distribution (Proportions)",
        col = c("lightblue", "lightgreen"),
        xlab = "Payment Method",
        ylab = "Proportion",
        names.arg = names(payment_method_proportions))

# Bar Plot with Proportions for Returns
returns_proportions <- prop.table(table(df$Returns))
barplot(returns_proportions,
        main = "Payment Method Distribution (Proportions)",
        col = c("lightblue", "lightgreen"),
        xlab = "Returns",
        ylab = "Proportion",
        names.arg = names(returns_proportions))

# Bar Plot with Proportions for Churn
churn_proportions <- prop.table(table(df$Churn))
barplot(churn_proportions,
        main = "Churn Distribution (Proportions)",
        col = c("lightgreen", "lightcoral"),
        xlab = "Churn",
        ylab = "Proportion",
        names.arg = names(churn_proportions))

Bivariate Analysis (Churn VS Features)

# Boxplot to compare Total Purchase Amount by Churn status
boxplot(Total_Purchase_Amount ~ Churn,
        data = df,
        main = "Comparison of Total Purchase Amount by Churn Status",
        xlab = "Churn",
        ylab = "Total Purchase Amount",
        col = c("lightblue", "lightcoral"),
        names = c("Non-Churned", "Churned"))

# Boxplot for Customer Age vs. Churn status
boxplot(Age ~ Churn,
        data = df,
        main = "Comparison of Customer Age by Churn Status",
        xlab = "Churn",
        ylab = "Customer Age",
        col = c("lightblue", "lightcoral"),
        names = c("Non-Churned", "Churned"))

# Boxplot for Product Price vs. Churn status
boxplot(Product_Price ~ Churn,
        data = df,
        main = "Comparison of Product Price by Churn Status",
        xlab = "Churn",
        ylab = "Product Price",
        col = c("lightblue", "lightcoral"),
        names = c("Non-Churned", "Churned"))

library(RColorBrewer)

# Create a bar plot for Gender vs Churn
gender_churn_table <- table(df$Gender, df$Churn)
barplot(gender_churn_table,
        beside = TRUE,
        col = c("lightblue", "lightcoral"),
        legend = rownames(gender_churn_table),
        main = "Gender vs Churn",
        xlab = "Churn Status",
        ylab = "Count",
        names.arg = c("Non-Churned", "Churned"))

# Create a bar plot for Payment Method vs Churn
payment_method_churn_table <- table(df$Payment_Method, df$Churn)
barplot(payment_method_churn_table,
        beside = TRUE,
        col = c("lightblue", "lightcoral", "lightyellow"),
        legend = rownames(payment_method_churn_table),
        main = "Payment Method vs Churn",
        xlab = "Payment Method Status",
        ylab = "Count",
        names.arg = c("Non-Churned", "Churned"))

# Create a bar plot for Product Category vs Churn
product_category_churn_table <- table(df$Product_Category, df$Churn)
barplot(product_category_churn_table,
        beside = TRUE,
        col = c("lightblue", "lightcoral", "lightyellow", "lightgreen"),
        legend = rownames(product_category_churn_table),
        main = "Product Category vs Churn",
        xlab = "Churn Status",
        ylab = "Count",
        names.arg = c("Non-Churned", "Churned"))

# Create a bar plot for Purchase Month Name vs Churn
month_colors <- brewer.pal(12, "Set3")
purchase_month_name_churn_table <- table(df$Purchase_Month_Name, df$Churn)
barplot(purchase_month_name_churn_table,
        beside = TRUE,
        col = month_colors,
        legend = rownames(purchase_month_name_churn_table),
        main = "Purchase Month Name vs Churn",
        xlab = "Churn Status",
        ylab = "Count",
        names.arg = c("Non-Churned", "Churned"))

Bivariate Analysis (Return VS Features)

# Boxplot to compare Total Purchase Amount by Return status
boxplot(Total_Purchase_Amount ~ Returns,
        data = df,
        main = "Comparison of Total Purchase Amount by Return Status",
        xlab = "Return",
        ylab = "Total Purchase Amount",
        col = c("lightblue", "lightcoral"),
        names = c("Non-Returned", "Returned"))

# Boxplot for Customer Age vs. Return status
boxplot(Age ~ Returns,
        data = df,
        main = "Comparison of Customer Age by Return Status",
        xlab = "Return",
        ylab = "Customer Age",
        col = c("lightblue", "lightcoral"),
        names = c("Non-Returned", "Returned"))

# Boxplot for Product Price vs. Return status
boxplot(Product_Price ~ Returns,
        data = df,
        main = "Comparison of Product Price by Return Status",
        xlab = "Return",
        ylab = "Product Price",
        col = c("lightblue", "lightcoral"),
        nnames = c("Non-Returned", "Returned"))

# Contingency table for Gender vs Returns
table(df$Gender, df$Returns)

##         
##              0     1
##   Female 74040 50284
##   Male   74484 51192

# Contingency table for Payment Method vs Returns
table(df$Payment_Method, df$Returns)

##              
##                   0     1
##   Cash        49326 33686
##   Credit Card 49689 33858
##   PayPal      49509 33932

# Contingency table for Product Category vs Returns
table(df$Product_Category, df$Returns)

##              
##                   0     1
##   Books       36841 25406
##   Clothing    37279 25302
##   Electronics 37182 25448
##   Home        37222 25320

# Contingency table for Purchase Month Name vs Returns
table(df$Purchase_Month_Name, df$Returns)

##            
##                 0     1
##   January   13509  9373
##   February  12373  8508
##   March     13470  9368
##   April     13220  8918
##   May       13689  9305
##   June      13249  8883
##   July      13655  9378
##   August    13776  9384
##   September 11388  7579
##   October   10078  6973
##   November   9901  6789
##   December  10216  7018

# Create a bar plot for Gender vs Returns
gender_returns_table <- table(df$Gender, df$Returns)
barplot(gender_returns_table,
        beside = TRUE,
        col = c("lightblue", "lightcoral"),
        legend = rownames(gender_returns_table),
        main = "Gender vs Returns",
        xlab = "Returns Status",
        ylab = "Count",
        names.arg = c("Non-Returned", "Returned"))

# Create a bar plot for Payment Method vs Returns
payment_method_returns_table <- table(df$Payment_Method, df$Returns)
barplot(payment_method_returns_table,
        beside = TRUE,
        col = c("lightblue", "lightcoral", "lightyellow"),
        legend = rownames(payment_method_returns_table),
        main = "Payment Method vs Returns",
        xlab = "Returns Status",
        ylab = "Count",
        names.arg = c("Non-Returned", "Returned"))

# Create a bar plot for Product Category vs Returns
product_category_returns_table <- table(df$Product_Category, df$Returns)
barplot(product_category_returns_table,
        beside = TRUE,
        col = c("lightblue", "lightcoral", "lightyellow", "lightgreen"),
        legend = rownames(product_category_returns_table),
        main = "Product Category vs Returns",
        xlab = "Returns Status",
        ylab = "Count",
        names.arg = c("Non-Returned", "Returned"))

# Create a bar plot for Purchase Month Name vs Returns
month_colors <- brewer.pal(12, "Set3")
purchase_month_name_returns_table <- table(df$Purchase_Month_Name, df$Returns)
barplot(purchase_month_name_returns_table,
        beside = TRUE,
        col = month_colors,
        legend = rownames(purchase_month_name_returns_table),
        main = "Purchase Month Name vs Returns",
        xlab = "Returns Status",
        ylab = "Count",
        names.arg = c("Non-Returned", "Returned"))

Return Rate Return Rate (by product category, gender, purchase month name, payment method)

# Return rate by product category
return_rate <- df %>%
  group_by(Product_Category) %>%
  summarise(Returns = sum(Returns == "1", na.rm = TRUE),
            Total_Orders = n(),
            Return_Rate = Returns / Total_Orders)

print(return_rate)

## # A tibble: 4 × 4
##   Product_Category Returns Total_Orders Return_Rate
##   <chr>              <int>        <int>       <dbl>
## 1 Books              25406        62247       0.408
## 2 Clothing           25302        62581       0.404
## 3 Electronics        25448        62630       0.406
## 4 Home               25320        62542       0.405

# Return rate by gender
return_rate_gender <- df %>%
  group_by(Gender) %>%
  summarise(Returns = sum(Returns == "1", na.rm = TRUE),
            Total_Orders = n(),
            Return_Rate = Returns / Total_Orders)

print(return_rate_gender)

## # A tibble: 2 × 4
##   Gender Returns Total_Orders Return_Rate
##   <chr>    <int>        <int>       <dbl>
## 1 Female   50284       124324       0.404
## 2 Male     51192       125676       0.407

ggplot(return_rate_gender, aes(x = Gender, y = Return_Rate, fill = Gender)) +
  geom_bar(stat = "identity") +
  labs(title = "Return Rate by Gender", x = "Gender", y = "Return Rate") +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent_format())

# Return rate by purchase month name
return_rate_month <- df %>%
  group_by(Purchase_Month_Name) %>%
  summarise(Returns = sum(Returns == "1", na.rm = TRUE),
            Total_Orders = n(),
            Return_Rate = Returns / Total_Orders)

print(return_rate_month)

## # A tibble: 12 × 4
##    Purchase_Month_Name Returns Total_Orders Return_Rate
##    <ord>                 <int>        <int>       <dbl>
##  1 January                9373        22882       0.410
##  2 February               8508        20881       0.407
##  3 March                  9368        22838       0.410
##  4 April                  8918        22138       0.403
##  5 May                    9305        22994       0.405
##  6 June                   8883        22132       0.401
##  7 July                   9378        23033       0.407
##  8 August                 9384        23160       0.405
##  9 September              7579        18967       0.400
## 10 October                6973        17051       0.409
## 11 November               6789        16690       0.407
## 12 December               7018        17234       0.407

return_rate_month$Purchase_Month_Name <- factor(return_rate_month$Purchase_Month_Name, levels = month.name)

ggplot(return_rate_month, aes(x = Purchase_Month_Name, y = Return_Rate, fill = Purchase_Month_Name)) +
  geom_bar(stat = "identity") +
  labs(title = "Return Rate by Month", x = "Month", y = "Return Rate") +
  theme_minimal() +
  scale_fill_manual(values = brewer.pal(12, "Set3")) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Return rate by payment method
return_rate_payment <- df %>%
  group_by(Payment_Method) %>%
  summarise(Returns = sum(Returns == "1", na.rm = TRUE),
            Total_Orders = n(),
            Return_Rate = Returns / Total_Orders)

print(return_rate_payment)

## # A tibble: 3 × 4
##   Payment_Method Returns Total_Orders Return_Rate
##   <chr>            <int>        <int>       <dbl>
## 1 Cash             33686        83012       0.406
## 2 Credit Card      33858        83547       0.405
## 3 PayPal           33932        83441       0.407

ggplot(return_rate_payment, aes(x = Payment_Method, y = Return_Rate, fill = Payment_Method)) +
  geom_bar(stat = "identity") +
  labs(title = "Return Rate by Payment Method", x = "Payment Method", y = "Return Rate") +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Churn Rate Churn Rate (by product category, gender, purchase month name, payment method)

# Churn rate by product category
churn_rate_category <- df %>%
  group_by(Product_Category) %>%
  summarise(Churned = sum(Churn == "1", na.rm = TRUE),
            Total_Customers = n(),
            Churn_Rate = Churned / Total_Customers)

print(churn_rate_category)

## # A tibble: 4 × 4
##   Product_Category Churned Total_Customers Churn_Rate
##   <chr>              <int>           <int>      <dbl>
## 1 Books              12499           62247      0.201
## 2 Clothing           12593           62581      0.201
## 3 Electronics        12553           62630      0.200
## 4 Home               12485           62542      0.200

ggplot(churn_rate_category, aes(x = reorder(Product_Category, -Churn_Rate), y = Churn_Rate, fill = Product_Category)) +
  geom_bar(stat = "identity") +
  labs(title = "Churn Rate by Product Category", x = "Product Category", y = "Churn Rate") +
  theme_minimal() +
  scale_fill_manual(values = brewer.pal(n = min(12, length(unique(churn_rate_category$Product_Category))), name = "Set3")) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")

# Churn rate by gender
churn_rate_gender <- df %>%
  group_by(Gender) %>%
  summarise(Churned = sum(Churn == "1", na.rm = TRUE),
            Total_Customers = n(),
            Churn_Rate = Churned / Total_Customers)

print(churn_rate_gender)

## # A tibble: 2 × 4
##   Gender Churned Total_Customers Churn_Rate
##   <chr>    <int>           <int>      <dbl>
## 1 Female   25067          124324      0.202
## 2 Male     25063          125676      0.199

ggplot(churn_rate_gender, aes(x = Gender, y = Churn_Rate, fill = Gender)) +
  geom_bar(stat = "identity") +
  labs(title = "Churn Rate by Gender", x = "Gender", y = "Churn Rate") +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent_format())

# Churn rate by purchase month name
churn_rate_month <- df %>%
  group_by(Purchase_Month_Name) %>%
  summarise(Churned = sum(Churn == "1", na.rm = TRUE),
            Total_Customers = n(),
            Churn_Rate = Churned / Total_Customers)

# Sort month names properly
churn_rate_month$Purchase_Month_Name <- factor(churn_rate_month$Purchase_Month_Name, levels = month.name)

print(churn_rate_month)

## # A tibble: 12 × 4
##    Purchase_Month_Name Churned Total_Customers Churn_Rate
##    <ord>                 <int>           <int>      <dbl>
##  1 January                4560           22882      0.199
##  2 February               4183           20881      0.200
##  3 March                  4560           22838      0.200
##  4 April                  4493           22138      0.203
##  5 May                    4558           22994      0.198
##  6 June                   4455           22132      0.201
##  7 July                   4702           23033      0.204
##  8 August                 4677           23160      0.202
##  9 September              3756           18967      0.198
## 10 October                3399           17051      0.199
## 11 November               3271           16690      0.196
## 12 December               3516           17234      0.204

ggplot(churn_rate_month, aes(x = Purchase_Month_Name, y = Churn_Rate, fill = Purchase_Month_Name)) +
  geom_bar(stat = "identity") +
  labs(title = "Churn Rate by Month", x = "Month", y = "Churn Rate") +
  theme_minimal() +
  scale_fill_manual(values = brewer.pal(12, "Set3")) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Churn rate by payment method
churn_rate_payment <- df %>%
  group_by(Payment_Method) %>%
  summarise(Churned = sum(Churn == "1", na.rm = TRUE),
            Total_Customers = n(),
            Churn_Rate = Churned / Total_Customers)

print(churn_rate_payment)

## # A tibble: 3 × 4
##   Payment_Method Churned Total_Customers Churn_Rate
##   <chr>            <int>           <int>      <dbl>
## 1 Cash             16812           83012      0.203
## 2 Credit Card      16427           83547      0.197
## 3 PayPal           16891           83441      0.202

ggplot(churn_rate_payment, aes(x = Payment_Method, y = Churn_Rate, fill = Payment_Method)) +
  geom_bar(stat = "identity") +
  labs(title = "Churn Rate by Payment Method", x = "Payment Method", y = "Churn Rate") +
  theme_minimal() +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

Data Modelling

1. Classification (Churn Prediction)

# Required libraries
library(caret) # Classification and Regression Training
library(e1071) # Tools for SVM, Naive Bayes, and more
library(randomForest) # Random Forest Algorithm

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(xgboost) # Extreme Gradient Boosting

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

library(MLmetrics) # Evaluation Metrics for ML Models

## 
## Attaching package: 'MLmetrics'

## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

## The following object is masked from 'package:base':
## 
##     Recall

library(lubridate)
library(dplyr)

# Load and Prepare Data
df <- read.csv("ecommerce_customer_data_large.csv", stringsAsFactors = FALSE)
colnames(df) <- gsub("\\.", "_", colnames(df))
df$Purchase_Date <- as.Date(df$Purchase_Date, format = "%Y-%m-%d")
df$Returns <- as.integer(df$Returns)
df <- df %>% dplyr::select(-Customer_Age)

# Convert relevant columns to factors
df$Gender <- as.factor(df$Gender)
df$Payment_Method <- as.factor(df$Payment_Method)
df$Product_Category <- as.factor(df$Product_Category)
df$Churn <- as.factor(df$Churn) # Target

# Train-test split
set.seed(42)
trainIndex <- createDataPartition(df$Churn, p = 0.80, list = FALSE)
train <- df[trainIndex, ]
test <- df[-trainIndex, ]

# Select relevant features
selected_vars <- c("Churn", "Product_Category", "Product_Price", "Quantity",
                   "Total_Purchase_Amount", "Payment_Method", "Age", "Gender")
train <- train[, selected_vars]
test <- test[, selected_vars]

Customer Churn (Classification)

Model to be use: 1. Logistic Regression 2. Random Forest 3. XGBoost

1. Logistic Regression (Churn Prediction)

# Fit model
model_log <- glm(
  Churn ~ ., data = train, family = "binomial"
)

# Predict on test data
pred_prob_log <- predict(model_log, newdata = test, type = "response")
pred_log <- factor(ifelse(pred_prob_log > 0.5, "1", "0"), levels = levels(test$Churn))

# Evaluation
conf_mat_log <- confusionMatrix(pred_log, test$Churn, positive = "1")
print(conf_mat_log)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 39974 10026
##          1     0     0
##                                          
##                Accuracy : 0.7995         
##                  95% CI : (0.7959, 0.803)
##     No Information Rate : 0.7995         
##     P-Value [Acc > NIR] : 0.5027         
##                                          
##                   Kappa : 0              
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.0000         
##             Specificity : 1.0000         
##          Pos Pred Value :    NaN         
##          Neg Pred Value : 0.7995         
##              Prevalence : 0.2005         
##          Detection Rate : 0.0000         
##    Detection Prevalence : 0.0000         
##       Balanced Accuracy : 0.5000         
##                                          
##        'Positive' Class : 1              
##

# Metrics
accuracy_log <- conf_mat_log$overall["Accuracy"]
precision_log <- conf_mat_log$byClass["Pos Pred Value"]
recall_log <- conf_mat_log$byClass["Sensitivity"]
f1_log <- 2 * ((precision_log * recall_log) / (precision_log + recall_log))

cat("\n--- Logistic Regression Metrics ---\n")

## 
## --- Logistic Regression Metrics ---

cat("Accuracy:", accuracy_log, "\n")

## Accuracy: 0.79948

cat("Precision:", precision_log, "\n")

## Precision: NaN

cat("Recall:", recall_log, "\n")

## Recall: 0

cat("F1 Score:", f1_log, "\n")

## F1 Score: NaN

2. Random Forest (Churn Prediction)

set.seed(42)
model_rf <- randomForest(Churn ~ ., data = train, ntree = 100)
pred_rf <- predict(model_rf, test)

# Evaluation
conf_mat_rf <- confusionMatrix(pred_rf, test$Churn, positive = "1")
print(conf_mat_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 39971 10025
##          1     3     1
##                                           
##                Accuracy : 0.7994          
##                  95% CI : (0.7959, 0.8029)
##     No Information Rate : 0.7995          
##     P-Value [Acc > NIR] : 0.5116          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 9.974e-05       
##             Specificity : 9.999e-01       
##          Pos Pred Value : 2.500e-01       
##          Neg Pred Value : 7.995e-01       
##              Prevalence : 2.005e-01       
##          Detection Rate : 2.000e-05       
##    Detection Prevalence : 8.000e-05       
##       Balanced Accuracy : 5.000e-01       
##                                           
##        'Positive' Class : 1               
##

# Metrics
accuracy_rf <- conf_mat_rf$overall["Accuracy"]
precision_rf <- conf_mat_rf$byClass["Pos Pred Value"]
recall_rf <- conf_mat_rf$byClass["Sensitivity"]
f1_rf <- 2 * ((precision_rf * recall_rf) / (precision_rf + recall_rf))

cat("\n--- Random Forest Metrics ---\n")

## 
## --- Random Forest Metrics ---

cat("Accuracy:", accuracy_rf, "\n")

## Accuracy: 0.79944

cat("Precision:", precision_rf, "\n")

## Precision: 0.25

cat("Recall:", recall_rf, "\n")

## Recall: 9.974067e-05

cat("F1 Score:", f1_rf, "\n")

## F1 Score: 0.0001994018

3. XGBoost (Churn Prediction)

# Prepare design matrices
train_matrix <- model.matrix(Churn ~ . -1, data = train)
test_matrix <- model.matrix(Churn ~ . -1, data = test)

# Prepare binary labels
train_label <- ifelse(train$Churn == "1", 1, 0)
test_label <- ifelse(test$Churn == "1", 1, 0)

# DMatrix objects
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)

# Set parameters and train
params <- list(
  objective = "binary:logistic",
  eval_metric = "logloss"
)

set.seed(42)
model_xgb <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 100,
  verbose = 0
)

# Predict
pred_prob_xgb <- predict(model_xgb, dtest)
pred_xgb <- factor(ifelse(pred_prob_xgb > 0.5, "1", "0"), levels = levels(test$Churn))

# Evaluation
conf_mat_xgb <- confusionMatrix(pred_xgb, test$Churn, positive = "1")
print(conf_mat_xgb)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 39960 10016
##          1    14    10
##                                           
##                Accuracy : 0.7994          
##                  95% CI : (0.7959, 0.8029)
##     No Information Rate : 0.7995          
##     P-Value [Acc > NIR] : 0.5205          
##                                           
##                   Kappa : 0.001           
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.0009974       
##             Specificity : 0.9996498       
##          Pos Pred Value : 0.4166667       
##          Neg Pred Value : 0.7995838       
##              Prevalence : 0.2005200       
##          Detection Rate : 0.0002000       
##    Detection Prevalence : 0.0004800       
##       Balanced Accuracy : 0.5003236       
##                                           
##        'Positive' Class : 1               
##

# Metrics
accuracy_xgb <- conf_mat_xgb$overall["Accuracy"]
precision_xgb <- conf_mat_xgb$byClass["Pos Pred Value"]
recall_xgb <- conf_mat_xgb$byClass["Sensitivity"]
f1_xgb <- 2 * ((precision_xgb * recall_xgb) / (precision_xgb + recall_xgb))

cat("\n--- XGBoost Metrics ---\n")

## 
## --- XGBoost Metrics ---

cat("Accuracy:", accuracy_xgb, "\n")

## Accuracy: 0.7994

cat("Precision:", precision_xgb, "\n")

## Precision: 0.4166667

cat("Recall:", recall_xgb, "\n")

## Recall: 0.0009974067

cat("F1 Score:", f1_xgb, "\n")

## F1 Score: 0.00199005

Summary Table

# Extract and compute metrics after each model evaluation

# Logistic Regression
log_accuracy <- as.numeric(conf_mat_log$overall["Accuracy"])
log_precision <- as.numeric(conf_mat_log$byClass["Pos Pred Value"])
log_recall <- as.numeric(conf_mat_log$byClass["Sensitivity"])
log_f1 <- 2 * ((log_precision * log_recall) / (log_precision + log_recall))

# Random Forest
rf_accuracy <- as.numeric(conf_mat_rf$overall["Accuracy"])
rf_precision <- as.numeric(conf_mat_rf$byClass["Pos Pred Value"])
rf_recall <- as.numeric(conf_mat_rf$byClass["Sensitivity"])
rf_f1 <- 2 * ((rf_precision * rf_recall) / (rf_precision + rf_recall))

# XGBoost
xgb_accuracy <- as.numeric(conf_mat_xgb$overall["Accuracy"])
xgb_precision <- as.numeric(conf_mat_xgb$byClass["Pos Pred Value"])
xgb_recall <- as.numeric(conf_mat_xgb$byClass["Sensitivity"])
xgb_f1 <- 2 * ((xgb_precision * xgb_recall) / (xgb_precision + xgb_recall))

# Create summary table
summary_table <- data.frame(
  Model = c("Logistic Regression", "Random Forest", "XGBoost"),
  Accuracy = c(log_accuracy, rf_accuracy, xgb_accuracy),
  Precision = c(log_precision, rf_precision, xgb_precision),
  Recall = c(log_recall, rf_recall, xgb_recall),
  F1_Score = c(log_f1, rf_f1, xgb_f1),
  stringsAsFactors = FALSE
)

# Round numeric columns
summary_table[, 2:5] <- round(summary_table[, 2:5], 6)

# View summary
print(summary_table)

##                 Model Accuracy Precision   Recall F1_Score
## 1 Logistic Regression  0.79948       NaN 0.000000      NaN
## 2       Random Forest  0.79944  0.250000 0.000100 0.000199
## 3             XGBoost  0.79940  0.416667 0.000997 0.001990

xgb_accuracy <- conf_mat_xgb$overall["Accuracy"]
xgb_precision <- conf_mat_xgb$byClass["Pos Pred Value"]
xgb_recall <- conf_mat_xgb$byClass["Sensitivity"]
xgb_f1 <- 2 * ((xgb_precision * xgb_recall) / (xgb_precision + xgb_recall))

SMOTE

# Read the file
df <- read.csv("ecommerce_customer_data_large.csv")
head(df)

##   Customer.ID       Purchase.Date Product.Category Product.Price Quantity
## 1       44605 2023-05-03 21:30:02             Home           177        1
## 2       44605 2021-05-16 13:57:44      Electronics           174        3
## 3       44605 2020-07-13 06:16:57            Books           413        1
## 4       44605 2023-01-17 13:14:36      Electronics           396        3
## 5       44605 2021-05-01 11:29:27            Books           259        4
## 6       13738 2022-08-25 06:48:33             Home           191        3
##   Total.Purchase.Amount Payment.Method Customer.Age Returns  Customer.Name Age
## 1                  2427         PayPal           31       1    John Rivera  31
## 2                  2448         PayPal           31       1    John Rivera  31
## 3                  2345    Credit Card           31       1    John Rivera  31
## 4                   937           Cash           31       0    John Rivera  31
## 5                  2598         PayPal           31       1    John Rivera  31
## 6                  3722    Credit Card           27       1 Lauren Johnson  27
##   Gender Churn
## 1 Female     0
## 2 Female     0
## 3 Female     0
## 4 Female     0
## 5 Female     0
## 6 Female     0

# Clean column names
colnames(df) <- gsub("\\.", "_", colnames(df))


# Convert data types
df$Purchase_Date <- as.Date(df$Purchase_Date, format = "%Y-%m-%d")
df$Returns[is.na(df$Returns)] <- 0
df$Returns <- as.integer(df$Returns)

# Drop redundant column
df <- df %>% dplyr::select(-Customer_Age)

# Extract date-related features
df$Purchase_Day_Name <- weekdays(df$Purchase_Date)
df$Purchase_Month <- month(df$Purchase_Date)
df$Purchase_Month_Name <- month(df$Purchase_Date, label = TRUE, abbr = FALSE)
df$Purchase_Year <- format(df$Purchase_Date, "%Y")

# Drop non-numeric or unnecessary columns for modeling
df_numeric <- df %>% dplyr::select(-Purchase_Date)

# Convert all character/factor variables to numeric
df_numeric[] <- lapply(df_numeric, function(x) {
  if (is.factor(x) || is.character(x)) {
    return(as.numeric(as.factor(x)))
  } else {
    return(x)
  }
})

#load SMOTE library
library(smotefamily)

# Apply SMOTE
set.seed(42)
smote_result <- SMOTE(
  X = df_numeric[, -which(names(df_numeric) == "Churn")],
  target = df_numeric$Churn,
  K = 5
)

df_smote <- smote_result$data
df_smote$Churn <- factor(df_smote$class, levels = c(0, 1))
df_smote$class <- NULL

# Split data
set.seed(42)
trainIndex <- createDataPartition(df_smote$Churn, p = 0.8, list = FALSE)
train_smote <- df_smote[trainIndex, ]
test_smote <- df_smote[-trainIndex, ]

1. Logistic Regression (with SMOTE)

model_log <- glm(Churn ~ ., data = train_smote, family = "binomial")
pred_prob_log <- predict(model_log, newdata = test_smote, type = "response")
pred_log <- factor(ifelse(pred_prob_log > 0.5, "1", "0"), levels = c("0", "1"))
test_smote$Churn <- factor(test_smote$Churn, levels = c("0", "1"))
conf_log <- confusionMatrix(pred_log, test_smote$Churn, positive = "1")

# Metrics
f1_log <- 2 * (conf_log$byClass["Pos Pred Value"] * conf_log$byClass["Sensitivity"]) /
  (conf_log$byClass["Pos Pred Value"] + conf_log$byClass["Sensitivity"])

2. Random Forest (with SMOTE)

model_rf <- randomForest(Churn ~ ., data = train_smote, ntree = 100)
pred_rf <- predict(model_rf, test_smote)
pred_rf <- factor(pred_rf, levels = c("0", "1"))
conf_rf <- confusionMatrix(pred_rf, test_smote$Churn, positive = "1")
f1_rf <- 2 * (conf_rf$byClass["Pos Pred Value"] * conf_rf$byClass["Sensitivity"]) /
  (conf_rf$byClass["Pos Pred Value"] + conf_rf$byClass["Sensitivity"])

3. XGBoost (with SMOTE)

train_matrix <- model.matrix(Churn ~ . -1, data = train_smote)
test_matrix <- model.matrix(Churn ~ . -1, data = test_smote)
train_label <- as.numeric(as.character(train_smote$Churn))
test_label <- as.numeric(as.character(test_smote$Churn))
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)

params <- list(
  objective = "binary:logistic",
  eval_metric = "logloss"
)

model_xgb <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 100,
  verbose = 0
)

pred_prob_xgb <- predict(model_xgb, dtest)
pred_xgb <- factor(ifelse(pred_prob_xgb > 0.5, "1", "0"), levels = c("0", "1"))
conf_xgb <- confusionMatrix(pred_xgb, test_smote$Churn, positive = "1")
f1_xgb <- 2 * (conf_xgb$byClass["Pos Pred Value"] * conf_xgb$byClass["Sensitivity"]) /
  (conf_xgb$byClass["Pos Pred Value"] + conf_xgb$byClass["Sensitivity"])

Summary

cat("\n--- Logistic Regression Metrics ---\n")

## 
## --- Logistic Regression Metrics ---

print(conf_log)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 39974 30078
##          1     0     0
##                                          
##                Accuracy : 0.5706         
##                  95% CI : (0.567, 0.5743)
##     No Information Rate : 0.5706         
##     P-Value [Acc > NIR] : 0.5016         
##                                          
##                   Kappa : 0              
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.0000         
##             Specificity : 1.0000         
##          Pos Pred Value :    NaN         
##          Neg Pred Value : 0.5706         
##              Prevalence : 0.4294         
##          Detection Rate : 0.0000         
##    Detection Prevalence : 0.0000         
##       Balanced Accuracy : 0.5000         
##                                          
##        'Positive' Class : 1              
##

cat("F1 Score:", f1_log, "\n")

## F1 Score: NaN

cat("\n--- Random Forest Metrics ---\n")

## 
## --- Random Forest Metrics ---

print(conf_rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 39971 10031
##          1     3 20047
##                                           
##                Accuracy : 0.8568          
##                  95% CI : (0.8541, 0.8594)
##     No Information Rate : 0.5706          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6951          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6665          
##             Specificity : 0.9999          
##          Pos Pred Value : 0.9999          
##          Neg Pred Value : 0.7994          
##              Prevalence : 0.4294          
##          Detection Rate : 0.2862          
##    Detection Prevalence : 0.2862          
##       Balanced Accuracy : 0.8332          
##                                           
##        'Positive' Class : 1               
##

cat("F1 Score:", f1_rf, "\n")

## F1 Score: 0.7998324

cat("\n--- XGBoost Metrics ---\n")

## 
## --- XGBoost Metrics ---

print(conf_xgb)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 39948  9666
##          1    26 20412
##                                           
##                Accuracy : 0.8616          
##                  95% CI : (0.8591, 0.8642)
##     No Information Rate : 0.5706          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.706           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6786          
##             Specificity : 0.9993          
##          Pos Pred Value : 0.9987          
##          Neg Pred Value : 0.8052          
##              Prevalence : 0.4294          
##          Detection Rate : 0.2914          
##    Detection Prevalence : 0.2918          
##       Balanced Accuracy : 0.8390          
##                                           
##        'Positive' Class : 1               
##

cat("F1 Score:", f1_xgb, "\n")

## F1 Score: 0.80814

2. Regression

Finding highest correlation numeric features and categorical features (annova hypo test)

library(corrplot)

## corrplot 0.95 loaded

# Correlation matrix for numeric variables (all features do not show any correlation other than age)
num_vars <- df %>% dplyr::select(where(is.numeric))
cor_matrix <- cor(num_vars, use = "complete.obs")
corrplot(cor_matrix, method = "color", tl.cex = 0.8)

# Scatter plots vs. Total Purchase Amount
ggplot(df, aes(x = Product_Price, y = Total_Purchase_Amount)) +
  geom_point(alpha = 0.5) + geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(df, aes(x = Quantity, y = Total_Purchase_Amount)) +
  geom_point(alpha = 0.5) + geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

ggplot(df, aes(x = Age, y = Total_Purchase_Amount)) +
  geom_point(alpha = 0.5) + geom_smooth(method = "lm")

## `geom_smooth()` using formula = 'y ~ x'

# Correlations with the target variable
target_corr <- cor_matrix[, "Total_Purchase_Amount"]
# Sort in decreasing order
sorted_corr <- sort(abs(target_corr), decreasing = TRUE)
print(sorted_corr)

## Total_Purchase_Amount                   Age        Purchase_Month 
##          1.0000000000          0.0565518329          0.0016261821 
##           Customer_ID         Product_Price              Quantity 
##          0.0013087744          0.0012972971          0.0012335733 
##               Returns                 Churn 
##          0.0010873979          0.0007055865

# highest correlation - age (0.056), Purchase_month (0.0016)

# to test the significant imapct of categorical features
#(using annova) (paymentMethod show highest impact, lowest p value)

# Convert categorical variables to factors
df$Product_Category <- as.factor(df$Product_Category)
df$Payment_Method <- as.factor(df$Payment_Method)
df$Gender <- as.factor(df$Gender)

# Convert Purchase_Date to date format and extract useful features
df$Purchase_Date <- as.POSIXct(df$Purchase_Date)
df$Purchase_Year <- as.factor(format(df$Purchase_Date, "%Y"))
df$Purchase_Month <- as.factor(format(df$Purchase_Date, "%m"))
df$Purchase_Day <- as.factor(format(df$Purchase_Date, "%d"))
df$Purchase_Weekday <- as.factor(weekdays(df$Purchase_Date))

# For each categorical variable, perform ANOVA with Total_Purchase_Amount
anova_results <- aov(Total_Purchase_Amount ~ Product_Category, data = df)
summary(anova_results)

##                      Df    Sum Sq Mean Sq F value Pr(>F)  
## Product_Category      3 1.417e+07 4723972    2.27 0.0782 .
## Residuals        249996 5.202e+11 2080994                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova_results <- aov(Total_Purchase_Amount ~ Payment_Method, data = df)
summary(anova_results)

##                    Df    Sum Sq  Mean Sq F value Pr(>F)   
## Payment_Method      2 2.432e+07 12158700   5.843 0.0029 **
## Residuals      249997 5.202e+11  2080945                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova_results <- aov(Total_Purchase_Amount ~ Gender, data = df)
summary(anova_results)

##                 Df    Sum Sq Mean Sq F value Pr(>F)
## Gender           1 1.178e+06 1177981   0.566  0.452
## Residuals   249998 5.203e+11 2081029

# best pvalue - product_category and payment method (0.0782 and 0.0029 respectively)

Keep the most statistically relevant feautures

Age
Product_Category
Payment_Method

colnames(df)

##  [1] "Customer_ID"           "Purchase_Date"         "Product_Category"     
##  [4] "Product_Price"         "Quantity"              "Total_Purchase_Amount"
##  [7] "Payment_Method"        "Returns"               "Customer_Name"        
## [10] "Age"                   "Gender"                "Churn"                
## [13] "Purchase_Day_Name"     "Purchase_Month"        "Purchase_Month_Name"  
## [16] "Purchase_Year"         "Purchase_Day"          "Purchase_Weekday"

# Read the file
df_model <- df %>%
  dplyr::select(Total_Purchase_Amount, Age, Product_Category, Payment_Method) %>%
  na.omit()
df_model$Product_Category <- as.factor(df_model$Product_Category)
df_model$Payment_Method <- as.factor(df_model$Payment_Method)

# Create polynomial terms (Age^2)  for polynomial regression
#df_model$Customer_Age2 <- df_model$Customer_Age^2
df_model$Age2 <- df_model$Age^2
# Split: 80% train / 20% test
split <- createDataPartition(df_model$Total_Purchase_Amount, p = 0.8, list = FALSE)
train_data <- df_model[split, ]
test_data <- df_model[-split, ]

# Normalize numeric variables for linear models

pre_proc <- preProcess(train_data[, c("Age", "Age2")], method = c("center", "scale"))
train_data_norm <- train_data
test_data_norm <- test_data

train_data_norm[, c("Age", "Age2")] <- predict(pre_proc, train_data[, c("Age", "Age2")])
test_data_norm[, c("Age", "Age2")] <- predict(pre_proc, test_data[, c("Age", "Age2")])

1. Linear Regression

train_control <- trainControl(method = "cv", number = 5)
model_lm <- train(Total_Purchase_Amount ~ Age + Product_Category + Payment_Method,
                  data = train_data_norm, method = "lm", trControl = train_control)

2. Polynomial Regression

#model_poly <- train(Total_Purchase_Amount ~ Age + Age2 + Product_Category + Payment_Method,
                    #data = train_data_norm, method = "lm", trControl = train_control)

3. Random Forest

tune_grid <- expand.grid(mtry = c(2, 3))

model_rf <- train(
  Total_Purchase_Amount ~ Age + Product_Category + Payment_Method,
  data = train_data,
  method = "rf",
  trControl = trainControl(method = "cv", number = 3),
  tuneGrid = tune_grid,
  ntree = 100
)

#model_rf <- train(Total_Purchase_Amount ~ Age + Product_Category + Payment_Method,
                  #data = train_data,
                  #method = "rf",
                  #trControl = train_control)

4. Gradient Boosting (GBM)

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("gbm")

## Installing package into 'C:/Users/hooih/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'gbm' successfully unpacked and MD5 sums checked

## Warning: cannot remove prior installation of package 'gbm'

## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\hooih\AppData\Local\R\win-library\4.4\00LOCK\gbm\libs\x64\gbm.dll to
## C:\Users\hooih\AppData\Local\R\win-library\4.4\gbm\libs\x64\gbm.dll: Permission
## denied

## Warning: restored 'gbm'

## 
## The downloaded binary packages are in
##  C:\Users\hooih\AppData\Local\Temp\RtmpSS3zOG\downloaded_packages

library(gbm)

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

model_gbm <- train(Total_Purchase_Amount ~ Age + Product_Category + Payment_Method,
                   data = train_data,
                   method = "gbm",
                   trControl = train_control,
                   verbose = FALSE)

Summary Table

pred_lm   <- predict(model_lm, test_data_norm)
#pred_poly <- predict(model_poly, test_data_norm)
pred_rf   <- predict(model_rf, test_data)
pred_gbm  <- predict(model_gbm, test_data)

install.packages("Metrics")

## Installing package into 'C:/Users/hooih/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)

## package 'Metrics' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\hooih\AppData\Local\Temp\RtmpSS3zOG\downloaded_packages

library(Metrics)

## 
## Attaching package: 'Metrics'

## The following objects are masked from 'package:caret':
## 
##     precision, recall

library(caret)

results <- data.frame(
  Model = c("Linear Regression",  "Random Forest", "Gradient Boosting"),
  RMSE = c(
    rmse(test_data$Total_Purchase_Amount, pred_lm),
    rmse(test_data$Total_Purchase_Amount, pred_rf),
    rmse(test_data$Total_Purchase_Amount, pred_gbm)
  ),
  MAE = c(
    mae(test_data$Total_Purchase_Amount, pred_lm),
    mae(test_data$Total_Purchase_Amount, pred_rf),
    mae(test_data$Total_Purchase_Amount, pred_gbm)
  ),
  R2 = c(
    R2(pred_lm, test_data$Total_Purchase_Amount),
    R2(pred_rf, test_data$Total_Purchase_Amount),
    R2(pred_gbm, test_data$Total_Purchase_Amount)
  ),
  MAPE = c(
    mape(test_data$Total_Purchase_Amount, pred_lm),
    mape(test_data$Total_Purchase_Amount, pred_rf),
    mape(test_data$Total_Purchase_Amount, pred_gbm)
  )
)

# Print the results
print(results)

##               Model     RMSE      MAE          R2     MAPE
## 1 Linear Regression 1441.212 1249.446 0.003189443 1.026447
## 2     Random Forest 1441.329 1249.578 0.003184376 1.027608
## 3 Gradient Boosting 1441.171 1249.385 0.003246331 1.026308

3. Data Modeling - Predictive Analysis

# Using product category as the only independent variable and 'Returns' as the target

# Load libraries
library(ggplot2)
library(dplyr)
library(caret)
library(randomForest)
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:Metrics':
## 
##     auc

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(xgboost)
library(e1071)

# Load and Prepare Dataset
df <- read.csv("ecommerce_customer_data_large.csv")

# Change the colnames with underscore instead of dot
colnames(df) <- gsub("\\.", "_", colnames(df))

# Fill missing Returns with 0
df$Returns[is.na(df$Returns)] <- 0
df$Returns <- factor(df$Returns, levels = c(0, 1))  # Ensure correct factor levels

# Convert to factor
df$Product_Category <- as.factor(df$Product_Category)

# Encode numeric for XGBoost
df$Product_Category_Num <- as.numeric(df$Product_Category)

# Train-test split (80-20)
set.seed(42)
train_index <- createDataPartition(df$Returns, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

# Initialize result data frame
df_metrics <- data.frame(
  Model = character(),
  Accuracy = numeric(),
  Precision = numeric(),
  Recall = numeric(),
  F1_Score = numeric(),
  stringsAsFactors = FALSE
)

1. Logistic Regression

log_model <- glm(Returns ~ Product_Category, data = train_data, family = "binomial")
log_probs <- predict(log_model, newdata = test_data, type = "response")
log_preds <- ifelse(log_probs > 0.5, 1, 0)
cm_log <- confusionMatrix(factor(log_preds), factor(test_data$Returns), positive = "1")

## Warning in confusionMatrix.default(factor(log_preds),
## factor(test_data$Returns), : Levels are not in the same order for reference and
## data. Refactoring data to match.

df_metrics <- rbind(df_metrics, data.frame(
  Model = "Logistic Regression",
  Accuracy = cm_log$overall["Accuracy"],
  Precision = cm_log$byClass["Precision"],
  Recall = cm_log$byClass["Recall"],
  F1_Score = cm_log$byClass["F1"]
))

roc_log <- roc(test_data$Returns, log_probs)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_log, main = "ROC Curve Comparison")

2. Random Forest

rf_model <- randomForest(Returns ~ Product_Category, data = train_data, ntree = 100)
rf_probs <- predict(rf_model, newdata = test_data, type = "prob")[,2]
rf_preds <- ifelse(rf_probs > 0.5, 1, 0)
cm_rf <- confusionMatrix(factor(rf_preds), factor(test_data$Returns), positive = "1")

## Warning in confusionMatrix.default(factor(rf_preds), factor(test_data$Returns),
## : Levels are not in the same order for reference and data. Refactoring data to
## match.

df_metrics <- rbind(df_metrics, data.frame(
  Model = "Random Forest",
  Accuracy = cm_rf$overall["Accuracy"],
  Precision = cm_rf$byClass["Precision"],
  Recall = cm_rf$byClass["Recall"],
  F1_Score = cm_rf$byClass["F1"]
))

roc_rf <- roc(test_data$Returns, rf_probs)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_rf, col = "blue", main = "ROC Curve - Random Forest")

3. XGBoost

# Convert Returns to numeric for XGBoost (ensure 0 and 1)
train_label <- as.numeric(as.character(train_data$Returns))
test_label <- as.numeric(as.character(test_data$Returns))

# DMatrix for XGBoost
train_matrix <- xgb.DMatrix(data = as.matrix(train_data$Product_Category_Num), label = train_label)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data$Product_Category_Num), label = test_label)

# Train XGBoost model
xgb_model <- xgboost(data = train_matrix, objective = "binary:logistic", nrounds = 50, verbose = 0)

# Predict and evaluate
xgb_probs <- predict(xgb_model, test_matrix)
xgb_preds <- ifelse(xgb_probs > 0.5, 1, 0)
cm_xgb <- confusionMatrix(factor(xgb_preds), factor(test_data$Returns), positive = "1")

## Warning in confusionMatrix.default(factor(xgb_preds),
## factor(test_data$Returns), : Levels are not in the same order for reference and
## data. Refactoring data to match.

# Add to df_metrics
df_metrics <- rbind(df_metrics, data.frame(
  Model = "XGBoost",
  Accuracy = cm_xgb$overall["Accuracy"],
  Precision = cm_xgb$byClass["Precision"],
  Recall = cm_xgb$byClass["Recall"],
  F1_Score = cm_xgb$byClass["F1"]
))

roc_xgb <- roc(test_data$Returns, xgb_probs)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_xgb, col = "darkgreen", main = "ROC Curve - XGBoost")

print(df_metrics)

##                         Model  Accuracy Precision Recall F1_Score
## Accuracy  Logistic Regression 0.5940919        NA      0       NA
## Accuracy1       Random Forest 0.5940919        NA      0       NA
## Accuracy2             XGBoost 0.5940919        NA      0       NA

# Checking why the result are the same for 3 models
table(df$Product_Category, df$Returns)

##              
##                   0     1
##   Books       36841 25406
##   Clothing    37279 25302
##   Electronics 37182 25448
##   Home        37222 25320

Analysis showed that return rates across product categories are remarkably similar (~40%), indicating that Product_Category alone is not a strong predictor of return behavior. Recommend to incorporate additional customer, transactional, or product-level features to build a more predictive model.
Logistic Regression model have poor performance, recommend to use only tree-based models such as random forest and XGBoost.

Feature Importance

# Using only 4 feature importance (Total_Purchase_Amount, Product_Price, Age, Purchase_Month)

# Load required libraries (4 important features without hyperparameter tuning)
library(caret)
library(dplyr)
library(xgboost)
library(randomForest)
library(lubridate)

# Data Preprocessing

# Load your dataset
df <- read.csv("ecommerce_customer_data_large.csv")

# Replace dots in column names
colnames(df) <- gsub("\\.", "_", colnames(df))

# Fill missing Returns with 0 and convert to integer
df$Returns[is.na(df$Returns)] <- 0
df$Returns <- as.integer(df$Returns)

# Convert Purchase_Date and engineer features
df$Purchase_Date <- as.Date(df$Purchase_Date, format = "%Y-%m-%d")
df$Purchase_Month <- month(df$Purchase_Date)
df$Purchase_Year <- format(df$Purchase_Date, "%Y")

# Drop Customer_Age
df <- df %>% dplyr::select(-Customer_Age)

# Select only required features
df_model <- df %>%
  dplyr::select(Returns, Total_Purchase_Amount, Product_Price, Age, Purchase_Month)

# Train-test split
set.seed(123)
train_index <- createDataPartition(df_model$Returns, p = 0.8, list = FALSE)
train_data <- df_model[train_index, ]
test_data  <- df_model[-train_index, ]

# True labels for evaluation
true_labels <- factor(test_data$Returns, levels = c(0, 1))

# Random Forest
#set.seed(123)
#rf_model <- randomForest(as.factor(Returns) ~ ., data = train_data)
#rf_preds <- predict(rf_model, test_data[, -1])
set.seed(123)
train_data_small <- train_data %>% dplyr::sample_frac(0.3)  # use 30% of training data
rf_model <- randomForest(as.factor(Returns) ~ ., data = train_data_small)

# XGBoost
train_matrix <- xgb.DMatrix(data = as.matrix(train_data[, -1]), label = train_data$Returns)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data[, -1]), label = test_data$Returns)

xgb_model <- xgboost(
  data = train_matrix,
  objective = "binary:logistic",
  nrounds = 100,
  max_depth = 6,
  eta = 0.1,
  verbose = 0
)

xgb_probs <- predict(xgb_model, test_matrix)
xgb_preds <- ifelse(xgb_probs > 0.5, 1, 0)

length(rf_preds)

## [1] 49999

length(xgb_preds)

## [1] 50000

length(true_labels)

## [1] 50000

evaluate_model <- function(preds, true_labels, model_name) {
  preds <- factor(preds, levels = c(0, 1))
  true_labels <- factor(true_labels, levels = c(0, 1))  # add this to avoid level mismatch
  cm <- confusionMatrix(preds, true_labels, positive = "1")
  data.frame(
    Model = model_name,
    Accuracy = round(cm$overall["Accuracy"], 4),
    Precision = round(cm$byClass["Precision"], 4),
    Recall = round(cm$byClass["Recall"], 4),
    F1_Score = round(cm$byClass["F1"], 4)
  )
}

# Trim true_labels and xgb_preds to match rf_preds
true_labels_trimmed <- true_labels[1:length(rf_preds)]
xgb_preds_trimmed <- xgb_preds[1:length(rf_preds)]

# Now evaluate
results <- bind_rows(
  evaluate_model(rf_preds, true_labels_trimmed, "Random Forest"),
  evaluate_model(xgb_preds_trimmed, true_labels_trimmed, "XGBoost")
)

# Print results
print(results)

##                      Model Accuracy Precision Recall F1_Score
## Accuracy...1 Random Forest   0.5948        NA 0.0000       NA
## Accuracy...2       XGBoost   0.5939    0.3544 0.0028   0.0055

# Feature Importance (random forest)
varImpPlot(rf_model)

# Feature Importance (XGBoost)
importance <- xgb.importance(model = xgb_model)
xgb.plot.importance(importance)

From both feature importance (from Random Forest and XGBoost), we can observed that Total_Purchase_Amount, Product_Price, Age, Purchase_Month are the top 4 feature importance to be used for predictive analysis.

Top 4 Feature Importance vs Returns

# Using only 4 feature importance (Total_Purchase_Amount, Product_Price, Age, Purchase_Month)

# Load required libraries (4 important features without hyperparameter tuning)
library(caret)
library(dplyr)
library(xgboost)
library(randomForest)
library(lubridate)

# Data Preprocessing

# Load your dataset
df <- read.csv("ecommerce_customer_data_large.csv")

# Replace dots in column names
colnames(df) <- gsub("\\.", "_", colnames(df))

# Fill missing Returns with 0 and convert to integer
df$Returns[is.na(df$Returns)] <- 0
df$Returns <- as.integer(df$Returns)

# Convert Purchase_Date and engineer features
df$Purchase_Date <- as.Date(df$Purchase_Date, format = "%Y-%m-%d")
df$Purchase_Month <- month(df$Purchase_Date)
df$Purchase_Year <- format(df$Purchase_Date, "%Y")

# Drop Customer_Age
df <- df %>% dplyr::select(-Customer_Age)

# Select only required features
df_model <- df %>%
  dplyr::select(Returns, Total_Purchase_Amount, Product_Price, Age, Purchase_Month)

# Train-test split
set.seed(123)
train_index <- createDataPartition(df_model$Returns, p = 0.8, list = FALSE)
train_data <- df_model[train_index, ]
test_data  <- df_model[-train_index, ]

# True labels for evaluation
true_labels <- factor(test_data$Returns, levels = c(0, 1))

# Random Forest
#set.seed(123)
#rf_model <- randomForest(as.factor(Returns) ~ ., data = train_data)
#rf_preds <- predict(rf_model, test_data[, -1])
set.seed(123)
train_sample <- train_data %>% dplyr::sample_frac(0.3)  # Use 30% of data
rf_model <- randomForest(as.factor(Returns) ~ ., data = train_sample, ntree = 100)

# XGBoost
train_matrix <- xgb.DMatrix(data = as.matrix(train_data[, -1]), label = train_data$Returns)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data[, -1]), label = test_data$Returns)

xgb_model <- xgboost(
  data = train_matrix,
  objective = "binary:logistic",
  nrounds = 100,
  max_depth = 6,
  eta = 0.1,
  verbose = 0
)

xgb_probs <- predict(xgb_model, test_matrix)
xgb_preds <- ifelse(xgb_probs > 0.5, 1, 0)

# Evaluation Function
evaluate_model <- function(preds, true_labels, model_name) {
  preds <- factor(preds, levels = c(0, 1))
  cm <- confusionMatrix(preds, true_labels, positive = "1")
  data.frame(
    Model = model_name,
    Accuracy = round(cm$overall["Accuracy"], 4),
    Precision = round(cm$byClass["Precision"], 4),
    Recall = round(cm$byClass["Recall"], 4),
    F1_Score = round(cm$byClass["F1"], 4)
  )
}

# Ensure alignment
min_len <- min(length(rf_preds), length(xgb_preds), length(true_labels))
rf_preds_trimmed <- rf_preds[1:min_len]
xgb_preds_trimmed <- xgb_preds[1:min_len]
true_labels_trimmed <- true_labels[1:min_len]

# Evaluation function (define above this chunk)
evaluate_model <- function(preds, true_labels, model_name) {
  preds <- factor(preds, levels = c(0, 1))
  true_labels <- factor(true_labels, levels = c(0, 1))
  
  cm <- confusionMatrix(preds, true_labels, positive = "1")
  
  data.frame(
    Model = model_name,
    Accuracy = round(cm$overall["Accuracy"], 4),
    Precision = round(cm$byClass["Precision"], 4),
    Recall = round(cm$byClass["Recall"], 4),
    F1_Score = round(cm$byClass["F1"], 4)
  )
}

# Evaluate all models using trimmed inputs
results <- bind_rows(
  evaluate_model(rf_preds_trimmed, true_labels_trimmed, "Random Forest"),
  evaluate_model(xgb_preds_trimmed, true_labels_trimmed, "XGBoost")
)

# Output
print(results)

##                      Model Accuracy Precision Recall F1_Score
## Accuracy...1 Random Forest   0.5948        NA 0.0000       NA
## Accuracy...2       XGBoost   0.5939    0.3544 0.0028   0.0055

# Check class ratio

sum(df$Returns == 1) / sum(df$Returns == 0)

## [1] 0.6832296

The class ratio is 0.683. It means that for every 1 negative (non-return), there are approximately 0.68 positive cases (returns).

Returns (1) ≈ 40.6%

No Returns (0) ≈ 59.4%

Class weight

Class weights are used to assign different importance (penalty) to each class when training a model. This helps the model pay more attention to the minority class during training.

In randomForest(), classwt = c(“0” = 1, “1” = 1.5) means the model will penalize misclassifying a return (1) more than misclassifying a non-return (0).

In xgboost, we use weight = ifelse(Returns == 1, 1.5, 1) to give higher weight to return instances.

# Load required libraries (using calss weight to improve the performance)
library(caret)
library(dplyr)
library(xgboost)
library(randomForest)
library(lubridate)



# Load your dataset
df <- read.csv("ecommerce_customer_data_large.csv")

# Replace dots in column names
colnames(df) <- gsub("\\.", "_", colnames(df))

# Fill missing Returns with 0 and convert to integer
df$Returns[is.na(df$Returns)] <- 0
df$Returns <- as.integer(df$Returns)

# Convert Purchase_Date and engineer features
df$Purchase_Date <- as.Date(df$Purchase_Date, format = "%Y-%m-%d")
df$Purchase_Month <- month(df$Purchase_Date)
df$Purchase_Year <- format(df$Purchase_Date, "%Y")

# Drop Customer_Age
df <- df %>% dplyr::select(-Customer_Age)

# Select only required features
df_model <- df %>%
  dplyr::select(Returns, Total_Purchase_Amount, Product_Price, Age, Purchase_Month)

# Train-test split
set.seed(123)
train_index <- createDataPartition(df_model$Returns, p = 0.8, list = FALSE)
train_data <- df_model[train_index, ]
test_data  <- df_model[-train_index, ]

# True labels for evaluation
true_labels <- factor(test_data$Returns, levels = c(0, 1))

# Random Forest
set.seed(123)
rf_model <- randomForest(
  as.factor(Returns) ~ .,
  data = train_data,
  classwt = c("0" = 1, "1" = 1.5),  # class weight to penalize minority class
  ntree = 100
)
rf_preds <- predict(rf_model, test_data[, -1])

# XGBoost with class weights
train_matrix <- xgb.DMatrix(
  data = as.matrix(train_data[, -1]),
  label = train_data$Returns,
  weight = ifelse(train_data$Returns == 1, 1.5, 1)
)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data[, -1]), label = test_data$Returns)

xgb_model <- xgboost(
  data = train_matrix,
  objective = "binary:logistic",
  nrounds = 100,
  max_depth = 6,
  eta = 0.1,
  verbose = 0
)
xgb_probs <- predict(xgb_model, test_matrix)
xgb_preds <- ifelse(xgb_probs > 0.5, 1, 0)

# Evaluation Function
evaluate_model <- function(preds, true_labels, model_name) {
  preds <- factor(preds, levels = c(0, 1))
  cm <- confusionMatrix(preds, true_labels, positive = "1")
  data.frame(
    Model = model_name,
    Accuracy = round(cm$overall["Accuracy"], 4),
    Precision = round(cm$byClass["Precision"], 4),
    Recall = round(cm$byClass["Recall"], 4),
    F1_Score = round(cm$byClass["F1"], 4)
  )
}

# Evaluate all models
results <- bind_rows(
  evaluate_model(rf_preds, true_labels, "Random Forest"),
  evaluate_model(xgb_preds, true_labels, "XGBoost")
)

# Print results
print(results)

##                      Model Accuracy Precision Recall F1_Score
## Accuracy...1 Random Forest   0.5506    0.4094 0.2466   0.3078
## Accuracy...2       XGBoost   0.4604    0.4031 0.6898   0.5088

Conclusion

This analysis applied classification, regression, and predictive techniques to understand customer behavior in three key areas: churn, returns, and spending.

Churn Prediction: XGBoost combined with SMOTE achieved the best results, with exceptionally high precision and improved recall. This model enables businesses to proactively identify customers at risk of leaving and take targeted retention actions.

Return Prediction: Although return rates were consistent (~40%) across product categories, category alone was not a strong predictor. The most effective models used top features beyond category, offering deeper insights into return behavior and guiding inventory and policy decisions.

Spending Prediction: Regression models struggled to accurately predict customer spend, showing very low R² values. The random residuals suggest predictions were unbiased but lacked strong explanatory variables.

Overall, while classification tasks like churn detection performed well, regression and category-based predictions highlight the need for richer behavioral and product-level features. Future improvements should focus on advanced feature engineering to enhance model accuracy and business value.

E-Commerce User Behavior Modeling through Classification, Regression and Predictive Analytics

Group 11_Project-WQD7004_PROGRAMMING FOR DATA SCIENCE

2025-06-12

R Markdown

Introduction

Project Objectives

Data Cleaning & Preprocessing

DATA ANALYSIS - EDA

Data Modelling

1. Classification (Churn Prediction)

1. Logistic Regression (Churn Prediction)

2. Random Forest (Churn Prediction)

3. XGBoost (Churn Prediction)

Summary Table

SMOTE

1. Logistic Regression (with SMOTE)

2. Random Forest (with SMOTE)

3. XGBoost (with SMOTE)

Summary

2. Regression

1. Linear Regression

2. Polynomial Regression

3. Random Forest

4. Gradient Boosting (GBM)

Summary Table

3. Data Modeling - Predictive Analysis

1. Logistic Regression

2. Random Forest

3. XGBoost

Feature Importance

Top 4 Feature Importance vs Returns

Class weight

Conclusion