Project Title “E-Commerce User Behavior Modeling through Classification, Regression and Predictive Analytics”
The “E-commerce Customer Behaviour and Purchase” dataset serves as a rich and structured representation of online-shopping-customer interactions within an E-commerce platform. The dataset from Kaggle is described as a synthetically generated dataset using the Python Faker library. Although it is not real-world data, the dataset is modelled to reflect or imitate realistic e-commerce behaviour patterns.
The dataset is generated to facilitate various types of data science and machine learning tasks related to online retail, such as customer churn prediction, market basket analysis, recommendation systems, and purchasing trend analysis etc. The dataset is well-structured in tabular form, where each row represents a unique transaction made by a customer, and each column captures a specific attribute related to the transaction, the customer, or their purchase behaviour.
The dataset spans multiple relevant aspects of customer purchasing behaviour. Key features include customer demographic information such as Customer ID, Customer Name, Age, and Gender, along with transactional data like Purchase Date, Product Category, Product Price, Quantity, and Total Purchase Amount. It also includes behavioural indicators such as Payment Method, returns and customer churn. These features collectively allow for in-depth analyses of consumer habits and provide a strong foundation for predictive modelling.
Two sets of datasets comprising 250000 records and 13 columns are provided. Given its use case in modelling customer behaviour over time, it can be assumed that it spans a large number of records to accurately simulate e-commerce activities across different customers and timeframes.
To identify factors influencing customer churn, helping businesses improve retention strategies.
To model user behavior using classification and regression techniques to predict important customer outcomes.
# Reading the dataset
df <- read.csv("ecommerce_customer_data_large.csv", stringsAsFactors = FALSE)
# Load necessary libraries
library(tidyverse) # For data manipulation and visualization
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret) # For machine learning functions
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(e1071) # For SVM and other ML models
library(rpart) # For decision trees
library(nnet) # For neural networks
library(MASS) # For logistic regression
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(ggplot2) # For visualization
library(dplyr) # For data wrangling```
library(lubridate)
# View structure and summary (df)
# Structure (data types)
str(df)
## 'data.frame': 250000 obs. of 13 variables:
## $ Customer.ID : int 44605 44605 44605 44605 44605 13738 13738 13738 13738 13738 ...
## $ Purchase.Date : chr "2023-05-03 21:30:02" "2021-05-16 13:57:44" "2020-07-13 06:16:57" "2023-01-17 13:14:36" ...
## $ Product.Category : chr "Home" "Electronics" "Books" "Electronics" ...
## $ Product.Price : int 177 174 413 396 259 191 205 370 12 40 ...
## $ Quantity : int 1 3 1 3 4 3 1 5 2 4 ...
## $ Total.Purchase.Amount: int 2427 2448 2345 937 2598 3722 2773 1486 2175 4327 ...
## $ Payment.Method : chr "PayPal" "PayPal" "Credit Card" "Cash" ...
## $ Customer.Age : int 31 31 31 31 31 27 27 27 27 27 ...
## $ Returns : num 1 1 1 0 1 1 NA 1 NA 0 ...
## $ Customer.Name : chr "John Rivera" "John Rivera" "John Rivera" "John Rivera" ...
## $ Age : int 31 31 31 31 31 27 27 27 27 27 ...
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ Churn : int 0 0 0 0 0 0 0 0 0 0 ...
# Quick overview
glimpse(df)
## Rows: 250,000
## Columns: 13
## $ Customer.ID <int> 44605, 44605, 44605, 44605, 44605, 13738, 13738,…
## $ Purchase.Date <chr> "2023-05-03 21:30:02", "2021-05-16 13:57:44", "2…
## $ Product.Category <chr> "Home", "Electronics", "Books", "Electronics", "…
## $ Product.Price <int> 177, 174, 413, 396, 259, 191, 205, 370, 12, 40, …
## $ Quantity <int> 1, 3, 1, 3, 4, 3, 1, 5, 2, 4, 3, 1, 2, 4, 1, 2, …
## $ Total.Purchase.Amount <int> 2427, 2448, 2345, 937, 2598, 3722, 2773, 1486, 2…
## $ Payment.Method <chr> "PayPal", "PayPal", "Credit Card", "Cash", "PayP…
## $ Customer.Age <int> 31, 31, 31, 31, 31, 27, 27, 27, 27, 27, 27, 27, …
## $ Returns <dbl> 1, 1, 1, 0, 1, 1, NA, 1, NA, 0, NA, 1, 0, 0, 0, …
## $ Customer.Name <chr> "John Rivera", "John Rivera", "John Rivera", "Jo…
## $ Age <int> 31, 31, 31, 31, 31, 27, 27, 27, 27, 27, 27, 27, …
## $ Gender <chr> "Female", "Female", "Female", "Female", "Female"…
## $ Churn <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
# Descriptive statistics
summary(df)
## Customer.ID Purchase.Date Product.Category Product.Price
## Min. : 1 Length:250000 Length:250000 Min. : 10.0
## 1st Qu.:12590 Class :character Class :character 1st Qu.:132.0
## Median :25011 Mode :character Mode :character Median :255.0
## Mean :25018 Mean :254.7
## 3rd Qu.:37441 3rd Qu.:377.0
## Max. :50000 Max. :500.0
##
## Quantity Total.Purchase.Amount Payment.Method Customer.Age
## Min. :1.000 Min. : 100 Length:250000 Min. :18.0
## 1st Qu.:2.000 1st Qu.:1476 Class :character 1st Qu.:30.0
## Median :3.000 Median :2725 Mode :character Median :44.0
## Mean :3.005 Mean :2725 Mean :43.8
## 3rd Qu.:4.000 3rd Qu.:3975 3rd Qu.:57.0
## Max. :5.000 Max. :5350 Max. :70.0
##
## Returns Customer.Name Age Gender
## Min. :0.0 Length:250000 Min. :18.0 Length:250000
## 1st Qu.:0.0 Class :character 1st Qu.:30.0 Class :character
## Median :1.0 Mode :character Median :44.0 Mode :character
## Mean :0.5 Mean :43.8
## 3rd Qu.:1.0 3rd Qu.:57.0
## Max. :1.0 Max. :70.0
## NA's :47382
## Churn
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2005
## 3rd Qu.:0.0000
## Max. :1.0000
##
# Dimensions: Rows and Columns
dim(df)
## [1] 250000 13
# Column Names
colnames(df)
## [1] "Customer.ID" "Purchase.Date" "Product.Category"
## [4] "Product.Price" "Quantity" "Total.Purchase.Amount"
## [7] "Payment.Method" "Customer.Age" "Returns"
## [10] "Customer.Name" "Age" "Gender"
## [13] "Churn"
# Check for first few columns
head(df)
## Customer.ID Purchase.Date Product.Category Product.Price Quantity
## 1 44605 2023-05-03 21:30:02 Home 177 1
## 2 44605 2021-05-16 13:57:44 Electronics 174 3
## 3 44605 2020-07-13 06:16:57 Books 413 1
## 4 44605 2023-01-17 13:14:36 Electronics 396 3
## 5 44605 2021-05-01 11:29:27 Books 259 4
## 6 13738 2022-08-25 06:48:33 Home 191 3
## Total.Purchase.Amount Payment.Method Customer.Age Returns Customer.Name Age
## 1 2427 PayPal 31 1 John Rivera 31
## 2 2448 PayPal 31 1 John Rivera 31
## 3 2345 Credit Card 31 1 John Rivera 31
## 4 937 Cash 31 0 John Rivera 31
## 5 2598 PayPal 31 1 John Rivera 31
## 6 3722 Credit Card 27 1 Lauren Johnson 27
## Gender Churn
## 1 Female 0
## 2 Female 0
## 3 Female 0
## 4 Female 0
## 5 Female 0
## 6 Female 0
Removing duplicates
Handling missing values
Fixing column names or formats
Converting data types (e.g., character to date)
Drop similar column (e.g., Customer_Age)
Encoding categorical variables into numeric formats
Feature scaling (normalization/standardization)
Creating new features
Binning or discretizing variables
# Check how many duplicate rows exist
sum(duplicated(df))
## [1] 0
# Check for missing values
colSums(is.na(df))
## Customer.ID Purchase.Date Product.Category
## 0 0 0
## Product.Price Quantity Total.Purchase.Amount
## 0 0 0
## Payment.Method Customer.Age Returns
## 0 0 47382
## Customer.Name Age Gender
## 0 0 0
## Churn
## 0
# Check rows with missing values in 'Returns'
missing_rows <- df[is.na(df$Returns), ]
# View the first few rows with missing values
head(missing_rows)
## Customer.ID Purchase.Date Product.Category Product.Price Quantity
## 7 13738 2023-07-25 05:17:24 Electronics 205 1
## 9 13738 2021-12-21 03:29:05 Home 12 2
## 11 33969 2023-02-28 19:58:23 Clothing 410 3
## 22 42650 2022-01-26 12:50:30 Electronics 105 2
## 27 42650 2020-05-24 06:45:48 Electronics 307 3
## 40 19676 2021-03-23 17:36:30 Electronics 218 4
## Total.Purchase.Amount Payment.Method Customer.Age Returns Customer.Name Age
## 7 2773 Credit Card 27 NA Lauren Johnson 27
## 9 2175 Cash 27 NA Lauren Johnson 27
## 11 5018 Credit Card 27 NA Carol Allen 27
## 22 3721 Credit Card 20 NA Curtis Smith 20
## 27 973 PayPal 20 NA Curtis Smith 20
## 40 2866 PayPal 57 NA Linda Lee 57
## Gender Churn
## 7 Female 0
## 9 Female 0
## 11 Male 0
## 22 Female 0
## 27 Female 0
## 40 Male 0
# Replace all dots with underscores in column names
colnames(df) <- gsub("\\.", "_", colnames(df))
# Check the column name after change
colnames(df)
## [1] "Customer_ID" "Purchase_Date" "Product_Category"
## [4] "Product_Price" "Quantity" "Total_Purchase_Amount"
## [7] "Payment_Method" "Customer_Age" "Returns"
## [10] "Customer_Name" "Age" "Gender"
## [13] "Churn"
# Convert data types
df$Purchase_Date <- as.Date(df$Purchase_Date, format = "%Y-%m-%d")
df$Returns <- as.integer(df$Returns)
# Create new features
# Extract components
df$Purchase_Day_Name <- weekdays(df$Purchase_Date)
df$Purchase_Month <- month(df$Purchase_Date)
df$Purchase_Month_Name <- month(df$Purchase_Date, label = TRUE, abbr = FALSE)
df$Purchase_Year <- format(df$Purchase_Date, "%Y")
head(df)
## Customer_ID Purchase_Date Product_Category Product_Price Quantity
## 1 44605 2023-05-03 Home 177 1
## 2 44605 2021-05-16 Electronics 174 3
## 3 44605 2020-07-13 Books 413 1
## 4 44605 2023-01-17 Electronics 396 3
## 5 44605 2021-05-01 Books 259 4
## 6 13738 2022-08-25 Home 191 3
## Total_Purchase_Amount Payment_Method Customer_Age Returns Customer_Name Age
## 1 2427 PayPal 31 1 John Rivera 31
## 2 2448 PayPal 31 1 John Rivera 31
## 3 2345 Credit Card 31 1 John Rivera 31
## 4 937 Cash 31 0 John Rivera 31
## 5 2598 PayPal 31 1 John Rivera 31
## 6 3722 Credit Card 27 1 Lauren Johnson 27
## Gender Churn Purchase_Day_Name Purchase_Month Purchase_Month_Name
## 1 Female 0 Wednesday 5 May
## 2 Female 0 Sunday 5 May
## 3 Female 0 Monday 7 July
## 4 Female 0 Tuesday 1 January
## 5 Female 0 Saturday 5 May
## 6 Female 0 Thursday 8 August
## Purchase_Year
## 1 2023
## 2 2021
## 3 2020
## 4 2023
## 5 2021
## 6 2022
# Check for total number of numerical features and categorical features in the dataset
num_features <- sum(sapply(df, is.numeric))
cat_features <- sum(sapply(df, function(x) is.factor(x) || is.character(x)))
cat("Numerical features:", num_features, "\n")
## Numerical features: 9
cat("Categorical features:", cat_features, "\n")
## Categorical features: 7
# Show the names of the numerical and categorical features
numerical_features <- names(df)[sapply(df, is.numeric)]
categorical_features <- names(df)[sapply(df, function(x) is.factor(x) || is.character(x))]
cat("Numerical Features:\n")
## Numerical Features:
print(numerical_features)
## [1] "Customer_ID" "Product_Price" "Quantity"
## [4] "Total_Purchase_Amount" "Customer_Age" "Returns"
## [7] "Age" "Churn" "Purchase_Month"
cat("\nCategorical Features:\n")
##
## Categorical Features:
print(categorical_features)
## [1] "Product_Category" "Payment_Method" "Customer_Name"
## [4] "Gender" "Purchase_Day_Name" "Purchase_Month_Name"
## [7] "Purchase_Year"
# Count the total number of churn and return
# Total Churned
sum(df$Churn == 1)
## [1] 50130
# Total Returned
sum(df$Returns == 1)
## [1] NA
Notice that for the sum of returned, the result showed , which means there is NA in the dataset for ‘Returns’
# Count the total number of return (only with value, set na.rm to TRUE)
total_returns <- sum(df$Returns == 1, na.rm = TRUE)
print(total_returns)
## [1] 101476
# Assuming the NA (missing value) is no return, we fill up with '0'
df$Returns[is.na(df$Returns)] <- 0
total_returns <- sum(df$Returns == 1)
print(total_returns)
## [1] 101476
# Check whether there is still NA in the dataset
sum(is.na(df$returns))
## [1] 0
# Count the total number of churn and return
# Total Churned
sum(df$Churn == 1)
## [1] 50130
# Total Returned
sum(df$Returns == 1)
## [1] 101476
There is no NA value in the dataset anymore, which means we successfully fill up missing value with ‘0’ There is a total of 101476 of returns.
Total Revenue and Average Revenue
# Calculate for total revenue and average revenue
# Total Revenue
total_revenue <- sum(df$Total_Purchase_Amount, na.rm = TRUE)
print(total_revenue)
## [1] 681346299
# Average Revenue
average_revenue <- mean(df$Total_Purchase_Amount, na.rm = TRUE)
print(average_revenue)
## [1] 2725.385
Top 5 Spending Customers
# Show the top 5 spending customers
top_customers <- df %>%
group_by(Customer_ID, Customer_Name) %>%
summarise(Total_Spending = sum(Total_Purchase_Amount, na.rm = TRUE), .groups = "drop") %>%
arrange(desc(Total_Spending)) %>%
slice_head(n = 5)
top_customers
## # A tibble: 5 × 3
## Customer_ID Customer_Name Total_Spending
## <int> <chr> <int>
## 1 39895 Reginald Gonzales 50659
## 2 39717 Joseph Kaiser 50496
## 3 48382 Katelyn Clark 50179
## 4 6633 Andre Spencer 48499
## 5 49743 Bryan Gonzalez 47015
# Plot bar graph
ggplot(top_customers, aes(x = reorder(Customer_Name, -Total_Spending), y = Total_Spending, fill = Customer_Name)) +
geom_bar(stat = "identity") +
labs(title = "Top 5 Highest Spending Customers",
x = "Customer Name",
y = "Total Spending") +
theme_minimal() +
theme(legend.position = "none") +
geom_text(aes(label = round(Total_Spending, 2)), vjust = -0.3)
Busiest Sales Day of the Week
# Determine the Busiest Sales Day of the week
sales_by_day <- df %>%
group_by(Purchase_Day_Name) %>%
summarise(Total_Sales = sum(Total_Purchase_Amount, na.rm = TRUE)) %>%
arrange(desc(Total_Sales))
# Show result
print(sales_by_day)
## # A tibble: 7 × 2
## Purchase_Day_Name Total_Sales
## <chr> <int>
## 1 Wednesday 98336024
## 2 Thursday 97990590
## 3 Saturday 97466717
## 4 Friday 97178917
## 5 Tuesday 96894118
## 6 Monday 96842007
## 7 Sunday 96637926
# Reorder factor based on Total_Sales (in descending order)
sales_by_day$Purchase_Day_Name <- factor(sales_by_day$Purchase_Day_Name, levels = sales_by_day$Purchase_Day_Name)
# Plot bar graph
ggplot(sales_by_day, aes(x = Purchase_Day_Name, y = Total_Sales, fill = Purchase_Day_Name)) +
geom_bar(stat = "identity") +
labs(title = "Total Sales by Day of the Week", x = "Day", y = "Total Sales") +
theme_minimal() +
theme(legend.position = "none")
Product Category
# List out all the product category
df %>%
distinct(Product_Category)
## Product_Category
## 1 Home
## 2 Electronics
## 3 Books
## 4 Clothing
# Count products per category
category_counts <- df %>%
group_by(Product_Category) %>%
summarise(Count = n()) %>%
mutate(Percentage = Count / sum(Count) * 100,
Label = paste0(Product_Category, " (", round(Percentage, 1), "%)"))
# Plot pie chart
ggplot(category_counts, aes(x = "", y = Count, fill = Product_Category)) +
geom_bar(stat = "identity", width = 1) +
coord_polar("y", start = 0) +
geom_text(aes(label = Label), position = position_stack(vjust = 0.5), size = 4) +
labs(title = "Product Category Distribution") +
theme_void()
Payment Type
# List out all the payment type
df %>%
distinct(Payment_Method)
## Payment_Method
## 1 PayPal
## 2 Credit Card
## 3 Cash
# Count number of transactions per payment method
payment_counts <- df %>%
group_by(Payment_Method) %>%
summarise(Count = n()) %>%
arrange(desc(Count))
# Plot bar graph
ggplot(payment_counts, aes(x = reorder(Payment_Method, -Count), y = Count, fill = Payment_Method)) +
geom_bar(stat = "identity") +
labs(title = "Transaction Count by Payment Method",
x = "Payment Method",
y = "Number of Transactions") +
theme_minimal() +
theme(legend.position = "none") +
geom_text(aes(label = Count), vjust = -0.3)
Univariate Analysis (Numerical Variables) Univariate Analysis
Numerical Variables (Customer_Age, Product_Price, Total_Purchase_Amount) • Understand distribution, outliers, and spread
hist(): Used to visualize the distribution of a numerical feature.
boxplot(): Used to detect outliers and visualize spread.
summary(): Provides a quick overview of key statistics like mean, median, and quartiles.
# Customer Age Histogram
hist(df$Age,
main = "Histogram of Customer Age",
xlab = "Customer Age",
col = "lightblue",
border = "black",
breaks = 15) # Adjust number of bins with breaks
# Product Price Histogram
hist(df$Product_Price,
main = "Histogram of Product Price",
xlab = "Product Price",
col = "lightgreen",
border = "black",
breaks = 20)
# Total Purchase Amount Histogram
hist(df$Total_Purchase_Amount,
main = "Histogram of Total Purchase Amount",
xlab = "Total Purchase Amount",
col = "lightcoral",
border = "black",
breaks = 25)
# Customer Age Boxplot
boxplot(df$Age,
main = "Boxplot of Customer Age",
ylab = "Customer Age",
col = "lightblue",
border = "blue")
# Product Price Boxplot
boxplot(df$Product_Price,
main = "Boxplot of Product Price",
ylab = "Product Price",
col = "lightgreen",
border = "green")
# Total Purchase Amount Boxplot
boxplot(df$Total_Purchase_Amount,
main = "Boxplot of Total Purchase Amount",
ylab = "Total Purchase Amount",
col = "lightcoral",
border = "red")
# Summary for Customer Age
summary(df$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 30.0 44.0 43.8 57.0 70.0
# Summary for Product Price
summary(df$Product_Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.0 132.0 255.0 254.7 377.0 500.0
# Summary for Total Purchase Amount
summary(df$Total_Purchase_Amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 100 1476 2725 2725 3975 5350
Univariate Analysis (Categorical Variables) Categorical Variables (Gender, Product_Category, Payment_Method, Returns, Churn) • See frequency of categories
table(): Generates a frequency table for categorical data, showing how many times each category appears.
barplot(): Visualizes categorical data using bar charts.
prop.table(): Converts the frequency table to proportions (useful for relative frequency analysis).
# Frequency Table for Gender
table(df$Gender)
##
## Female Male
## 124324 125676
# Frequency Table for Product Category
table(df$Product_Category)
##
## Books Clothing Electronics Home
## 62247 62581 62630 62542
# Frequency Table for Payment Method
table(df$Payment_Method)
##
## Cash Credit Card PayPal
## 83012 83547 83441
# Frequency Table for Returns
table(df$Returns)
##
## 0 1
## 148524 101476
# Frequency Table for Churn
table(df$Churn)
##
## 0 1
## 199870 50130
# Bar Plot for Gender
gender_counts <- table(df$Gender)
barplot(gender_counts,
main = "Gender Distribution",
col = c("lightblue", "lightgreen"),
xlab = "Gender",
ylab = "Count",
names.arg = names(gender_counts))
# Bar Plot for Product Category
product_category_counts <- table(df$Product_Category)
barplot(product_category_counts,
main = "Product Category Distribution",
col = rainbow(length(product_category_counts)),
xlab = "Product Category",
ylab = "Count",
las = 2) # Rotate axis labels for better readability
# Bar Plot for Payment Method
payment_method_counts <- table(df$Payment_Method)
barplot(payment_method_counts,
main = "Payment Method Distribution",
col = c("lightblue", "lightcoral", "lightgreen"),
xlab = "Payment Method",
ylab = "Count",
names.arg = names(payment_method_counts))
# Bar Plot for Returns
returns_counts <- table(df$Returns)
barplot(returns_counts,
main = "Returns Distribution",
col = c("lightblue", "lightpink"),
xlab = "Returns",
ylab = "Count",
names.arg = names(returns_counts))
# Bar Plot for Churn
churn_counts <- table(df$Churn)
barplot(churn_counts,
main = "Churn Distribution",
col = c("lightgreen", "lightcoral"),
xlab = "Churn",
ylab = "Count",
names.arg = names(churn_counts))
# Bar Plot with Proportions for Gender
gender_proportions <- prop.table(table(df$Gender))
barplot(gender_proportions,
main = "Gender Distribution (Proportions)",
col = c("lightblue", "lightgreen"),
xlab = "Gender",
ylab = "Proportion",
names.arg = names(gender_proportions))
# Bar Plot with Proportions for Product Category
product_category_proportions <- prop.table(table(df$Product_Category))
barplot(product_category_proportions,
main = "Product Category Distribution (Proportions)",
col = c("lightblue", "lightgreen"),
xlab = "Product Category",
ylab = "Proportion",
names.arg = names(product_category_proportions))
# Bar Plot with Proportions for Payment Method
payment_method_proportions <- prop.table(table(df$Payment_Method))
barplot(payment_method_proportions,
main = "Payment Method Distribution (Proportions)",
col = c("lightblue", "lightgreen"),
xlab = "Payment Method",
ylab = "Proportion",
names.arg = names(payment_method_proportions))
# Bar Plot with Proportions for Returns
returns_proportions <- prop.table(table(df$Returns))
barplot(returns_proportions,
main = "Payment Method Distribution (Proportions)",
col = c("lightblue", "lightgreen"),
xlab = "Returns",
ylab = "Proportion",
names.arg = names(returns_proportions))
# Bar Plot with Proportions for Churn
churn_proportions <- prop.table(table(df$Churn))
barplot(churn_proportions,
main = "Churn Distribution (Proportions)",
col = c("lightgreen", "lightcoral"),
xlab = "Churn",
ylab = "Proportion",
names.arg = names(churn_proportions))
Bivariate Analysis (Churn VS Features)
# Boxplot to compare Total Purchase Amount by Churn status
boxplot(Total_Purchase_Amount ~ Churn,
data = df,
main = "Comparison of Total Purchase Amount by Churn Status",
xlab = "Churn",
ylab = "Total Purchase Amount",
col = c("lightblue", "lightcoral"),
names = c("Non-Churned", "Churned"))
# Boxplot for Customer Age vs. Churn status
boxplot(Age ~ Churn,
data = df,
main = "Comparison of Customer Age by Churn Status",
xlab = "Churn",
ylab = "Customer Age",
col = c("lightblue", "lightcoral"),
names = c("Non-Churned", "Churned"))
# Boxplot for Product Price vs. Churn status
boxplot(Product_Price ~ Churn,
data = df,
main = "Comparison of Product Price by Churn Status",
xlab = "Churn",
ylab = "Product Price",
col = c("lightblue", "lightcoral"),
names = c("Non-Churned", "Churned"))
library(RColorBrewer)
# Create a bar plot for Gender vs Churn
gender_churn_table <- table(df$Gender, df$Churn)
barplot(gender_churn_table,
beside = TRUE,
col = c("lightblue", "lightcoral"),
legend = rownames(gender_churn_table),
main = "Gender vs Churn",
xlab = "Churn Status",
ylab = "Count",
names.arg = c("Non-Churned", "Churned"))
# Create a bar plot for Payment Method vs Churn
payment_method_churn_table <- table(df$Payment_Method, df$Churn)
barplot(payment_method_churn_table,
beside = TRUE,
col = c("lightblue", "lightcoral", "lightyellow"),
legend = rownames(payment_method_churn_table),
main = "Payment Method vs Churn",
xlab = "Payment Method Status",
ylab = "Count",
names.arg = c("Non-Churned", "Churned"))
# Create a bar plot for Product Category vs Churn
product_category_churn_table <- table(df$Product_Category, df$Churn)
barplot(product_category_churn_table,
beside = TRUE,
col = c("lightblue", "lightcoral", "lightyellow", "lightgreen"),
legend = rownames(product_category_churn_table),
main = "Product Category vs Churn",
xlab = "Churn Status",
ylab = "Count",
names.arg = c("Non-Churned", "Churned"))
# Create a bar plot for Purchase Month Name vs Churn
month_colors <- brewer.pal(12, "Set3")
purchase_month_name_churn_table <- table(df$Purchase_Month_Name, df$Churn)
barplot(purchase_month_name_churn_table,
beside = TRUE,
col = month_colors,
legend = rownames(purchase_month_name_churn_table),
main = "Purchase Month Name vs Churn",
xlab = "Churn Status",
ylab = "Count",
names.arg = c("Non-Churned", "Churned"))
Bivariate Analysis (Return VS Features)
# Boxplot to compare Total Purchase Amount by Return status
boxplot(Total_Purchase_Amount ~ Returns,
data = df,
main = "Comparison of Total Purchase Amount by Return Status",
xlab = "Return",
ylab = "Total Purchase Amount",
col = c("lightblue", "lightcoral"),
names = c("Non-Returned", "Returned"))
# Boxplot for Customer Age vs. Return status
boxplot(Age ~ Returns,
data = df,
main = "Comparison of Customer Age by Return Status",
xlab = "Return",
ylab = "Customer Age",
col = c("lightblue", "lightcoral"),
names = c("Non-Returned", "Returned"))
# Boxplot for Product Price vs. Return status
boxplot(Product_Price ~ Returns,
data = df,
main = "Comparison of Product Price by Return Status",
xlab = "Return",
ylab = "Product Price",
col = c("lightblue", "lightcoral"),
nnames = c("Non-Returned", "Returned"))
# Contingency table for Gender vs Returns
table(df$Gender, df$Returns)
##
## 0 1
## Female 74040 50284
## Male 74484 51192
# Contingency table for Payment Method vs Returns
table(df$Payment_Method, df$Returns)
##
## 0 1
## Cash 49326 33686
## Credit Card 49689 33858
## PayPal 49509 33932
# Contingency table for Product Category vs Returns
table(df$Product_Category, df$Returns)
##
## 0 1
## Books 36841 25406
## Clothing 37279 25302
## Electronics 37182 25448
## Home 37222 25320
# Contingency table for Purchase Month Name vs Returns
table(df$Purchase_Month_Name, df$Returns)
##
## 0 1
## January 13509 9373
## February 12373 8508
## March 13470 9368
## April 13220 8918
## May 13689 9305
## June 13249 8883
## July 13655 9378
## August 13776 9384
## September 11388 7579
## October 10078 6973
## November 9901 6789
## December 10216 7018
# Create a bar plot for Gender vs Returns
gender_returns_table <- table(df$Gender, df$Returns)
barplot(gender_returns_table,
beside = TRUE,
col = c("lightblue", "lightcoral"),
legend = rownames(gender_returns_table),
main = "Gender vs Returns",
xlab = "Returns Status",
ylab = "Count",
names.arg = c("Non-Returned", "Returned"))
# Create a bar plot for Payment Method vs Returns
payment_method_returns_table <- table(df$Payment_Method, df$Returns)
barplot(payment_method_returns_table,
beside = TRUE,
col = c("lightblue", "lightcoral", "lightyellow"),
legend = rownames(payment_method_returns_table),
main = "Payment Method vs Returns",
xlab = "Returns Status",
ylab = "Count",
names.arg = c("Non-Returned", "Returned"))
# Create a bar plot for Product Category vs Returns
product_category_returns_table <- table(df$Product_Category, df$Returns)
barplot(product_category_returns_table,
beside = TRUE,
col = c("lightblue", "lightcoral", "lightyellow", "lightgreen"),
legend = rownames(product_category_returns_table),
main = "Product Category vs Returns",
xlab = "Returns Status",
ylab = "Count",
names.arg = c("Non-Returned", "Returned"))
# Create a bar plot for Purchase Month Name vs Returns
month_colors <- brewer.pal(12, "Set3")
purchase_month_name_returns_table <- table(df$Purchase_Month_Name, df$Returns)
barplot(purchase_month_name_returns_table,
beside = TRUE,
col = month_colors,
legend = rownames(purchase_month_name_returns_table),
main = "Purchase Month Name vs Returns",
xlab = "Returns Status",
ylab = "Count",
names.arg = c("Non-Returned", "Returned"))
Return Rate Return Rate (by product category, gender, purchase month name, payment method)
# Return rate by product category
return_rate <- df %>%
group_by(Product_Category) %>%
summarise(Returns = sum(Returns == "1", na.rm = TRUE),
Total_Orders = n(),
Return_Rate = Returns / Total_Orders)
print(return_rate)
## # A tibble: 4 × 4
## Product_Category Returns Total_Orders Return_Rate
## <chr> <int> <int> <dbl>
## 1 Books 25406 62247 0.408
## 2 Clothing 25302 62581 0.404
## 3 Electronics 25448 62630 0.406
## 4 Home 25320 62542 0.405
# Return rate by gender
return_rate_gender <- df %>%
group_by(Gender) %>%
summarise(Returns = sum(Returns == "1", na.rm = TRUE),
Total_Orders = n(),
Return_Rate = Returns / Total_Orders)
print(return_rate_gender)
## # A tibble: 2 × 4
## Gender Returns Total_Orders Return_Rate
## <chr> <int> <int> <dbl>
## 1 Female 50284 124324 0.404
## 2 Male 51192 125676 0.407
ggplot(return_rate_gender, aes(x = Gender, y = Return_Rate, fill = Gender)) +
geom_bar(stat = "identity") +
labs(title = "Return Rate by Gender", x = "Gender", y = "Return Rate") +
theme_minimal() +
scale_y_continuous(labels = scales::percent_format())
# Return rate by purchase month name
return_rate_month <- df %>%
group_by(Purchase_Month_Name) %>%
summarise(Returns = sum(Returns == "1", na.rm = TRUE),
Total_Orders = n(),
Return_Rate = Returns / Total_Orders)
print(return_rate_month)
## # A tibble: 12 × 4
## Purchase_Month_Name Returns Total_Orders Return_Rate
## <ord> <int> <int> <dbl>
## 1 January 9373 22882 0.410
## 2 February 8508 20881 0.407
## 3 March 9368 22838 0.410
## 4 April 8918 22138 0.403
## 5 May 9305 22994 0.405
## 6 June 8883 22132 0.401
## 7 July 9378 23033 0.407
## 8 August 9384 23160 0.405
## 9 September 7579 18967 0.400
## 10 October 6973 17051 0.409
## 11 November 6789 16690 0.407
## 12 December 7018 17234 0.407
return_rate_month$Purchase_Month_Name <- factor(return_rate_month$Purchase_Month_Name, levels = month.name)
ggplot(return_rate_month, aes(x = Purchase_Month_Name, y = Return_Rate, fill = Purchase_Month_Name)) +
geom_bar(stat = "identity") +
labs(title = "Return Rate by Month", x = "Month", y = "Return Rate") +
theme_minimal() +
scale_fill_manual(values = brewer.pal(12, "Set3")) +
scale_y_continuous(labels = scales::percent_format()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Return rate by payment method
return_rate_payment <- df %>%
group_by(Payment_Method) %>%
summarise(Returns = sum(Returns == "1", na.rm = TRUE),
Total_Orders = n(),
Return_Rate = Returns / Total_Orders)
print(return_rate_payment)
## # A tibble: 3 × 4
## Payment_Method Returns Total_Orders Return_Rate
## <chr> <int> <int> <dbl>
## 1 Cash 33686 83012 0.406
## 2 Credit Card 33858 83547 0.405
## 3 PayPal 33932 83441 0.407
ggplot(return_rate_payment, aes(x = Payment_Method, y = Return_Rate, fill = Payment_Method)) +
geom_bar(stat = "identity") +
labs(title = "Return Rate by Payment Method", x = "Payment Method", y = "Return Rate") +
theme_minimal() +
scale_y_continuous(labels = scales::percent_format()) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
Churn Rate Churn Rate (by product category, gender, purchase month name, payment method)
# Churn rate by product category
churn_rate_category <- df %>%
group_by(Product_Category) %>%
summarise(Churned = sum(Churn == "1", na.rm = TRUE),
Total_Customers = n(),
Churn_Rate = Churned / Total_Customers)
print(churn_rate_category)
## # A tibble: 4 × 4
## Product_Category Churned Total_Customers Churn_Rate
## <chr> <int> <int> <dbl>
## 1 Books 12499 62247 0.201
## 2 Clothing 12593 62581 0.201
## 3 Electronics 12553 62630 0.200
## 4 Home 12485 62542 0.200
ggplot(churn_rate_category, aes(x = reorder(Product_Category, -Churn_Rate), y = Churn_Rate, fill = Product_Category)) +
geom_bar(stat = "identity") +
labs(title = "Churn Rate by Product Category", x = "Product Category", y = "Churn Rate") +
theme_minimal() +
scale_fill_manual(values = brewer.pal(n = min(12, length(unique(churn_rate_category$Product_Category))), name = "Set3")) +
scale_y_continuous(labels = scales::percent_format()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")
# Churn rate by gender
churn_rate_gender <- df %>%
group_by(Gender) %>%
summarise(Churned = sum(Churn == "1", na.rm = TRUE),
Total_Customers = n(),
Churn_Rate = Churned / Total_Customers)
print(churn_rate_gender)
## # A tibble: 2 × 4
## Gender Churned Total_Customers Churn_Rate
## <chr> <int> <int> <dbl>
## 1 Female 25067 124324 0.202
## 2 Male 25063 125676 0.199
ggplot(churn_rate_gender, aes(x = Gender, y = Churn_Rate, fill = Gender)) +
geom_bar(stat = "identity") +
labs(title = "Churn Rate by Gender", x = "Gender", y = "Churn Rate") +
theme_minimal() +
scale_y_continuous(labels = scales::percent_format())
# Churn rate by purchase month name
churn_rate_month <- df %>%
group_by(Purchase_Month_Name) %>%
summarise(Churned = sum(Churn == "1", na.rm = TRUE),
Total_Customers = n(),
Churn_Rate = Churned / Total_Customers)
# Sort month names properly
churn_rate_month$Purchase_Month_Name <- factor(churn_rate_month$Purchase_Month_Name, levels = month.name)
print(churn_rate_month)
## # A tibble: 12 × 4
## Purchase_Month_Name Churned Total_Customers Churn_Rate
## <ord> <int> <int> <dbl>
## 1 January 4560 22882 0.199
## 2 February 4183 20881 0.200
## 3 March 4560 22838 0.200
## 4 April 4493 22138 0.203
## 5 May 4558 22994 0.198
## 6 June 4455 22132 0.201
## 7 July 4702 23033 0.204
## 8 August 4677 23160 0.202
## 9 September 3756 18967 0.198
## 10 October 3399 17051 0.199
## 11 November 3271 16690 0.196
## 12 December 3516 17234 0.204
ggplot(churn_rate_month, aes(x = Purchase_Month_Name, y = Churn_Rate, fill = Purchase_Month_Name)) +
geom_bar(stat = "identity") +
labs(title = "Churn Rate by Month", x = "Month", y = "Churn Rate") +
theme_minimal() +
scale_fill_manual(values = brewer.pal(12, "Set3")) +
scale_y_continuous(labels = scales::percent_format()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Churn rate by payment method
churn_rate_payment <- df %>%
group_by(Payment_Method) %>%
summarise(Churned = sum(Churn == "1", na.rm = TRUE),
Total_Customers = n(),
Churn_Rate = Churned / Total_Customers)
print(churn_rate_payment)
## # A tibble: 3 × 4
## Payment_Method Churned Total_Customers Churn_Rate
## <chr> <int> <int> <dbl>
## 1 Cash 16812 83012 0.203
## 2 Credit Card 16427 83547 0.197
## 3 PayPal 16891 83441 0.202
ggplot(churn_rate_payment, aes(x = Payment_Method, y = Churn_Rate, fill = Payment_Method)) +
geom_bar(stat = "identity") +
labs(title = "Churn Rate by Payment Method", x = "Payment Method", y = "Churn Rate") +
theme_minimal() +
scale_y_continuous(labels = scales::percent_format()) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
# Required libraries
library(caret) # Classification and Regression Training
library(e1071) # Tools for SVM, Naive Bayes, and more
library(randomForest) # Random Forest Algorithm
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
library(xgboost) # Extreme Gradient Boosting
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
library(MLmetrics) # Evaluation Metrics for ML Models
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
library(lubridate)
library(dplyr)
# Load and Prepare Data
df <- read.csv("ecommerce_customer_data_large.csv", stringsAsFactors = FALSE)
colnames(df) <- gsub("\\.", "_", colnames(df))
df$Purchase_Date <- as.Date(df$Purchase_Date, format = "%Y-%m-%d")
df$Returns <- as.integer(df$Returns)
df <- df %>% dplyr::select(-Customer_Age)
# Convert relevant columns to factors
df$Gender <- as.factor(df$Gender)
df$Payment_Method <- as.factor(df$Payment_Method)
df$Product_Category <- as.factor(df$Product_Category)
df$Churn <- as.factor(df$Churn) # Target
# Train-test split
set.seed(42)
trainIndex <- createDataPartition(df$Churn, p = 0.80, list = FALSE)
train <- df[trainIndex, ]
test <- df[-trainIndex, ]
# Select relevant features
selected_vars <- c("Churn", "Product_Category", "Product_Price", "Quantity",
"Total_Purchase_Amount", "Payment_Method", "Age", "Gender")
train <- train[, selected_vars]
test <- test[, selected_vars]
Customer Churn (Classification)
Model to be use: 1. Logistic Regression 2. Random Forest 3. XGBoost
# Fit model
model_log <- glm(
Churn ~ ., data = train, family = "binomial"
)
# Predict on test data
pred_prob_log <- predict(model_log, newdata = test, type = "response")
pred_log <- factor(ifelse(pred_prob_log > 0.5, "1", "0"), levels = levels(test$Churn))
# Evaluation
conf_mat_log <- confusionMatrix(pred_log, test$Churn, positive = "1")
print(conf_mat_log)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 39974 10026
## 1 0 0
##
## Accuracy : 0.7995
## 95% CI : (0.7959, 0.803)
## No Information Rate : 0.7995
## P-Value [Acc > NIR] : 0.5027
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.7995
## Prevalence : 0.2005
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 1
##
# Metrics
accuracy_log <- conf_mat_log$overall["Accuracy"]
precision_log <- conf_mat_log$byClass["Pos Pred Value"]
recall_log <- conf_mat_log$byClass["Sensitivity"]
f1_log <- 2 * ((precision_log * recall_log) / (precision_log + recall_log))
cat("\n--- Logistic Regression Metrics ---\n")
##
## --- Logistic Regression Metrics ---
cat("Accuracy:", accuracy_log, "\n")
## Accuracy: 0.79948
cat("Precision:", precision_log, "\n")
## Precision: NaN
cat("Recall:", recall_log, "\n")
## Recall: 0
cat("F1 Score:", f1_log, "\n")
## F1 Score: NaN
set.seed(42)
model_rf <- randomForest(Churn ~ ., data = train, ntree = 100)
pred_rf <- predict(model_rf, test)
# Evaluation
conf_mat_rf <- confusionMatrix(pred_rf, test$Churn, positive = "1")
print(conf_mat_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 39971 10025
## 1 3 1
##
## Accuracy : 0.7994
## 95% CI : (0.7959, 0.8029)
## No Information Rate : 0.7995
## P-Value [Acc > NIR] : 0.5116
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 9.974e-05
## Specificity : 9.999e-01
## Pos Pred Value : 2.500e-01
## Neg Pred Value : 7.995e-01
## Prevalence : 2.005e-01
## Detection Rate : 2.000e-05
## Detection Prevalence : 8.000e-05
## Balanced Accuracy : 5.000e-01
##
## 'Positive' Class : 1
##
# Metrics
accuracy_rf <- conf_mat_rf$overall["Accuracy"]
precision_rf <- conf_mat_rf$byClass["Pos Pred Value"]
recall_rf <- conf_mat_rf$byClass["Sensitivity"]
f1_rf <- 2 * ((precision_rf * recall_rf) / (precision_rf + recall_rf))
cat("\n--- Random Forest Metrics ---\n")
##
## --- Random Forest Metrics ---
cat("Accuracy:", accuracy_rf, "\n")
## Accuracy: 0.79944
cat("Precision:", precision_rf, "\n")
## Precision: 0.25
cat("Recall:", recall_rf, "\n")
## Recall: 9.974067e-05
cat("F1 Score:", f1_rf, "\n")
## F1 Score: 0.0001994018
# Prepare design matrices
train_matrix <- model.matrix(Churn ~ . -1, data = train)
test_matrix <- model.matrix(Churn ~ . -1, data = test)
# Prepare binary labels
train_label <- ifelse(train$Churn == "1", 1, 0)
test_label <- ifelse(test$Churn == "1", 1, 0)
# DMatrix objects
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)
# Set parameters and train
params <- list(
objective = "binary:logistic",
eval_metric = "logloss"
)
set.seed(42)
model_xgb <- xgb.train(
params = params,
data = dtrain,
nrounds = 100,
verbose = 0
)
# Predict
pred_prob_xgb <- predict(model_xgb, dtest)
pred_xgb <- factor(ifelse(pred_prob_xgb > 0.5, "1", "0"), levels = levels(test$Churn))
# Evaluation
conf_mat_xgb <- confusionMatrix(pred_xgb, test$Churn, positive = "1")
print(conf_mat_xgb)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 39960 10016
## 1 14 10
##
## Accuracy : 0.7994
## 95% CI : (0.7959, 0.8029)
## No Information Rate : 0.7995
## P-Value [Acc > NIR] : 0.5205
##
## Kappa : 0.001
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0009974
## Specificity : 0.9996498
## Pos Pred Value : 0.4166667
## Neg Pred Value : 0.7995838
## Prevalence : 0.2005200
## Detection Rate : 0.0002000
## Detection Prevalence : 0.0004800
## Balanced Accuracy : 0.5003236
##
## 'Positive' Class : 1
##
# Metrics
accuracy_xgb <- conf_mat_xgb$overall["Accuracy"]
precision_xgb <- conf_mat_xgb$byClass["Pos Pred Value"]
recall_xgb <- conf_mat_xgb$byClass["Sensitivity"]
f1_xgb <- 2 * ((precision_xgb * recall_xgb) / (precision_xgb + recall_xgb))
cat("\n--- XGBoost Metrics ---\n")
##
## --- XGBoost Metrics ---
cat("Accuracy:", accuracy_xgb, "\n")
## Accuracy: 0.7994
cat("Precision:", precision_xgb, "\n")
## Precision: 0.4166667
cat("Recall:", recall_xgb, "\n")
## Recall: 0.0009974067
cat("F1 Score:", f1_xgb, "\n")
## F1 Score: 0.00199005
# Extract and compute metrics after each model evaluation
# Logistic Regression
log_accuracy <- as.numeric(conf_mat_log$overall["Accuracy"])
log_precision <- as.numeric(conf_mat_log$byClass["Pos Pred Value"])
log_recall <- as.numeric(conf_mat_log$byClass["Sensitivity"])
log_f1 <- 2 * ((log_precision * log_recall) / (log_precision + log_recall))
# Random Forest
rf_accuracy <- as.numeric(conf_mat_rf$overall["Accuracy"])
rf_precision <- as.numeric(conf_mat_rf$byClass["Pos Pred Value"])
rf_recall <- as.numeric(conf_mat_rf$byClass["Sensitivity"])
rf_f1 <- 2 * ((rf_precision * rf_recall) / (rf_precision + rf_recall))
# XGBoost
xgb_accuracy <- as.numeric(conf_mat_xgb$overall["Accuracy"])
xgb_precision <- as.numeric(conf_mat_xgb$byClass["Pos Pred Value"])
xgb_recall <- as.numeric(conf_mat_xgb$byClass["Sensitivity"])
xgb_f1 <- 2 * ((xgb_precision * xgb_recall) / (xgb_precision + xgb_recall))
# Create summary table
summary_table <- data.frame(
Model = c("Logistic Regression", "Random Forest", "XGBoost"),
Accuracy = c(log_accuracy, rf_accuracy, xgb_accuracy),
Precision = c(log_precision, rf_precision, xgb_precision),
Recall = c(log_recall, rf_recall, xgb_recall),
F1_Score = c(log_f1, rf_f1, xgb_f1),
stringsAsFactors = FALSE
)
# Round numeric columns
summary_table[, 2:5] <- round(summary_table[, 2:5], 6)
# View summary
print(summary_table)
## Model Accuracy Precision Recall F1_Score
## 1 Logistic Regression 0.79948 NaN 0.000000 NaN
## 2 Random Forest 0.79944 0.250000 0.000100 0.000199
## 3 XGBoost 0.79940 0.416667 0.000997 0.001990
xgb_accuracy <- conf_mat_xgb$overall["Accuracy"]
xgb_precision <- conf_mat_xgb$byClass["Pos Pred Value"]
xgb_recall <- conf_mat_xgb$byClass["Sensitivity"]
xgb_f1 <- 2 * ((xgb_precision * xgb_recall) / (xgb_precision + xgb_recall))
# Read the file
df <- read.csv("ecommerce_customer_data_large.csv")
head(df)
## Customer.ID Purchase.Date Product.Category Product.Price Quantity
## 1 44605 2023-05-03 21:30:02 Home 177 1
## 2 44605 2021-05-16 13:57:44 Electronics 174 3
## 3 44605 2020-07-13 06:16:57 Books 413 1
## 4 44605 2023-01-17 13:14:36 Electronics 396 3
## 5 44605 2021-05-01 11:29:27 Books 259 4
## 6 13738 2022-08-25 06:48:33 Home 191 3
## Total.Purchase.Amount Payment.Method Customer.Age Returns Customer.Name Age
## 1 2427 PayPal 31 1 John Rivera 31
## 2 2448 PayPal 31 1 John Rivera 31
## 3 2345 Credit Card 31 1 John Rivera 31
## 4 937 Cash 31 0 John Rivera 31
## 5 2598 PayPal 31 1 John Rivera 31
## 6 3722 Credit Card 27 1 Lauren Johnson 27
## Gender Churn
## 1 Female 0
## 2 Female 0
## 3 Female 0
## 4 Female 0
## 5 Female 0
## 6 Female 0
# Clean column names
colnames(df) <- gsub("\\.", "_", colnames(df))
# Convert data types
df$Purchase_Date <- as.Date(df$Purchase_Date, format = "%Y-%m-%d")
df$Returns[is.na(df$Returns)] <- 0
df$Returns <- as.integer(df$Returns)
# Drop redundant column
df <- df %>% dplyr::select(-Customer_Age)
# Extract date-related features
df$Purchase_Day_Name <- weekdays(df$Purchase_Date)
df$Purchase_Month <- month(df$Purchase_Date)
df$Purchase_Month_Name <- month(df$Purchase_Date, label = TRUE, abbr = FALSE)
df$Purchase_Year <- format(df$Purchase_Date, "%Y")
# Drop non-numeric or unnecessary columns for modeling
df_numeric <- df %>% dplyr::select(-Purchase_Date)
# Convert all character/factor variables to numeric
df_numeric[] <- lapply(df_numeric, function(x) {
if (is.factor(x) || is.character(x)) {
return(as.numeric(as.factor(x)))
} else {
return(x)
}
})
#load SMOTE library
library(smotefamily)
# Apply SMOTE
set.seed(42)
smote_result <- SMOTE(
X = df_numeric[, -which(names(df_numeric) == "Churn")],
target = df_numeric$Churn,
K = 5
)
df_smote <- smote_result$data
df_smote$Churn <- factor(df_smote$class, levels = c(0, 1))
df_smote$class <- NULL
# Split data
set.seed(42)
trainIndex <- createDataPartition(df_smote$Churn, p = 0.8, list = FALSE)
train_smote <- df_smote[trainIndex, ]
test_smote <- df_smote[-trainIndex, ]
model_log <- glm(Churn ~ ., data = train_smote, family = "binomial")
pred_prob_log <- predict(model_log, newdata = test_smote, type = "response")
pred_log <- factor(ifelse(pred_prob_log > 0.5, "1", "0"), levels = c("0", "1"))
test_smote$Churn <- factor(test_smote$Churn, levels = c("0", "1"))
conf_log <- confusionMatrix(pred_log, test_smote$Churn, positive = "1")
# Metrics
f1_log <- 2 * (conf_log$byClass["Pos Pred Value"] * conf_log$byClass["Sensitivity"]) /
(conf_log$byClass["Pos Pred Value"] + conf_log$byClass["Sensitivity"])
model_rf <- randomForest(Churn ~ ., data = train_smote, ntree = 100)
pred_rf <- predict(model_rf, test_smote)
pred_rf <- factor(pred_rf, levels = c("0", "1"))
conf_rf <- confusionMatrix(pred_rf, test_smote$Churn, positive = "1")
f1_rf <- 2 * (conf_rf$byClass["Pos Pred Value"] * conf_rf$byClass["Sensitivity"]) /
(conf_rf$byClass["Pos Pred Value"] + conf_rf$byClass["Sensitivity"])
train_matrix <- model.matrix(Churn ~ . -1, data = train_smote)
test_matrix <- model.matrix(Churn ~ . -1, data = test_smote)
train_label <- as.numeric(as.character(train_smote$Churn))
test_label <- as.numeric(as.character(test_smote$Churn))
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)
params <- list(
objective = "binary:logistic",
eval_metric = "logloss"
)
model_xgb <- xgb.train(
params = params,
data = dtrain,
nrounds = 100,
verbose = 0
)
pred_prob_xgb <- predict(model_xgb, dtest)
pred_xgb <- factor(ifelse(pred_prob_xgb > 0.5, "1", "0"), levels = c("0", "1"))
conf_xgb <- confusionMatrix(pred_xgb, test_smote$Churn, positive = "1")
f1_xgb <- 2 * (conf_xgb$byClass["Pos Pred Value"] * conf_xgb$byClass["Sensitivity"]) /
(conf_xgb$byClass["Pos Pred Value"] + conf_xgb$byClass["Sensitivity"])
cat("\n--- Logistic Regression Metrics ---\n")
##
## --- Logistic Regression Metrics ---
print(conf_log)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 39974 30078
## 1 0 0
##
## Accuracy : 0.5706
## 95% CI : (0.567, 0.5743)
## No Information Rate : 0.5706
## P-Value [Acc > NIR] : 0.5016
##
## Kappa : 0
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.0000
## Specificity : 1.0000
## Pos Pred Value : NaN
## Neg Pred Value : 0.5706
## Prevalence : 0.4294
## Detection Rate : 0.0000
## Detection Prevalence : 0.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : 1
##
cat("F1 Score:", f1_log, "\n")
## F1 Score: NaN
cat("\n--- Random Forest Metrics ---\n")
##
## --- Random Forest Metrics ---
print(conf_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 39971 10031
## 1 3 20047
##
## Accuracy : 0.8568
## 95% CI : (0.8541, 0.8594)
## No Information Rate : 0.5706
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6951
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6665
## Specificity : 0.9999
## Pos Pred Value : 0.9999
## Neg Pred Value : 0.7994
## Prevalence : 0.4294
## Detection Rate : 0.2862
## Detection Prevalence : 0.2862
## Balanced Accuracy : 0.8332
##
## 'Positive' Class : 1
##
cat("F1 Score:", f1_rf, "\n")
## F1 Score: 0.7998324
cat("\n--- XGBoost Metrics ---\n")
##
## --- XGBoost Metrics ---
print(conf_xgb)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 39948 9666
## 1 26 20412
##
## Accuracy : 0.8616
## 95% CI : (0.8591, 0.8642)
## No Information Rate : 0.5706
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.706
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6786
## Specificity : 0.9993
## Pos Pred Value : 0.9987
## Neg Pred Value : 0.8052
## Prevalence : 0.4294
## Detection Rate : 0.2914
## Detection Prevalence : 0.2918
## Balanced Accuracy : 0.8390
##
## 'Positive' Class : 1
##
cat("F1 Score:", f1_xgb, "\n")
## F1 Score: 0.80814
Finding highest correlation numeric features and categorical features (annova hypo test)
library(corrplot)
## corrplot 0.95 loaded
# Correlation matrix for numeric variables (all features do not show any correlation other than age)
num_vars <- df %>% dplyr::select(where(is.numeric))
cor_matrix <- cor(num_vars, use = "complete.obs")
corrplot(cor_matrix, method = "color", tl.cex = 0.8)
# Scatter plots vs. Total Purchase Amount
ggplot(df, aes(x = Product_Price, y = Total_Purchase_Amount)) +
geom_point(alpha = 0.5) + geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(df, aes(x = Quantity, y = Total_Purchase_Amount)) +
geom_point(alpha = 0.5) + geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(df, aes(x = Age, y = Total_Purchase_Amount)) +
geom_point(alpha = 0.5) + geom_smooth(method = "lm")
## `geom_smooth()` using formula = 'y ~ x'
# Correlations with the target variable
target_corr <- cor_matrix[, "Total_Purchase_Amount"]
# Sort in decreasing order
sorted_corr <- sort(abs(target_corr), decreasing = TRUE)
print(sorted_corr)
## Total_Purchase_Amount Age Purchase_Month
## 1.0000000000 0.0565518329 0.0016261821
## Customer_ID Product_Price Quantity
## 0.0013087744 0.0012972971 0.0012335733
## Returns Churn
## 0.0010873979 0.0007055865
# highest correlation - age (0.056), Purchase_month (0.0016)
# to test the significant imapct of categorical features
#(using annova) (paymentMethod show highest impact, lowest p value)
# Convert categorical variables to factors
df$Product_Category <- as.factor(df$Product_Category)
df$Payment_Method <- as.factor(df$Payment_Method)
df$Gender <- as.factor(df$Gender)
# Convert Purchase_Date to date format and extract useful features
df$Purchase_Date <- as.POSIXct(df$Purchase_Date)
df$Purchase_Year <- as.factor(format(df$Purchase_Date, "%Y"))
df$Purchase_Month <- as.factor(format(df$Purchase_Date, "%m"))
df$Purchase_Day <- as.factor(format(df$Purchase_Date, "%d"))
df$Purchase_Weekday <- as.factor(weekdays(df$Purchase_Date))
# For each categorical variable, perform ANOVA with Total_Purchase_Amount
anova_results <- aov(Total_Purchase_Amount ~ Product_Category, data = df)
summary(anova_results)
## Df Sum Sq Mean Sq F value Pr(>F)
## Product_Category 3 1.417e+07 4723972 2.27 0.0782 .
## Residuals 249996 5.202e+11 2080994
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_results <- aov(Total_Purchase_Amount ~ Payment_Method, data = df)
summary(anova_results)
## Df Sum Sq Mean Sq F value Pr(>F)
## Payment_Method 2 2.432e+07 12158700 5.843 0.0029 **
## Residuals 249997 5.202e+11 2080945
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova_results <- aov(Total_Purchase_Amount ~ Gender, data = df)
summary(anova_results)
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 1.178e+06 1177981 0.566 0.452
## Residuals 249998 5.203e+11 2081029
# best pvalue - product_category and payment method (0.0782 and 0.0029 respectively)
Keep the most statistically relevant feautures
colnames(df)
## [1] "Customer_ID" "Purchase_Date" "Product_Category"
## [4] "Product_Price" "Quantity" "Total_Purchase_Amount"
## [7] "Payment_Method" "Returns" "Customer_Name"
## [10] "Age" "Gender" "Churn"
## [13] "Purchase_Day_Name" "Purchase_Month" "Purchase_Month_Name"
## [16] "Purchase_Year" "Purchase_Day" "Purchase_Weekday"
# Read the file
df_model <- df %>%
dplyr::select(Total_Purchase_Amount, Age, Product_Category, Payment_Method) %>%
na.omit()
df_model$Product_Category <- as.factor(df_model$Product_Category)
df_model$Payment_Method <- as.factor(df_model$Payment_Method)
# Create polynomial terms (Age^2) for polynomial regression
#df_model$Customer_Age2 <- df_model$Customer_Age^2
df_model$Age2 <- df_model$Age^2
# Split: 80% train / 20% test
split <- createDataPartition(df_model$Total_Purchase_Amount, p = 0.8, list = FALSE)
train_data <- df_model[split, ]
test_data <- df_model[-split, ]
# Normalize numeric variables for linear models
pre_proc <- preProcess(train_data[, c("Age", "Age2")], method = c("center", "scale"))
train_data_norm <- train_data
test_data_norm <- test_data
train_data_norm[, c("Age", "Age2")] <- predict(pre_proc, train_data[, c("Age", "Age2")])
test_data_norm[, c("Age", "Age2")] <- predict(pre_proc, test_data[, c("Age", "Age2")])
train_control <- trainControl(method = "cv", number = 5)
model_lm <- train(Total_Purchase_Amount ~ Age + Product_Category + Payment_Method,
data = train_data_norm, method = "lm", trControl = train_control)
#model_poly <- train(Total_Purchase_Amount ~ Age + Age2 + Product_Category + Payment_Method,
#data = train_data_norm, method = "lm", trControl = train_control)
tune_grid <- expand.grid(mtry = c(2, 3))
model_rf <- train(
Total_Purchase_Amount ~ Age + Product_Category + Payment_Method,
data = train_data,
method = "rf",
trControl = trainControl(method = "cv", number = 3),
tuneGrid = tune_grid,
ntree = 100
)
#model_rf <- train(Total_Purchase_Amount ~ Age + Product_Category + Payment_Method,
#data = train_data,
#method = "rf",
#trControl = train_control)
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("gbm")
## Installing package into 'C:/Users/hooih/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'gbm' successfully unpacked and MD5 sums checked
## Warning: cannot remove prior installation of package 'gbm'
## Warning in file.copy(savedcopy, lib, recursive = TRUE): problem copying
## C:\Users\hooih\AppData\Local\R\win-library\4.4\00LOCK\gbm\libs\x64\gbm.dll to
## C:\Users\hooih\AppData\Local\R\win-library\4.4\gbm\libs\x64\gbm.dll: Permission
## denied
## Warning: restored 'gbm'
##
## The downloaded binary packages are in
## C:\Users\hooih\AppData\Local\Temp\RtmpSS3zOG\downloaded_packages
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
model_gbm <- train(Total_Purchase_Amount ~ Age + Product_Category + Payment_Method,
data = train_data,
method = "gbm",
trControl = train_control,
verbose = FALSE)
pred_lm <- predict(model_lm, test_data_norm)
#pred_poly <- predict(model_poly, test_data_norm)
pred_rf <- predict(model_rf, test_data)
pred_gbm <- predict(model_gbm, test_data)
install.packages("Metrics")
## Installing package into 'C:/Users/hooih/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'Metrics' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\hooih\AppData\Local\Temp\RtmpSS3zOG\downloaded_packages
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
library(caret)
results <- data.frame(
Model = c("Linear Regression", "Random Forest", "Gradient Boosting"),
RMSE = c(
rmse(test_data$Total_Purchase_Amount, pred_lm),
rmse(test_data$Total_Purchase_Amount, pred_rf),
rmse(test_data$Total_Purchase_Amount, pred_gbm)
),
MAE = c(
mae(test_data$Total_Purchase_Amount, pred_lm),
mae(test_data$Total_Purchase_Amount, pred_rf),
mae(test_data$Total_Purchase_Amount, pred_gbm)
),
R2 = c(
R2(pred_lm, test_data$Total_Purchase_Amount),
R2(pred_rf, test_data$Total_Purchase_Amount),
R2(pred_gbm, test_data$Total_Purchase_Amount)
),
MAPE = c(
mape(test_data$Total_Purchase_Amount, pred_lm),
mape(test_data$Total_Purchase_Amount, pred_rf),
mape(test_data$Total_Purchase_Amount, pred_gbm)
)
)
# Print the results
print(results)
## Model RMSE MAE R2 MAPE
## 1 Linear Regression 1441.212 1249.446 0.003189443 1.026447
## 2 Random Forest 1441.329 1249.578 0.003184376 1.027608
## 3 Gradient Boosting 1441.171 1249.385 0.003246331 1.026308
# Using product category as the only independent variable and 'Returns' as the target
# Load libraries
library(ggplot2)
library(dplyr)
library(caret)
library(randomForest)
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:Metrics':
##
## auc
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(xgboost)
library(e1071)
# Load and Prepare Dataset
df <- read.csv("ecommerce_customer_data_large.csv")
# Change the colnames with underscore instead of dot
colnames(df) <- gsub("\\.", "_", colnames(df))
# Fill missing Returns with 0
df$Returns[is.na(df$Returns)] <- 0
df$Returns <- factor(df$Returns, levels = c(0, 1)) # Ensure correct factor levels
# Convert to factor
df$Product_Category <- as.factor(df$Product_Category)
# Encode numeric for XGBoost
df$Product_Category_Num <- as.numeric(df$Product_Category)
# Train-test split (80-20)
set.seed(42)
train_index <- createDataPartition(df$Returns, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
# Initialize result data frame
df_metrics <- data.frame(
Model = character(),
Accuracy = numeric(),
Precision = numeric(),
Recall = numeric(),
F1_Score = numeric(),
stringsAsFactors = FALSE
)
log_model <- glm(Returns ~ Product_Category, data = train_data, family = "binomial")
log_probs <- predict(log_model, newdata = test_data, type = "response")
log_preds <- ifelse(log_probs > 0.5, 1, 0)
cm_log <- confusionMatrix(factor(log_preds), factor(test_data$Returns), positive = "1")
## Warning in confusionMatrix.default(factor(log_preds),
## factor(test_data$Returns), : Levels are not in the same order for reference and
## data. Refactoring data to match.
df_metrics <- rbind(df_metrics, data.frame(
Model = "Logistic Regression",
Accuracy = cm_log$overall["Accuracy"],
Precision = cm_log$byClass["Precision"],
Recall = cm_log$byClass["Recall"],
F1_Score = cm_log$byClass["F1"]
))
roc_log <- roc(test_data$Returns, log_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_log, main = "ROC Curve Comparison")
rf_model <- randomForest(Returns ~ Product_Category, data = train_data, ntree = 100)
rf_probs <- predict(rf_model, newdata = test_data, type = "prob")[,2]
rf_preds <- ifelse(rf_probs > 0.5, 1, 0)
cm_rf <- confusionMatrix(factor(rf_preds), factor(test_data$Returns), positive = "1")
## Warning in confusionMatrix.default(factor(rf_preds), factor(test_data$Returns),
## : Levels are not in the same order for reference and data. Refactoring data to
## match.
df_metrics <- rbind(df_metrics, data.frame(
Model = "Random Forest",
Accuracy = cm_rf$overall["Accuracy"],
Precision = cm_rf$byClass["Precision"],
Recall = cm_rf$byClass["Recall"],
F1_Score = cm_rf$byClass["F1"]
))
roc_rf <- roc(test_data$Returns, rf_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_rf, col = "blue", main = "ROC Curve - Random Forest")
# Convert Returns to numeric for XGBoost (ensure 0 and 1)
train_label <- as.numeric(as.character(train_data$Returns))
test_label <- as.numeric(as.character(test_data$Returns))
# DMatrix for XGBoost
train_matrix <- xgb.DMatrix(data = as.matrix(train_data$Product_Category_Num), label = train_label)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data$Product_Category_Num), label = test_label)
# Train XGBoost model
xgb_model <- xgboost(data = train_matrix, objective = "binary:logistic", nrounds = 50, verbose = 0)
# Predict and evaluate
xgb_probs <- predict(xgb_model, test_matrix)
xgb_preds <- ifelse(xgb_probs > 0.5, 1, 0)
cm_xgb <- confusionMatrix(factor(xgb_preds), factor(test_data$Returns), positive = "1")
## Warning in confusionMatrix.default(factor(xgb_preds),
## factor(test_data$Returns), : Levels are not in the same order for reference and
## data. Refactoring data to match.
# Add to df_metrics
df_metrics <- rbind(df_metrics, data.frame(
Model = "XGBoost",
Accuracy = cm_xgb$overall["Accuracy"],
Precision = cm_xgb$byClass["Precision"],
Recall = cm_xgb$byClass["Recall"],
F1_Score = cm_xgb$byClass["F1"]
))
roc_xgb <- roc(test_data$Returns, xgb_probs)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_xgb, col = "darkgreen", main = "ROC Curve - XGBoost")
print(df_metrics)
## Model Accuracy Precision Recall F1_Score
## Accuracy Logistic Regression 0.5940919 NA 0 NA
## Accuracy1 Random Forest 0.5940919 NA 0 NA
## Accuracy2 XGBoost 0.5940919 NA 0 NA
# Checking why the result are the same for 3 models
table(df$Product_Category, df$Returns)
##
## 0 1
## Books 36841 25406
## Clothing 37279 25302
## Electronics 37182 25448
## Home 37222 25320
Analysis showed that return rates across product categories are remarkably similar (~40%), indicating that Product_Category alone is not a strong predictor of return behavior. Recommend to incorporate additional customer, transactional, or product-level features to build a more predictive model.
Logistic Regression model have poor performance, recommend to use only tree-based models such as random forest and XGBoost.
# Using only 4 feature importance (Total_Purchase_Amount, Product_Price, Age, Purchase_Month)
# Load required libraries (4 important features without hyperparameter tuning)
library(caret)
library(dplyr)
library(xgboost)
library(randomForest)
library(lubridate)
# Data Preprocessing
# Load your dataset
df <- read.csv("ecommerce_customer_data_large.csv")
# Replace dots in column names
colnames(df) <- gsub("\\.", "_", colnames(df))
# Fill missing Returns with 0 and convert to integer
df$Returns[is.na(df$Returns)] <- 0
df$Returns <- as.integer(df$Returns)
# Convert Purchase_Date and engineer features
df$Purchase_Date <- as.Date(df$Purchase_Date, format = "%Y-%m-%d")
df$Purchase_Month <- month(df$Purchase_Date)
df$Purchase_Year <- format(df$Purchase_Date, "%Y")
# Drop Customer_Age
df <- df %>% dplyr::select(-Customer_Age)
# Select only required features
df_model <- df %>%
dplyr::select(Returns, Total_Purchase_Amount, Product_Price, Age, Purchase_Month)
# Train-test split
set.seed(123)
train_index <- createDataPartition(df_model$Returns, p = 0.8, list = FALSE)
train_data <- df_model[train_index, ]
test_data <- df_model[-train_index, ]
# True labels for evaluation
true_labels <- factor(test_data$Returns, levels = c(0, 1))
# Random Forest
#set.seed(123)
#rf_model <- randomForest(as.factor(Returns) ~ ., data = train_data)
#rf_preds <- predict(rf_model, test_data[, -1])
set.seed(123)
train_data_small <- train_data %>% dplyr::sample_frac(0.3) # use 30% of training data
rf_model <- randomForest(as.factor(Returns) ~ ., data = train_data_small)
# XGBoost
train_matrix <- xgb.DMatrix(data = as.matrix(train_data[, -1]), label = train_data$Returns)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data[, -1]), label = test_data$Returns)
xgb_model <- xgboost(
data = train_matrix,
objective = "binary:logistic",
nrounds = 100,
max_depth = 6,
eta = 0.1,
verbose = 0
)
xgb_probs <- predict(xgb_model, test_matrix)
xgb_preds <- ifelse(xgb_probs > 0.5, 1, 0)
length(rf_preds)
## [1] 49999
length(xgb_preds)
## [1] 50000
length(true_labels)
## [1] 50000
evaluate_model <- function(preds, true_labels, model_name) {
preds <- factor(preds, levels = c(0, 1))
true_labels <- factor(true_labels, levels = c(0, 1)) # add this to avoid level mismatch
cm <- confusionMatrix(preds, true_labels, positive = "1")
data.frame(
Model = model_name,
Accuracy = round(cm$overall["Accuracy"], 4),
Precision = round(cm$byClass["Precision"], 4),
Recall = round(cm$byClass["Recall"], 4),
F1_Score = round(cm$byClass["F1"], 4)
)
}
# Trim true_labels and xgb_preds to match rf_preds
true_labels_trimmed <- true_labels[1:length(rf_preds)]
xgb_preds_trimmed <- xgb_preds[1:length(rf_preds)]
# Now evaluate
results <- bind_rows(
evaluate_model(rf_preds, true_labels_trimmed, "Random Forest"),
evaluate_model(xgb_preds_trimmed, true_labels_trimmed, "XGBoost")
)
# Print results
print(results)
## Model Accuracy Precision Recall F1_Score
## Accuracy...1 Random Forest 0.5948 NA 0.0000 NA
## Accuracy...2 XGBoost 0.5939 0.3544 0.0028 0.0055
# Feature Importance (random forest)
varImpPlot(rf_model)
# Feature Importance (XGBoost)
importance <- xgb.importance(model = xgb_model)
xgb.plot.importance(importance)
From both feature importance (from Random Forest and XGBoost), we can observed that Total_Purchase_Amount, Product_Price, Age, Purchase_Month are the top 4 feature importance to be used for predictive analysis.
# Using only 4 feature importance (Total_Purchase_Amount, Product_Price, Age, Purchase_Month)
# Load required libraries (4 important features without hyperparameter tuning)
library(caret)
library(dplyr)
library(xgboost)
library(randomForest)
library(lubridate)
# Data Preprocessing
# Load your dataset
df <- read.csv("ecommerce_customer_data_large.csv")
# Replace dots in column names
colnames(df) <- gsub("\\.", "_", colnames(df))
# Fill missing Returns with 0 and convert to integer
df$Returns[is.na(df$Returns)] <- 0
df$Returns <- as.integer(df$Returns)
# Convert Purchase_Date and engineer features
df$Purchase_Date <- as.Date(df$Purchase_Date, format = "%Y-%m-%d")
df$Purchase_Month <- month(df$Purchase_Date)
df$Purchase_Year <- format(df$Purchase_Date, "%Y")
# Drop Customer_Age
df <- df %>% dplyr::select(-Customer_Age)
# Select only required features
df_model <- df %>%
dplyr::select(Returns, Total_Purchase_Amount, Product_Price, Age, Purchase_Month)
# Train-test split
set.seed(123)
train_index <- createDataPartition(df_model$Returns, p = 0.8, list = FALSE)
train_data <- df_model[train_index, ]
test_data <- df_model[-train_index, ]
# True labels for evaluation
true_labels <- factor(test_data$Returns, levels = c(0, 1))
# Random Forest
#set.seed(123)
#rf_model <- randomForest(as.factor(Returns) ~ ., data = train_data)
#rf_preds <- predict(rf_model, test_data[, -1])
set.seed(123)
train_sample <- train_data %>% dplyr::sample_frac(0.3) # Use 30% of data
rf_model <- randomForest(as.factor(Returns) ~ ., data = train_sample, ntree = 100)
# XGBoost
train_matrix <- xgb.DMatrix(data = as.matrix(train_data[, -1]), label = train_data$Returns)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data[, -1]), label = test_data$Returns)
xgb_model <- xgboost(
data = train_matrix,
objective = "binary:logistic",
nrounds = 100,
max_depth = 6,
eta = 0.1,
verbose = 0
)
xgb_probs <- predict(xgb_model, test_matrix)
xgb_preds <- ifelse(xgb_probs > 0.5, 1, 0)
# Evaluation Function
evaluate_model <- function(preds, true_labels, model_name) {
preds <- factor(preds, levels = c(0, 1))
cm <- confusionMatrix(preds, true_labels, positive = "1")
data.frame(
Model = model_name,
Accuracy = round(cm$overall["Accuracy"], 4),
Precision = round(cm$byClass["Precision"], 4),
Recall = round(cm$byClass["Recall"], 4),
F1_Score = round(cm$byClass["F1"], 4)
)
}
# Ensure alignment
min_len <- min(length(rf_preds), length(xgb_preds), length(true_labels))
rf_preds_trimmed <- rf_preds[1:min_len]
xgb_preds_trimmed <- xgb_preds[1:min_len]
true_labels_trimmed <- true_labels[1:min_len]
# Evaluation function (define above this chunk)
evaluate_model <- function(preds, true_labels, model_name) {
preds <- factor(preds, levels = c(0, 1))
true_labels <- factor(true_labels, levels = c(0, 1))
cm <- confusionMatrix(preds, true_labels, positive = "1")
data.frame(
Model = model_name,
Accuracy = round(cm$overall["Accuracy"], 4),
Precision = round(cm$byClass["Precision"], 4),
Recall = round(cm$byClass["Recall"], 4),
F1_Score = round(cm$byClass["F1"], 4)
)
}
# Evaluate all models using trimmed inputs
results <- bind_rows(
evaluate_model(rf_preds_trimmed, true_labels_trimmed, "Random Forest"),
evaluate_model(xgb_preds_trimmed, true_labels_trimmed, "XGBoost")
)
# Output
print(results)
## Model Accuracy Precision Recall F1_Score
## Accuracy...1 Random Forest 0.5948 NA 0.0000 NA
## Accuracy...2 XGBoost 0.5939 0.3544 0.0028 0.0055
# Check class ratio
sum(df$Returns == 1) / sum(df$Returns == 0)
## [1] 0.6832296
The class ratio is 0.683. It means that for every 1 negative (non-return), there are approximately 0.68 positive cases (returns).
Returns (1) ≈ 40.6%
No Returns (0) ≈ 59.4%
Class weights are used to assign different importance (penalty) to each class when training a model. This helps the model pay more attention to the minority class during training.
In randomForest(), classwt = c(“0” = 1, “1” = 1.5) means the model will penalize misclassifying a return (1) more than misclassifying a non-return (0).
In xgboost, we use weight = ifelse(Returns == 1, 1.5, 1) to give higher weight to return instances.
# Load required libraries (using calss weight to improve the performance)
library(caret)
library(dplyr)
library(xgboost)
library(randomForest)
library(lubridate)
# Load your dataset
df <- read.csv("ecommerce_customer_data_large.csv")
# Replace dots in column names
colnames(df) <- gsub("\\.", "_", colnames(df))
# Fill missing Returns with 0 and convert to integer
df$Returns[is.na(df$Returns)] <- 0
df$Returns <- as.integer(df$Returns)
# Convert Purchase_Date and engineer features
df$Purchase_Date <- as.Date(df$Purchase_Date, format = "%Y-%m-%d")
df$Purchase_Month <- month(df$Purchase_Date)
df$Purchase_Year <- format(df$Purchase_Date, "%Y")
# Drop Customer_Age
df <- df %>% dplyr::select(-Customer_Age)
# Select only required features
df_model <- df %>%
dplyr::select(Returns, Total_Purchase_Amount, Product_Price, Age, Purchase_Month)
# Train-test split
set.seed(123)
train_index <- createDataPartition(df_model$Returns, p = 0.8, list = FALSE)
train_data <- df_model[train_index, ]
test_data <- df_model[-train_index, ]
# True labels for evaluation
true_labels <- factor(test_data$Returns, levels = c(0, 1))
# Random Forest
set.seed(123)
rf_model <- randomForest(
as.factor(Returns) ~ .,
data = train_data,
classwt = c("0" = 1, "1" = 1.5), # class weight to penalize minority class
ntree = 100
)
rf_preds <- predict(rf_model, test_data[, -1])
# XGBoost with class weights
train_matrix <- xgb.DMatrix(
data = as.matrix(train_data[, -1]),
label = train_data$Returns,
weight = ifelse(train_data$Returns == 1, 1.5, 1)
)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data[, -1]), label = test_data$Returns)
xgb_model <- xgboost(
data = train_matrix,
objective = "binary:logistic",
nrounds = 100,
max_depth = 6,
eta = 0.1,
verbose = 0
)
xgb_probs <- predict(xgb_model, test_matrix)
xgb_preds <- ifelse(xgb_probs > 0.5, 1, 0)
# Evaluation Function
evaluate_model <- function(preds, true_labels, model_name) {
preds <- factor(preds, levels = c(0, 1))
cm <- confusionMatrix(preds, true_labels, positive = "1")
data.frame(
Model = model_name,
Accuracy = round(cm$overall["Accuracy"], 4),
Precision = round(cm$byClass["Precision"], 4),
Recall = round(cm$byClass["Recall"], 4),
F1_Score = round(cm$byClass["F1"], 4)
)
}
# Evaluate all models
results <- bind_rows(
evaluate_model(rf_preds, true_labels, "Random Forest"),
evaluate_model(xgb_preds, true_labels, "XGBoost")
)
# Print results
print(results)
## Model Accuracy Precision Recall F1_Score
## Accuracy...1 Random Forest 0.5506 0.4094 0.2466 0.3078
## Accuracy...2 XGBoost 0.4604 0.4031 0.6898 0.5088
This analysis applied classification, regression, and predictive techniques to understand customer behavior in three key areas: churn, returns, and spending.
Churn Prediction: XGBoost combined with SMOTE achieved the best results, with exceptionally high precision and improved recall. This model enables businesses to proactively identify customers at risk of leaving and take targeted retention actions.
Return Prediction: Although return rates were consistent (~40%) across product categories, category alone was not a strong predictor. The most effective models used top features beyond category, offering deeper insights into return behavior and guiding inventory and policy decisions.
Spending Prediction: Regression models struggled to accurately predict customer spend, showing very low R² values. The random residuals suggest predictions were unbiased but lacked strong explanatory variables.
Overall, while classification tasks like churn detection performed well, regression and category-based predictions highlight the need for richer behavioral and product-level features. Future improvements should focus on advanced feature engineering to enhance model accuracy and business value.