1.1 Background
Black Friday is the day after the United States Thanksgiving holiday. The term “Black Friday” originated in the 1960s, initially used by Philadelphia police to describe the heavy and disruptive pedestrian and vehicle traffic that occurred on the day after Thanksgiving. Eventually, retailers started using the term to signify the start of the holiday shopping season and to attract customers with special sales and deals. It is now recognized as one of the busiest shopping days of the year, with retailers offering substantial discounts, limited-time offers, doorbuster deals, and other incentives to attract shoppers.
As consumer demand is anticipated to surge in a particular period, hence sales prediction is relatively crucial for various business planning such as supply chain management, resource optimization, inventory management, sales performance evaluation, and market analysis. Conventionally, most businesses rely heavily on a knowledge base and predicting sales trends (Cheriyan, S. el at., 2018). Since traditional methods are not very helpful for business growth in terms of revenue, the use of machine learning approaches proves to be an important point for designing the business plan taking into account consumer shopping behavior (Aher, A., el at., 2021).
In the era of Big Data and advanced technologies, companies are increasingly relying on data-driven strategies to gain a competitive advantage. By leveraging the available sales history dataset, machine learning is able to provide valuable business insight that facilitates business development and implementation of effective business strategies (Sathyanarayana, S., el at., 2023). In particular, accurate and timely sales forecasting helps businesses to optimize inventory management, allocate resources efficiently, and develop targeted marketing strategies. In addition, understanding the underlying patterns and factors that influence sales can provide valuable insights into consumer behavior and preferences.
In this paper, we aim to identify underlying customer purchase behavior against various products as well as to analyze data correlation among the variables. Subsequently, to develop an effective sales forecasting model by evaluating the performance of XG Boost and Linear Regression.
1.2 About Dataset
The dataset of this project is a purchase summary of various customers for selected high-volume products from last month for a retail company “ABC Private Limited”. The dataset consists of 550k rows and 12 columns with different data types, breakdown by 6 columns with integer data type, 5 columns with string data type and 1 column labeled as ID.
The variables in the dataset are listed as followed:
Source link: https://www.kaggle.com/datasets/rajeshrampure/black-friday-sale?resource=download
1.3 Project Questions
1.4 Project Objectives
Data Overview
# Import dataset
df=read.csv("train.csv",header = T)
# View first 6 rows of df
head(df)
# View structure of df
str(df)
## 'data.frame': 550068 obs. of 12 variables:
## $ User_ID : int 1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
## $ Product_ID : chr "P00069042" "P00248942" "P00087842" "P00085442" ...
## $ Gender : chr "F" "F" "F" "F" ...
## $ Age : chr "0-17" "0-17" "0-17" "0-17" ...
## $ Occupation : int 10 10 10 10 16 15 7 7 7 20 ...
## $ City_Category : chr "A" "A" "A" "A" ...
## $ Stay_In_Current_City_Years: chr "2" "2" "2" "2" ...
## $ Marital_Status : int 0 0 0 0 0 0 1 1 1 1 ...
## $ Product_Category_1 : int 3 1 12 12 8 1 1 1 1 8 ...
## $ Product_Category_2 : int NA 6 NA 14 NA 2 8 15 16 NA ...
## $ Product_Category_3 : int NA 14 NA NA NA NA 17 NA NA NA ...
## $ Purchase : int 8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...
# View the number of unique values for each category
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sapply(df, function(x) n_distinct(x))
## User_ID Product_ID
## 5891 3631
## Gender Age
## 2 7
## Occupation City_Category
## 21 3
## Stay_In_Current_City_Years Marital_Status
## 5 2
## Product_Category_1 Product_Category_2
## 20 18
## Product_Category_3 Purchase
## 16 18105
summary(df)
## User_ID Product_ID Gender Age
## Min. :1000001 Length:550068 Length:550068 Length:550068
## 1st Qu.:1001516 Class :character Class :character Class :character
## Median :1003077 Mode :character Mode :character Mode :character
## Mean :1003029
## 3rd Qu.:1004478
## Max. :1006040
##
## Occupation City_Category Stay_In_Current_City_Years
## Min. : 0.000 Length:550068 Length:550068
## 1st Qu.: 2.000 Class :character Class :character
## Median : 7.000 Mode :character Mode :character
## Mean : 8.077
## 3rd Qu.:14.000
## Max. :20.000
##
## Marital_Status Product_Category_1 Product_Category_2 Product_Category_3
## Min. :0.0000 Min. : 1.000 Min. : 2.00 Min. : 3.0
## 1st Qu.:0.0000 1st Qu.: 1.000 1st Qu.: 5.00 1st Qu.: 9.0
## Median :0.0000 Median : 5.000 Median : 9.00 Median :14.0
## Mean :0.4097 Mean : 5.404 Mean : 9.84 Mean :12.7
## 3rd Qu.:1.0000 3rd Qu.: 8.000 3rd Qu.:15.00 3rd Qu.:16.0
## Max. :1.0000 Max. :20.000 Max. :18.00 Max. :18.0
## NA's :173638 NA's :383247
## Purchase
## Min. : 12
## 1st Qu.: 5823
## Median : 8047
## Mean : 9264
## 3rd Qu.:12054
## Max. :23961
##
Data Preprocessing
Duplicated rows and null removal
# Check if there is duplicated rows
any(duplicated(df))
## [1] FALSE
# Check if there is any NA in df
any(is.na(df))
## [1] TRUE
# Check total of NA for every columns
colSums(is.na(df))
## User_ID Product_ID
## 0 0
## Gender Age
## 0 0
## Occupation City_Category
## 0 0
## Stay_In_Current_City_Years Marital_Status
## 0 0
## Product_Category_1 Product_Category_2
## 0 173638
## Product_Category_3 Purchase
## 383247 0
# Replace NA values in Product Category 2 & 3 with "0"
df$Product_Category_2[is.na(df$Product_Category_2)] = 0
df$Product_Category_3[is.na(df$Product_Category_3)] = 0
# All NA are successfully removed
colSums(is.na(df))
## User_ID Product_ID
## 0 0
## Gender Age
## 0 0
## Occupation City_Category
## 0 0
## Stay_In_Current_City_Years Marital_Status
## 0 0
## Product_Category_1 Product_Category_2
## 0 0
## Product_Category_3 Purchase
## 0 0
Data Encoding
# Map Female(F) into 0 and Male(M) into 1
df$Gender = ifelse(df$Gender == "F", 0, 1)
# Check unique variables available in "Age" columns
unique(df$Age)
## [1] "0-17" "55+" "26-35" "46-50" "51-55" "36-45" "18-25"
# Map Age range value from youngest to eldest into 1 to 7 accordingly
df$Age = ifelse(df$Age == "0-17", 1,
ifelse(df$Age == "18-25", 2,
ifelse(df$Age == "26-35", 3,
ifelse(df$Age == "36-45", 4,
ifelse(df$Age == "46-50", 5,
ifelse(df$Age == "51-55", 6, 7))))))
# Double check replaced unique variables
unique(df$Gender)
## [1] 0 1
unique(df$Age)
## [1] 1 7 3 5 6 4 2
# Check unique values of Stay_in_Current_City_Years
unique(df$Stay_In_Current_City_Years)
## [1] "2" "4+" "3" "1" "0"
# Change "4+" to 4 and convert remaining values in the column into numeric data type
df$Stay_In_Current_City_Years=ifelse(df$Stay_In_Current_City_Years == "4+", 4,as.numeric(df$Stay_In_Current_City_Years))
## Warning in ifelse(df$Stay_In_Current_City_Years == "4+", 4,
## as.numeric(df$Stay_In_Current_City_Years)): NAs introduced by coercion
str(df)
## 'data.frame': 550068 obs. of 12 variables:
## $ User_ID : int 1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
## $ Product_ID : chr "P00069042" "P00248942" "P00087842" "P00085442" ...
## $ Gender : num 0 0 0 0 1 1 1 1 1 1 ...
## $ Age : num 1 1 1 1 7 3 5 5 5 3 ...
## $ Occupation : int 10 10 10 10 16 15 7 7 7 20 ...
## $ City_Category : chr "A" "A" "A" "A" ...
## $ Stay_In_Current_City_Years: num 2 2 2 2 4 3 2 2 2 1 ...
## $ Marital_Status : int 0 0 0 0 0 0 1 1 1 1 ...
## $ Product_Category_1 : int 3 1 12 12 8 1 1 1 1 8 ...
## $ Product_Category_2 : num 0 6 0 14 0 2 8 15 16 0 ...
## $ Product_Category_3 : num 0 14 0 0 0 0 17 0 0 0 ...
## $ Purchase : int 8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...
Data Transformation
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.2 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Add a new column (colvalue) with all values equal to 1
# Then do "spread"
df_transformed_city = df %>% mutate(colvalue=1) %>% spread(City_Category, colvalue)
# Convert "NA" value to 0
df_transformed_city[is.na(df_transformed_city)]<-0
# Rearrange City Category back into position
df1=select(df_transformed_city,c("User_ID","Product_ID","Gender","Age","Occupation","A","B","C","Stay_In_Current_City_Years","Marital_Status","Product_Category_1","Product_Category_2","Product_Category_3","Purchase"))
# Rename each dummy variables by adding prefix "City_" in front of them
names(df1)[names(df1) == "A"] <- "City_A"
names(df1)[names(df1) == "B"] <- "City_B"
names(df1)[names(df1) == "C"] <- "City_C"
Data Type Conversion
# Remove substring "P00" from Product_ID
df1$Product_ID <- gsub("P00", "", df1$Product_ID)
# Convert values in Product_ID into numeric data type
df1$Product_ID <- as.numeric(df1$Product_ID)
# Convert remaining variables into numeric data type (for ease of correlation)
df1$Occupation=as.numeric(df1$Occupation)
df1$Marital_Status=as.numeric(df1$Marital_Status)
df1$Product_Category_1=as.numeric(df1$Product_Category_1)
df1$Purchase=as.numeric(df1$Purchase)
str(df1)
## 'data.frame': 550068 obs. of 14 variables:
## $ User_ID : int 1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
## $ Product_ID : num 69042 248942 87842 85442 285442 ...
## $ Gender : num 0 0 0 0 1 1 1 1 1 1 ...
## $ Age : num 1 1 1 1 7 3 5 5 5 3 ...
## $ Occupation : num 10 10 10 10 16 15 7 7 7 20 ...
## $ City_A : num 1 1 1 1 0 1 0 0 0 1 ...
## $ City_B : num 0 0 0 0 0 0 1 1 1 0 ...
## $ City_C : num 0 0 0 0 1 0 0 0 0 0 ...
## $ Stay_In_Current_City_Years: num 2 2 2 2 4 3 2 2 2 1 ...
## $ Marital_Status : num 0 0 0 0 0 0 1 1 1 1 ...
## $ Product_Category_1 : num 3 1 12 12 8 1 1 1 1 8 ...
## $ Product_Category_2 : num 0 6 0 14 0 2 8 15 16 0 ...
## $ Product_Category_3 : num 0 14 0 0 0 0 17 0 0 0 ...
## $ Purchase : num 8370 15200 1422 1057 7969 ...
**Extra codes to make both df and df1 similar except for data transformation part (City)
# Convert remaining columns into numeric data type
df$Occupation=as.numeric(df$Occupation)
df$Marital_Status=as.numeric(df$Marital_Status)
df$Product_Category_1=as.numeric(df$Product_Category_1)
df$Purchase=as.numeric(df$Purchase)
df=select(df,-User_ID)
str(df)
## 'data.frame': 550068 obs. of 11 variables:
## $ Product_ID : chr "P00069042" "P00248942" "P00087842" "P00085442" ...
## $ Gender : num 0 0 0 0 1 1 1 1 1 1 ...
## $ Age : num 1 1 1 1 7 3 5 5 5 3 ...
## $ Occupation : num 10 10 10 10 16 15 7 7 7 20 ...
## $ City_Category : chr "A" "A" "A" "A" ...
## $ Stay_In_Current_City_Years: num 2 2 2 2 4 3 2 2 2 1 ...
## $ Marital_Status : num 0 0 0 0 0 0 1 1 1 1 ...
## $ Product_Category_1 : num 3 1 12 12 8 1 1 1 1 8 ...
## $ Product_Category_2 : num 0 6 0 14 0 2 8 15 16 0 ...
## $ Product_Category_3 : num 0 14 0 0 0 0 17 0 0 0 ...
## $ Purchase : num 8370 15200 1422 1057 7969 ...
str(df1)
## 'data.frame': 550068 obs. of 14 variables:
## $ User_ID : int 1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
## $ Product_ID : num 69042 248942 87842 85442 285442 ...
## $ Gender : num 0 0 0 0 1 1 1 1 1 1 ...
## $ Age : num 1 1 1 1 7 3 5 5 5 3 ...
## $ Occupation : num 10 10 10 10 16 15 7 7 7 20 ...
## $ City_A : num 1 1 1 1 0 1 0 0 0 1 ...
## $ City_B : num 0 0 0 0 0 0 1 1 1 0 ...
## $ City_C : num 0 0 0 0 1 0 0 0 0 0 ...
## $ Stay_In_Current_City_Years: num 2 2 2 2 4 3 2 2 2 1 ...
## $ Marital_Status : num 0 0 0 0 0 0 1 1 1 1 ...
## $ Product_Category_1 : num 3 1 12 12 8 1 1 1 1 8 ...
## $ Product_Category_2 : num 0 6 0 14 0 2 8 15 16 0 ...
## $ Product_Category_3 : num 0 14 0 0 0 0 17 0 0 0 ...
## $ Purchase : num 8370 15200 1422 1057 7969 ...
Export cleaned data for further EDA and modelling
# Save df into "output_without_transformation.csv"
write.csv(df, file = "output_without_transformation.csv", row.names = T)
# Save df into "output_with_transformation.csv"
write.csv(df1, file = "output_with_transformation.csv", row.names = T)
# Load required libraries
library(dplyr)
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(rpart)
# Read the cleaned data file (output_with_transformation.csv)
df1 <- read.csv("output_with_transformation.csv", header = TRUE)
# Summary statistics
summary(df1)
## X User_ID Product_ID Gender
## Min. : 1 Min. :1000001 Min. : 142 Min. :0.0000
## 1st Qu.:137518 1st Qu.:1001516 1st Qu.: 97142 1st Qu.:1.0000
## Median :275035 Median :1003077 Median :169742 Median :1.0000
## Mean :275035 Mean :1003029 Mean :174745 Mean :0.7531
## 3rd Qu.:412551 3rd Qu.:1004478 3rd Qu.:259142 3rd Qu.:1.0000
## Max. :550068 Max. :1006040 Max. :375436 Max. :1.0000
## Age Occupation City_A City_B
## Min. :1.000 Min. : 0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:3.000 1st Qu.: 2.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :3.000 Median : 7.000 Median :0.0000 Median :0.0000
## Mean :3.496 Mean : 8.077 Mean :0.2685 Mean :0.4203
## 3rd Qu.:4.000 3rd Qu.:14.000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :7.000 Max. :20.000 Max. :1.0000 Max. :1.0000
## City_C Stay_In_Current_City_Years Marital_Status
## Min. :0.0000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :2.000 Median :0.0000
## Mean :0.3112 Mean :1.858 Mean :0.4097
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :1.0000 Max. :4.000 Max. :1.0000
## Product_Category_1 Product_Category_2 Product_Category_3 Purchase
## Min. : 1.000 Min. : 0.000 Min. : 0.000 Min. : 12
## 1st Qu.: 1.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 5823
## Median : 5.000 Median : 5.000 Median : 0.000 Median : 8047
## Mean : 5.404 Mean : 6.735 Mean : 3.842 Mean : 9264
## 3rd Qu.: 8.000 3rd Qu.:14.000 3rd Qu.: 8.000 3rd Qu.:12054
## Max. :20.000 Max. :18.000 Max. :18.000 Max. :23961
# How many unique User_IDs are there in the dataset?
length(unique(df1$User_ID)) # The store had 5891 customers
## [1] 5891
# How many items did each customer purchase?
Unique_UserID <- as.data.frame(table(df1$User_ID))
names(Unique_UserID) <- c("User_ID", "Customer_Purchase_Count")
head(Unique_UserID)
# Due to the large dataset, the average values were used for this analysis
new_data <- df1 %>%
group_by(User_ID, Age, Gender, Occupation, City_A, City_B, City_C, Stay_In_Current_City_Years, Marital_Status) %>%
summarise(Avg_Product_Category_1 = mean(Product_Category_1),
Avg_Product_Category_2 = mean(Product_Category_2),
Avg_Product_Category_3 = mean(Product_Category_3),
Avg_Purchase = mean(Purchase))
## `summarise()` has grouped output by 'User_ID', 'Age', 'Gender', 'Occupation',
## 'City_A', 'City_B', 'City_C', 'Stay_In_Current_City_Years'. You can override
## using the `.groups` argument.
Which Age group/gender had the highest purchase by product category?
Product Category 1
ggplot(new_data, aes(Gender, Avg_Product_Category_1, fill = Gender)) +
geom_col(width = 0.4) +
facet_wrap(~ Age) +
labs(title = "Age Group/Gender Vs Product Category 1")
Product Category 2
ggplot(new_data, aes(Gender, Avg_Product_Category_2, fill = Gender)) +
geom_col() +
facet_wrap(~ Age) +
labs(title = "Age Group/Gender Vs Product Category 2")
Product Category 3
ggplot(new_data, aes(Gender, Avg_Product_Category_3, fill = Gender)) +
geom_col() +
facet_wrap(~ Age) +
labs(title = "Age Group/Gender Vs Product Category 3")
Age group versus Average purchase amount
ggplot(new_data, aes(Age, Avg_Purchase, fill = Gender)) +
geom_col() +
facet_wrap(~ Gender) +
labs(title = "Age Group/Gender Vs Average Purchase Amount")
Which product category raked in the most money wrapped with age? Product Category 1
ggplot(new_data, aes(Avg_Product_Category_1, Avg_Purchase, color = Age)) +
geom_point() +
facet_wrap(~ Age) +
labs(title = "Product_Cat_1_Avg Vs Avg_Purchase_Amount")
Product Category 2
ggplot(new_data, aes(Avg_Product_Category_2, Avg_Purchase, color = Age)) +
geom_point() +
facet_wrap(~ Age) +
labs(title = "Product_Cat_2_Avg Vs Avg_Purchase_Amount")
Product Category 3
ggplot(new_data, aes(Avg_Product_Category_3, Avg_Purchase, color = Age)) +
geom_point() +
facet_wrap(~ Age) +
labs(title = "Product_Cat_3_Avg Vs Avg_Purchase_Amount")
Explore the occupation variable versus the product categories and the purchase amount
Which occupation had more influence on product purchase?
Product Category 1
ggplot(new_data, aes(Occupation, Avg_Product_Category_1, fill = Occupation)) +
geom_col() +
facet_wrap(~ Age) +
labs(title = "Occupation Vs Product Category 1")
Product Category 2
ggplot(new_data, aes(Occupation, Avg_Product_Category_2, fill = Occupation)) +
geom_col() +
facet_wrap(~ Age) +
labs(title = "Occupation Vs Product Category 2")
Product Category 3
ggplot(new_data, aes(Occupation, Avg_Product_Category_3, fill = Occupation)) +
geom_col() +
facet_wrap(~ Age) +
labs(title = "Occupation Vs Product Category 3")
Which occupation spent the most money?
ggplot(new_data, aes(Occupation, Avg_Purchase, fill = Occupation)) +
geom_col() +
facet_wrap(~ Age) +
labs(title = "Occupation Vs Average Purchase Amount")
Explore the city category variable versus the product categories and the purchase amount
Box plot for average product category 1
ggplot(new_data, aes(x = ifelse(City_A == 1, "City_A", ifelse(City_B == 1, "City_B", "City_C")), y = Avg_Product_Category_1)) +
geom_boxplot() +
xlab("City Category") +
ylab("Average Product Category 1") +
ggtitle("Average Product Category 1 by City Category")
Box plot for average product category 2
ggplot(new_data, aes(x = ifelse(City_A == 1, "City_A", ifelse(City_B == 1, "City_B", "City_C")), y = Avg_Product_Category_2)) +
geom_boxplot() +
xlab("City Category") +
ylab("Average Product Category 2") +
ggtitle("Average Product Category 2 by City Category")
Box plot for average product category 3
ggplot(new_data, aes(x = ifelse(City_A == 1, "City_A", ifelse(City_B == 1, "City_B", "City_C")), y = Avg_Product_Category_3)) +
geom_boxplot() +
xlab("City Category") +
ylab("Average Product Category 3") +
ggtitle("Average Product Category 3 by City Category")
City Category versus Average purchase amount
ggplot(new_data, aes(x = ifelse(City_A == 1, "City_A", ifelse(City_B == 1, "City_B", "City_C")), y = Avg_Purchase)) +
geom_boxplot() +
xlab("City Category") +
ylab("Average Purchase") +
ggtitle("Average Purchase by City Category")
Explore the Stay in current city variable versus the product categories and the purchase amount Product Category 1
ggplot(new_data, aes(Stay_In_Current_City_Years, Avg_Product_Category_1, fill = Stay_In_Current_City_Years)) +
geom_col() +
facet_wrap(~ Age) +
labs(title = "Stay in current city Vs Product_Category 1")
Product Category 2
ggplot(new_data, aes(Stay_In_Current_City_Years, Avg_Product_Category_2, fill = Stay_In_Current_City_Years)) +
geom_col() +
facet_wrap(~ Age) +
labs(title = "Stay in current city Vs Product_Category 2")
Product category 3
ggplot(new_data, aes(Stay_In_Current_City_Years, Avg_Product_Category_3, fill = Stay_In_Current_City_Years)) +
geom_col() +
facet_wrap(~ Age) +
labs(title = "Stay in current city Vs Product_Category 3")
Stay in current city versus Average purchase amount
ggplot(new_data, aes(Stay_In_Current_City_Years, Avg_Purchase, fill = Stay_In_Current_City_Years)) +
geom_col() +
facet_wrap(~ Age) +
labs(title = "Stay in current city Vs Avg_Purchase_Amount")
Explore the marital status variable. #Box plot of average product category 1 by marital status
ggplot(new_data, aes(x = factor(Marital_Status), y = Avg_Product_Category_1)) +
geom_boxplot() +
xlab("Marital Status") +
ylab("Average Product Category 1") +
ggtitle("Average Product Category 1 by Marital Status")
Box plot of average product category 2 by marital status
ggplot(new_data, aes(x = factor(Marital_Status), y = Avg_Product_Category_2)) +
geom_boxplot() +
xlab("Marital Status") +
ylab("Average Product Category 2") +
ggtitle("Average Product Category 2 by Marital Status")
Box plot of average product category 3 by marital status
ggplot(new_data, aes(x = factor(Marital_Status), y = Avg_Product_Category_3)) +
geom_boxplot() +
xlab("Marital Status") +
ylab("Average Product Category 3") +
ggtitle("Average Product Category 3 by Marital Status")
Marital status versus average purchase amount
ggplot(new_data, aes(x = factor(Marital_Status), y = Avg_Purchase)) +
geom_boxplot() +
xlab("Marital Status") +
ylab("Average Purchase") +
ggtitle("Average Purchase by Marital Status")
Correlation analysis
# Select numerical columns for correlation analysis
numerical_cols <- c("Age", "Occupation", "City_A", "City_B", "City_C", "Stay_In_Current_City_Years", "Marital_Status", "Avg_Product_Category_1", "Avg_Product_Category_2", "Avg_Product_Category_3", "Avg_Purchase")
# Subset data for correlation analysis
cor_data <- new_data[numerical_cols]
# Calculate correlation matrix
cor_matrix <- cor(cor_data)
# Plot correlation matrix
corrplot(cor_matrix, method = "circle", tl.col = "black", tl.srt = 45)
4.1 XGBOOST
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
set.seed(1)
data <- read.csv("output_with_transformation.csv", header = TRUE)
data <- data[, -1]
summary(data)
## User_ID Product_ID Gender Age
## Min. :1000001 Min. : 142 Min. :0.0000 Min. :1.000
## 1st Qu.:1001516 1st Qu.: 97142 1st Qu.:1.0000 1st Qu.:3.000
## Median :1003077 Median :169742 Median :1.0000 Median :3.000
## Mean :1003029 Mean :174745 Mean :0.7531 Mean :3.496
## 3rd Qu.:1004478 3rd Qu.:259142 3rd Qu.:1.0000 3rd Qu.:4.000
## Max. :1006040 Max. :375436 Max. :1.0000 Max. :7.000
## Occupation City_A City_B City_C
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median : 7.000 Median :0.0000 Median :0.0000 Median :0.0000
## Mean : 8.077 Mean :0.2685 Mean :0.4203 Mean :0.3112
## 3rd Qu.:14.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :20.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Stay_In_Current_City_Years Marital_Status Product_Category_1
## Min. :0.000 Min. :0.0000 Min. : 1.000
## 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 1.000
## Median :2.000 Median :0.0000 Median : 5.000
## Mean :1.858 Mean :0.4097 Mean : 5.404
## 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.: 8.000
## Max. :4.000 Max. :1.0000 Max. :20.000
## Product_Category_2 Product_Category_3 Purchase
## Min. : 0.000 Min. : 0.000 Min. : 12
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 5823
## Median : 5.000 Median : 0.000 Median : 8047
## Mean : 6.735 Mean : 3.842 Mean : 9264
## 3rd Qu.:14.000 3rd Qu.: 8.000 3rd Qu.:12054
## Max. :18.000 Max. :18.000 Max. :23961
sapply(data, class)
## User_ID Product_ID
## "integer" "integer"
## Gender Age
## "integer" "integer"
## Occupation City_A
## "integer" "integer"
## City_B City_C
## "integer" "integer"
## Stay_In_Current_City_Years Marital_Status
## "integer" "integer"
## Product_Category_1 Product_Category_2
## "integer" "integer"
## Product_Category_3 Purchase
## "integer" "integer"
data[] <- sapply(data, as.numeric)
indices <- createDataPartition(data$Purchase, p = 0.8, list = FALSE)
train_data <- data[indices,]
test_data <- data[-indices,]
train_control <- trainControl(method = "CV",
number = 3,
verboseIter = FALSE,
allowParallel = TRUE)
# Training the XGBoost model
xgb_model <- xgboost(
data = as.matrix(train_data[, -14]), # Use matrix representation of features
label = train_data[, 14], # Specify the target variable
nrounds = 50, # Set the number of boosting rounds
max_depth = 4, # Set the maximum depth of each tree
eta = 0.3, # Set the learning rate
gamma = 0, # Set the minimum loss reduction required for a split
colsample_bytree = 1, # Set the fraction of columns used by each tree
min_child_weight = 1, # Set the minimum sum of instance weight required for a child
subsample = 1, # Set the fraction of instances used by each tree
verbose = FALSE # Print progress messages
)
# Making predictions on the test data
predictions <- predict(xgb_model, as.matrix(test_data[, -14]))
#prediction:
xgb.pred <- predictions
mse = mean((test_data$Purchase - xgb.pred)^2)
mae = caret::MAE(test_data$Purchase, xgb.pred)
rmse = caret::RMSE(test_data$Purchase, xgb.pred)
# Calculate R2 score
SSR <- sum((test_data$Purchase - xgb.pred)^2)
SST <- sum((test_data$Purchase - mean(test_data$Purchase))^2)
R2_score <- 1 - (SSR/SST)
R2_score
## [1] 0.6862764
4.2 Linear Regression
# Linear Regression Model
lm_model <- lm(Purchase ~ . - User_ID - Product_ID, data = train_data)
# Making predictions on the test data
lm_predictions <- predict(lm_model, newdata = test_data)
# Calculate metrics
lm_mse <- mean((test_data$Purchase - lm_predictions)^2)
lm_mae <- caret::MAE(test_data$Purchase, lm_predictions)
lm_rmse <- caret::RMSE(test_data$Purchase, lm_predictions)
# Calculate R2 score
lm_SSR <- sum((test_data$Purchase - lm_predictions)^2)
lm_SST <- sum((test_data$Purchase - mean(test_data$Purchase))^2)
lm_R2_score <- 1 - (lm_SSR/lm_SST)
lm_model # Print the linear regression model
##
## Call:
## lm(formula = Purchase ~ . - User_ID - Product_ID, data = train_data)
##
## Coefficients:
## (Intercept) Gender
## 10110.569 484.877
## Age Occupation
## 106.094 5.131
## City_A City_B
## -618.180 -479.990
## City_C Stay_In_Current_City_Years
## NA 6.838
## Marital_Status Product_Category_1
## -49.118 -347.429
## Product_Category_2 Product_Category_3
## 11.393 143.946
lm_mse # Print MSE
## [1] 21267874
lm_mae # Print MAE
## [1] 3527.213
lm_rmse # Print RMSE
## [1] 4611.711
lm_R2_score # Print R2 score
## [1] 0.1531761
result_xgboost <- c(mse = round(mse,2), mae = round(mae,2), rmse = round(rmse,2), R2_score = round(R2_score,2))
result_lm <- c(mse = round(lm_mse,2), mae = round(lm_mae,2), rmse = round(lm_rmse,2), R2_score = round(lm_R2_score,2))
summary_result <- data.frame(XGBoost = result_xgboost, LinearRegression = result_lm)
summary_result
Based on the table above, the XGBoost model generally outperforms the Linear Regression model, as it achieves lower values for MSE, MAE, and RMSE, indicating better predictive accuracy. Additionally, the XGBoost model has a higher R-squared (R2) score, indicating that it explains a larger proportion of the variance in the target variable compared to the Linear Regression model.
In conclusion, we are able to fulfill the targeted objectives. Firstly, we are able to identify the customer purchase behavior against various products and able to analyse data correlation at EDA stage. Secondly, we able to develop a sales forecasting model using XGBoost & Linear Regression, subsequently evaluate the best model, which is XGBoost.
By developing an efficient and effective Sales Prediction Model, retailer are able to achieve revenue optimization through performing various planning, such as supply chain management, manpower planning and market analysis.
As a result, they can react quickly to the dynamic change of the market and gain market position.
Aher, A., Kanan, R., & Vispute, S. (2021). Data Analysis and Price Prediction of Black Friday Sales using Machine Learning Techniques. International Journal of Engineering Research & Technology (IJERT), 10, 7. doi:10.17577/IJERTV10IS070271
Alagarsamy, S., Varma, K. G., Harshitha, K., Hareesh, K., & Varshini, K. (2023). Predictive Analytics for Black Friday Sales using Machine Learning Technique. 2023 International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT), 389-393. doi:10.1109/IDCIoT56793.2023.10053454
Cheriyan, S., Ibrahim, S., Mohanan, S., & Treesa, S. (2018). Intelligent Sales Prediction Using Machine Learning Techniques. 2018 International Conference on Computing, Electronics & Communications Engineering (iCCECE), 53-58. doi:10.1109/iCCECOME.2018.8659115
Sathyanarayana, S., Apeksha, C., Chethana, S., Chinmayee, H. C., & Abhishree, G, L. (2023). Big Mart Sales Prediction Using Machine Learning. International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), 12. doi:10.17148/IJARCCE.2023.124112