Table of content

  1. Introduction
  1. Data Preparation
  1. Exploratory Data Analysis
  1. Modeling
  1. Model Evaluation
  1. Conclusion

1. Introduction

1.1 Background

Black Friday is the day after the United States Thanksgiving holiday. The term “Black Friday” originated in the 1960s, initially used by Philadelphia police to describe the heavy and disruptive pedestrian and vehicle traffic that occurred on the day after Thanksgiving. Eventually, retailers started using the term to signify the start of the holiday shopping season and to attract customers with special sales and deals. It is now recognized as one of the busiest shopping days of the year, with retailers offering substantial discounts, limited-time offers, doorbuster deals, and other incentives to attract shoppers.

As consumer demand is anticipated to surge in a particular period, hence sales prediction is relatively crucial for various business planning such as supply chain management, resource optimization, inventory management, sales performance evaluation, and market analysis. Conventionally, most businesses rely heavily on a knowledge base and predicting sales trends (Cheriyan, S. el at., 2018). Since traditional methods are not very helpful for business growth in terms of revenue, the use of machine learning approaches proves to be an important point for designing the business plan taking into account consumer shopping behavior (Aher, A., el at., 2021).

In the era of Big Data and advanced technologies, companies are increasingly relying on data-driven strategies to gain a competitive advantage. By leveraging the available sales history dataset, machine learning is able to provide valuable business insight that facilitates business development and implementation of effective business strategies (Sathyanarayana, S., el at., 2023). In particular, accurate and timely sales forecasting helps businesses to optimize inventory management, allocate resources efficiently, and develop targeted marketing strategies. In addition, understanding the underlying patterns and factors that influence sales can provide valuable insights into consumer behavior and preferences.

In this paper, we aim to identify underlying customer purchase behavior against various products as well as to analyze data correlation among the variables. Subsequently, to develop an effective sales forecasting model by evaluating the performance of XG Boost and Linear Regression.

1.2 About Dataset

The dataset of this project is a purchase summary of various customers for selected high-volume products from last month for a retail company “ABC Private Limited”. The dataset consists of 550k rows and 12 columns with different data types, breakdown by 6 columns with integer data type, 5 columns with string data type and 1 column labeled as ID.

The variables in the dataset are listed as followed:

  1. User_ID: This variable represents the unique ID of the user.
  2. Product_ID: This variable represents the unique ID of the product.
  3. Gender: This variable indicates the sex of the user (male or female).
  4. Age: This variable represents the age of the user, grouped into bins or categories.
  5. Occupation: This variable refers to the occupation of the user. The occupation values are likely masked or anonymized for privacy reasons.
  6. City_Category: This variable represents the category of the city where the user is located. It is categorized as A, B, or C.
  7. Stay_In_Current_City_Years: This variable denotes the number of years the user has stayed in the current city.
  8. Marital_Status: This variable indicates the marital status of the user (e.g., 0 for unmarried, 1 for married).
  9. Product_Category_1, Product_Category_2, Product_Category_3: These variables represent the product categories to which a product belongs. The categories may be masked or anonymized.
  10. Purchase: This variable is the target variable and represents the purchase amount made by the user.

Source link: https://www.kaggle.com/datasets/rajeshrampure/black-friday-sale?resource=download

1.3 Project Questions

  1. What is the customer purchase behavior against various products and the variables’ correlation?
  2. Which is the best Machine Learning model for Sales Prediction?

1.4 Project Objectives

  1. To identify customer purchase behavior against various products and analyze data correlation among the variables
  2. To develop an effective sales forecasting model by evaluating the performance of XG Boost and Linear Regression.

2. Data Preparation

Data Overview

# Import dataset 
df=read.csv("train.csv",header = T)

# View first 6 rows of df
head(df)
# View structure of df
str(df)
## 'data.frame':    550068 obs. of  12 variables:
##  $ User_ID                   : int  1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
##  $ Product_ID                : chr  "P00069042" "P00248942" "P00087842" "P00085442" ...
##  $ Gender                    : chr  "F" "F" "F" "F" ...
##  $ Age                       : chr  "0-17" "0-17" "0-17" "0-17" ...
##  $ Occupation                : int  10 10 10 10 16 15 7 7 7 20 ...
##  $ City_Category             : chr  "A" "A" "A" "A" ...
##  $ Stay_In_Current_City_Years: chr  "2" "2" "2" "2" ...
##  $ Marital_Status            : int  0 0 0 0 0 0 1 1 1 1 ...
##  $ Product_Category_1        : int  3 1 12 12 8 1 1 1 1 8 ...
##  $ Product_Category_2        : int  NA 6 NA 14 NA 2 8 15 16 NA ...
##  $ Product_Category_3        : int  NA 14 NA NA NA NA 17 NA NA NA ...
##  $ Purchase                  : int  8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...
# View the number of unique values for each category
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
sapply(df, function(x) n_distinct(x))
##                    User_ID                 Product_ID 
##                       5891                       3631 
##                     Gender                        Age 
##                          2                          7 
##                 Occupation              City_Category 
##                         21                          3 
## Stay_In_Current_City_Years             Marital_Status 
##                          5                          2 
##         Product_Category_1         Product_Category_2 
##                         20                         18 
##         Product_Category_3                   Purchase 
##                         16                      18105
summary(df)
##     User_ID         Product_ID           Gender              Age           
##  Min.   :1000001   Length:550068      Length:550068      Length:550068     
##  1st Qu.:1001516   Class :character   Class :character   Class :character  
##  Median :1003077   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1003029                                                           
##  3rd Qu.:1004478                                                           
##  Max.   :1006040                                                           
##                                                                            
##    Occupation     City_Category      Stay_In_Current_City_Years
##  Min.   : 0.000   Length:550068      Length:550068             
##  1st Qu.: 2.000   Class :character   Class :character          
##  Median : 7.000   Mode  :character   Mode  :character          
##  Mean   : 8.077                                                
##  3rd Qu.:14.000                                                
##  Max.   :20.000                                                
##                                                                
##  Marital_Status   Product_Category_1 Product_Category_2 Product_Category_3
##  Min.   :0.0000   Min.   : 1.000     Min.   : 2.00      Min.   : 3.0      
##  1st Qu.:0.0000   1st Qu.: 1.000     1st Qu.: 5.00      1st Qu.: 9.0      
##  Median :0.0000   Median : 5.000     Median : 9.00      Median :14.0      
##  Mean   :0.4097   Mean   : 5.404     Mean   : 9.84      Mean   :12.7      
##  3rd Qu.:1.0000   3rd Qu.: 8.000     3rd Qu.:15.00      3rd Qu.:16.0      
##  Max.   :1.0000   Max.   :20.000     Max.   :18.00      Max.   :18.0      
##                                      NA's   :173638     NA's   :383247    
##     Purchase    
##  Min.   :   12  
##  1st Qu.: 5823  
##  Median : 8047  
##  Mean   : 9264  
##  3rd Qu.:12054  
##  Max.   :23961  
## 

Data Preprocessing

Duplicated rows and null removal

# Check if there is duplicated rows
any(duplicated(df))
## [1] FALSE
# Check if there is any NA in df
any(is.na(df))
## [1] TRUE
# Check total of NA for every columns
colSums(is.na(df))
##                    User_ID                 Product_ID 
##                          0                          0 
##                     Gender                        Age 
##                          0                          0 
##                 Occupation              City_Category 
##                          0                          0 
## Stay_In_Current_City_Years             Marital_Status 
##                          0                          0 
##         Product_Category_1         Product_Category_2 
##                          0                     173638 
##         Product_Category_3                   Purchase 
##                     383247                          0
# Replace NA values in Product Category 2 & 3 with "0"
df$Product_Category_2[is.na(df$Product_Category_2)] = 0
df$Product_Category_3[is.na(df$Product_Category_3)] = 0

# All NA are successfully removed
colSums(is.na(df))
##                    User_ID                 Product_ID 
##                          0                          0 
##                     Gender                        Age 
##                          0                          0 
##                 Occupation              City_Category 
##                          0                          0 
## Stay_In_Current_City_Years             Marital_Status 
##                          0                          0 
##         Product_Category_1         Product_Category_2 
##                          0                          0 
##         Product_Category_3                   Purchase 
##                          0                          0

Data Encoding

# Map Female(F) into 0 and Male(M) into 1
df$Gender = ifelse(df$Gender == "F", 0, 1)

# Check unique variables available in "Age" columns 
unique(df$Age)
## [1] "0-17"  "55+"   "26-35" "46-50" "51-55" "36-45" "18-25"
# Map Age range value from youngest to eldest into 1 to 7 accordingly
df$Age = ifelse(df$Age == "0-17", 1, 
                    ifelse(df$Age == "18-25", 2,
                           ifelse(df$Age == "26-35", 3,
                                  ifelse(df$Age == "36-45", 4,
                                         ifelse(df$Age == "46-50", 5,
                                                ifelse(df$Age == "51-55", 6, 7))))))
# Double check replaced unique variables
unique(df$Gender)
## [1] 0 1
unique(df$Age)
## [1] 1 7 3 5 6 4 2
# Check unique values of Stay_in_Current_City_Years
unique(df$Stay_In_Current_City_Years)
## [1] "2"  "4+" "3"  "1"  "0"
# Change "4+" to 4 and convert remaining values in the column into numeric data type 
df$Stay_In_Current_City_Years=ifelse(df$Stay_In_Current_City_Years == "4+", 4,as.numeric(df$Stay_In_Current_City_Years)) 
## Warning in ifelse(df$Stay_In_Current_City_Years == "4+", 4,
## as.numeric(df$Stay_In_Current_City_Years)): NAs introduced by coercion
str(df)
## 'data.frame':    550068 obs. of  12 variables:
##  $ User_ID                   : int  1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
##  $ Product_ID                : chr  "P00069042" "P00248942" "P00087842" "P00085442" ...
##  $ Gender                    : num  0 0 0 0 1 1 1 1 1 1 ...
##  $ Age                       : num  1 1 1 1 7 3 5 5 5 3 ...
##  $ Occupation                : int  10 10 10 10 16 15 7 7 7 20 ...
##  $ City_Category             : chr  "A" "A" "A" "A" ...
##  $ Stay_In_Current_City_Years: num  2 2 2 2 4 3 2 2 2 1 ...
##  $ Marital_Status            : int  0 0 0 0 0 0 1 1 1 1 ...
##  $ Product_Category_1        : int  3 1 12 12 8 1 1 1 1 8 ...
##  $ Product_Category_2        : num  0 6 0 14 0 2 8 15 16 0 ...
##  $ Product_Category_3        : num  0 14 0 0 0 0 17 0 0 0 ...
##  $ Purchase                  : int  8370 15200 1422 1057 7969 15227 19215 15854 15686 7871 ...

Data Transformation

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.2     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Add a new column (colvalue) with all values equal to 1
# Then do "spread"
df_transformed_city = df %>% mutate(colvalue=1) %>% spread(City_Category, colvalue)
# Convert "NA" value to 0
df_transformed_city[is.na(df_transformed_city)]<-0
# Rearrange City Category back into position
df1=select(df_transformed_city,c("User_ID","Product_ID","Gender","Age","Occupation","A","B","C","Stay_In_Current_City_Years","Marital_Status","Product_Category_1","Product_Category_2","Product_Category_3","Purchase"))

# Rename each dummy variables by adding prefix "City_" in front of them
names(df1)[names(df1) == "A"] <- "City_A"
names(df1)[names(df1) == "B"] <- "City_B"
names(df1)[names(df1) == "C"] <- "City_C"

Data Type Conversion

# Remove substring "P00" from Product_ID 
df1$Product_ID <- gsub("P00", "", df1$Product_ID)

# Convert values in Product_ID into numeric data type
df1$Product_ID <- as.numeric(df1$Product_ID)

# Convert remaining variables into numeric data type (for ease of correlation)
df1$Occupation=as.numeric(df1$Occupation)
df1$Marital_Status=as.numeric(df1$Marital_Status)
df1$Product_Category_1=as.numeric(df1$Product_Category_1)
df1$Purchase=as.numeric(df1$Purchase)

str(df1)
## 'data.frame':    550068 obs. of  14 variables:
##  $ User_ID                   : int  1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
##  $ Product_ID                : num  69042 248942 87842 85442 285442 ...
##  $ Gender                    : num  0 0 0 0 1 1 1 1 1 1 ...
##  $ Age                       : num  1 1 1 1 7 3 5 5 5 3 ...
##  $ Occupation                : num  10 10 10 10 16 15 7 7 7 20 ...
##  $ City_A                    : num  1 1 1 1 0 1 0 0 0 1 ...
##  $ City_B                    : num  0 0 0 0 0 0 1 1 1 0 ...
##  $ City_C                    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Stay_In_Current_City_Years: num  2 2 2 2 4 3 2 2 2 1 ...
##  $ Marital_Status            : num  0 0 0 0 0 0 1 1 1 1 ...
##  $ Product_Category_1        : num  3 1 12 12 8 1 1 1 1 8 ...
##  $ Product_Category_2        : num  0 6 0 14 0 2 8 15 16 0 ...
##  $ Product_Category_3        : num  0 14 0 0 0 0 17 0 0 0 ...
##  $ Purchase                  : num  8370 15200 1422 1057 7969 ...

**Extra codes to make both df and df1 similar except for data transformation part (City)

# Convert remaining columns into numeric data type
df$Occupation=as.numeric(df$Occupation)
df$Marital_Status=as.numeric(df$Marital_Status)
df$Product_Category_1=as.numeric(df$Product_Category_1)
df$Purchase=as.numeric(df$Purchase)
df=select(df,-User_ID)
str(df)
## 'data.frame':    550068 obs. of  11 variables:
##  $ Product_ID                : chr  "P00069042" "P00248942" "P00087842" "P00085442" ...
##  $ Gender                    : num  0 0 0 0 1 1 1 1 1 1 ...
##  $ Age                       : num  1 1 1 1 7 3 5 5 5 3 ...
##  $ Occupation                : num  10 10 10 10 16 15 7 7 7 20 ...
##  $ City_Category             : chr  "A" "A" "A" "A" ...
##  $ Stay_In_Current_City_Years: num  2 2 2 2 4 3 2 2 2 1 ...
##  $ Marital_Status            : num  0 0 0 0 0 0 1 1 1 1 ...
##  $ Product_Category_1        : num  3 1 12 12 8 1 1 1 1 8 ...
##  $ Product_Category_2        : num  0 6 0 14 0 2 8 15 16 0 ...
##  $ Product_Category_3        : num  0 14 0 0 0 0 17 0 0 0 ...
##  $ Purchase                  : num  8370 15200 1422 1057 7969 ...
str(df1)
## 'data.frame':    550068 obs. of  14 variables:
##  $ User_ID                   : int  1000001 1000001 1000001 1000001 1000002 1000003 1000004 1000004 1000004 1000005 ...
##  $ Product_ID                : num  69042 248942 87842 85442 285442 ...
##  $ Gender                    : num  0 0 0 0 1 1 1 1 1 1 ...
##  $ Age                       : num  1 1 1 1 7 3 5 5 5 3 ...
##  $ Occupation                : num  10 10 10 10 16 15 7 7 7 20 ...
##  $ City_A                    : num  1 1 1 1 0 1 0 0 0 1 ...
##  $ City_B                    : num  0 0 0 0 0 0 1 1 1 0 ...
##  $ City_C                    : num  0 0 0 0 1 0 0 0 0 0 ...
##  $ Stay_In_Current_City_Years: num  2 2 2 2 4 3 2 2 2 1 ...
##  $ Marital_Status            : num  0 0 0 0 0 0 1 1 1 1 ...
##  $ Product_Category_1        : num  3 1 12 12 8 1 1 1 1 8 ...
##  $ Product_Category_2        : num  0 6 0 14 0 2 8 15 16 0 ...
##  $ Product_Category_3        : num  0 14 0 0 0 0 17 0 0 0 ...
##  $ Purchase                  : num  8370 15200 1422 1057 7969 ...

Export cleaned data for further EDA and modelling

# Save df into "output_without_transformation.csv"
write.csv(df, file = "output_without_transformation.csv", row.names = T)

# Save df into "output_with_transformation.csv"
write.csv(df1, file = "output_with_transformation.csv", row.names = T)

3. Exploratory Data Analysis

# Load required libraries
library(dplyr)
library(ggplot2)
library(corrplot)
## corrplot 0.92 loaded
library(rpart)
# Read the cleaned data file (output_with_transformation.csv)
df1 <- read.csv("output_with_transformation.csv", header = TRUE)
# Summary statistics
summary(df1)
##        X             User_ID          Product_ID         Gender      
##  Min.   :     1   Min.   :1000001   Min.   :   142   Min.   :0.0000  
##  1st Qu.:137518   1st Qu.:1001516   1st Qu.: 97142   1st Qu.:1.0000  
##  Median :275035   Median :1003077   Median :169742   Median :1.0000  
##  Mean   :275035   Mean   :1003029   Mean   :174745   Mean   :0.7531  
##  3rd Qu.:412551   3rd Qu.:1004478   3rd Qu.:259142   3rd Qu.:1.0000  
##  Max.   :550068   Max.   :1006040   Max.   :375436   Max.   :1.0000  
##       Age          Occupation         City_A           City_B      
##  Min.   :1.000   Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:3.000   1st Qu.: 2.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :3.000   Median : 7.000   Median :0.0000   Median :0.0000  
##  Mean   :3.496   Mean   : 8.077   Mean   :0.2685   Mean   :0.4203  
##  3rd Qu.:4.000   3rd Qu.:14.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :7.000   Max.   :20.000   Max.   :1.0000   Max.   :1.0000  
##      City_C       Stay_In_Current_City_Years Marital_Status  
##  Min.   :0.0000   Min.   :0.000              Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:1.000              1st Qu.:0.0000  
##  Median :0.0000   Median :2.000              Median :0.0000  
##  Mean   :0.3112   Mean   :1.858              Mean   :0.4097  
##  3rd Qu.:1.0000   3rd Qu.:3.000              3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :4.000              Max.   :1.0000  
##  Product_Category_1 Product_Category_2 Product_Category_3    Purchase    
##  Min.   : 1.000     Min.   : 0.000     Min.   : 0.000     Min.   :   12  
##  1st Qu.: 1.000     1st Qu.: 0.000     1st Qu.: 0.000     1st Qu.: 5823  
##  Median : 5.000     Median : 5.000     Median : 0.000     Median : 8047  
##  Mean   : 5.404     Mean   : 6.735     Mean   : 3.842     Mean   : 9264  
##  3rd Qu.: 8.000     3rd Qu.:14.000     3rd Qu.: 8.000     3rd Qu.:12054  
##  Max.   :20.000     Max.   :18.000     Max.   :18.000     Max.   :23961
# How many unique User_IDs are there in the dataset?
length(unique(df1$User_ID))  # The store had 5891 customers
## [1] 5891
# How many items did each customer purchase?
Unique_UserID <- as.data.frame(table(df1$User_ID))
names(Unique_UserID) <- c("User_ID", "Customer_Purchase_Count")
head(Unique_UserID)
# Due to the large dataset, the average values were used for this analysis
new_data <- df1 %>%
  group_by(User_ID, Age, Gender, Occupation, City_A, City_B, City_C, Stay_In_Current_City_Years, Marital_Status) %>%
  summarise(Avg_Product_Category_1 = mean(Product_Category_1),
            Avg_Product_Category_2 = mean(Product_Category_2),
            Avg_Product_Category_3 = mean(Product_Category_3),
            Avg_Purchase = mean(Purchase))
## `summarise()` has grouped output by 'User_ID', 'Age', 'Gender', 'Occupation',
## 'City_A', 'City_B', 'City_C', 'Stay_In_Current_City_Years'. You can override
## using the `.groups` argument.

Which Age group/gender had the highest purchase by product category?

Product Category 1

ggplot(new_data, aes(Gender, Avg_Product_Category_1, fill = Gender)) + 
  geom_col(width = 0.4) + 
  facet_wrap(~ Age) +
  labs(title = "Age Group/Gender Vs Product Category 1")

Product Category 2

ggplot(new_data, aes(Gender, Avg_Product_Category_2, fill = Gender)) + 
  geom_col() + 
  facet_wrap(~ Age) +
  labs(title = "Age Group/Gender Vs Product Category 2")

Product Category 3

ggplot(new_data, aes(Gender, Avg_Product_Category_3, fill = Gender)) + 
  geom_col() + 
  facet_wrap(~ Age) +
  labs(title = "Age Group/Gender Vs Product Category 3")

Age group versus Average purchase amount

ggplot(new_data, aes(Age, Avg_Purchase, fill = Gender)) + 
  geom_col() + 
  facet_wrap(~ Gender) +
  labs(title = "Age Group/Gender Vs Average Purchase Amount")

Which product category raked in the most money wrapped with age? Product Category 1

ggplot(new_data, aes(Avg_Product_Category_1, Avg_Purchase, color = Age)) + 
  geom_point() + 
  facet_wrap(~ Age) +
  labs(title = "Product_Cat_1_Avg Vs Avg_Purchase_Amount")

Product Category 2

ggplot(new_data, aes(Avg_Product_Category_2, Avg_Purchase, color = Age)) + 
  geom_point() + 
  facet_wrap(~ Age) +
  labs(title = "Product_Cat_2_Avg Vs Avg_Purchase_Amount")

Product Category 3

ggplot(new_data, aes(Avg_Product_Category_3, Avg_Purchase, color = Age)) + 
  geom_point() + 
  facet_wrap(~ Age) +
  labs(title = "Product_Cat_3_Avg Vs Avg_Purchase_Amount")

Explore the occupation variable versus the product categories and the purchase amount

Which occupation had more influence on product purchase?

Product Category 1

ggplot(new_data, aes(Occupation, Avg_Product_Category_1, fill = Occupation)) + 
  geom_col() + 
  facet_wrap(~ Age) +
  labs(title = "Occupation Vs Product Category 1")

Product Category 2

ggplot(new_data, aes(Occupation, Avg_Product_Category_2, fill = Occupation)) + 
  geom_col() + 
  facet_wrap(~ Age) +
  labs(title = "Occupation Vs Product Category 2")

Product Category 3

ggplot(new_data, aes(Occupation, Avg_Product_Category_3, fill = Occupation)) + 
  geom_col() + 
  facet_wrap(~ Age) +
  labs(title = "Occupation Vs Product Category 3")

Which occupation spent the most money?

ggplot(new_data, aes(Occupation, Avg_Purchase, fill = Occupation)) + 
  geom_col() + 
  facet_wrap(~ Age) +
  labs(title = "Occupation Vs Average Purchase Amount")

Explore the city category variable versus the product categories and the purchase amount

Box plot for average product category 1

ggplot(new_data, aes(x = ifelse(City_A == 1, "City_A", ifelse(City_B == 1, "City_B", "City_C")), y = Avg_Product_Category_1)) +
  geom_boxplot() +
  xlab("City Category") +
  ylab("Average Product Category 1") +
  ggtitle("Average Product Category 1 by City Category")

Box plot for average product category 2

ggplot(new_data, aes(x = ifelse(City_A == 1, "City_A", ifelse(City_B == 1, "City_B", "City_C")), y = Avg_Product_Category_2)) +
  geom_boxplot() +
  xlab("City Category") +
  ylab("Average Product Category 2") +
  ggtitle("Average Product Category 2 by City Category")

Box plot for average product category 3

ggplot(new_data, aes(x = ifelse(City_A == 1, "City_A", ifelse(City_B == 1, "City_B", "City_C")), y = Avg_Product_Category_3)) +
  geom_boxplot() +
  xlab("City Category") +
  ylab("Average Product Category 3") +
  ggtitle("Average Product Category 3 by City Category")

City Category versus Average purchase amount

ggplot(new_data, aes(x = ifelse(City_A == 1, "City_A", ifelse(City_B == 1, "City_B", "City_C")), y = Avg_Purchase)) +
  geom_boxplot() +
  xlab("City Category") +
  ylab("Average Purchase") +
  ggtitle("Average Purchase by City Category")

Explore the Stay in current city variable versus the product categories and the purchase amount Product Category 1

ggplot(new_data, aes(Stay_In_Current_City_Years, Avg_Product_Category_1, fill = Stay_In_Current_City_Years)) + 
  geom_col() +
  facet_wrap(~ Age) +
  labs(title = "Stay in current city Vs Product_Category 1")

Product Category 2

ggplot(new_data, aes(Stay_In_Current_City_Years, Avg_Product_Category_2, fill = Stay_In_Current_City_Years)) + 
  geom_col() +
  facet_wrap(~ Age) +
  labs(title = "Stay in current city Vs Product_Category 2")

Product category 3

ggplot(new_data, aes(Stay_In_Current_City_Years, Avg_Product_Category_3, fill = Stay_In_Current_City_Years)) + 
  geom_col() +
  facet_wrap(~ Age) +
  labs(title = "Stay in current city Vs Product_Category 3")

Stay in current city versus Average purchase amount

ggplot(new_data, aes(Stay_In_Current_City_Years, Avg_Purchase, fill = Stay_In_Current_City_Years)) + 
  geom_col() +
  facet_wrap(~ Age) +
  labs(title = "Stay in current city Vs Avg_Purchase_Amount")

Explore the marital status variable. #Box plot of average product category 1 by marital status

ggplot(new_data, aes(x = factor(Marital_Status), y = Avg_Product_Category_1)) +
  geom_boxplot() +
  xlab("Marital Status") +
  ylab("Average Product Category 1") +
  ggtitle("Average Product Category 1 by Marital Status")

Box plot of average product category 2 by marital status

ggplot(new_data, aes(x = factor(Marital_Status), y = Avg_Product_Category_2)) +
  geom_boxplot() +
  xlab("Marital Status") +
  ylab("Average Product Category 2") +
  ggtitle("Average Product Category 2 by Marital Status")

Box plot of average product category 3 by marital status

ggplot(new_data, aes(x = factor(Marital_Status), y = Avg_Product_Category_3)) +
  geom_boxplot() +
  xlab("Marital Status") +
  ylab("Average Product Category 3") +
  ggtitle("Average Product Category 3 by Marital Status")

Marital status versus average purchase amount

ggplot(new_data, aes(x = factor(Marital_Status), y = Avg_Purchase)) +
  geom_boxplot() +
  xlab("Marital Status") +
  ylab("Average Purchase") +
  ggtitle("Average Purchase by Marital Status")

Correlation analysis

# Select numerical columns for correlation analysis
numerical_cols <- c("Age", "Occupation", "City_A", "City_B", "City_C", "Stay_In_Current_City_Years", "Marital_Status", "Avg_Product_Category_1", "Avg_Product_Category_2", "Avg_Product_Category_3", "Avg_Purchase")

# Subset data for correlation analysis
cor_data <- new_data[numerical_cols]

# Calculate correlation matrix
cor_matrix <- cor(cor_data)

# Plot correlation matrix
corrplot(cor_matrix, method = "circle", tl.col = "black", tl.srt = 45)

4. Modeling

4.1 XGBOOST

library(xgboost)
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
set.seed(1)
data <- read.csv("output_with_transformation.csv", header = TRUE)
data <- data[, -1]
summary(data)
##     User_ID          Product_ID         Gender            Age       
##  Min.   :1000001   Min.   :   142   Min.   :0.0000   Min.   :1.000  
##  1st Qu.:1001516   1st Qu.: 97142   1st Qu.:1.0000   1st Qu.:3.000  
##  Median :1003077   Median :169742   Median :1.0000   Median :3.000  
##  Mean   :1003029   Mean   :174745   Mean   :0.7531   Mean   :3.496  
##  3rd Qu.:1004478   3rd Qu.:259142   3rd Qu.:1.0000   3rd Qu.:4.000  
##  Max.   :1006040   Max.   :375436   Max.   :1.0000   Max.   :7.000  
##    Occupation         City_A           City_B           City_C      
##  Min.   : 0.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 2.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 7.000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   : 8.077   Mean   :0.2685   Mean   :0.4203   Mean   :0.3112  
##  3rd Qu.:14.000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :20.000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##  Stay_In_Current_City_Years Marital_Status   Product_Category_1
##  Min.   :0.000              Min.   :0.0000   Min.   : 1.000    
##  1st Qu.:1.000              1st Qu.:0.0000   1st Qu.: 1.000    
##  Median :2.000              Median :0.0000   Median : 5.000    
##  Mean   :1.858              Mean   :0.4097   Mean   : 5.404    
##  3rd Qu.:3.000              3rd Qu.:1.0000   3rd Qu.: 8.000    
##  Max.   :4.000              Max.   :1.0000   Max.   :20.000    
##  Product_Category_2 Product_Category_3    Purchase    
##  Min.   : 0.000     Min.   : 0.000     Min.   :   12  
##  1st Qu.: 0.000     1st Qu.: 0.000     1st Qu.: 5823  
##  Median : 5.000     Median : 0.000     Median : 8047  
##  Mean   : 6.735     Mean   : 3.842     Mean   : 9264  
##  3rd Qu.:14.000     3rd Qu.: 8.000     3rd Qu.:12054  
##  Max.   :18.000     Max.   :18.000     Max.   :23961
sapply(data, class)
##                    User_ID                 Product_ID 
##                  "integer"                  "integer" 
##                     Gender                        Age 
##                  "integer"                  "integer" 
##                 Occupation                     City_A 
##                  "integer"                  "integer" 
##                     City_B                     City_C 
##                  "integer"                  "integer" 
## Stay_In_Current_City_Years             Marital_Status 
##                  "integer"                  "integer" 
##         Product_Category_1         Product_Category_2 
##                  "integer"                  "integer" 
##         Product_Category_3                   Purchase 
##                  "integer"                  "integer"
data[] <- sapply(data, as.numeric) 

indices <- createDataPartition(data$Purchase, p = 0.8, list = FALSE)
train_data <- data[indices,]
test_data <- data[-indices,]
train_control <- trainControl(method = "CV",
                              number = 3,
                              verboseIter = FALSE,
                              allowParallel = TRUE)
# Training the XGBoost model
xgb_model <- xgboost(
  data = as.matrix(train_data[, -14]),  # Use matrix representation of features
  label = train_data[, 14],  # Specify the target variable
  nrounds = 50,  # Set the number of boosting rounds
  max_depth = 4,  # Set the maximum depth of each tree
  eta = 0.3,  # Set the learning rate
  gamma = 0,  # Set the minimum loss reduction required for a split
  colsample_bytree = 1,  # Set the fraction of columns used by each tree
  min_child_weight = 1,  # Set the minimum sum of instance weight required for a child
  subsample = 1,  # Set the fraction of instances used by each tree
  verbose = FALSE  # Print progress messages
)
# Making predictions on the test data
predictions <- predict(xgb_model, as.matrix(test_data[, -14]))
#prediction:
xgb.pred <- predictions

mse = mean((test_data$Purchase - xgb.pred)^2)
mae = caret::MAE(test_data$Purchase, xgb.pred)
rmse = caret::RMSE(test_data$Purchase, xgb.pred)
# Calculate R2 score
SSR <- sum((test_data$Purchase - xgb.pred)^2)
SST <- sum((test_data$Purchase - mean(test_data$Purchase))^2)
R2_score <- 1 - (SSR/SST)

R2_score
## [1] 0.6862764

4.2 Linear Regression

# Linear Regression Model
lm_model <- lm(Purchase ~ . - User_ID - Product_ID, data = train_data)
# Making predictions on the test data
lm_predictions <- predict(lm_model, newdata = test_data)
# Calculate metrics
lm_mse <- mean((test_data$Purchase - lm_predictions)^2)
lm_mae <- caret::MAE(test_data$Purchase, lm_predictions)
lm_rmse <- caret::RMSE(test_data$Purchase, lm_predictions)
# Calculate R2 score
lm_SSR <- sum((test_data$Purchase - lm_predictions)^2)
lm_SST <- sum((test_data$Purchase - mean(test_data$Purchase))^2)
lm_R2_score <- 1 - (lm_SSR/lm_SST)
lm_model  # Print the linear regression model
## 
## Call:
## lm(formula = Purchase ~ . - User_ID - Product_ID, data = train_data)
## 
## Coefficients:
##                (Intercept)                      Gender  
##                  10110.569                     484.877  
##                        Age                  Occupation  
##                    106.094                       5.131  
##                     City_A                      City_B  
##                   -618.180                    -479.990  
##                     City_C  Stay_In_Current_City_Years  
##                         NA                       6.838  
##             Marital_Status          Product_Category_1  
##                    -49.118                    -347.429  
##         Product_Category_2          Product_Category_3  
##                     11.393                     143.946
lm_mse  # Print MSE
## [1] 21267874
lm_mae  # Print MAE
## [1] 3527.213
lm_rmse  # Print RMSE
## [1] 4611.711
lm_R2_score  # Print R2 score
## [1] 0.1531761

5. Model Evaluation

result_xgboost <- c(mse = round(mse,2), mae = round(mae,2), rmse = round(rmse,2), R2_score = round(R2_score,2))
result_lm <- c(mse = round(lm_mse,2), mae = round(lm_mae,2), rmse = round(lm_rmse,2), R2_score = round(lm_R2_score,2))
summary_result <- data.frame(XGBoost =  result_xgboost, LinearRegression = result_lm)
summary_result

Based on the table above, the XGBoost model generally outperforms the Linear Regression model, as it achieves lower values for MSE, MAE, and RMSE, indicating better predictive accuracy. Additionally, the XGBoost model has a higher R-squared (R2) score, indicating that it explains a larger proportion of the variance in the target variable compared to the Linear Regression model.

6. Conclusion

In conclusion, we are able to fulfill the targeted objectives. Firstly, we are able to identify the customer purchase behavior against various products and able to analyse data correlation at EDA stage. Secondly, we able to develop a sales forecasting model using XGBoost & Linear Regression, subsequently evaluate the best model, which is XGBoost.

By developing an efficient and effective Sales Prediction Model, retailer are able to achieve revenue optimization through performing various planning, such as supply chain management, manpower planning and market analysis.

As a result, they can react quickly to the dynamic change of the market and gain market position.

7. Reference

Aher, A., Kanan, R., & Vispute, S. (2021). Data Analysis and Price Prediction of Black Friday Sales using Machine Learning Techniques. International Journal of Engineering Research & Technology (IJERT), 10, 7. doi:10.17577/IJERTV10IS070271

Alagarsamy, S., Varma, K. G., Harshitha, K., Hareesh, K., & Varshini, K. (2023). Predictive Analytics for Black Friday Sales using Machine Learning Technique. 2023 International Conference on Intelligent Data Communication Technologies and Internet of Things (IDCIoT), 389-393. doi:10.1109/IDCIoT56793.2023.10053454

Cheriyan, S., Ibrahim, S., Mohanan, S., & Treesa, S. (2018). Intelligent Sales Prediction Using Machine Learning Techniques. 2018 International Conference on Computing, Electronics & Communications Engineering (iCCECE), 53-58. doi:10.1109/iCCECOME.2018.8659115

Sathyanarayana, S., Apeksha, C., Chethana, S., Chinmayee, H. C., & Abhishree, G, L. (2023). Big Mart Sales Prediction Using Machine Learning. International Journal of Advanced Research in Computer and Communication Engineering (IJARCCE), 12. doi:10.17148/IJARCCE.2023.124112