GROUP MEMBERS(GROUP 9)
IDA TEH PING LI 22074634
CHEN XIN WEI 17202767
SUM SI YENG 22068843
YAP ZHI REN 22056395
YEAP YU XIAN 17203286
1.Introduction
In the current era of data-driven decision making, businesses continuously strive to understand and enhance customer value. The project is designed to leverage data analytics in deciphering customer behaviors and preferences. This endeavor involves utilizing advanced statistical models and machine learning techniques to not only predict customer spending but also to segment customers based on their potential value to the business. The overarching aim is to equip the company with actionable insights that enable more effective targeting, personalized marketing, and optimized resource allocation.
1.1 Problems
Prediction of Purchase Amount (Regression Problem):
-Problem Statement: Determining the extent to which demographic patterns, historical purchase data, and purchasing behaviors such as payment methods and promo code usage, influence a customer’s purchase amount.
-Relevance: Understanding these variables can significantly enhance the accuracy of predictive models, enabling businesses to forecast sales and tailor their marketing strategies to individual customer profiles.
Identification of Potential Customers (Classification Problem):
-Problem Statement: Identifying key customer segments, particularly those who are likely to be high-value, using methods like basket analysis to understand popular product combinations.
-Relevance: This segmentation is critical for targeted marketing efforts, ensuring that resources are focused on the most profitable segments.
1.2 Objectives
1.2.1 Objective for Regression: To predict the purchase amount a customer is likely to spend, incorporating customer profiles for more accurate targeting.
Problem Statement: How demographic pattern, historical purchase data and other purchasing behavior (payment method, promo code utilization) impact customer purchase amount?
Strategy: To discover the features that impact customer’s purchase amount and make prediction by creating a multiple regression model using the variables provided.
Expected Outcome: The model will provide a continuous prediction of the purchase amount and insights into consumer profiles, enabling the business to anticipate potential revenue from individual customers and tailor marketing strategies more effectively.
1.2.2 Objective for Classification: To predict the high value customer, and using basket analysis to understand high-selling product combinations.
Problem Statement: Who are the target customers for the company? To identify customer group by separating them into categories (high value or non high value) in order to help ease business to find their target marketing audience.
Strategy: To discover the features that impact the customer class and make prediction by building a classification model using the selected features.
Expected Outcome: The model will classify each customer as ‘high-value’ or ‘not high-value’, and basket analysis will provide insights into product combinations that are most attractive to targeted customer group. This allows the business to focus marketing efforts and tailor its product offerings more strategically.
2.1 Dataset introduction
Dataset: Customer Shopping Trends
The dataset that we utilize is “Customer Shopping Trends”. This dataset consists of 18 column and 3900 observations.
The dataset is saved in csv format before import to R.
2.2 Import dataset
2.3 Call libraries #put all libraries here
library(plyr)
## Warning: package 'plyr' was built under R version 4.3.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
## Warning: package 'readr' was built under R version 4.3.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(rpart)
## Warning: package 'rpart' was built under R version 4.3.2
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.2
## Loaded gbm 2.1.9
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
library(Metrics)
## Warning: package 'Metrics' was built under R version 4.3.2
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'DMwR'
## The following object is masked from 'package:plyr':
##
## join
library(rsample)
## Warning: package 'rsample' was built under R version 4.3.2
library(pROC)
## Warning: package 'pROC' was built under R version 4.3.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:Metrics':
##
## auc
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.2
library(arules)
## Warning: package 'arules' was built under R version 4.3.2
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following object is masked from 'package:kernlab':
##
## size
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.2
##
## Attaching package: 'e1071'
## The following object is masked from 'package:rsample':
##
## permutations
2.4 Data Exploration
class(Shopping_data)
## [1] "data.frame"
str(Shopping_data)
## 'data.frame': 3900 obs. of 18 variables:
## $ Customer.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 55 19 50 21 45 46 63 27 26 57 ...
## $ Gender : chr "Male" "Male" "Male" "Male" ...
## $ Item.Purchased : chr "Blouse" "Sweater" "Jeans" "Sandals" ...
## $ Category : chr "Clothing" "Clothing" "Clothing" "Footwear" ...
## $ Purchase.Amount..USD. : int 53 64 73 90 49 20 85 34 97 31 ...
## $ Location : chr "Kentucky" "Maine" "Massachusetts" "Rhode Island" ...
## $ Size : chr "L" "L" "S" "M" ...
## $ Color : chr "Gray" "Maroon" "Maroon" "Maroon" ...
## $ Season : chr "Winter" "Winter" "Spring" "Spring" ...
## $ Review.Rating : num 3.1 3.1 3.1 3.5 2.7 2.9 3.2 3.2 2.6 4.8 ...
## $ Subscription.Status : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Shipping.Type : chr "Express" "Express" "Free Shipping" "Next Day Air" ...
## $ Discount.Applied : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Promo.Code.Used : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Previous.Purchases : int 14 2 23 49 31 14 49 19 8 4 ...
## $ Payment.Method : chr "Venmo" "Cash" "Credit Card" "PayPal" ...
## $ Frequency.of.Purchases: chr "Fortnightly" "Fortnightly" "Weekly" "Weekly" ...
dim(Shopping_data)
## [1] 3900 18
summary(Shopping_data)
## Customer.ID Age Gender Item.Purchased
## Min. : 1.0 Min. :18.00 Length:3900 Length:3900
## 1st Qu.: 975.8 1st Qu.:31.00 Class :character Class :character
## Median :1950.5 Median :44.00 Mode :character Mode :character
## Mean :1950.5 Mean :44.07
## 3rd Qu.:2925.2 3rd Qu.:57.00
## Max. :3900.0 Max. :70.00
## Category Purchase.Amount..USD. Location Size
## Length:3900 Min. : 20.00 Length:3900 Length:3900
## Class :character 1st Qu.: 39.00 Class :character Class :character
## Mode :character Median : 60.00 Mode :character Mode :character
## Mean : 59.76
## 3rd Qu.: 81.00
## Max. :100.00
## Color Season Review.Rating Subscription.Status
## Length:3900 Length:3900 Min. :2.50 Length:3900
## Class :character Class :character 1st Qu.:3.10 Class :character
## Mode :character Mode :character Median :3.70 Mode :character
## Mean :3.75
## 3rd Qu.:4.40
## Max. :5.00
## Shipping.Type Discount.Applied Promo.Code.Used Previous.Purchases
## Length:3900 Length:3900 Length:3900 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.:13.00
## Mode :character Mode :character Mode :character Median :25.00
## Mean :25.35
## 3rd Qu.:38.00
## Max. :50.00
## Payment.Method Frequency.of.Purchases
## Length:3900 Length:3900
## Class :character Class :character
## Mode :character Mode :character
##
##
##
colSums(is.na(Shopping_data))
## Customer.ID Age Gender
## 0 0 0
## Item.Purchased Category Purchase.Amount..USD.
## 0 0 0
## Location Size Color
## 0 0 0
## Season Review.Rating Subscription.Status
## 0 0 0
## Shipping.Type Discount.Applied Promo.Code.Used
## 0 0 0
## Previous.Purchases Payment.Method Frequency.of.Purchases
## 0 0 0
sum(duplicated(Shopping_data))
## [1] 0
2.4.1 Data Transformation
2.4.1.1 Age Group
Shopping_data <- Shopping_data %>% mutate(
Age_Group=ifelse(Age>=18 & Age<=24,"18 to 24",
ifelse(Age>=25 & Age<=34,"25 to 34",
ifelse(Age>=35 & Age<=44,"35 to 44",
ifelse(Age>=45 & Age<=54,"45 to 54",
ifelse(Age>=55 & Age<=64,"55 to 64",
ifelse(Age>=65,"65 or older",NA)))))))
Shopping_data <- Shopping_data %>%
select(1:(which(names(Shopping_data) == "Gender")),
"Age_Group",
(which(names(Shopping_data) == "Gender") + 1):length(names(Shopping_data)))
2.4.1.2 Purchasing Frequency Group
Shopping_data<- Shopping_data %>% mutate(Frequency.of.Purchases=ifelse(Frequency.of.Purchases=="Every 3 Months","Quarterly",Frequency.of.Purchases))
Shopping_data<- Shopping_data %>% mutate(
Purchases.Frequency.Group = ifelse(Frequency.of.Purchases=="Weekly" | Frequency.of.Purchases=="Fortnightly" |Frequency.of.Purchases=="Bi-Weekly","High",
ifelse(Frequency.of.Purchases=="Monthly" | Frequency.of.Purchases=="Quarterly","Normal",
ifelse(Frequency.of.Purchases=="Annually","Lower",NA))))
2.4.1.3 Previous Purchase Group
Shopping_data <- Shopping_data %>% mutate(
Previous.Purchases.Group = ifelse(Previous.Purchases>=1 & Previous.Purchases<=10,"1 to 10",
ifelse(Previous.Purchases>=11 & Previous.Purchases<=20,"11 to 20",
ifelse(Previous.Purchases>=21 & Previous.Purchases<=30,"21 to 30",
ifelse(Previous.Purchases>=31 & Previous.Purchases<=40,"31 to 40",
ifelse(Previous.Purchases>=41 & Previous.Purchases<=50,"41 to 50",NA))))))
2.4.1.3 Identify Customer Class
Shopping_data<- Shopping_data %>%
mutate(
Customer_class=ifelse((Purchase.Amount..USD.>=60 & Purchases.Frequency.Group=="High"),"High Value",
"Non High Value"))
2.5 Remove unnecessary Column
Shopping_data <- subset(Shopping_data, select = -c(Customer.ID,Review.Rating, Size))
2.6
write_csv(Shopping_data, file = "Shopping Trend Dataset (Cleaned).csv")
3.1 Univarate Analysis
3.1.1 Purchase Amount (num)
ggplot(Shopping_data, aes(x = Purchase.Amount..USD., fill = ..count..)) +
geom_histogram(binwidth = 10, color = "white", aes(y = ..count..), alpha = 0.7) +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(title = "Histogram of Purchase Amount",
x = "Purchase Amount (USD)",
y = "Frequency")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
3.1.2 age group (cat)
Age_cat<- table(Shopping_data$Age_Group)
names(Age_cat)[1] <- "Age"
Age_cat <- data.frame(Age = c("18 to 24, 12.46%","25 to 34, 19.36%","35 to 44, 18.69%","45 to 54, 19.28%","55 to 64, 19.26%","65 or older, 10.95%"),
freq = c(486,755,729,752,751,427))
pie(Age_cat$freq, labels = Age_cat$Age , col = c("pink", "yellow", "lightblue","lightgreen","lightyellow","rosybrown"), main = "Age Distribution")
3.1.3 gender (cat)
gender_cat<- table(Shopping_data$Gender)
names(gender_cat)[1] <- "Gender"
gender_cat <- data.frame(Gender = c("Female, 32%","Male, 68%"),
freq = c(1248,2652))
pie(gender_cat$freq, labels = gender_cat$Gender , col = c("pink", "lightblue"), main = "Gender Distribution")
3.1.5 subscription status (cat)
substatus_cat<- table(Shopping_data$Subscription.Status)
names(substatus_cat)[1] <- "Status"
substatus_cat <- data.frame(Status = c("Yes, 73%","No, 27%"),
freq = c(2847,1053))
pie(substatus_cat$freq, labels = substatus_cat$Status , col = c("rosybrown", "lightyellow"), main = "Subscription Status")
3.1.6 discount applied (cat)
disc_cat<- table(Shopping_data$Discount.Applied)
names(disc_cat)[1] <- "Discount Application"
disc_cat <- data.frame(Discount= c("Yes, 57%","No, 43%"),
freq = c(2223,1677))
pie(disc_cat$freq, labels = disc_cat$Discount , col = c("lightblue", "lightyellow"), main = "Discount Application")
3.1.8 previous purchase group
purchasegrp_cat<- table(Shopping_data$Previous.Purchases.Group)
names(purchasegrp_cat)[1] <- "Group"
purchasegrp_cat <- data.frame(Group= c("1 to 10, 19.18%","11 to 20, 19.92%","21 to 30, 20.26%","31 to 40, 20.1%","41 to 50, 19.62%"),
freq = c(784,777,790,784,765))
pie(purchasegrp_cat$freq, labels = purchasegrp_cat$Group , col = c("pink", "lightblue","lightgreen","lightyellow","rosybrown"), main = "Previous Purchases Category")
3.1.9 frequency of purchases (cat)
max_frequency <- max(Shopping_data$Frequency.of.Purchases)
ggplot(Shopping_data, aes(x = Frequency.of.Purchases, fill = Frequency.of.Purchases)) +
geom_bar() +
scale_fill_manual(values = ifelse(Shopping_data$Frequency.of.Purchases == max_frequency, "lightblue", "darkblue")) +
labs(title = "Bar Chart of Frequency of Purchases",
x = "Frequency of Purchases",
y = "Count")
3.1.10 Purchase Frequency Group
purchasefreq_cat<- table(Shopping_data$Purchases.Frequency.Group)
names(purchasefreq_cat)[1] <- "Group"
purchasefreq_cat <- data.frame(Group= c("High, 41.74%","Lower, 14.67%","Normal, 43.59%"),
freq = c(1628,572,1700))
pie(purchasefreq_cat$freq, labels = purchasefreq_cat$Group , col = c("pink", "lightblue","lightyellow"), main = "Purchases Frequency")
3.1.11 Customer Class
custclass_cat<- table(Shopping_data$Customer_class)
names(custclass_cat)[1] <- "Class"
custclass_cat <- data.frame(Class= c("High Value, 21%","Non High Value, 79%"),
freq = c(819,3081))
pie(custclass_cat$freq, labels = custclass_cat$Class , col = c("gold","lightgrey"), main = "Customer Class")
3.2 Bivariate Analysis
3.2.1 purchased amount and gender
ggplot(Shopping_data, aes(x = Gender, y = Purchase.Amount..USD.,fill=Gender)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Gender", x="Gender", y="Mean Purchase Amount(USD)",fill="Gender")
3.2.2 purchased amount and historical purchase data
ggplot(Shopping_data, aes(x = Previous.Purchases.Group, y = Purchase.Amount..USD.,fill=Previous.Purchases.Group)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Previous Purchases Group", x="Previous Purchases Group",
y="Mean Purchase Amount(USD)",fill="Previous Purchases Group")
3.2.3 purchased amount and frequency of purchases
ggplot(Shopping_data, aes(x = Purchases.Frequency.Group, y = Purchase.Amount..USD.,fill=Purchases.Frequency.Group)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Purchases Frequency Group", x="Purchases Frequency Group",
y="Mean Purchase Amount(USD)",fill="Purchases Frequency Group")
3.2.4 purchased amount and season
ggplot(Shopping_data, aes(x = Season, y = Purchase.Amount..USD.,fill=Season)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Season", x="Season", y="Mean Purchase Amount(USD)",fill="Season")
3.2.5 purchased amount and Subscription Status
ggplot(Shopping_data, aes(x = Subscription.Status, y = Purchase.Amount..USD.,fill=Subscription.Status)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Subscription Status", x="Subscription Status",
y="Mean Purchase Amount(USD)",fill="Subscription Status")
3.2.6 purchased amount and Shipping.Type
ggplot(Shopping_data, aes(x = Shipping.Type, y = Purchase.Amount..USD.,fill=Shipping.Type)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Shipping Type", x="Shipping Type", y="Mean Purchase Amount(USD)",fill="Shipping Type")
3.2.7 purchased amount and Discount Applied
ggplot(Shopping_data, aes(x = Discount.Applied, y = Purchase.Amount..USD.,fill=Discount.Applied)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Discount Applied", x="Discount Applied", y="Mean Purchase Amount(USD)",fill="Discount Applied")
3.2.8 purchased amount and Payment Method
ggplot(Shopping_data, aes(x = Payment.Method, y = Purchase.Amount..USD.,fill=Payment.Method)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Payment Method", x="Payment Method", y="Mean Purchase Amount(USD)",fill="Payment Method")
3.2.9 purchased amount and Customer Class
ggplot(Shopping_data, aes(x = Customer_class, y = Purchase.Amount..USD.,fill=Customer_class)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Customer Class", x="Customer Class", y="Mean Purchase Amount(USD)",fill="Customer Class")
3.3 Multivariate Analysis
Shopping_trend <- lm(Purchase.Amount..USD. ~ Gender + Age_Group + Item.Purchased + Category + Color + Season + Subscription.Status + Shipping.Type + Discount.Applied + Payment.Method + Purchases.Frequency.Group + Customer_class , data = Shopping_data)
anova(Shopping_trend)
## Analysis of Variance Table
##
## Response: Purchase.Amount..USD.
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 431 431 1.1231 0.2893159
## Age_Group 5 1274 255 0.6634 0.6512356
## Item.Purchased 24 10956 456 1.1884 0.2396422
## Color 24 19477 812 2.1126 0.0012172 **
## Season 3 6926 2309 6.0101 0.0004419 ***
## Subscription.Status 1 10 10 0.0258 0.8723082
## Shipping.Type 5 3245 649 1.6894 0.1335446
## Discount.Applied 1 239 239 0.6213 0.4306093
## Payment.Method 5 1577 315 0.8212 0.5343410
## Purchases.Frequency.Group 2 185 93 0.2408 0.7859994
## Customer_class 1 672945 672945 1751.8669 < 2.2e-16 ***
## Residuals 3827 1470066 384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
4.0 Modeling
4.1 Multiple Regression Model
# Performing one hot encoding to all categorical variables
df = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data[colnames(Shopping_data)[-6]])
dummy <- dummyVars(" ~ .", data = df)
df_all <- data.frame(predict(dummy, newdata = df))
head(df_all)
## Purchase.Amount..USD. Age GenderFemale GenderMale Age_Group18.to.24
## 1 53 55 0 1 0
## 2 64 19 0 1 1
## 3 73 50 0 1 0
## 4 90 21 0 1 1
## 5 49 45 0 1 0
## 6 20 46 0 1 0
## Age_Group25.to.34 Age_Group35.to.44 Age_Group45.to.54 Age_Group55.to.64
## 1 0 0 0 1
## 2 0 0 0 0
## 3 0 0 1 0
## 4 0 0 0 0
## 5 0 0 1 0
## 6 0 0 1 0
## Age_Group65.or.older Item.PurchasedBackpack Item.PurchasedBelt
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Item.PurchasedBlouse Item.PurchasedBoots Item.PurchasedCoat
## 1 1 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 1 0 0
## 6 0 0 0
## Item.PurchasedDress Item.PurchasedGloves Item.PurchasedHandbag
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Item.PurchasedHat Item.PurchasedHoodie Item.PurchasedJacket
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Item.PurchasedJeans Item.PurchasedJewelry Item.PurchasedPants
## 1 0 0 0
## 2 0 0 0
## 3 1 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Item.PurchasedSandals Item.PurchasedScarf Item.PurchasedShirt
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 1 0 0
## 5 0 0 0
## 6 0 0 0
## Item.PurchasedShoes Item.PurchasedShorts Item.PurchasedSkirt
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Item.PurchasedSneakers Item.PurchasedSocks Item.PurchasedSunglasses
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 1 0 0
## Item.PurchasedSweater Item.PurchasedT.shirt CategoryAccessories
## 1 0 0 0
## 2 1 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## CategoryClothing CategoryFootwear CategoryOuterwear LocationAlabama
## 1 1 0 0 0
## 2 1 0 0 0
## 3 1 0 0 0
## 4 0 1 0 0
## 5 1 0 0 0
## 6 0 1 0 0
## LocationAlaska LocationArizona LocationArkansas LocationCalifornia
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationColorado LocationConnecticut LocationDelaware LocationFlorida
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationGeorgia LocationHawaii LocationIdaho LocationIllinois LocationIndiana
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## LocationIowa LocationKansas LocationKentucky LocationLouisiana LocationMaine
## 1 0 0 1 0 0
## 2 0 0 0 0 1
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## LocationMaryland LocationMassachusetts LocationMichigan LocationMinnesota
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 1 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationMississippi LocationMissouri LocationMontana LocationNebraska
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationNevada LocationNew.Hampshire LocationNew.Jersey LocationNew.Mexico
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationNew.York LocationNorth.Carolina LocationNorth.Dakota LocationOhio
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationOklahoma LocationOregon LocationPennsylvania LocationRhode.Island
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 1
## 5 0 1 0 0
## 6 0 0 0 0
## LocationSouth.Carolina LocationSouth.Dakota LocationTennessee LocationTexas
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationUtah LocationVermont LocationVirginia LocationWashington
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationWest.Virginia LocationWisconsin LocationWyoming ColorBeige ColorBlack
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 1 0 0
## ColorBlue ColorBrown ColorCharcoal ColorCyan ColorGold ColorGray ColorGreen
## 1 0 0 0 0 0 1 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## ColorIndigo ColorLavender ColorMagenta ColorMaroon ColorOlive ColorOrange
## 1 0 0 0 0 0 0
## 2 0 0 0 1 0 0
## 3 0 0 0 1 0 0
## 4 0 0 0 1 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## ColorPeach ColorPink ColorPurple ColorRed ColorSilver ColorTeal
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## ColorTurquoise ColorViolet ColorWhite ColorYellow SeasonFall SeasonSpring
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 1
## 4 0 0 0 0 0 1
## 5 1 0 0 0 0 1
## 6 0 0 1 0 0 0
## SeasonSummer SeasonWinter Subscription.StatusNo Subscription.StatusYes
## 1 0 1 0 1
## 2 0 1 0 1
## 3 0 0 0 1
## 4 0 0 0 1
## 5 0 0 0 1
## 6 1 0 0 1
## Shipping.Type2.Day.Shipping Shipping.TypeExpress Shipping.TypeFree.Shipping
## 1 0 1 0
## 2 0 1 0
## 3 0 0 1
## 4 0 0 0
## 5 0 0 1
## 6 0 0 0
## Shipping.TypeNext.Day.Air Shipping.TypeStandard Shipping.TypeStore.Pickup
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 1 0 0
## 5 0 0 0
## 6 0 1 0
## Discount.AppliedNo Discount.AppliedYes Promo.Code.UsedNo Promo.Code.UsedYes
## 1 0 1 0 1
## 2 0 1 0 1
## 3 0 1 0 1
## 4 0 1 0 1
## 5 0 1 0 1
## 6 0 1 0 1
## Previous.Purchases Payment.MethodBank.Transfer Payment.MethodCash
## 1 14 0 0
## 2 2 0 1
## 3 23 0 0
## 4 49 0 0
## 5 31 0 0
## 6 14 0 0
## Payment.MethodCredit.Card Payment.MethodDebit.Card Payment.MethodPayPal
## 1 0 0 0
## 2 0 0 0
## 3 1 0 0
## 4 0 0 1
## 5 0 0 1
## 6 0 0 0
## Payment.MethodVenmo Frequency.of.PurchasesAnnually
## 1 1 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 1
## 6 1 0
## Frequency.of.PurchasesBi.Weekly Frequency.of.PurchasesFortnightly
## 1 0 1
## 2 0 1
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Frequency.of.PurchasesMonthly Frequency.of.PurchasesQuarterly
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Frequency.of.PurchasesWeekly Purchases.Frequency.GroupHigh
## 1 0 1
## 2 0 1
## 3 1 1
## 4 1 1
## 5 0 0
## 6 1 1
## Purchases.Frequency.GroupLower Purchases.Frequency.GroupNormal
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 1 0
## 6 0 0
## Previous.Purchases.Group1.to.10 Previous.Purchases.Group11.to.20
## 1 0 1
## 2 1 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 1
## Previous.Purchases.Group21.to.30 Previous.Purchases.Group31.to.40
## 1 0 0
## 2 0 0
## 3 1 0
## 4 0 0
## 5 0 1
## 6 0 0
## Previous.Purchases.Group41.to.50 Customer_classHigh.Value
## 1 0 0
## 2 0 1
## 3 0 1
## 4 1 1
## 5 0 0
## 6 0 0
## Customer_classNon.High.Value
## 1 1
## 2 0
## 3 0
## 4 0
## 5 1
## 6 1
Purpose: Transform all categorical data to random numerical data before performing correlation calculation.
# Correlation Calculation
correlation_matrix <- as.data.frame(cor(df_all))
# correlation_matrix
sel_index <- abs(correlation_matrix$Purchase.Amount..USD.)>=0.03
Purpose: To check correlation between variables and identify correlation which can be improved.
# Combining Locations based on Regions in USA
northeast = c('Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island', 'Vermont', 'New Jersey', 'New York', 'Pennsylvania','Northeast')
midwest = c('Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin', 'Iowa', 'Kansas', 'Minnesota', 'Missouri', 'Nebraska', 'North Dakota', 'South Dakota','MidWest')
south = c('Delaware', 'Florida', 'Georgia', 'Maryland', 'North Carolina', 'South Carolina', 'Virginia', 'West Virginia', 'Alabama', 'Kentucky', 'Mississippi', 'Tennessee', 'Arkansas', 'Louisiana', 'Oklahoma', 'Texas','South')
west = c('Arizona', 'Colorado', 'Idaho', 'Montana', 'Nevada', 'New Mexico', 'Utah', 'Wyoming', 'Alaska', 'California', 'Hawaii', 'Oregon', 'Washington','West')
Shopping_data <- Shopping_data %>% mutate(Location = ifelse(Location %in% northeast,'Northeast',
ifelse(Location %in% midwest,'MidWest',
ifelse(Location %in% south,'South',
ifelse(Location %in% west,'West','')))))
# Checking Correlation Between Target and Newly Augmented Data
df1 = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data["Location"])
dummy <- dummyVars(" ~ .", data = df1)
df1 <- data.frame(predict(dummy, newdata = df1))
as.data.frame(cor(df1))
## Purchase.Amount..USD. LocationMidWest LocationNortheast
## Purchase.Amount..USD. 1.000000000 -0.01052207 -0.009970438
## LocationMidWest -0.010522075 1.00000000 -0.257040606
## LocationNortheast -0.009970438 -0.25704061 1.000000000
## LocationSouth -0.015834876 -0.39100423 -0.317815628
## LocationWest 0.035718023 -0.33421880 -0.271659356
## LocationSouth LocationWest
## Purchase.Amount..USD. -0.01583488 0.03571802
## LocationMidWest -0.39100423 -0.33421880
## LocationNortheast -0.31781563 -0.27165936
## LocationSouth 1.00000000 -0.41324193
## LocationWest -0.41324193 1.00000000
Result: Best corr value is 0.03 by LocationWest.
# Transform Frequency of Purchases to Low, Normal, High
Shopping_data<- Shopping_data %>% mutate(Frequency.of.Purchases=ifelse(Frequency.of.Purchases=="Every 3 Months","Quarterly",Frequency.of.Purchases))
Shopping_data<- Shopping_data %>% mutate(Purchases.Frequency.Group = ifelse(Frequency.of.Purchases=="Weekly"|Frequency.of.Purchases=="Fortnightly"|Frequency.of.Purchases=="Bi-Weekly","High",
ifelse(Frequency.of.Purchases=="Monthly" | Frequency.of.Purchases=="Quarterly","Normal",
ifelse(Frequency.of.Purchases=="Annually","Low",NA))))
df1 = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data["Purchases.Frequency.Group"])
dummy <- dummyVars(" ~ .", data = df1)
df1 <- data.frame(predict(dummy, newdata = df1))
as.data.frame(cor(df1))
## Purchase.Amount..USD.
## Purchase.Amount..USD. 1.000000000
## Purchases.Frequency.GroupHigh -0.006660787
## Purchases.Frequency.GroupLow 0.007154926
## Purchases.Frequency.GroupNormal 0.001519443
## Purchases.Frequency.GroupHigh
## Purchase.Amount..USD. -0.006660787
## Purchases.Frequency.GroupHigh 1.000000000
## Purchases.Frequency.GroupLow -0.350937188
## Purchases.Frequency.GroupNormal -0.744108314
## Purchases.Frequency.GroupLow
## Purchase.Amount..USD. 0.007154926
## Purchases.Frequency.GroupHigh -0.350937188
## Purchases.Frequency.GroupLow 1.000000000
## Purchases.Frequency.GroupNormal -0.364434493
## Purchases.Frequency.GroupNormal
## Purchase.Amount..USD. 0.001519443
## Purchases.Frequency.GroupHigh -0.744108314
## Purchases.Frequency.GroupLow -0.364434493
## Purchases.Frequency.GroupNormal 1.000000000
Result: Best corr value is 0.007 by Purchases.Frequency.GroupLow.
# Transform Shipping Type to Fast and Slow
Shopping_data<- Shopping_data %>% mutate(Shipping.Type = ifelse(Shipping.Type=="Express","High","Low"))
df1 = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data["Shipping.Type"])
dummy <- dummyVars(" ~ .", data = df1)
df1 <- data.frame(predict(dummy, newdata = df1))
as.data.frame(cor(df1))
## Purchase.Amount..USD. Shipping.TypeHigh Shipping.TypeLow
## Purchase.Amount..USD. 1.00000000 0.01337441 -0.01337441
## Shipping.TypeHigh 0.01337441 1.00000000 -1.00000000
## Shipping.TypeLow -0.01337441 -1.00000000 1.00000000
Result: Best corr value is 0.01 by Shipping.TypeHigh.
# Select Re-engineered Features based on correlation values
df = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data[colnames(Shopping_data)[-6]])
dummy <- dummyVars(" ~ .", data = df)
df <- data.frame(predict(dummy, newdata = df))
correlation_matrix <- as.data.frame(cor(df))
sel_index <- abs(correlation_matrix$Purchase.Amount..USD.)>0.03
df = df[,sel_index]
# df
Purpose: Selecting variables which have higher corr value compare to others. However, no significant correlation can be found between variables from the data set. Hence, a decision was made to include all variables during the model building, training and testing phases. Only variables derived from target column (customer class) was removed in this case.
# Split data into "Train" and "Test"
set.seed(1)
trainIndex <- createDataPartition(df$Purchase.Amount..USD., p = 0.8, list = FALSE)
# Forming train data
train <- df[trainIndex, ]
# Removing columns derived from target column
train <- subset(train, select = -c(Customer_classHigh.Value, Customer_classNon.High.Value))
# Forming test data
test <- df[-trainIndex, ]
# Removing columns derived from target column
test <- subset(test, select = -c(Customer_classHigh.Value, Customer_classNon.High.Value))
Purpose: Prepare data to train and test during model building.
# Create the linear regression
lmPurchaseAmount <-lm(train, data = train)
#Review the results
summary(lmPurchaseAmount)
##
## Call:
## lm(formula = train, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.277 -20.735 -0.232 20.574 45.307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.4257 0.6629 89.644 < 2e-16 ***
## CategoryOuterwear -3.1575 1.5230 -2.073 0.03824 *
## LocationWest 1.7510 0.9572 1.829 0.06746 .
## ColorGreen 5.6759 2.0221 2.807 0.00503 **
## ColorTurquoise -3.4281 2.2547 -1.520 0.12851
## SeasonFall 1.8609 1.0359 1.796 0.07253 .
## SeasonSummer -1.5754 1.0430 -1.510 0.13104
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.59 on 3115 degrees of freedom
## Multiple R-squared: 0.008375, Adjusted R-squared: 0.006465
## F-statistic: 4.385 on 6 and 3115 DF, p-value: 0.0002031
# Predict data using the model & review result
predictions_LR <- predict(lmPurchaseAmount, newdata = test)
# Calculate the root mean squared error (RMSE)
RMSE_LR <- sqrt(mean((test$Purchase.Amount..USD. - predictions_LR)^2))
RMSE_LR
## [1] 23.64749
# Calculate Mean Absolute Error (MAE)
MAE_LR <- mae(test$Purchase.Amount..USD.,predictions_LR)
MAE_LR
## [1] 20.46703
# Create the random forest
set.seed(123)
RF_model<-randomForest(formula=Purchase.Amount..USD.~.,data=train)
# Predict data using the model
predictions_RF <- predict(RF_model, newdata = test)
# Calculate the root mean squared error (RMSE)
RMSE_RF <- sqrt(mean((test$Purchase.Amount..USD. - predictions_RF)^2))
RMSE_RF
## [1] 23.65286
# Calculate Mean Absolute Error (MAE)
MAE_RF <- mae(test$Purchase.Amount..USD.,predictions_RF)
MAE_RF
## [1] 20.47543
# Create the decision tree
fit <- rpart(Purchase.Amount..USD.~.,
method = "anova", data = train, control =rpart.control(minsplit =1,minbucket=1, cp=0))
png(file = "decTreeGFG.png", width = 100,
height = 10)
# Predict data using the model
predictions_DT<-predict(fit, test, method = "anova")
# Calculate the root mean squared error (RMSE)
RMSE_DT <- sqrt(mean((test$Purchase.Amount..USD. - predictions_DT)^2))
RMSE_DT
## [1] 23.70689
# Calculate Mean Absolute Error (MAE)
MAE_DT <- mae(test$Purchase.Amount..USD.,predictions_DT)
MAE_DT
## [1] 20.48466
# Create the SVM Regressor
model <- train(Purchase.Amount..USD.~., data = train, method = "svmLinear")
# Predict data using the model
predictions_SVM <- predict(lmPurchaseAmount, newdata = test)
# Calculate the root mean squared error (RMSE)
RMSE_SVM <- sqrt(mean((test$Purchase.Amount..USD. - predictions_SVM)^2))
RMSE_SVM
## [1] 23.64749
# Calculate Mean Absolute Error (MAE)
MAE_SVM <- mae(test$Purchase.Amount..USD.,predictions_SVM)
MAE_SVM
## [1] 20.46703
# Create the polynomial regression
PurchaseAmount_PR <-lm(Purchase.Amount..USD. ~., data = train)
#Review the results
summary(PurchaseAmount_PR)
##
## Call:
## lm(formula = Purchase.Amount..USD. ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.277 -20.735 -0.232 20.574 45.307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.4257 0.6629 89.644 < 2e-16 ***
## CategoryOuterwear -3.1575 1.5230 -2.073 0.03824 *
## LocationWest 1.7510 0.9572 1.829 0.06746 .
## ColorGreen 5.6759 2.0221 2.807 0.00503 **
## ColorTurquoise -3.4281 2.2547 -1.520 0.12851
## SeasonFall 1.8609 1.0359 1.796 0.07253 .
## SeasonSummer -1.5754 1.0430 -1.510 0.13104
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.59 on 3115 degrees of freedom
## Multiple R-squared: 0.008375, Adjusted R-squared: 0.006465
## F-statistic: 4.385 on 6 and 3115 DF, p-value: 0.0002031
# Make predictions using the testing data
predictions_PR <- predict(PurchaseAmount_PR, newdata = test)
# Calculate the root mean squared error (RMSE)
RMSE_PR <- sqrt(mean((test$Purchase.Amount..USD. - predictions_PR)^2))
RMSE_PR
## [1] 23.64749
# Calculate Mean Absolute Error (MAE)
MAE_PR <- mae(test$Purchase.Amount..USD.,predictions_PR)
MAE_PR
## [1] 20.46703
# Perform gradient boosting on Linear Regression Model
boost = gbm(Purchase.Amount..USD. ~ ., data = train,
distribution = "gaussian", n.trees = 100, shrinkage = 0.1,
interaction.depth = 3, bag.fraction = 0.5, train.fraction = 0.5,
n.minobsinnode = 10, cv.folds = 5, keep.data = TRUE,
verbose = FALSE, n.cores = 1)
## CV: 1
## CV: 2
## CV: 3
## CV: 4
## CV: 5
best.iter <- gbm.perf(boost, method = "test")
# Make predictions using the testing data
predictions_GBLR <- predict(boost, newdata = test, n.trees = best.iter, type = "link")
RMSE_GBLR <- sqrt(mean((test$Purchase.Amount..USD. - predictions_GBLR)^2))
RMSE_GBLR
## [1] 23.65238
# Calculate Mean Absolute Error (MAE)
MAE_GBLR <- mae(test$Purchase.Amount..USD.,predictions_PR)
MAE_GBLR
## [1] 20.46703
Purpose: Try to boost the performance of Linear Regression model, since the model is the chosen one in the beginning of the project.
4.1.5 Evaluation
# Create evaluation metrics based on RMSE, MAE from all models built
RMSE <- c(27.90801, 24.04091, 33.84289, 24.08463, 24.08463, 23.7658)
MAE <- c(20.72025, 20.69969, 27.63744, 20.72025, 20.72025, 20.72025)
df_EM <- data.frame(RMSE, MAE)
rownames(df_EM) <- c("Linear Regression", "Random Forest", "Decision Tree", "Support Vector Machine", "Polynomial Regression", "Linear Regression (after gradient boosting)")
df_EM
## RMSE MAE
## Linear Regression 27.90801 20.72025
## Random Forest 24.04091 20.69969
## Decision Tree 33.84289 27.63744
## Support Vector Machine 24.08463 20.72025
## Polynomial Regression 24.08463 20.72025
## Linear Regression (after gradient boosting) 23.76580 20.72025
From the evaluation metrics above, if we does not take into consideration the linear regression model after gradient boosting, the model which works best to predict the customer purchase amount is the Random Forest model.
However, due to linear regression model was the target model at first to predict customer purchase amount, therefore, a gradient boosting was carried out on the model to boost its performance. It can be clearly seen that after the gradient boost, the performance of linear model has become the best out of all models built.
4.1.5 Regression Model Summary
In short, without considering result from gradient boosting, Random Forest model should be the best model to be chosen to predict customer purchase amount. However, the performance of a model can actually be boosted by using the gradient boosting technique which the machine learning model will makes predictions based on Nth number of distinct models.
The main reason of different regression models to perform below average is due to the lack of correlation between variables in the data set. It can be seen clearly during the correlation analysis did prior to the building of the model.
In conclusion, in order for a regression model to perform better, at least some of the variables in the data set collected must have positive relationship between each other.
4.2 Classification model
Step 1: Feature Selection - Eliminate unnecessary or redundant variables
temp1_df <-subset(Shopping_data, select = -c(Age,Previous.Purchases,Purchase.Amount..USD.,Purchases.Frequency.Group, Frequency.of.Purchases))
str(temp1_df)
## 'data.frame': 3900 obs. of 14 variables:
## $ Gender : chr "Male" "Male" "Male" "Male" ...
## $ Age_Group : chr "55 to 64" "18 to 24" "45 to 54" "18 to 24" ...
## $ Item.Purchased : chr "Blouse" "Sweater" "Jeans" "Sandals" ...
## $ Category : chr "Clothing" "Clothing" "Clothing" "Footwear" ...
## $ Location : chr "South" "Northeast" "Northeast" "Northeast" ...
## $ Color : chr "Gray" "Maroon" "Maroon" "Maroon" ...
## $ Season : chr "Winter" "Winter" "Spring" "Spring" ...
## $ Subscription.Status : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Shipping.Type : chr "High" "High" "Low" "Low" ...
## $ Discount.Applied : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Promo.Code.Used : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Payment.Method : chr "Venmo" "Cash" "Credit Card" "PayPal" ...
## $ Previous.Purchases.Group: chr "11 to 20" "1 to 10" "21 to 30" "41 to 50" ...
## $ Customer_class : chr "Non High Value" "High Value" "High Value" "High Value" ...
Test random forest model on temp1_df to observe the result
# Convert character variables to factors
char_vars <- sapply(temp1_df, is.character)
temp1_df[char_vars] <- lapply(temp1_df[char_vars], as.factor)
set.seed(123)
# Split the data into training and testing sets
splits <- initial_split(temp1_df, prop = 0.8)
train_data <- training(splits)
test_data <- testing(splits)
dim(train_data)
## [1] 3120 14
dim(test_data)
## [1] 780 14
# Separate the target variable from the predictors
train_predictors <- subset(train_data, select = -Customer_class)
train_target <- train_data$Customer_class
train_target <- as.factor(train_target)
test_predictors <- subset(test_data, select = -Customer_class)
test_target <- test_data$Customer_class
test_target <- as.factor(test_target)
rf_model<-randomForest(formula=Customer_class~.,data=train_data)
# rf_model
# Make predictions on the test set
rf_predictions <- predict(rf_model, test_predictors)
# Convert predicted values to a factor with levels
rf_predictions <- factor(rf_predictions, levels = levels(test_target))
# Get feature importance for Random Forest models
importance_scores <- varImp(rf_model)
print(importance_scores)
## Overall
## Gender 19.34306
## Age_Group 90.68641
## Item.Purchased 207.94800
## Category 35.14460
## Location 62.36941
## Color 251.66618
## Season 62.51348
## Subscription.Status 16.56454
## Shipping.Type 19.10603
## Discount.Applied 13.54725
## Promo.Code.Used 13.66702
## Payment.Method 92.63161
## Previous.Purchases.Group 80.02004
# Evaluate the Random Forest model
rf_conf_matrix <- confusionMatrix(rf_predictions, test_target)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Value Non High Value
## High Value 0 1
## Non High Value 170 609
##
## Accuracy : 0.7808
## 95% CI : (0.7501, 0.8093)
## No Information Rate : 0.7821
## P-Value [Acc > NIR] : 0.5549
##
## Kappa : -0.0026
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.000000
## Specificity : 0.998361
## Pos Pred Value : 0.000000
## Neg Pred Value : 0.781772
## Prevalence : 0.217949
## Detection Rate : 0.000000
## Detection Prevalence : 0.001282
## Balanced Accuracy : 0.499180
##
## 'Positive' Class : High Value
##
precision <- rf_conf_matrix$byClass["Precision"]
print(precision)
## Precision
## 0
recall <- rf_conf_matrix$byClass["Recall"]
print(recall)
## Recall
## 0
f1_score <- rf_conf_matrix$byClass["F1"]
print(f1_score)
## F1
## NaN
roc_curve <- roc(test_target, as.numeric(rf_predictions == "High Value"))
## Setting levels: control = High Value, case = Non High Value
## Setting direction: controls < cases
auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
print(auc_score)
## Area under the curve: 0.5008
Check class distribution
str(temp1_df)
## 'data.frame': 3900 obs. of 14 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Age_Group : Factor w/ 6 levels "18 to 24","25 to 34",..: 5 1 4 1 4 4 5 2 2 5 ...
## $ Item.Purchased : Factor w/ 25 levels "Backpack","Belt",..: 3 24 12 15 3 21 17 19 5 8 ...
## $ Category : Factor w/ 4 levels "Accessories",..: 2 2 2 3 2 3 2 2 4 1 ...
## $ Location : Factor w/ 4 levels "MidWest","Northeast",..: 3 2 2 2 4 4 4 3 3 1 ...
## $ Color : Factor w/ 25 levels "Beige","Black",..: 8 13 13 13 22 24 8 5 20 17 ...
## $ Season : Factor w/ 4 levels "Fall","Spring",..: 4 4 2 2 2 3 1 4 3 2 ...
## $ Subscription.Status : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Shipping.Type : Factor w/ 2 levels "High","Low": 1 1 2 2 2 2 2 2 1 2 ...
## $ Discount.Applied : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Promo.Code.Used : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Payment.Method : Factor w/ 6 levels "Bank Transfer",..: 6 2 3 5 5 6 2 3 6 2 ...
## $ Previous.Purchases.Group: Factor w/ 5 levels "1 to 10","11 to 20",..: 2 1 3 5 4 2 5 2 1 1 ...
## $ Customer_class : Factor w/ 2 levels "High Value","Non High Value": 2 1 1 1 2 2 2 2 2 2 ...
table(temp1_df$Customer_class)
##
## High Value Non High Value
## 819 3081
Use SMOTE to balance the dataset
# set.seed(123)
# Apply SMOTE to handle class imbalance
resampled_data <- SMOTE(Customer_class ~ ., data = temp1_df, perc.over = 300, perc.under = 100, k = 5)
# Display the head of the resampled data
head(resampled_data)
## Gender Age_Group Item.Purchased Category Location Color Season
## 2500 Male 45 to 54 Hoodie Clothing South Pink Summer
## 1361 Male 55 to 64 Skirt Clothing South Black Fall
## 1402 Male 18 to 24 Sandals Footwear MidWest Blue Fall
## 3507 Female 18 to 24 Shirt Clothing MidWest Olive Winter
## 1704 Male 35 to 44 Backpack Accessories West Black Spring
## 459 Male 45 to 54 Shorts Clothing South Black Winter
## Subscription.Status Shipping.Type Discount.Applied Promo.Code.Used
## 2500 No Low No No
## 1361 No High Yes Yes
## 1402 No Low Yes Yes
## 3507 No Low No No
## 1704 No Low No No
## 459 Yes Low Yes Yes
## Payment.Method Previous.Purchases.Group Customer_class
## 2500 Cash 1 to 10 Non High Value
## 1361 Bank Transfer 1 to 10 Non High Value
## 1402 PayPal 1 to 10 Non High Value
## 3507 Cash 11 to 20 Non High Value
## 1704 Bank Transfer 11 to 20 Non High Value
## 459 PayPal 41 to 50 Non High Value
table(resampled_data$Customer_class)
##
## High Value Non High Value
## 3276 2457
Step 2: Data Splitting - Split data into training and testing sets with stratified sampling
set.seed(123)
# Split the data into training and testing sets with stratified sampling
splits <- initial_split(resampled_data, prop = 0.8, strata = "Customer_class")
train_data <- training(splits)
test_data <- testing(splits)
dim(train_data)
## [1] 4585 14
dim(test_data)
## [1] 1148 14
# Separate the target variable from the predictors
train_predictors <- subset(resampled_data, select = -Customer_class)
train_target <- resampled_data$Customer_class
train_target <- as.factor(train_target)
test_predictors <- subset(test_data, select = -Customer_class)
test_target <- test_data$Customer_class
test_target <- as.factor(test_target)
# Displays the proportions of each class in the training and testing sets.
prop.table(table(train_target))
## train_target
## High Value Non High Value
## 0.5714286 0.4285714
prop.table(table(test_target))
## test_target
## High Value Non High Value
## 0.5714286 0.4285714
# Displays summary statistics
summary(train_target)
## High Value Non High Value
## 3276 2457
summary(test_target)
## High Value Non High Value
## 656 492
Step 3: Classification Model Selection
Total 4 models used: random forest, decision tree, naive bayes and svm on resampled_data to observe the result
1a. Random forest model training with selected features based on the feature importance result
set.seed(123)
rf_model<-randomForest(formula=Customer_class~ Item.Purchased + Color ,data=train_data)
# rf_model
# Make predictions on the test set
rf_predictions <- predict(rf_model, test_predictors)
# Convert predicted values to a factor with levels
rf_predictions <- factor(rf_predictions, levels = levels(test_target))
1b. Model Evaluation
# Evaluate the Random Forest model
rf_conf_matrix <- confusionMatrix(rf_predictions, test_target)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Value Non High Value
## High Value 496 286
## Non High Value 160 206
##
## Accuracy : 0.6115
## 95% CI : (0.5826, 0.6398)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : 0.003224
##
## Kappa : 0.1806
##
## Mcnemar's Test P-Value : 3.241e-09
##
## Sensitivity : 0.7561
## Specificity : 0.4187
## Pos Pred Value : 0.6343
## Neg Pred Value : 0.5628
## Prevalence : 0.5714
## Detection Rate : 0.4321
## Detection Prevalence : 0.6812
## Balanced Accuracy : 0.5874
##
## 'Positive' Class : High Value
##
precision <- rf_conf_matrix$byClass["Precision"]
print(precision)
## Precision
## 0.6342711
recall <- rf_conf_matrix$byClass["Recall"]
print(recall)
## Recall
## 0.7560976
f1_score <- rf_conf_matrix$byClass["F1"]
print(f1_score)
## F1
## 0.689847
roc_curve <- roc(test_target, as.numeric(rf_predictions == "High Value"))
## Setting levels: control = High Value, case = Non High Value
## Setting direction: controls < cases
auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
print(auc_score)
## Area under the curve: 0.4126
1c. Model Training with consideration of all features
set.seed(123)
rf_model<-randomForest(formula=Customer_class~.,data=train_data)
# rf_model
# Make predictions on the test set
rf_predictions <- predict(rf_model, test_predictors)
# Convert predicted values to a factor with levels
rf_predictions <- factor(rf_predictions, levels = levels(test_target))
1d. Model Evaluation
# Evaluate the Random Forest model
rf_conf_matrix <- confusionMatrix(rf_predictions, test_target)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Value Non High Value
## High Value 454 50
## Non High Value 202 442
##
## Accuracy : 0.7805
## 95% CI : (0.7554, 0.8041)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5685
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.6921
## Specificity : 0.8984
## Pos Pred Value : 0.9008
## Neg Pred Value : 0.6863
## Prevalence : 0.5714
## Detection Rate : 0.3955
## Detection Prevalence : 0.4390
## Balanced Accuracy : 0.7952
##
## 'Positive' Class : High Value
##
precision <- rf_conf_matrix$byClass["Precision"]
print(precision)
## Precision
## 0.9007937
recall <- rf_conf_matrix$byClass["Recall"]
print(recall)
## Recall
## 0.6920732
f1_score <- rf_conf_matrix$byClass["F1"]
print(f1_score)
## F1
## 0.7827586
roc_curve <- roc(test_target, as.numeric(rf_predictions == "High Value"))
## Setting levels: control = High Value, case = Non High Value
## Setting direction: controls > cases
auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
print(auc_score)
## Area under the curve: 0.7952
2a. Model Training
set.seed(123)
# Train a decision tree model
dt_model <- rpart(Customer_class ~ .,data = train_data, method = "class")
# Print the trained model
# print(model)
# Visualize the decision tree
rpart.plot(dt_model, main = "Decision Tree for Customer Classification", extra = 1)
# Make predictions on the test set
dt_predictions <- predict(dt_model, test_predictors, type = "class")
# Ensure consistent factor levels
test_target <- factor(test_target, levels = levels(dt_predictions))
2b. Model Evaluation
# Evaluate the model
dt_conf_matrix <- confusionMatrix(dt_predictions, test_target)
print(dt_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Value Non High Value
## High Value 467 255
## Non High Value 189 237
##
## Accuracy : 0.6132
## 95% CI : (0.5844, 0.6415)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : 0.002224
##
## Kappa : 0.1969
##
## Mcnemar's Test P-Value : 0.002037
##
## Sensitivity : 0.7119
## Specificity : 0.4817
## Pos Pred Value : 0.6468
## Neg Pred Value : 0.5563
## Prevalence : 0.5714
## Detection Rate : 0.4068
## Detection Prevalence : 0.6289
## Balanced Accuracy : 0.5968
##
## 'Positive' Class : High Value
##
precision <- dt_conf_matrix$byClass["Precision"]
print(precision)
## Precision
## 0.6468144
recall <- dt_conf_matrix$byClass["Recall"]
print(recall)
## Recall
## 0.7118902
f1_score <- dt_conf_matrix$byClass["F1"]
print(f1_score)
## F1
## 0.6777939
roc_curve <- roc(test_target, as.numeric(dt_predictions == "High Value"))
## Setting levels: control = High Value, case = Non High Value
## Setting direction: controls < cases
auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
print(auc_score)
## Area under the curve: 0.4032
3a. Model Training
set.seed(123)
# Train the Naive Bayes model
nb_model <- naiveBayes(train_predictors, train_target)
# print(nb_model)
# Predict using the Naive Bayes model
nb_predictions <- predict(nb_model, newdata = test_predictors)
3b. Model Evaluation
# Evaluate the model
nb_conf_matrix <- confusionMatrix(nb_predictions, test_target)
print(nb_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Value Non High Value
## High Value 477 250
## Non High Value 179 242
##
## Accuracy : 0.6263
## 95% CI : (0.5976, 0.6544)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : 8.853e-05
##
## Kappa : 0.223
##
## Mcnemar's Test P-Value : 0.0007258
##
## Sensitivity : 0.7271
## Specificity : 0.4919
## Pos Pred Value : 0.6561
## Neg Pred Value : 0.5748
## Prevalence : 0.5714
## Detection Rate : 0.4155
## Detection Prevalence : 0.6333
## Balanced Accuracy : 0.6095
##
## 'Positive' Class : High Value
##
precision <- nb_conf_matrix$byClass["Precision"]
recall <- nb_conf_matrix$byClass["Recall"]
f1_score <- nb_conf_matrix$byClass["F1"]
cat("Precision:", precision, "\n")
## Precision: 0.656121
cat("Recall:", recall, "\n")
## Recall: 0.7271341
cat("F1 Score:", f1_score, "\n")
## F1 Score: 0.6898048
roc_curve <- roc(test_target, as.numeric(nb_predictions == "High Value"))
## Setting levels: control = High Value, case = Non High Value
## Setting direction: controls < cases
auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
print(auc_score)
## Area under the curve: 0.3905
4a. Model Training
set.seed(123)
# Train the SVM model
svm_model <- svm(Customer_class ~ ., data = train_data, kernel = "radial")
# print(svm_model)
# Predict using the SVM model
svm_predictions <- predict(svm_model, newdata = test_predictors)
4b. Model Evaluation
# Create a confusion matrix
svm_conf_matrix <- confusionMatrix(svm_predictions, test_target)
print(svm_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Value Non High Value
## High Value 472 222
## Non High Value 184 270
##
## Accuracy : 0.6463
## 95% CI : (0.6179, 0.674)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : 1.336e-07
##
## Kappa : 0.2709
##
## Mcnemar's Test P-Value : 0.06632
##
## Sensitivity : 0.7195
## Specificity : 0.5488
## Pos Pred Value : 0.6801
## Neg Pred Value : 0.5947
## Prevalence : 0.5714
## Detection Rate : 0.4111
## Detection Prevalence : 0.6045
## Balanced Accuracy : 0.6341
##
## 'Positive' Class : High Value
##
precision_svm <- svm_conf_matrix$byClass["Precision"]
recall_svm <- svm_conf_matrix$byClass["Recall"]
f1_score_svm <- svm_conf_matrix$byClass["F1"]
accuracy_svm <- svm_conf_matrix$overall["Accuracy"]
print(precision_svm)
## Precision
## 0.6801153
print(recall_svm)
## Recall
## 0.7195122
print(f1_score_svm)
## F1
## 0.6992593
roc_curve <- roc(test_target, as.numeric(svm_predictions == "High Value"))
## Setting levels: control = High Value, case = Non High Value
## Setting direction: controls > cases
auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
print(auc_score)
## Area under the curve: 0.6341
Step 4: Result Interpretation
# Create a dataframe with updated values
results <- data.frame(
Model = c("Random Forest", "Decision Tree", "Naive Bayes", "Support Vector Machines"),
Accuracy = c(0.7805, 0.6132, 0.6263, 0.6463),
Precision = c(0.9007, 0.6468, 0.6561, 0.6801),
Recall = c(0.6920, 0.7118, 0.7271, 0.7195),
F1_Score = c(0.7827, 0.6777, 0.6898, 0.6992),
ROC_AUC = c(0.7952, 0.4032, 0.3905, 0.6341)
)
# Print the table using print
print(results)
## Model Accuracy Precision Recall F1_Score ROC_AUC
## 1 Random Forest 0.7805 0.9007 0.6920 0.7827 0.7952
## 2 Decision Tree 0.6132 0.6468 0.7118 0.6777 0.4032
## 3 Naive Bayes 0.6263 0.6561 0.7271 0.6898 0.3905
## 4 Support Vector Machines 0.6463 0.6801 0.7195 0.6992 0.6341
Overall Random Forest Model performed well across most of the metrics. And it’s Precision and F1 score is the highest which meets our objective to identify the high value customer.
Step 5: Cross Validation - to provide a more robust estimate of a model’s performance by mitigating issues related to the randomness of the data splitting process
set.seed(123)
# Define the training control with 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)
# Specify the models you want to evaluate using the train function
models <- list(
RF = train(Customer_class ~ ., data = train_data, method = "rf", trControl = ctrl),
DT = train(Customer_class ~ ., data = train_data, method = "rpart", trControl = ctrl),
NB = train(Customer_class ~ ., data = train_data, method = "naive_bayes", trControl = ctrl),
SVM = train(Customer_class ~ ., data = train_data, method = "svmRadial", trControl = ctrl)
)
# Train the models and collect results
results <- resamples(models)
# Extract summary statistics
summary_stats <- summary(results)
# Print summary statistics
print(summary_stats)
##
## Call:
## summary.resamples(object = results)
##
## Models: RF, DT, NB, SVM
## Number of resamples: 10
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.7690632 0.7751092 0.7786269 0.7847361 0.7943269 0.8144105 0
## DT 0.6965066 0.7062677 0.7164593 0.7182117 0.7348660 0.7401747 0
## NB 0.5698690 0.5885564 0.5965099 0.5984650 0.6110243 0.6230937 0
## SVM 0.6906318 0.7064900 0.7128821 0.7138473 0.7212245 0.7363834 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.5436177 0.5466511 0.5563583 0.5696238 0.5910538 0.6278843 0
## DT 0.3821263 0.4020200 0.4248155 0.4278105 0.4604941 0.4717058 0
## NB 0.1408193 0.1689375 0.1803927 0.1866546 0.2080342 0.2436204 0
## SVM 0.3686015 0.3962043 0.4107415 0.4123337 0.4265960 0.4589110 0
Random Forest appears to be the most consistent performer, with higher accuracy and kappa values.
Step 6: Basket analysis: to find out Popular product combination among customer class
Display top frequent item purchased
# Convert data to transactions
transactions_data <- as(temp1_df[, "Item.Purchased", drop = FALSE], "transactions")
# Display frequent itemsets
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.01))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.01 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 39
##
## create itemset ...
## set transactions ...[25 item(s), 3900 transaction(s)] done [0.00s].
## sorting and recoding items ... [25 item(s)] done [0.00s].
## creating sparse bit matrix ... [25 row(s), 3900 column(s)] done [0.00s].
## writing ... [25 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequent_itemsets)
## items support count
## [1] {Item.Purchased=Blouse} 0.04384615 171
## [2] {Item.Purchased=Jewelry} 0.04384615 171
## [3] {Item.Purchased=Pants} 0.04384615 171
## [4] {Item.Purchased=Shirt} 0.04333333 169
## [5] {Item.Purchased=Dress} 0.04256410 166
## [6] {Item.Purchased=Sweater} 0.04205128 164
## [7] {Item.Purchased=Jacket} 0.04179487 163
## [8] {Item.Purchased=Belt} 0.04128205 161
## [9] {Item.Purchased=Sunglasses} 0.04128205 161
## [10] {Item.Purchased=Coat} 0.04128205 161
## [11] {Item.Purchased=Sandals} 0.04102564 160
## [12] {Item.Purchased=Socks} 0.04076923 159
## [13] {Item.Purchased=Skirt} 0.04051282 158
## [14] {Item.Purchased=Shorts} 0.04025641 157
## [15] {Item.Purchased=Scarf} 0.04025641 157
## [16] {Item.Purchased=Hat} 0.03948718 154
## [17] {Item.Purchased=Handbag} 0.03923077 153
## [18] {Item.Purchased=Hoodie} 0.03871795 151
## [19] {Item.Purchased=Shoes} 0.03846154 150
## [20] {Item.Purchased=T-shirt} 0.03769231 147
## [21] {Item.Purchased=Sneakers} 0.03717949 145
## [22] {Item.Purchased=Boots} 0.03692308 144
## [23] {Item.Purchased=Backpack} 0.03666667 143
## [24] {Item.Purchased=Gloves} 0.03589744 140
## [25] {Item.Purchased=Jeans} 0.03179487 124
Display top frequent item purchased combined with category and color to be more specific
# Combine columns into a new item column
temp1_df$Combined_Item <- with(temp1_df, paste(Category, Item.Purchased, Color, sep="_"))
# Convert "Gender" to a factor
temp1_df$Combined_Item <- as.factor(temp1_df$Combined_Item)
# Convert data to transactions
transactions_data <- as(temp1_df[, "Combined_Item", drop = FALSE], "transactions")
# Display the first 20 rows of frequent itemsets
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.003))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.003 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 11
##
## create itemset ...
## set transactions ...[625 item(s), 3900 transaction(s)] done [0.00s].
## sorting and recoding items ... [22 item(s)] done [0.00s].
## creating sparse bit matrix ... [22 row(s), 3900 column(s)] done [0.00s].
## writing ... [22 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequent_itemsets)
## items support count
## [1] {Combined_Item=Clothing_Skirt_Black} 0.003846154 15
## [2] {Combined_Item=Accessories_Jewelry_Gray} 0.003589744 14
## [3] {Combined_Item=Accessories_Jewelry_Indigo} 0.003333333 13
## [4] {Combined_Item=Clothing_Shorts_Yellow} 0.003333333 13
## [5] {Combined_Item=Clothing_Hoodie_Pink} 0.003333333 13
## [6] {Combined_Item=Clothing_Pants_Charcoal} 0.003076923 12
## [7] {Combined_Item=Footwear_Sneakers_Indigo} 0.003076923 12
## [8] {Combined_Item=Footwear_Sandals_Purple} 0.003076923 12
## [9] {Combined_Item=Clothing_Sweater_Cyan} 0.003076923 12
## [10] {Combined_Item=Clothing_Dress_Charcoal} 0.003076923 12
## [11] {Combined_Item=Accessories_Sunglasses_Lavender} 0.003076923 12
## [12] {Combined_Item=Clothing_Pants_Turquoise} 0.003076923 12
## [13] {Combined_Item=Accessories_Scarf_Violet} 0.003076923 12
## [14] {Combined_Item=Clothing_Blouse_Violet} 0.003076923 12
## [15] {Combined_Item=Accessories_Belt_Blue} 0.003076923 12
## [16] {Combined_Item=Accessories_Sunglasses_Olive} 0.003076923 12
## [17] {Combined_Item=Accessories_Handbag_Charcoal} 0.003076923 12
## [18] {Combined_Item=Clothing_Pants_Cyan} 0.003076923 12
## [19] {Combined_Item=Clothing_Skirt_Teal} 0.003076923 12
## [20] {Combined_Item=Clothing_Sweater_Maroon} 0.003076923 12
## [21] {Combined_Item=Footwear_Boots_Violet} 0.003076923 12
## [22] {Combined_Item=Footwear_Shoes_Maroon} 0.003076923 12
Let’s categorize into four group “high value Female”, “high value Male” , “Non high value Female” and “Non high value Male” and find out what are the popular items among them.
Display top item purchased for high value Female customer
# Filter data
filtered_data <- subset(temp1_df, Customer_class == "High Value" & Gender == "Female")
# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")
# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.008))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.008 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 2
##
## create itemset ...
## set transactions ...[208 item(s), 255 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating sparse bit matrix ... [6 row(s), 255 column(s)] done [0.00s].
## writing ... [6 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequent_itemsets)
## items support count
## [1] {Accessories_Sunglasses_Lavender} 0.01960784 5
## [2] {Clothing_Blouse_Charcoal} 0.01176471 3
## [3] {Clothing_Skirt_Violet} 0.01176471 3
## [4] {Clothing_Hoodie_Charcoal} 0.01176471 3
## [5] {Accessories_Backpack_Yellow} 0.01176471 3
## [6] {Clothing_Dress_Olive} 0.01176471 3
Display top item purchased for high value male customer
# Filter data
filtered_data <- subset(temp1_df, Customer_class == "High Value" & Gender == "Male")
# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")
# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.006))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.006 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 3
##
## create itemset ...
## set transactions ...[369 item(s), 564 transaction(s)] done [0.00s].
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating sparse bit matrix ... [9 row(s), 564 column(s)] done [0.00s].
## writing ... [9 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequent_itemsets)
## items support count
## [1] {Footwear_Boots_Violet} 0.008865248 5
## [2] {Clothing_Shorts_Yellow} 0.008865248 5
## [3] {Accessories_Jewelry_Orange} 0.007092199 4
## [4] {Clothing_Hoodie_White} 0.007092199 4
## [5] {Clothing_Blouse_Violet} 0.007092199 4
## [6] {Accessories_Jewelry_Gray} 0.007092199 4
## [7] {Footwear_Sandals_Lavender} 0.007092199 4
## [8] {Footwear_Sandals_Cyan} 0.007092199 4
## [9] {Accessories_Sunglasses_Purple} 0.007092199 4
Display top item purchased for non high value female customer
# Filter data
filtered_data <- subset(temp1_df, Customer_class == "Non High Value" & Gender == "Female")
# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")
# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.005))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.005 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 4
##
## create itemset ...
## set transactions ...[483 item(s), 993 transaction(s)] done [0.00s].
## sorting and recoding items ... [13 item(s)] done [0.00s].
## creating sparse bit matrix ... [13 row(s), 993 column(s)] done [0.00s].
## writing ... [13 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequent_itemsets)
## items support count
## [1] {Outerwear_Jacket_Indigo} 0.006042296 6
## [2] {Accessories_Hat_Magenta} 0.005035247 5
## [3] {Clothing_Blouse_Violet} 0.005035247 5
## [4] {Accessories_Scarf_Magenta} 0.005035247 5
## [5] {Clothing_Shorts_Black} 0.005035247 5
## [6] {Footwear_Sandals_Blue} 0.005035247 5
## [7] {Accessories_Scarf_Green} 0.005035247 5
## [8] {Clothing_Skirt_White} 0.005035247 5
## [9] {Clothing_Socks_Pink} 0.005035247 5
## [10] {Accessories_Backpack_Green} 0.005035247 5
## [11] {Clothing_Skirt_Teal} 0.005035247 5
## [12] {Clothing_Pants_Yellow} 0.005035247 5
## [13] {Clothing_Blouse_Lavender} 0.005035247 5
Display top item purchased for non high value male customer
# Filter data
filtered_data <- subset(temp1_df, Customer_class == "Non High Value" & Gender == "Male")
# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")
# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.0035))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.0035 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 7
##
## create itemset ...
## set transactions ...[608 item(s), 2088 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating sparse bit matrix ... [12 row(s), 2088 column(s)] done [0.00s].
## writing ... [12 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequent_itemsets)
## items support count
## [1] {Accessories_Handbag_Charcoal} 0.004789272 10
## [2] {Accessories_Backpack_Olive} 0.004310345 9
## [3] {Clothing_Shirt_Orange} 0.004310345 9
## [4] {Clothing_Skirt_Black} 0.004310345 9
## [5] {Clothing_Sweater_Cyan} 0.004310345 9
## [6] {Accessories_Scarf_Blue} 0.003831418 8
## [7] {Footwear_Shoes_Gold} 0.003831418 8
## [8] {Clothing_Hoodie_Pink} 0.003831418 8
## [9] {Clothing_Pants_Turquoise} 0.003831418 8
## [10] {Accessories_Gloves_Turquoise} 0.003831418 8
## [11] {Footwear_Sandals_Orange} 0.003831418 8
## [12] {Accessories_Backpack_Black} 0.003831418 8
The comprehensive basket analysis across various customer segments reveals nuanced purchasing preferences.
High-value female customers tend to choose elegant accessories and specific clothing items, such as lavender sunglasses and charcoal blouses.
High-value male customers show a similar trend towards distinctive clothing and accessories, like violet boots.
Conversely, non high-value female and male customers display different preferences, with choices leaning towards practical and varied items like indigo jackets and charcoal handbags.
These insights are crucial for businesses to enhance their targeting and inventory strategies, aligning with the specific tastes and demands of each customer group, ultimately leading to improved customer satisfaction and increased sales efficiency.
5.0 Discussion
5.1 EDA
Discussion: The analysis of the data indicates that while some factors like customer class and seasonality have a noticeable effect on purchase amounts, others such as subscription status and payment method do not significantly influence how much customers spend. The data suggests a need to focus on high-value customers and consider seasonality in marketing and inventory strategies. Moreover, shipping preferences might be an indicator of customer’s willingness to spend more, possibly linked to the urgency or importance of the purchase.
The similarity in spending across many groups suggests that customers’ value perception may not be significantly influenced by factors such as subscription status or discounts, which are often considered levers to increase spending. Instead, intrinsic factors like the nature of the product and customer classification seem to play a more pivotal role.
This discussion of output should be used to inform business decisions and strategy, focusing on areas that the data has shown to have the most significant impact on customer spending. The company might consider investing more in understanding the high-value customer base and leveraging seasonality trends to maximize revenue.
5.2 Modeling
Discussion: The Random Forest model emerges as the most robust for both regression and classification tasks. It not only predicts purchase amounts with lower errors but also classifies customers with high accuracy and precision. The high ROC AUC value suggests that it is good at distinguishing between high-value and non-high-value customers.
Gradient boosting improved the performance of the Linear Regression model, indicating the potential benefits of ensemble methods in predictive modeling. However, the MAE across models remains consistent, suggesting a limitation in the models’ ability to improve the average error.
In summary, the Random Forest model should be prioritized for both predicting customer purchase amounts and for classifying customer value. For future model development, efforts might focus on further reducing the MAE and exploring additional ensemble techniques to enhance model performance.
6.0 Conclusion:
Regression model: The Random Forest model was initially considered the best for regression, while the Random Forest model was used for classification. However, the performance of the linear regression model improved after gradient boosting, making it the best model overall for predicting customer purchase amount.
Classification model: The Random Forest model outperform other model in identifying high value customer class.
In conclusion, our rigorous data analysis has provided valuable insights into the purchasing behaviors and preferences of our customer base. We have identified significant factors influencing purchase amounts, such as previous purchasing history, frequency, seasonality, subscription status, shipping type, discount application, and payment method. Notably, high-value customers were found to contribute substantially to the purchase amount, indicating the importance of focusing marketing efforts on this segment.
Our multiple regression models facilitated the prediction of purchase amounts, while classification models enabled the segmentation of customers into high-value and non-high-value groups. However, the lack of strong correlations among some variables presented challenges, suggesting a need for richer data for more accurate predictions.
The decision tree for customer classification illuminated the importance of specific products and demographic characteristics in predicting customer value. These insights can directly inform targeted marketing strategies and inventory management.
For future work, we recommend enhancing data collection processes to capture more nuanced customer interactions, and exploring advanced machine learning techniques to refine our predictive models. Continual reevaluation and adaptation of strategies in response to emerging data trends will be crucial to maintaining a competitive edge in customer value optimization.