GROUP MEMBERS(GROUP 9)
IDA TEH PING LI 22074634
CHEN XIN WEI 17202767
SUM SI YENG 22068843
YAP ZHI REN 22056395
YEAP YU XIAN 17203286
1.Introduction In the current era of data-driven decision making, businesses continuously strive to understand and enhance customer value. The project titled “Customer Value Prediction and Segmentation: Purchasing Amount Predictive Modeling and Potential Customer Identification” is designed to leverage data analytics in deciphering customer behaviors and preferences. This endeavor involves utilizing advanced statistical models and machine learning techniques to not only predict customer spending but also to segment customers based on their potential value to the business. The overarching aim is to equip the company with actionable insights that enable more effective targeting, personalized marketing, and optimized resource allocation.
1.1 Problems
Prediction of Purchase Amount (Regression Problem):
-Problem Statement: Determining the extent to which demographic patterns, historical purchase data, and purchasing behaviors such as payment methods and promo code usage, influence a customer’s purchase amount.
-Relevance: Understanding these variables can significantly enhance the accuracy of predictive models, enabling businesses to forecast sales and tailor their marketing strategies to individual customer profiles.
Identification of Potential Customers (Classification Problem):
-Problem Statement: Identifying key customer segments, particularly those who are likely to be high-value, using methods like basket analysis to understand popular product combinations.
-Relevance: This segmentation is critical for targeted marketing efforts, ensuring that resources are focused on the most profitable segments.
1.2 Objectives
Objective for Regression:
Goal: To develop a multiple regression model that accurately predicts the purchase amount a customer is likely to spend, based on a variety of influencing factors.
Delivery Method: Implementing a multiple regression approach, potentially supplemented with significance testing and model fitness evaluation, to uncover key features influencing customer spend.
Expected Outcome: A robust model offering continuous predictions of purchase amounts, coupled with insights into consumer profiles for strategic business planning.
Objective for Classification:
Goal: To classify customers into distinct categories, emphasizing the identification of high-value customers through basket analysis.
Delivery Method: Creating a classification model (such as a decision tree) that integrates basket analysis to discern patterns in high-selling product combinations.
Expected Outcome: A classification system that distinguishes ‘high-value’ customers from others, providing insights into product combinations preferred by these segments for strategic marketing and product development.
2.1 Dataset introduction
Dataset: Customer Shopping Trends
The dataset that we utilize is “Customer Shopping Trends”. This dataset consists of 18 column and 3900 observations….
The dataset is saved in csv format before import to R.
2.2 Import dataset
2.3 Call libraries #put all libraries here
library(plyr)
## Warning: package 'plyr' was built under R version 4.3.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
## Warning: package 'readr' was built under R version 4.3.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.2
## corrplot 0.92 loaded
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(rpart)
## Warning: package 'rpart' was built under R version 4.3.2
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.2
## Loaded gbm 2.1.9
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
library(Metrics)
## Warning: package 'Metrics' was built under R version 4.3.2
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
library(DMwR)
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
##
## Attaching package: 'DMwR'
## The following object is masked from 'package:plyr':
##
## join
library(rsample)
## Warning: package 'rsample' was built under R version 4.3.2
library(pROC)
## Warning: package 'pROC' was built under R version 4.3.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following object is masked from 'package:Metrics':
##
## auc
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.2
library(arules)
## Warning: package 'arules' was built under R version 4.3.2
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following object is masked from 'package:kernlab':
##
## size
## The following object is masked from 'package:dplyr':
##
## recode
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.2
##
## Attaching package: 'e1071'
## The following object is masked from 'package:rsample':
##
## permutations
2.4 Data Exploration
class(Shopping_data)
## [1] "data.frame"
str(Shopping_data)
## 'data.frame': 3900 obs. of 18 variables:
## $ Customer.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : int 55 19 50 21 45 46 63 27 26 57 ...
## $ Gender : chr "Male" "Male" "Male" "Male" ...
## $ Item.Purchased : chr "Blouse" "Sweater" "Jeans" "Sandals" ...
## $ Category : chr "Clothing" "Clothing" "Clothing" "Footwear" ...
## $ Purchase.Amount..USD. : int 53 64 73 90 49 20 85 34 97 31 ...
## $ Location : chr "Kentucky" "Maine" "Massachusetts" "Rhode Island" ...
## $ Size : chr "L" "L" "S" "M" ...
## $ Color : chr "Gray" "Maroon" "Maroon" "Maroon" ...
## $ Season : chr "Winter" "Winter" "Spring" "Spring" ...
## $ Review.Rating : num 3.1 3.1 3.1 3.5 2.7 2.9 3.2 3.2 2.6 4.8 ...
## $ Subscription.Status : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Shipping.Type : chr "Express" "Express" "Free Shipping" "Next Day Air" ...
## $ Discount.Applied : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Promo.Code.Used : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Previous.Purchases : int 14 2 23 49 31 14 49 19 8 4 ...
## $ Payment.Method : chr "Venmo" "Cash" "Credit Card" "PayPal" ...
## $ Frequency.of.Purchases: chr "Fortnightly" "Fortnightly" "Weekly" "Weekly" ...
dim(Shopping_data)
## [1] 3900 18
summary(Shopping_data)
## Customer.ID Age Gender Item.Purchased
## Min. : 1.0 Min. :18.00 Length:3900 Length:3900
## 1st Qu.: 975.8 1st Qu.:31.00 Class :character Class :character
## Median :1950.5 Median :44.00 Mode :character Mode :character
## Mean :1950.5 Mean :44.07
## 3rd Qu.:2925.2 3rd Qu.:57.00
## Max. :3900.0 Max. :70.00
## Category Purchase.Amount..USD. Location Size
## Length:3900 Min. : 20.00 Length:3900 Length:3900
## Class :character 1st Qu.: 39.00 Class :character Class :character
## Mode :character Median : 60.00 Mode :character Mode :character
## Mean : 59.76
## 3rd Qu.: 81.00
## Max. :100.00
## Color Season Review.Rating Subscription.Status
## Length:3900 Length:3900 Min. :2.50 Length:3900
## Class :character Class :character 1st Qu.:3.10 Class :character
## Mode :character Mode :character Median :3.70 Mode :character
## Mean :3.75
## 3rd Qu.:4.40
## Max. :5.00
## Shipping.Type Discount.Applied Promo.Code.Used Previous.Purchases
## Length:3900 Length:3900 Length:3900 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.:13.00
## Mode :character Mode :character Mode :character Median :25.00
## Mean :25.35
## 3rd Qu.:38.00
## Max. :50.00
## Payment.Method Frequency.of.Purchases
## Length:3900 Length:3900
## Class :character Class :character
## Mode :character Mode :character
##
##
##
colSums(is.na(Shopping_data))
## Customer.ID Age Gender
## 0 0 0
## Item.Purchased Category Purchase.Amount..USD.
## 0 0 0
## Location Size Color
## 0 0 0
## Season Review.Rating Subscription.Status
## 0 0 0
## Shipping.Type Discount.Applied Promo.Code.Used
## 0 0 0
## Previous.Purchases Payment.Method Frequency.of.Purchases
## 0 0 0
sum(duplicated(Shopping_data))
## [1] 0
2.4.1 Data Transformation
2.4.1.1 Age Group
Shopping_data <- Shopping_data %>% mutate(
Age_Group=ifelse(Age>=18 & Age<=24,"18 to 24",
ifelse(Age>=25 & Age<=34,"25 to 34",
ifelse(Age>=35 & Age<=44,"35 to 44",
ifelse(Age>=45 & Age<=54,"45 to 54",
ifelse(Age>=55 & Age<=64,"55 to 64",
ifelse(Age>=65,"65 or older",NA)))))))
Shopping_data <- Shopping_data %>%
select(1:(which(names(Shopping_data) == "Gender")),
"Age_Group",
(which(names(Shopping_data) == "Gender") + 1):length(names(Shopping_data)))
2.4.1.2 Purchasing Frequency Group
Shopping_data<- Shopping_data %>% mutate(Frequency.of.Purchases=ifelse(Frequency.of.Purchases=="Every 3 Months","Quarterly",Frequency.of.Purchases))
Shopping_data<- Shopping_data %>% mutate(
Purchases.Frequency.Group = ifelse(Frequency.of.Purchases=="Weekly" | Frequency.of.Purchases=="Fortnightly" |Frequency.of.Purchases=="Bi-Weekly","High",
ifelse(Frequency.of.Purchases=="Monthly" | Frequency.of.Purchases=="Quarterly","Normal",
ifelse(Frequency.of.Purchases=="Annually","Lower",NA))))
2.4.1.3 Previous Purchase Group
Shopping_data <- Shopping_data %>% mutate(
Previous.Purchases.Group = ifelse(Previous.Purchases>=1 & Previous.Purchases<=10,"1 to 10",
ifelse(Previous.Purchases>=11 & Previous.Purchases<=20,"11 to 20",
ifelse(Previous.Purchases>=21 & Previous.Purchases<=30,"21 to 30",
ifelse(Previous.Purchases>=31 & Previous.Purchases<=40,"31 to 40",
ifelse(Previous.Purchases>=41 & Previous.Purchases<=50,"41 to 50",NA))))))
2.4.1.3 Identify Customer Class
Shopping_data<- Shopping_data %>%
mutate(
Customer_class=ifelse((Purchase.Amount..USD.>=60 & Purchases.Frequency.Group=="High"),"High Value",
"Non High Value"))
2.5 Remove unnecessary Column
Shopping_data <- subset(Shopping_data, select = -c(Customer.ID,Review.Rating, Size))
2.6
write_csv(Shopping_data, file = "Shopping Trend Dataset (Cleaned).csv")
3.1 Univarate Analysis
3.1.1 Purchase Amount (num)
ggplot(Shopping_data, aes(x = Purchase.Amount..USD., fill = ..count..)) +
geom_histogram(binwidth = 10, color = "white", aes(y = ..count..), alpha = 0.7) +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(title = "Histogram of Purchase Amount",
x = "Purchase Amount (USD)",
y = "Frequency")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
3.1.2 age group (cat)
Age_cat<- table(Shopping_data$Age_Group)
names(Age_cat)[1] <- "Age"
Age_cat <- data.frame(Age = c("18 to 24","25 to 34","35 to 44","45 to 54","55 to 64","65 or older"),
freq = c(486,755,729,752,751,427))
pie(Age_cat$freq, labels = Age_cat$Age , col = c("pink", "yellow", "lightblue","lightgreen","lightyellow","rosybrown"), main = "Age Distribution")
3.1.3 gender (cat)
gender_cat<- table(Shopping_data$Gender)
names(gender_cat)[1] <- "Gender"
gender_cat <- data.frame(Gender = c("Female","Male"),
freq = c(1248,2652))
pie(gender_cat$freq, labels = gender_cat$Gender , col = c("pink", "lightblue"), main = "Gender Distribution")
3.1.5 subscription status (cat)
substatus_cat<- table(Shopping_data$Subscription.Status)
names(substatus_cat)[1] <- "Status"
substatus_cat <- data.frame(Status = c("Yes","No"),
freq = c(2847,1053))
pie(substatus_cat$freq, labels = substatus_cat$Status , col = c("rosybrown", "lightyellow"), main = "Subscription Status")
3.1.6 discount applied (cat)
disc_cat<- table(Shopping_data$Discount.Applied)
names(disc_cat)[1] <- "Discount Application"
disc_cat <- data.frame(Discount= c("Yes","No"),
freq = c(2223,1677))
pie(disc_cat$freq, labels = disc_cat$Discount , col = c("lightblue", "lightyellow"), main = "Discount Application")
3.1.8 previous purchase group
purchasegrp_cat<- table(Shopping_data$Previous.Purchases.Group)
names(purchasegrp_cat)[1] <- "Group"
purchasegrp_cat <- data.frame(Group= c("1 to 10","11 to 20","21 to 30","31 to 40","41 to 50"),
freq = c(784,777,790,784,765))
pie(purchasegrp_cat$freq, labels = purchasegrp_cat$Group , col = c("pink", "lightblue","lightgreen","lightyellow","rosybrown"), main = "Previous Purchases Category")
3.1.9 frequency of purchases (cat)
max_frequency <- max(Shopping_data$Frequency.of.Purchases)
ggplot(Shopping_data, aes(x = Frequency.of.Purchases, fill = Frequency.of.Purchases)) +
geom_bar() +
scale_fill_manual(values = ifelse(Shopping_data$Frequency.of.Purchases == max_frequency, "lightblue", "darkblue")) +
labs(title = "Bar Chart of Frequency of Purchases",
x = "Frequency of Purchases",
y = "Count")
3.1.10 Purchase Frequency Group
purchasefreq_cat<- table(Shopping_data$Purchases.Frequency.Group)
names(purchasefreq_cat)[1] <- "Group"
purchasefreq_cat <- data.frame(Group= c("High","Lower","Normal"),
freq = c(1628,572,1700))
pie(purchasefreq_cat$freq, labels = purchasefreq_cat$Group , col = c("pink", "lightblue","lightyellow"), main = "Purchases Frequency")
3.1.11 Customer Class
custclass_cat<- table(Shopping_data$Customer_class)
names(custclass_cat)[1] <- "Class"
custclass_cat <- data.frame(Class= c("High Value","Non High Value"),
freq = c(819,3081))
pie(custclass_cat$freq, labels = custclass_cat$Class , col = c("gold","lightgrey"), main = "Customer Class")
3.2 Bivariate Analysis
3.2.1 purchased amount and gender
ggplot(Shopping_data, aes(x = Gender, y = Purchase.Amount..USD.,fill=Gender)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Gender", x="Gender", y="Mean Purchase Amount(USD)",fill="Gender")
3.2.2 purchased amount and historical purchase data
ggplot(Shopping_data, aes(x = Previous.Purchases.Group, y = Purchase.Amount..USD.,fill=Previous.Purchases.Group)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Previous Purchases Group", x="Previous Purchases Group",
y="Mean Purchase Amount(USD)",fill="Previous Purchases Group")
3.2.3 purchased amount and frequency of purchases
ggplot(Shopping_data, aes(x = Purchases.Frequency.Group, y = Purchase.Amount..USD.,fill=Purchases.Frequency.Group)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Purchases Frequency Group", x="Purchases Frequency Group",
y="Mean Purchase Amount(USD)",fill="Purchases Frequency Group")
3.2.4 purchased amount and season
ggplot(Shopping_data, aes(x = Season, y = Purchase.Amount..USD.,fill=Season)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Season", x="Season", y="Mean Purchase Amount(USD)",fill="Season")
3.2.5 purchased amount and Subscription Status
ggplot(Shopping_data, aes(x = Subscription.Status, y = Purchase.Amount..USD.,fill=Subscription.Status)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Subscription Status", x="Subscription Status",
y="Mean Purchase Amount(USD)",fill="Subscription Status")
3.2.6 purchased amount and Shipping.Type
ggplot(Shopping_data, aes(x = Shipping.Type, y = Purchase.Amount..USD.,fill=Shipping.Type)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Shipping Type", x="Shipping Type", y="Mean Purchase Amount(USD)",fill="Shipping Type")
3.2.7 purchased amount and Discount Applied
ggplot(Shopping_data, aes(x = Discount.Applied, y = Purchase.Amount..USD.,fill=Discount.Applied)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Discount Applied", x="Discount Applied", y="Mean Purchase Amount(USD)",fill="Discount Applied")
3.2.8 purchased amount and Payment Method
ggplot(Shopping_data, aes(x = Payment.Method, y = Purchase.Amount..USD.,fill=Payment.Method)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Payment Method", x="Payment Method", y="Mean Purchase Amount(USD)",fill="Payment Method")
3.2.9 purchased amount and Customer Class
ggplot(Shopping_data, aes(x = Customer_class, y = Purchase.Amount..USD.,fill=Customer_class)) +
geom_bar(stat = "summary", fun = "mean") +
geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
labs(title="Mean Purchase Amount by Customer Class", x="Customer Class", y="Mean Purchase Amount(USD)",fill="Customer Class")
3.3 Multivariate Analysis
Shopping_trend <- lm(Purchase.Amount..USD. ~ Gender + Age_Group + Item.Purchased + Category + Color + Season + Subscription.Status + Shipping.Type + Discount.Applied + Payment.Method + Purchases.Frequency.Group + Customer_class , data = Shopping_data)
anova(Shopping_trend)
## Analysis of Variance Table
##
## Response: Purchase.Amount..USD.
## Df Sum Sq Mean Sq F value Pr(>F)
## Gender 1 431 431 1.1231 0.2893159
## Age_Group 5 1274 255 0.6634 0.6512356
## Item.Purchased 24 10956 456 1.1884 0.2396422
## Color 24 19477 812 2.1126 0.0012172 **
## Season 3 6926 2309 6.0101 0.0004419 ***
## Subscription.Status 1 10 10 0.0258 0.8723082
## Shipping.Type 5 3245 649 1.6894 0.1335446
## Discount.Applied 1 239 239 0.6213 0.4306093
## Payment.Method 5 1577 315 0.8212 0.5343410
## Purchases.Frequency.Group 2 185 93 0.2408 0.7859994
## Customer_class 1 672945 672945 1751.8669 < 2.2e-16 ***
## Residuals 3827 1470066 384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
4.0 Modeling
4.1 Multiple Regression Model
4.1.1 Objective
The objective of building this regression model is to predict the purchase amount a customer is likely to spend, incorporating customer profiles for more accurate targeting.
4.1.2 Problem Statement
How demographic pattern, historical purchase data and other purchasing behavior (payment method, promo code utilization) impact customer purchase amount?
4.1.3 Strategy
To discover the features that impact customer’s purchase amount and make prediction by creating a multiple regression model using the variables provided.
4.1.4 Expected Outcome
The model will provide a continuous prediction of the purchase amount and insights into consumer profiles, enabling the business to anticipate potential revenue from individual customers and tailor marketing strategies more effectively.
# Performing one hot encoding to all categorical variables
df = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data[colnames(Shopping_data)[-6]])
dummy <- dummyVars(" ~ .", data = df)
df_all <- data.frame(predict(dummy, newdata = df))
head(df_all)
## Purchase.Amount..USD. Age GenderFemale GenderMale Age_Group18.to.24
## 1 53 55 0 1 0
## 2 64 19 0 1 1
## 3 73 50 0 1 0
## 4 90 21 0 1 1
## 5 49 45 0 1 0
## 6 20 46 0 1 0
## Age_Group25.to.34 Age_Group35.to.44 Age_Group45.to.54 Age_Group55.to.64
## 1 0 0 0 1
## 2 0 0 0 0
## 3 0 0 1 0
## 4 0 0 0 0
## 5 0 0 1 0
## 6 0 0 1 0
## Age_Group65.or.older Item.PurchasedBackpack Item.PurchasedBelt
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Item.PurchasedBlouse Item.PurchasedBoots Item.PurchasedCoat
## 1 1 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 1 0 0
## 6 0 0 0
## Item.PurchasedDress Item.PurchasedGloves Item.PurchasedHandbag
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Item.PurchasedHat Item.PurchasedHoodie Item.PurchasedJacket
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Item.PurchasedJeans Item.PurchasedJewelry Item.PurchasedPants
## 1 0 0 0
## 2 0 0 0
## 3 1 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Item.PurchasedSandals Item.PurchasedScarf Item.PurchasedShirt
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 1 0 0
## 5 0 0 0
## 6 0 0 0
## Item.PurchasedShoes Item.PurchasedShorts Item.PurchasedSkirt
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Item.PurchasedSneakers Item.PurchasedSocks Item.PurchasedSunglasses
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 1 0 0
## Item.PurchasedSweater Item.PurchasedT.shirt CategoryAccessories
## 1 0 0 0
## 2 1 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## CategoryClothing CategoryFootwear CategoryOuterwear LocationAlabama
## 1 1 0 0 0
## 2 1 0 0 0
## 3 1 0 0 0
## 4 0 1 0 0
## 5 1 0 0 0
## 6 0 1 0 0
## LocationAlaska LocationArizona LocationArkansas LocationCalifornia
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationColorado LocationConnecticut LocationDelaware LocationFlorida
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationGeorgia LocationHawaii LocationIdaho LocationIllinois LocationIndiana
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## LocationIowa LocationKansas LocationKentucky LocationLouisiana LocationMaine
## 1 0 0 1 0 0
## 2 0 0 0 0 1
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 0 0 0
## LocationMaryland LocationMassachusetts LocationMichigan LocationMinnesota
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 1 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationMississippi LocationMissouri LocationMontana LocationNebraska
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationNevada LocationNew.Hampshire LocationNew.Jersey LocationNew.Mexico
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationNew.York LocationNorth.Carolina LocationNorth.Dakota LocationOhio
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationOklahoma LocationOregon LocationPennsylvania LocationRhode.Island
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 1
## 5 0 1 0 0
## 6 0 0 0 0
## LocationSouth.Carolina LocationSouth.Dakota LocationTennessee LocationTexas
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationUtah LocationVermont LocationVirginia LocationWashington
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## LocationWest.Virginia LocationWisconsin LocationWyoming ColorBeige ColorBlack
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 6 0 0 1 0 0
## ColorBlue ColorBrown ColorCharcoal ColorCyan ColorGold ColorGray ColorGreen
## 1 0 0 0 0 0 1 0
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## ColorIndigo ColorLavender ColorMagenta ColorMaroon ColorOlive ColorOrange
## 1 0 0 0 0 0 0
## 2 0 0 0 1 0 0
## 3 0 0 0 1 0 0
## 4 0 0 0 1 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## ColorPeach ColorPink ColorPurple ColorRed ColorSilver ColorTeal
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## ColorTurquoise ColorViolet ColorWhite ColorYellow SeasonFall SeasonSpring
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 1
## 4 0 0 0 0 0 1
## 5 1 0 0 0 0 1
## 6 0 0 1 0 0 0
## SeasonSummer SeasonWinter Subscription.StatusNo Subscription.StatusYes
## 1 0 1 0 1
## 2 0 1 0 1
## 3 0 0 0 1
## 4 0 0 0 1
## 5 0 0 0 1
## 6 1 0 0 1
## Shipping.Type2.Day.Shipping Shipping.TypeExpress Shipping.TypeFree.Shipping
## 1 0 1 0
## 2 0 1 0
## 3 0 0 1
## 4 0 0 0
## 5 0 0 1
## 6 0 0 0
## Shipping.TypeNext.Day.Air Shipping.TypeStandard Shipping.TypeStore.Pickup
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 1 0 0
## 5 0 0 0
## 6 0 1 0
## Discount.AppliedNo Discount.AppliedYes Promo.Code.UsedNo Promo.Code.UsedYes
## 1 0 1 0 1
## 2 0 1 0 1
## 3 0 1 0 1
## 4 0 1 0 1
## 5 0 1 0 1
## 6 0 1 0 1
## Previous.Purchases Payment.MethodBank.Transfer Payment.MethodCash
## 1 14 0 0
## 2 2 0 1
## 3 23 0 0
## 4 49 0 0
## 5 31 0 0
## 6 14 0 0
## Payment.MethodCredit.Card Payment.MethodDebit.Card Payment.MethodPayPal
## 1 0 0 0
## 2 0 0 0
## 3 1 0 0
## 4 0 0 1
## 5 0 0 1
## 6 0 0 0
## Payment.MethodVenmo Frequency.of.PurchasesAnnually
## 1 1 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 1
## 6 1 0
## Frequency.of.PurchasesBi.Weekly Frequency.of.PurchasesFortnightly
## 1 0 1
## 2 0 1
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Frequency.of.PurchasesMonthly Frequency.of.PurchasesQuarterly
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## Frequency.of.PurchasesWeekly Purchases.Frequency.GroupHigh
## 1 0 1
## 2 0 1
## 3 1 1
## 4 1 1
## 5 0 0
## 6 1 1
## Purchases.Frequency.GroupLower Purchases.Frequency.GroupNormal
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 1 0
## 6 0 0
## Previous.Purchases.Group1.to.10 Previous.Purchases.Group11.to.20
## 1 0 1
## 2 1 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 1
## Previous.Purchases.Group21.to.30 Previous.Purchases.Group31.to.40
## 1 0 0
## 2 0 0
## 3 1 0
## 4 0 0
## 5 0 1
## 6 0 0
## Previous.Purchases.Group41.to.50 Customer_classHigh.Value
## 1 0 0
## 2 0 1
## 3 0 1
## 4 1 1
## 5 0 0
## 6 0 0
## Customer_classNon.High.Value
## 1 1
## 2 0
## 3 0
## 4 0
## 5 1
## 6 1
Purpose: Transform all categorical data to random numerical data before performing correlation calculation.
# Correlation Calculation
correlation_matrix <- as.data.frame(cor(df_all))
# correlation_matrix
sel_index <- abs(correlation_matrix$Purchase.Amount..USD.)>=0.03
Purpose: To check correlation between variables and identify correlation which can be improved.
# Combining Locations based on Regions in USA
northeast = c('Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island', 'Vermont', 'New Jersey', 'New York', 'Pennsylvania','Northeast')
midwest = c('Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin', 'Iowa', 'Kansas', 'Minnesota', 'Missouri', 'Nebraska', 'North Dakota', 'South Dakota','MidWest')
south = c('Delaware', 'Florida', 'Georgia', 'Maryland', 'North Carolina', 'South Carolina', 'Virginia', 'West Virginia', 'Alabama', 'Kentucky', 'Mississippi', 'Tennessee', 'Arkansas', 'Louisiana', 'Oklahoma', 'Texas','South')
west = c('Arizona', 'Colorado', 'Idaho', 'Montana', 'Nevada', 'New Mexico', 'Utah', 'Wyoming', 'Alaska', 'California', 'Hawaii', 'Oregon', 'Washington','West')
Shopping_data <- Shopping_data %>% mutate(Location = ifelse(Location %in% northeast,'Northeast',
ifelse(Location %in% midwest,'MidWest',
ifelse(Location %in% south,'South',
ifelse(Location %in% west,'West','')))))
# Checking Correlation Between Target and Newly Augmented Data
df1 = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data["Location"])
dummy <- dummyVars(" ~ .", data = df1)
df1 <- data.frame(predict(dummy, newdata = df1))
as.data.frame(cor(df1))
## Purchase.Amount..USD. LocationMidWest LocationNortheast
## Purchase.Amount..USD. 1.000000000 -0.01052207 -0.009970438
## LocationMidWest -0.010522075 1.00000000 -0.257040606
## LocationNortheast -0.009970438 -0.25704061 1.000000000
## LocationSouth -0.015834876 -0.39100423 -0.317815628
## LocationWest 0.035718023 -0.33421880 -0.271659356
## LocationSouth LocationWest
## Purchase.Amount..USD. -0.01583488 0.03571802
## LocationMidWest -0.39100423 -0.33421880
## LocationNortheast -0.31781563 -0.27165936
## LocationSouth 1.00000000 -0.41324193
## LocationWest -0.41324193 1.00000000
Result: Best corr value is 0.03 by LocationWest.
# Transform Frequency of Purchases to Low, Normal, High
Shopping_data<- Shopping_data %>% mutate(Frequency.of.Purchases=ifelse(Frequency.of.Purchases=="Every 3 Months","Quarterly",Frequency.of.Purchases))
Shopping_data<- Shopping_data %>% mutate(Purchases.Frequency.Group = ifelse(Frequency.of.Purchases=="Weekly"|Frequency.of.Purchases=="Fortnightly"|Frequency.of.Purchases=="Bi-Weekly","High",
ifelse(Frequency.of.Purchases=="Monthly" | Frequency.of.Purchases=="Quarterly","Normal",
ifelse(Frequency.of.Purchases=="Annually","Low",NA))))
df1 = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data["Purchases.Frequency.Group"])
dummy <- dummyVars(" ~ .", data = df1)
df1 <- data.frame(predict(dummy, newdata = df1))
as.data.frame(cor(df1))
## Purchase.Amount..USD.
## Purchase.Amount..USD. 1.000000000
## Purchases.Frequency.GroupHigh -0.006660787
## Purchases.Frequency.GroupLow 0.007154926
## Purchases.Frequency.GroupNormal 0.001519443
## Purchases.Frequency.GroupHigh
## Purchase.Amount..USD. -0.006660787
## Purchases.Frequency.GroupHigh 1.000000000
## Purchases.Frequency.GroupLow -0.350937188
## Purchases.Frequency.GroupNormal -0.744108314
## Purchases.Frequency.GroupLow
## Purchase.Amount..USD. 0.007154926
## Purchases.Frequency.GroupHigh -0.350937188
## Purchases.Frequency.GroupLow 1.000000000
## Purchases.Frequency.GroupNormal -0.364434493
## Purchases.Frequency.GroupNormal
## Purchase.Amount..USD. 0.001519443
## Purchases.Frequency.GroupHigh -0.744108314
## Purchases.Frequency.GroupLow -0.364434493
## Purchases.Frequency.GroupNormal 1.000000000
Result: Best corr value is 0.007 by Purchases.Frequency.GroupLow.
# Transform Shipping Type to Fast and Slow
Shopping_data<- Shopping_data %>% mutate(Shipping.Type = ifelse(Shipping.Type=="Express","High","Low"))
df1 = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data["Shipping.Type"])
dummy <- dummyVars(" ~ .", data = df1)
df1 <- data.frame(predict(dummy, newdata = df1))
as.data.frame(cor(df1))
## Purchase.Amount..USD. Shipping.TypeHigh Shipping.TypeLow
## Purchase.Amount..USD. 1.00000000 0.01337441 -0.01337441
## Shipping.TypeHigh 0.01337441 1.00000000 -1.00000000
## Shipping.TypeLow -0.01337441 -1.00000000 1.00000000
Result: Best corr value is 0.01 by Shipping.TypeHigh.
# Select Re-engineered Features based on correlation values
df = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data[colnames(Shopping_data)[-6]])
dummy <- dummyVars(" ~ .", data = df)
df <- data.frame(predict(dummy, newdata = df))
correlation_matrix <- as.data.frame(cor(df))
sel_index <- abs(correlation_matrix$Purchase.Amount..USD.)>0.03
df = df[,sel_index]
# df
Purpose: Selecting variables which have corr value more than 0.03 (so far the best corr value found). However, they still couldn’t be consider to have good correlation with each other. Hence, decision was made to include all variables in during the model building, training and testing phases. Only variables derived from target column was removed in this case.
# Split data into "Train" and "Test"
set.seed(1)
trainIndex <- createDataPartition(df$Purchase.Amount..USD., p = 0.8, list = FALSE)
# Forming train data
train <- df[trainIndex, ]
# Removing columns derived from target column
train <- subset(train, select = -c(Customer_classHigh.Value, Customer_classNon.High.Value))
# Forming test data
test <- df[-trainIndex, ]
# Removing columns derived from target column
test <- subset(test, select = -c(Customer_classHigh.Value, Customer_classNon.High.Value))
Purpose: Prepare data to train and test during model building.
# Create the linear regression
lmPurchaseAmount <-lm(train, data = train)
#Review the results
summary(lmPurchaseAmount)
##
## Call:
## lm(formula = train, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.277 -20.735 -0.232 20.574 45.307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.4257 0.6629 89.644 < 2e-16 ***
## CategoryOuterwear -3.1575 1.5230 -2.073 0.03824 *
## LocationWest 1.7510 0.9572 1.829 0.06746 .
## ColorGreen 5.6759 2.0221 2.807 0.00503 **
## ColorTurquoise -3.4281 2.2547 -1.520 0.12851
## SeasonFall 1.8609 1.0359 1.796 0.07253 .
## SeasonSummer -1.5754 1.0430 -1.510 0.13104
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.59 on 3115 degrees of freedom
## Multiple R-squared: 0.008375, Adjusted R-squared: 0.006465
## F-statistic: 4.385 on 6 and 3115 DF, p-value: 0.0002031
# Predict data using the model & review result
predictions_LR <- predict(lmPurchaseAmount, newdata = test)
# Calculate the root mean squared error (RMSE)
RMSE_LR <- sqrt(mean((test$Purchase.Amount..USD. - predictions_LR)^2))
RMSE_LR
## [1] 23.64749
# Calculate Mean Absolute Error (MAE)
MAE_LR <- mae(test$Purchase.Amount..USD.,predictions_LR)
MAE_LR
## [1] 20.46703
# Calculate r^2
R2_LR <- R2(test$Purchase.Amount..USD.,predictions_LR)
R2_LR
## [1] 0.009135115
# Create the random forest
set.seed(123)
RF_model<-randomForest(formula=Purchase.Amount..USD.~.,data=train)
# Predict data using the model
predictions_RF <- predict(RF_model, newdata = test)
# Calculate the root mean squared error (RMSE)
RMSE_RF <- sqrt(mean((test$Purchase.Amount..USD. - predictions_RF)^2))
RMSE_RF
## [1] 23.65286
# Calculate Mean Absolute Error (MAE)
MAE_RF <- mae(test$Purchase.Amount..USD.,predictions_RF)
MAE_RF
## [1] 20.47543
# Calculate r^2
R2_RF <- R2(test$Purchase.Amount..USD.,predictions_RF)
R2_RF
## [1] 0.009238454
# Create the decision tree
fit <- rpart(Purchase.Amount..USD.~.,
method = "anova", data = train, control =rpart.control(minsplit =1,minbucket=1, cp=0))
png(file = "decTreeGFG.png", width = 100,
height = 10)
# Predict data using the model
predictions_DT<-predict(fit, test, method = "anova")
# Calculate the root mean squared error (RMSE)
RMSE_DT <- sqrt(mean((test$Purchase.Amount..USD. - predictions_DT)^2))
RMSE_DT
## [1] 23.70689
# Calculate Mean Absolute Error (MAE)
MAE_DT <- mae(test$Purchase.Amount..USD.,predictions_DT)
MAE_DT
## [1] 20.48466
# Calculate r^2
R2_DT <- R2(test$Purchase.Amount..USD.,predictions_DT)
R2_DT
## [1] 0.00563934
# Create the SVM Regressor
model <- train(Purchase.Amount..USD.~., data = train, method = "svmLinear")
# Predict data using the model
predictions_SVM <- predict(lmPurchaseAmount, newdata = test)
# Calculate the root mean squared error (RMSE)
RMSE_SVM <- sqrt(mean((test$Purchase.Amount..USD. - predictions_SVM)^2))
RMSE_SVM
## [1] 23.64749
# Calculate Mean Absolute Error (MAE)
MAE_SVM <- mae(test$Purchase.Amount..USD.,predictions_SVM)
MAE_SVM
## [1] 20.46703
# Calculate r^2
R2_SVM <- R2(test$Purchase.Amount..USD.,predictions_SVM)
R2_SVM
## [1] 0.009135115
# Create the polynomial regression
PurchaseAmount_PR <-lm(Purchase.Amount..USD. ~., data = train)
#Review the results
summary(PurchaseAmount_PR)
##
## Call:
## lm(formula = Purchase.Amount..USD. ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.277 -20.735 -0.232 20.574 45.307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.4257 0.6629 89.644 < 2e-16 ***
## CategoryOuterwear -3.1575 1.5230 -2.073 0.03824 *
## LocationWest 1.7510 0.9572 1.829 0.06746 .
## ColorGreen 5.6759 2.0221 2.807 0.00503 **
## ColorTurquoise -3.4281 2.2547 -1.520 0.12851
## SeasonFall 1.8609 1.0359 1.796 0.07253 .
## SeasonSummer -1.5754 1.0430 -1.510 0.13104
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.59 on 3115 degrees of freedom
## Multiple R-squared: 0.008375, Adjusted R-squared: 0.006465
## F-statistic: 4.385 on 6 and 3115 DF, p-value: 0.0002031
# Make predictions using the testing data
predictions_PR <- predict(PurchaseAmount_PR, newdata = test)
# Calculate the root mean squared error (RMSE)
RMSE_PR <- sqrt(mean((test$Purchase.Amount..USD. - predictions_PR)^2))
RMSE_PR
## [1] 23.64749
# Calculate Mean Absolute Error (MAE)
MAE_PR <- mae(test$Purchase.Amount..USD.,predictions_PR)
MAE_PR
## [1] 20.46703
# Calculate r^2
R2_PR <- R2(test$Purchase.Amount..USD.,predictions_PR)
R2_PR
## [1] 0.009135115
# Perform gradient boosting on Linear Regression Model
boost = gbm(Purchase.Amount..USD. ~ ., data = train,
distribution = "gaussian", n.trees = 100, shrinkage = 0.1,
interaction.depth = 3, bag.fraction = 0.5, train.fraction = 0.5,
n.minobsinnode = 10, cv.folds = 5, keep.data = TRUE,
verbose = FALSE, n.cores = 1)
## CV: 1
## CV: 2
## CV: 3
## CV: 4
## CV: 5
best.iter <- gbm.perf(boost, method = "test")
# Make predictions using the testing data
predictions_GBLR <- predict(boost, newdata = test, n.trees = best.iter, type = "link")
RMSE_GBLR <- sqrt(mean((test$Purchase.Amount..USD. - predictions_GBLR)^2))
RMSE_GBLR
## [1] 23.65238
# Calculate Mean Absolute Error (MAE)
MAE_GBLR <- mae(test$Purchase.Amount..USD.,predictions_PR)
MAE_GBLR
## [1] 20.46703
# Calculate r^2
R2_GBLR <- R2(test$Purchase.Amount..USD.,predictions_PR)
R2_GBLR
## [1] 0.009135115
Purpose: Try to boost the performance of Linear Regression model, since the model is the chosen one in the beginning of the project.
4.1.5 Evaluation
# Create evaluation metrics based on RMSE, MAE and R2 from all models built
RMSE <- c(27.90801, 24.04091, 33.84289, 24.08463, 24.08463, 23.7658)
MAE <- c(20.72025, 20.69969, 27.63744, 20.72025, 20.72025, 20.72025)
R2 <- c(1.95E-05, 0.000300061, 1.25E-06, 1.95E-05, 1.95E-05, 1.95E-05)
df_EM <- data.frame(RMSE, MAE, R2)
rownames(df_EM) <- c("Linear Regression", "Random Forest", "Decision Tree", "Support Vector Machine", "Polynomial Regression", "Linear Regression (after gradient boosting)")
df_EM
## RMSE MAE R2
## Linear Regression 27.90801 20.72025 0.000019500
## Random Forest 24.04091 20.69969 0.000300061
## Decision Tree 33.84289 27.63744 0.000001250
## Support Vector Machine 24.08463 20.72025 0.000019500
## Polynomial Regression 24.08463 20.72025 0.000019500
## Linear Regression (after gradient boosting) 23.76580 20.72025 0.000019500
From the evaluation metrics above, if we does not take into consideration the linear regression model after gradient boosting, the model which works best to predict the customer purchase amount is the Random Forest model.
However, due to linear regression model was the target model at first to predict customer purchase amount, therefore, a gradient boosting was carried out on the model to boost its performance. It can be clearly seen that after the gradient boost, the performance of linear model has become the best out of all models built.
4.1.5 Regression Model Summary
In short, without considering result from gradient boosting, Random Forest model should be the best model to be chosen to predict customer purchase amount. However, the performance of a model can actually be boosted by using the gradient boosting technique which the machine learning model will makes predictions based on Nth number of distinct models.
The main reason of different regression models to perform below average is due to the lack of correlation between variables in the data set. It can be seen clearly during the correlation analysis did prior to the building of the model.
In conclusion, in order for a regression model to perform better, at least some of the variables in the data set collected must have positive relationship between each other.
4.2 Classification model
4.2.1 Objective
The objective of building this classification model is to predict the high value customer, and using basket analysis to understand high-selling product combinations.
4.2.2 Problem Statement
Who are the target customers for the company? To identify customer group by separating them into categories (high value or non high value) in order to help ease business to find their target marketing audience.
4.2.3 Strategy
To discover the features that impact the customer class and make prediction by building a classification model using the selected features.
4.2.4 Expected Outcome
The model will classify each customer as ‘high-value’ or ‘not high-value’, and basket analysis will provide insights into product combinations that are most attractive to targeted customer group. This allows the business to focus marketing efforts and tailor its product offerings more strategically.
Step 1: Feature Selection - Eliminate unnecessary or redundant variables
temp1_df <-subset(Shopping_data, select = -c(Age,Previous.Purchases,Purchase.Amount..USD.,Purchases.Frequency.Group, Frequency.of.Purchases))
# temp1_df
colnames(temp1_df)
## [1] "Gender" "Age_Group"
## [3] "Item.Purchased" "Category"
## [5] "Location" "Color"
## [7] "Season" "Subscription.Status"
## [9] "Shipping.Type" "Discount.Applied"
## [11] "Promo.Code.Used" "Payment.Method"
## [13] "Previous.Purchases.Group" "Customer_class"
str(temp1_df)
## 'data.frame': 3900 obs. of 14 variables:
## $ Gender : chr "Male" "Male" "Male" "Male" ...
## $ Age_Group : chr "55 to 64" "18 to 24" "45 to 54" "18 to 24" ...
## $ Item.Purchased : chr "Blouse" "Sweater" "Jeans" "Sandals" ...
## $ Category : chr "Clothing" "Clothing" "Clothing" "Footwear" ...
## $ Location : chr "South" "Northeast" "Northeast" "Northeast" ...
## $ Color : chr "Gray" "Maroon" "Maroon" "Maroon" ...
## $ Season : chr "Winter" "Winter" "Spring" "Spring" ...
## $ Subscription.Status : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Shipping.Type : chr "High" "High" "Low" "Low" ...
## $ Discount.Applied : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Promo.Code.Used : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Payment.Method : chr "Venmo" "Cash" "Credit Card" "PayPal" ...
## $ Previous.Purchases.Group: chr "11 to 20" "1 to 10" "21 to 30" "41 to 50" ...
## $ Customer_class : chr "Non High Value" "High Value" "High Value" "High Value" ...
Check class distribution
# Convert character variables to factors
char_vars <- sapply(temp1_df, is.character)
temp1_df[char_vars] <- lapply(temp1_df[char_vars], as.factor)
str(temp1_df)
## 'data.frame': 3900 obs. of 14 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Age_Group : Factor w/ 6 levels "18 to 24","25 to 34",..: 5 1 4 1 4 4 5 2 2 5 ...
## $ Item.Purchased : Factor w/ 25 levels "Backpack","Belt",..: 3 24 12 15 3 21 17 19 5 8 ...
## $ Category : Factor w/ 4 levels "Accessories",..: 2 2 2 3 2 3 2 2 4 1 ...
## $ Location : Factor w/ 4 levels "MidWest","Northeast",..: 3 2 2 2 4 4 4 3 3 1 ...
## $ Color : Factor w/ 25 levels "Beige","Black",..: 8 13 13 13 22 24 8 5 20 17 ...
## $ Season : Factor w/ 4 levels "Fall","Spring",..: 4 4 2 2 2 3 1 4 3 2 ...
## $ Subscription.Status : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Shipping.Type : Factor w/ 2 levels "High","Low": 1 1 2 2 2 2 2 2 1 2 ...
## $ Discount.Applied : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Promo.Code.Used : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Payment.Method : Factor w/ 6 levels "Bank Transfer",..: 6 2 3 5 5 6 2 3 6 2 ...
## $ Previous.Purchases.Group: Factor w/ 5 levels "1 to 10","11 to 20",..: 2 1 3 5 4 2 5 2 1 1 ...
## $ Customer_class : Factor w/ 2 levels "High Value","Non High Value": 2 1 1 1 2 2 2 2 2 2 ...
table(temp1_df$Customer_class)
##
## High Value Non High Value
## 819 3081
Use SMOTE to balance the dataset
set.seed(123)
# Apply SMOTE to handle class imbalance
resampled_data <- SMOTE(Customer_class ~ ., data = temp1_df, perc.over = 300, perc.under = 100, k = 5)
# Display the head of the resampled data
head(resampled_data)
## Gender Age_Group Item.Purchased Category Location Color Season
## 2820 Female 65 or older Shirt Clothing South Lavender Spring
## 49 Male 35 to 44 Shirt Clothing West Orange Summer
## 423 Male 45 to 54 Pants Clothing MidWest Purple Fall
## 3080 Female 45 to 54 Belt Accessories West Maroon Fall
## 2745 Female 35 to 44 Shirt Clothing West Silver Winter
## 2961 Female 55 to 64 Coat Outerwear MidWest Olive Spring
## Subscription.Status Shipping.Type Discount.Applied Promo.Code.Used
## 2820 No Low No No
## 49 Yes Low Yes Yes
## 423 Yes Low Yes Yes
## 3080 No Low No No
## 2745 No Low No No
## 2961 No Low No No
## Payment.Method Previous.Purchases.Group Customer_class
## 2820 Cash 11 to 20 Non High Value
## 49 Cash 31 to 40 Non High Value
## 423 Debit Card 31 to 40 Non High Value
## 3080 Bank Transfer 1 to 10 Non High Value
## 2745 Credit Card 31 to 40 Non High Value
## 2961 Credit Card 41 to 50 Non High Value
table(resampled_data$Customer_class)
##
## High Value Non High Value
## 3276 2457
Step 2: Data Splitting - Split data into training and testing sets with stratified sampling
set.seed(123)
# Split the data into training and testing sets with stratified sampling
splits <- initial_split(resampled_data, prop = 0.8, strata = "Customer_class")
train_data <- training(splits)
test_data <- testing(splits)
dim(train_data)
## [1] 4585 14
dim(test_data)
## [1] 1148 14
# Displays the ratio of rows in the training set compared to the original data
dim(train_data) / dim(resampled_data)
## [1] 0.7997558 1.0000000
# Displays the proportions of each class in the training and testing sets.
prop.table(table(train_data$Customer_class))
##
## High Value Non High Value
## 0.5714286 0.4285714
prop.table(table(test_data$Customer_class))
##
## High Value Non High Value
## 0.5714286 0.4285714
# Displays summary statistics
summary(train_data$Customer_class)
## High Value Non High Value
## 2620 1965
summary(test_data$Customer_class)
## High Value Non High Value
## 656 492
Step 3: Classification Model Selection
1a. Model Training
set.seed(123)
train_data$Customer_class <- as.factor(train_data$Customer_class)
rf_model<-randomForest(formula=Customer_class~.,data=train_data)
# rf_model
# Get feature importance for Random Forest models
importance_scores <- varImp(rf_model)
print(importance_scores)
## Overall
## Gender 66.23866
## Age_Group 155.99608
## Item.Purchased 440.53226
## Category 84.14500
## Location 105.15283
## Color 451.23273
## Season 103.81943
## Subscription.Status 58.28065
## Shipping.Type 141.37565
## Discount.Applied 110.15461
## Promo.Code.Used 130.63489
## Payment.Method 154.86953
## Previous.Purchases.Group 131.66865
# Make predictions on the test set
rf_predictions <- predict(rf_model, test_data)
1b. Model Evaluation
# Evaluate the Random Forest model
rf_conf_matrix <- confusionMatrix(rf_predictions, test_data$Customer_class)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Value Non High Value
## High Value 476 44
## Non High Value 180 448
##
## Accuracy : 0.8049
## 95% CI : (0.7807, 0.8274)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6149
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.7256
## Specificity : 0.9106
## Pos Pred Value : 0.9154
## Neg Pred Value : 0.7134
## Prevalence : 0.5714
## Detection Rate : 0.4146
## Detection Prevalence : 0.4530
## Balanced Accuracy : 0.8181
##
## 'Positive' Class : High Value
##
precision <- rf_conf_matrix$byClass["Precision"]
print(precision)
## Precision
## 0.9153846
recall <- rf_conf_matrix$byClass["Recall"]
print(recall)
## Recall
## 0.7256098
f1_score <- rf_conf_matrix$byClass["F1"]
print(f1_score)
## F1
## 0.8095238
roc_curve <- roc(test_data$Customer_class, as.numeric(rf_predictions == "High Value"))
## Setting levels: control = High Value, case = Non High Value
## Setting direction: controls > cases
auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
print(auc_score)
## Area under the curve: 0.8181
1c. Test random forest with selected features
set.seed(123)
train_data$Customer_class <- as.factor(train_data$Customer_class)
rf_model<-randomForest(formula=Customer_class~ Item.Purchased + Color ,data=train_data)
# rf_model
# Get feature importance for Random Forest models
importance_scores <- varImp(rf_model)
print(importance_scores)
## Overall
## Item.Purchased 285.8403
## Color 258.7243
# Make predictions on the test set
rf_predictions <- predict(rf_model, test_data)
1d. Model Evaluation
# Evaluate the Random Forest model
rf_conf_matrix <- confusionMatrix(rf_predictions, test_data$Customer_class)
print(rf_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Value Non High Value
## High Value 490 260
## Non High Value 166 232
##
## Accuracy : 0.6289
## 95% CI : (0.6002, 0.6569)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : 4.227e-05
##
## Kappa : 0.2238
##
## Mcnemar's Test P-Value : 6.610e-06
##
## Sensitivity : 0.7470
## Specificity : 0.4715
## Pos Pred Value : 0.6533
## Neg Pred Value : 0.5829
## Prevalence : 0.5714
## Detection Rate : 0.4268
## Detection Prevalence : 0.6533
## Balanced Accuracy : 0.6092
##
## 'Positive' Class : High Value
##
precision <- rf_conf_matrix$byClass["Precision"]
print(precision)
## Precision
## 0.6533333
recall <- rf_conf_matrix$byClass["Recall"]
print(recall)
## Recall
## 0.7469512
f1_score <- rf_conf_matrix$byClass["F1"]
print(f1_score)
## F1
## 0.6970128
roc_curve <- roc(test_data$Customer_class, as.numeric(rf_predictions == "High Value"))
## Setting levels: control = High Value, case = Non High Value
## Setting direction: controls < cases
auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
print(auc_score)
## Area under the curve: 0.3908
2a. Model Training
set.seed(123)
# Train a decision tree model
model <- rpart(Customer_class ~ Gender + Age_Group + Item.Purchased + Location + Subscription.Status + Shipping.Type + Discount.Applied + Promo.Code.Used + Payment.Method + Previous.Purchases.Group,
data = train_data,
method = "class")
# Print the trained model
# print(model)
# Visualize the decision tree
rpart.plot(model, main = "Decision Tree for Customer Classification", extra = 1)
# Make predictions on the test set
dt_predictions <- predict(model, test_data, type = "class")
# Ensure consistent factor levels
test_data$Customer_class <- factor(test_data$Customer_class, levels = levels(dt_predictions))
2b. Model Evaluation
# Evaluate the model
dt_conf_matrix <- confusionMatrix(dt_predictions, test_data$Customer_class)
print(dt_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Value Non High Value
## High Value 448 72
## Non High Value 208 420
##
## Accuracy : 0.7561
## 95% CI : (0.7302, 0.7807)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5187
##
## Mcnemar's Test P-Value : 7.158e-16
##
## Sensitivity : 0.6829
## Specificity : 0.8537
## Pos Pred Value : 0.8615
## Neg Pred Value : 0.6688
## Prevalence : 0.5714
## Detection Rate : 0.3902
## Detection Prevalence : 0.4530
## Balanced Accuracy : 0.7683
##
## 'Positive' Class : High Value
##
precision <- dt_conf_matrix$byClass["Precision"]
print(precision)
## Precision
## 0.8615385
recall <- dt_conf_matrix$byClass["Recall"]
print(recall)
## Recall
## 0.6829268
f1_score <- dt_conf_matrix$byClass["F1"]
print(f1_score)
## F1
## 0.7619048
roc_curve <- roc(test_data$Customer_class, as.numeric(dt_predictions == "High Value"))
## Setting levels: control = High Value, case = Non High Value
## Setting direction: controls > cases
auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
print(auc_score)
## Area under the curve: 0.7683
3a. Model Training
set.seed(123)
# Separate the target variable from the predictors
train_predictors <- subset(train_data, select = -Customer_class)
train_target <- train_data$Customer_class
test_predictors <- subset(test_data, select = -Customer_class)
test_target <- test_data$Customer_class
# Train the Naive Bayes model
nb_model <- naiveBayes(train_predictors, train_target)
# print(nb_model)
# Predict using the Naive Bayes model
nb_predictions <- predict(nb_model, newdata = test_predictors)
3b. Model Evaluation
# Evaluate the model
nb_conf_matrix <- confusionMatrix(nb_predictions, test_target)
print(nb_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Value Non High Value
## High Value 448 237
## Non High Value 208 255
##
## Accuracy : 0.6124
## 95% CI : (0.5835, 0.6407)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : 0.002683
##
## Kappa : 0.2027
##
## Mcnemar's Test P-Value : 0.184400
##
## Sensitivity : 0.6829
## Specificity : 0.5183
## Pos Pred Value : 0.6540
## Neg Pred Value : 0.5508
## Prevalence : 0.5714
## Detection Rate : 0.3902
## Detection Prevalence : 0.5967
## Balanced Accuracy : 0.6006
##
## 'Positive' Class : High Value
##
accuracy <- nb_conf_matrix$overall["Accuracy"]
precision <- nb_conf_matrix$byClass["Precision"]
recall <- nb_conf_matrix$byClass["Recall"]
f1_score <- nb_conf_matrix$byClass["F1"]
cat("Accuracy:", accuracy, "\n")
## Accuracy: 0.6123693
cat("Precision:", precision, "\n")
## Precision: 0.6540146
cat("Recall:", recall, "\n")
## Recall: 0.6829268
cat("F1 Score:", f1_score, "\n")
## F1 Score: 0.6681581
roc_curve <- roc(test_data$Customer_class, as.numeric(nb_predictions == "High Value"))
## Setting levels: control = High Value, case = Non High Value
## Setting direction: controls > cases
auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
print(auc_score)
## Area under the curve: 0.6006
4a. Model Training
set.seed(123)
# Separate the target variable from the predictors
train_predictors <- subset(train_data, select = -Customer_class)
train_target <- train_data$Customer_class
test_predictors <- subset(test_data, select = -Customer_class)
test_target <- test_data$Customer_class
# Train the SVM model
svm_model <- svm(Customer_class ~ ., data = train_data, kernel = "radial")
# print(svm_model)
# Predict using the SVM model
svm_predictions <- predict(svm_model, newdata = test_data)
4b. Model Evaluation
# Create a confusion matrix
svm_conf_matrix <- confusionMatrix(svm_predictions, test_target)
print(svm_conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction High Value Non High Value
## High Value 429 167
## Non High Value 227 325
##
## Accuracy : 0.6568
## 95% CI : (0.6285, 0.6843)
## No Information Rate : 0.5714
## P-Value [Acc > NIR] : 2.086e-09
##
## Kappa : 0.3098
##
## Mcnemar's Test P-Value : 0.002955
##
## Sensitivity : 0.6540
## Specificity : 0.6606
## Pos Pred Value : 0.7198
## Neg Pred Value : 0.5888
## Prevalence : 0.5714
## Detection Rate : 0.3737
## Detection Prevalence : 0.5192
## Balanced Accuracy : 0.6573
##
## 'Positive' Class : High Value
##
precision_svm <- svm_conf_matrix$byClass["Precision"]
recall_svm <- svm_conf_matrix$byClass["Recall"]
f1_score_svm <- svm_conf_matrix$byClass["F1"]
accuracy_svm <- svm_conf_matrix$overall["Accuracy"]
print(precision_svm)
## Precision
## 0.7197987
print(recall_svm)
## Recall
## 0.6539634
print(f1_score_svm)
## F1
## 0.6853035
print(accuracy_svm)
## Accuracy
## 0.6567944
roc_curve <- roc(test_data$Customer_class, as.numeric(svm_predictions == "High Value"))
## Setting levels: control = High Value, case = Non High Value
## Setting direction: controls > cases
auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)
print(auc_score)
## Area under the curve: 0.6573
Step 4: Result Interpretation
results <- data.frame(
Model = c("Random Forest", "Decision Tree", "Naive Bayes", "Support Vector Machines"),
Accuracy = c(0.8049, 0.7561, 0.6124, 0.6568),
Precision = c(0.9154, 0.8615, 0.6540, 0.7198),
Recall = c(0.7256, 0.6829, 0.6829, 0.6540),
F1_Score = c(0.8095, 0.7619, 0.6682, 0.6853),
ROC_AUC = c(0.8181, 0.7683, 0.6006, 0.6573)
)
# Print the table using print
print(results)
## Model Accuracy Precision Recall F1_Score ROC_AUC
## 1 Random Forest 0.8049 0.9154 0.7256 0.8095 0.8181
## 2 Decision Tree 0.7561 0.8615 0.6829 0.7619 0.7683
## 3 Naive Bayes 0.6124 0.6540 0.6829 0.6682 0.6006
## 4 Support Vector Machines 0.6568 0.7198 0.6540 0.6853 0.6573
Overall Random Forest Model performed well across all metrics.
Step 5: Basket analysis: to find out Popular product combination among customer class
Display top frequent item purchased
# Convert data to transactions
transactions_data <- as(temp1_df[, "Item.Purchased", drop = FALSE], "transactions")
# Display frequent itemsets
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.01))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.01 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 39
##
## create itemset ...
## set transactions ...[25 item(s), 3900 transaction(s)] done [0.00s].
## sorting and recoding items ... [25 item(s)] done [0.00s].
## creating sparse bit matrix ... [25 row(s), 3900 column(s)] done [0.00s].
## writing ... [25 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequent_itemsets)
## items support count
## [1] {Item.Purchased=Blouse} 0.04384615 171
## [2] {Item.Purchased=Jewelry} 0.04384615 171
## [3] {Item.Purchased=Pants} 0.04384615 171
## [4] {Item.Purchased=Shirt} 0.04333333 169
## [5] {Item.Purchased=Dress} 0.04256410 166
## [6] {Item.Purchased=Sweater} 0.04205128 164
## [7] {Item.Purchased=Jacket} 0.04179487 163
## [8] {Item.Purchased=Belt} 0.04128205 161
## [9] {Item.Purchased=Sunglasses} 0.04128205 161
## [10] {Item.Purchased=Coat} 0.04128205 161
## [11] {Item.Purchased=Sandals} 0.04102564 160
## [12] {Item.Purchased=Socks} 0.04076923 159
## [13] {Item.Purchased=Skirt} 0.04051282 158
## [14] {Item.Purchased=Shorts} 0.04025641 157
## [15] {Item.Purchased=Scarf} 0.04025641 157
## [16] {Item.Purchased=Hat} 0.03948718 154
## [17] {Item.Purchased=Handbag} 0.03923077 153
## [18] {Item.Purchased=Hoodie} 0.03871795 151
## [19] {Item.Purchased=Shoes} 0.03846154 150
## [20] {Item.Purchased=T-shirt} 0.03769231 147
## [21] {Item.Purchased=Sneakers} 0.03717949 145
## [22] {Item.Purchased=Boots} 0.03692308 144
## [23] {Item.Purchased=Backpack} 0.03666667 143
## [24] {Item.Purchased=Gloves} 0.03589744 140
## [25] {Item.Purchased=Jeans} 0.03179487 124
Display top frequent item purchased combined with category and color to be more specific
# Combine columns into a new item column
temp1_df$Combined_Item <- with(temp1_df, paste(Category, Item.Purchased, Color, sep="_"))
# Convert "Gender" to a factor
temp1_df$Combined_Item <- as.factor(temp1_df$Combined_Item)
# Convert data to transactions
transactions_data <- as(temp1_df[, "Combined_Item", drop = FALSE], "transactions")
# Display the first 20 rows of frequent itemsets
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.003))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.003 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 11
##
## create itemset ...
## set transactions ...[625 item(s), 3900 transaction(s)] done [0.00s].
## sorting and recoding items ... [22 item(s)] done [0.00s].
## creating sparse bit matrix ... [22 row(s), 3900 column(s)] done [0.00s].
## writing ... [22 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequent_itemsets)
## items support count
## [1] {Combined_Item=Clothing_Skirt_Black} 0.003846154 15
## [2] {Combined_Item=Accessories_Jewelry_Gray} 0.003589744 14
## [3] {Combined_Item=Accessories_Jewelry_Indigo} 0.003333333 13
## [4] {Combined_Item=Clothing_Shorts_Yellow} 0.003333333 13
## [5] {Combined_Item=Clothing_Hoodie_Pink} 0.003333333 13
## [6] {Combined_Item=Clothing_Pants_Charcoal} 0.003076923 12
## [7] {Combined_Item=Footwear_Sneakers_Indigo} 0.003076923 12
## [8] {Combined_Item=Footwear_Sandals_Purple} 0.003076923 12
## [9] {Combined_Item=Clothing_Sweater_Cyan} 0.003076923 12
## [10] {Combined_Item=Clothing_Dress_Charcoal} 0.003076923 12
## [11] {Combined_Item=Accessories_Sunglasses_Lavender} 0.003076923 12
## [12] {Combined_Item=Clothing_Pants_Turquoise} 0.003076923 12
## [13] {Combined_Item=Accessories_Scarf_Violet} 0.003076923 12
## [14] {Combined_Item=Clothing_Blouse_Violet} 0.003076923 12
## [15] {Combined_Item=Accessories_Belt_Blue} 0.003076923 12
## [16] {Combined_Item=Accessories_Sunglasses_Olive} 0.003076923 12
## [17] {Combined_Item=Accessories_Handbag_Charcoal} 0.003076923 12
## [18] {Combined_Item=Clothing_Pants_Cyan} 0.003076923 12
## [19] {Combined_Item=Clothing_Skirt_Teal} 0.003076923 12
## [20] {Combined_Item=Clothing_Sweater_Maroon} 0.003076923 12
## [21] {Combined_Item=Footwear_Boots_Violet} 0.003076923 12
## [22] {Combined_Item=Footwear_Shoes_Maroon} 0.003076923 12
Let’s categorize into four group “high value Female”, “high value Male” , “Non high value Female” and “Non high value Male” and find out what are the popular items among them.
Display top item purchased for high value Female customer
# Filter data
filtered_data <- subset(temp1_df, Customer_class == "High Value" & Gender == "Female")
# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")
# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.008))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.008 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 2
##
## create itemset ...
## set transactions ...[208 item(s), 255 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating sparse bit matrix ... [6 row(s), 255 column(s)] done [0.00s].
## writing ... [6 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequent_itemsets)
## items support count
## [1] {Accessories_Sunglasses_Lavender} 0.01960784 5
## [2] {Clothing_Blouse_Charcoal} 0.01176471 3
## [3] {Clothing_Skirt_Violet} 0.01176471 3
## [4] {Clothing_Hoodie_Charcoal} 0.01176471 3
## [5] {Accessories_Backpack_Yellow} 0.01176471 3
## [6] {Clothing_Dress_Olive} 0.01176471 3
Display top item purchased for high value male customer
# Filter data
filtered_data <- subset(temp1_df, Customer_class == "High Value" & Gender == "Male")
# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")
# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.006))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.006 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 3
##
## create itemset ...
## set transactions ...[369 item(s), 564 transaction(s)] done [0.00s].
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating sparse bit matrix ... [9 row(s), 564 column(s)] done [0.00s].
## writing ... [9 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequent_itemsets)
## items support count
## [1] {Footwear_Boots_Violet} 0.008865248 5
## [2] {Clothing_Shorts_Yellow} 0.008865248 5
## [3] {Accessories_Jewelry_Orange} 0.007092199 4
## [4] {Clothing_Hoodie_White} 0.007092199 4
## [5] {Clothing_Blouse_Violet} 0.007092199 4
## [6] {Accessories_Jewelry_Gray} 0.007092199 4
## [7] {Footwear_Sandals_Lavender} 0.007092199 4
## [8] {Footwear_Sandals_Cyan} 0.007092199 4
## [9] {Accessories_Sunglasses_Purple} 0.007092199 4
Display top item purchased for non high value female customer
# Filter data
filtered_data <- subset(temp1_df, Customer_class == "Non High Value" & Gender == "Female")
# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")
# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.005))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.005 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 4
##
## create itemset ...
## set transactions ...[483 item(s), 993 transaction(s)] done [0.00s].
## sorting and recoding items ... [13 item(s)] done [0.00s].
## creating sparse bit matrix ... [13 row(s), 993 column(s)] done [0.00s].
## writing ... [13 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequent_itemsets)
## items support count
## [1] {Outerwear_Jacket_Indigo} 0.006042296 6
## [2] {Accessories_Hat_Magenta} 0.005035247 5
## [3] {Clothing_Blouse_Violet} 0.005035247 5
## [4] {Accessories_Scarf_Magenta} 0.005035247 5
## [5] {Clothing_Shorts_Black} 0.005035247 5
## [6] {Footwear_Sandals_Blue} 0.005035247 5
## [7] {Accessories_Scarf_Green} 0.005035247 5
## [8] {Clothing_Skirt_White} 0.005035247 5
## [9] {Clothing_Socks_Pink} 0.005035247 5
## [10] {Accessories_Backpack_Green} 0.005035247 5
## [11] {Clothing_Skirt_Teal} 0.005035247 5
## [12] {Clothing_Pants_Yellow} 0.005035247 5
## [13] {Clothing_Blouse_Lavender} 0.005035247 5
Display top item purchased for non high value male customer
# Filter data
filtered_data <- subset(temp1_df, Customer_class == "Non High Value" & Gender == "Male")
# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")
# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.0035))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.0035 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 7
##
## create itemset ...
## set transactions ...[608 item(s), 2088 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating sparse bit matrix ... [12 row(s), 2088 column(s)] done [0.00s].
## writing ... [12 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(frequent_itemsets)
## items support count
## [1] {Accessories_Handbag_Charcoal} 0.004789272 10
## [2] {Accessories_Backpack_Olive} 0.004310345 9
## [3] {Clothing_Shirt_Orange} 0.004310345 9
## [4] {Clothing_Skirt_Black} 0.004310345 9
## [5] {Clothing_Sweater_Cyan} 0.004310345 9
## [6] {Accessories_Scarf_Blue} 0.003831418 8
## [7] {Footwear_Shoes_Gold} 0.003831418 8
## [8] {Clothing_Hoodie_Pink} 0.003831418 8
## [9] {Clothing_Pants_Turquoise} 0.003831418 8
## [10] {Accessories_Gloves_Turquoise} 0.003831418 8
## [11] {Footwear_Sandals_Orange} 0.003831418 8
## [12] {Accessories_Backpack_Black} 0.003831418 8