Customer Value Prediction and Segmentation: Purchasing Amount Predictive Modeling and Potential Customer Identification

GROUP MEMBERS(GROUP 9)

IDA TEH PING LI 22074634

CHEN XIN WEI 17202767

SUM SI YENG 22068843

YAP ZHI REN 22056395

YEAP YU XIAN 17203286

1.Introduction In the current era of data-driven decision making, businesses continuously strive to understand and enhance customer value. The project titled “Customer Value Prediction and Segmentation: Purchasing Amount Predictive Modeling and Potential Customer Identification” is designed to leverage data analytics in deciphering customer behaviors and preferences. This endeavor involves utilizing advanced statistical models and machine learning techniques to not only predict customer spending but also to segment customers based on their potential value to the business. The overarching aim is to equip the company with actionable insights that enable more effective targeting, personalized marketing, and optimized resource allocation.

1.1 Problems

Prediction of Purchase Amount (Regression Problem):

-Problem Statement: Determining the extent to which demographic patterns, historical purchase data, and purchasing behaviors such as payment methods and promo code usage, influence a customer’s purchase amount.

-Relevance: Understanding these variables can significantly enhance the accuracy of predictive models, enabling businesses to forecast sales and tailor their marketing strategies to individual customer profiles.

Identification of Potential Customers (Classification Problem):

-Problem Statement: Identifying key customer segments, particularly those who are likely to be high-value, using methods like basket analysis to understand popular product combinations.

-Relevance: This segmentation is critical for targeted marketing efforts, ensuring that resources are focused on the most profitable segments.

1.2 Objectives

Objective for Regression:

Goal: To develop a multiple regression model that accurately predicts the purchase amount a customer is likely to spend, based on a variety of influencing factors.
Delivery Method: Implementing a multiple regression approach, potentially supplemented with significance testing and model fitness evaluation, to uncover key features influencing customer spend.
Expected Outcome: A robust model offering continuous predictions of purchase amounts, coupled with insights into consumer profiles for strategic business planning.

Objective for Classification:

Goal: To classify customers into distinct categories, emphasizing the identification of high-value customers through basket analysis.
Delivery Method: Creating a classification model (such as a decision tree) that integrates basket analysis to discern patterns in high-selling product combinations.
Expected Outcome: A classification system that distinguishes ‘high-value’ customers from others, providing insights into product combinations preferred by these segments for strategic marketing and product development.

Data Preprocessing

2.1 Dataset introduction

Dataset: Customer Shopping Trends

Source: https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset?select=shopping_trends_updated.csv

The dataset that we utilize is “Customer Shopping Trends”. This dataset consists of 18 column and 3900 observations….

The dataset is saved in csv format before import to R.

2.2 Import dataset

2.3 Call libraries #put all libraries here

library(plyr)

## Warning: package 'plyr' was built under R version 4.3.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)

## Warning: package 'readr' was built under R version 4.3.2

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.2

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.2

## corrplot 0.92 loaded

library(caret)

## Warning: package 'caret' was built under R version 4.3.2

## Loading required package: lattice

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.2

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(rpart)

## Warning: package 'rpart' was built under R version 4.3.2

library(gbm)

## Warning: package 'gbm' was built under R version 4.3.2

## Loaded gbm 2.1.9

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(Metrics)

## Warning: package 'Metrics' was built under R version 4.3.2

## 
## Attaching package: 'Metrics'

## The following objects are masked from 'package:caret':
## 
##     precision, recall

library(DMwR)

## Loading required package: grid

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## 
## Attaching package: 'DMwR'

## The following object is masked from 'package:plyr':
## 
##     join

library(rsample)

## Warning: package 'rsample' was built under R version 4.3.2

library(pROC)

## Warning: package 'pROC' was built under R version 4.3.2

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:Metrics':
## 
##     auc

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.3.2

library(arules)

## Warning: package 'arules' was built under R version 4.3.2

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following object is masked from 'package:kernlab':
## 
##     size

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(e1071)

## Warning: package 'e1071' was built under R version 4.3.2

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:rsample':
## 
##     permutations

2.4 Data Exploration

class(Shopping_data)

## [1] "data.frame"

str(Shopping_data)

## 'data.frame':    3900 obs. of  18 variables:
##  $ Customer.ID           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age                   : int  55 19 50 21 45 46 63 27 26 57 ...
##  $ Gender                : chr  "Male" "Male" "Male" "Male" ...
##  $ Item.Purchased        : chr  "Blouse" "Sweater" "Jeans" "Sandals" ...
##  $ Category              : chr  "Clothing" "Clothing" "Clothing" "Footwear" ...
##  $ Purchase.Amount..USD. : int  53 64 73 90 49 20 85 34 97 31 ...
##  $ Location              : chr  "Kentucky" "Maine" "Massachusetts" "Rhode Island" ...
##  $ Size                  : chr  "L" "L" "S" "M" ...
##  $ Color                 : chr  "Gray" "Maroon" "Maroon" "Maroon" ...
##  $ Season                : chr  "Winter" "Winter" "Spring" "Spring" ...
##  $ Review.Rating         : num  3.1 3.1 3.1 3.5 2.7 2.9 3.2 3.2 2.6 4.8 ...
##  $ Subscription.Status   : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Shipping.Type         : chr  "Express" "Express" "Free Shipping" "Next Day Air" ...
##  $ Discount.Applied      : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Promo.Code.Used       : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Previous.Purchases    : int  14 2 23 49 31 14 49 19 8 4 ...
##  $ Payment.Method        : chr  "Venmo" "Cash" "Credit Card" "PayPal" ...
##  $ Frequency.of.Purchases: chr  "Fortnightly" "Fortnightly" "Weekly" "Weekly" ...

dim(Shopping_data)

## [1] 3900   18

summary(Shopping_data)

##   Customer.ID          Age           Gender          Item.Purchased    
##  Min.   :   1.0   Min.   :18.00   Length:3900        Length:3900       
##  1st Qu.: 975.8   1st Qu.:31.00   Class :character   Class :character  
##  Median :1950.5   Median :44.00   Mode  :character   Mode  :character  
##  Mean   :1950.5   Mean   :44.07                                        
##  3rd Qu.:2925.2   3rd Qu.:57.00                                        
##  Max.   :3900.0   Max.   :70.00                                        
##    Category         Purchase.Amount..USD.   Location             Size          
##  Length:3900        Min.   : 20.00        Length:3900        Length:3900       
##  Class :character   1st Qu.: 39.00        Class :character   Class :character  
##  Mode  :character   Median : 60.00        Mode  :character   Mode  :character  
##                     Mean   : 59.76                                             
##                     3rd Qu.: 81.00                                             
##                     Max.   :100.00                                             
##     Color              Season          Review.Rating  Subscription.Status
##  Length:3900        Length:3900        Min.   :2.50   Length:3900        
##  Class :character   Class :character   1st Qu.:3.10   Class :character   
##  Mode  :character   Mode  :character   Median :3.70   Mode  :character   
##                                        Mean   :3.75                      
##                                        3rd Qu.:4.40                      
##                                        Max.   :5.00                      
##  Shipping.Type      Discount.Applied   Promo.Code.Used    Previous.Purchases
##  Length:3900        Length:3900        Length:3900        Min.   : 1.00     
##  Class :character   Class :character   Class :character   1st Qu.:13.00     
##  Mode  :character   Mode  :character   Mode  :character   Median :25.00     
##                                                           Mean   :25.35     
##                                                           3rd Qu.:38.00     
##                                                           Max.   :50.00     
##  Payment.Method     Frequency.of.Purchases
##  Length:3900        Length:3900           
##  Class :character   Class :character      
##  Mode  :character   Mode  :character      
##                                           
##                                           
##

colSums(is.na(Shopping_data))

##            Customer.ID                    Age                 Gender 
##                      0                      0                      0 
##         Item.Purchased               Category  Purchase.Amount..USD. 
##                      0                      0                      0 
##               Location                   Size                  Color 
##                      0                      0                      0 
##                 Season          Review.Rating    Subscription.Status 
##                      0                      0                      0 
##          Shipping.Type       Discount.Applied        Promo.Code.Used 
##                      0                      0                      0 
##     Previous.Purchases         Payment.Method Frequency.of.Purchases 
##                      0                      0                      0

sum(duplicated(Shopping_data))

## [1] 0

2.4.1 Data Transformation

2.4.1.1 Age Group

Shopping_data <- Shopping_data %>% mutate(
  Age_Group=ifelse(Age>=18 & Age<=24,"18 to 24",
                             ifelse(Age>=25 & Age<=34,"25 to 34",
                                    ifelse(Age>=35 & Age<=44,"35 to 44",
                                           ifelse(Age>=45 & Age<=54,"45 to 54",
                                                  ifelse(Age>=55 & Age<=64,"55 to 64",
                                                         ifelse(Age>=65,"65 or older",NA)))))))

Shopping_data <- Shopping_data %>%
  select(1:(which(names(Shopping_data) == "Gender")),
         "Age_Group",
         (which(names(Shopping_data) == "Gender") + 1):length(names(Shopping_data)))

2.4.1.2 Purchasing Frequency Group

Shopping_data<- Shopping_data %>% mutate(Frequency.of.Purchases=ifelse(Frequency.of.Purchases=="Every 3 Months","Quarterly",Frequency.of.Purchases))
 
Shopping_data<- Shopping_data %>% mutate(
  Purchases.Frequency.Group = ifelse(Frequency.of.Purchases=="Weekly" | Frequency.of.Purchases=="Fortnightly" |Frequency.of.Purchases=="Bi-Weekly","High",
                   ifelse(Frequency.of.Purchases=="Monthly" | Frequency.of.Purchases=="Quarterly","Normal",
                          ifelse(Frequency.of.Purchases=="Annually","Lower",NA))))

2.4.1.3 Previous Purchase Group

Shopping_data <- Shopping_data %>% mutate(
  Previous.Purchases.Group = ifelse(Previous.Purchases>=1 & Previous.Purchases<=10,"1 to 10",
                            ifelse(Previous.Purchases>=11 & Previous.Purchases<=20,"11 to 20",
                                   ifelse(Previous.Purchases>=21 & Previous.Purchases<=30,"21 to 30",
                                          ifelse(Previous.Purchases>=31 & Previous.Purchases<=40,"31 to 40",
                                                 ifelse(Previous.Purchases>=41 & Previous.Purchases<=50,"41 to 50",NA))))))

2.4.1.3 Identify Customer Class

Shopping_data<- Shopping_data %>% 
  mutate(
    Customer_class=ifelse((Purchase.Amount..USD.>=60 & Purchases.Frequency.Group=="High"),"High Value",
                          "Non High Value"))

2.5 Remove unnecessary Column

Shopping_data <- subset(Shopping_data, select = -c(Customer.ID,Review.Rating, Size))

2.6

write_csv(Shopping_data, file = "Shopping Trend Dataset (Cleaned).csv")

Exploratory Data Analysis

3.1 Univarate Analysis

3.1.1 Purchase Amount (num)

ggplot(Shopping_data, aes(x = Purchase.Amount..USD., fill = ..count..)) +
  geom_histogram(binwidth = 10, color = "white", aes(y = ..count..), alpha = 0.7) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Histogram of Purchase Amount",
       x = "Purchase Amount (USD)",
       y = "Frequency")

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

3.1.2 age group (cat)

Age_cat<- table(Shopping_data$Age_Group)
names(Age_cat)[1] <- "Age"
Age_cat <- data.frame(Age = c("18 to 24","25 to 34","35 to 44","45 to 54","55 to 64","65 or older"),
                         freq = c(486,755,729,752,751,427))
pie(Age_cat$freq, labels = Age_cat$Age , col = c("pink", "yellow", "lightblue","lightgreen","lightyellow","rosybrown"), main = "Age Distribution")

3.1.3 gender (cat)

gender_cat<- table(Shopping_data$Gender)
names(gender_cat)[1] <- "Gender"
gender_cat <- data.frame(Gender = c("Female","Male"),
                         freq = c(1248,2652))
pie(gender_cat$freq, labels = gender_cat$Gender , col = c("pink", "lightblue"), main = "Gender Distribution")

3.1.5 subscription status (cat)

substatus_cat<- table(Shopping_data$Subscription.Status)
names(substatus_cat)[1] <- "Status"
substatus_cat <- data.frame(Status = c("Yes","No"),
                         freq = c(2847,1053))
pie(substatus_cat$freq, labels = substatus_cat$Status , col = c("rosybrown", "lightyellow"), main = "Subscription Status")

3.1.6 discount applied (cat)

disc_cat<- table(Shopping_data$Discount.Applied)
names(disc_cat)[1] <- "Discount Application"
disc_cat <- data.frame(Discount= c("Yes","No"),
                         freq = c(2223,1677))
pie(disc_cat$freq, labels = disc_cat$Discount , col = c("lightblue", "lightyellow"), main = "Discount Application")

3.1.8 previous purchase group

purchasegrp_cat<- table(Shopping_data$Previous.Purchases.Group)
names(purchasegrp_cat)[1] <- "Group"
purchasegrp_cat <- data.frame(Group= c("1 to 10","11 to 20","21 to 30","31 to 40","41 to 50"),
                         freq = c(784,777,790,784,765))
pie(purchasegrp_cat$freq, labels = purchasegrp_cat$Group , col = c("pink", "lightblue","lightgreen","lightyellow","rosybrown"), main = "Previous Purchases Category")

3.1.9 frequency of purchases (cat)

max_frequency <- max(Shopping_data$Frequency.of.Purchases)
ggplot(Shopping_data, aes(x = Frequency.of.Purchases, fill = Frequency.of.Purchases)) +
  geom_bar() +
  scale_fill_manual(values = ifelse(Shopping_data$Frequency.of.Purchases == max_frequency, "lightblue", "darkblue")) +
  labs(title = "Bar Chart of Frequency of Purchases",
       x = "Frequency of Purchases",
       y = "Count")

3.1.10 Purchase Frequency Group

purchasefreq_cat<- table(Shopping_data$Purchases.Frequency.Group)
names(purchasefreq_cat)[1] <- "Group"
purchasefreq_cat <- data.frame(Group= c("High","Lower","Normal"),
                         freq = c(1628,572,1700))
pie(purchasefreq_cat$freq, labels = purchasefreq_cat$Group , col = c("pink", "lightblue","lightyellow"), main = "Purchases Frequency")

3.1.11 Customer Class

custclass_cat<- table(Shopping_data$Customer_class)
names(custclass_cat)[1] <- "Class"
custclass_cat <- data.frame(Class= c("High Value","Non High Value"),
                         freq = c(819,3081))
pie(custclass_cat$freq, labels = custclass_cat$Class , col = c("gold","lightgrey"), main = "Customer Class")

3.2 Bivariate Analysis

3.2.1 purchased amount and gender

ggplot(Shopping_data, aes(x = Gender, y = Purchase.Amount..USD.,fill=Gender)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Gender", x="Gender", y="Mean Purchase Amount(USD)",fill="Gender")

3.2.2 purchased amount and historical purchase data

ggplot(Shopping_data, aes(x = Previous.Purchases.Group, y = Purchase.Amount..USD.,fill=Previous.Purchases.Group)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Previous Purchases Group", x="Previous Purchases Group", 
       y="Mean Purchase Amount(USD)",fill="Previous Purchases Group")

3.2.3 purchased amount and frequency of purchases

ggplot(Shopping_data, aes(x = Purchases.Frequency.Group, y = Purchase.Amount..USD.,fill=Purchases.Frequency.Group)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Purchases Frequency Group", x="Purchases Frequency Group", 
       y="Mean Purchase Amount(USD)",fill="Purchases Frequency Group")

3.2.4 purchased amount and season

ggplot(Shopping_data, aes(x = Season, y = Purchase.Amount..USD.,fill=Season)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Season", x="Season", y="Mean Purchase Amount(USD)",fill="Season")

3.2.5 purchased amount and Subscription Status

ggplot(Shopping_data, aes(x = Subscription.Status, y = Purchase.Amount..USD.,fill=Subscription.Status)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Subscription Status", x="Subscription Status", 
       y="Mean Purchase Amount(USD)",fill="Subscription Status")

3.2.6 purchased amount and Shipping.Type

ggplot(Shopping_data, aes(x = Shipping.Type, y = Purchase.Amount..USD.,fill=Shipping.Type)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Shipping Type", x="Shipping Type", y="Mean Purchase Amount(USD)",fill="Shipping Type")

3.2.7 purchased amount and Discount Applied

ggplot(Shopping_data, aes(x = Discount.Applied, y = Purchase.Amount..USD.,fill=Discount.Applied)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Discount Applied", x="Discount Applied", y="Mean Purchase Amount(USD)",fill="Discount Applied")

3.2.8 purchased amount and Payment Method

ggplot(Shopping_data, aes(x = Payment.Method, y = Purchase.Amount..USD.,fill=Payment.Method)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Payment Method", x="Payment Method", y="Mean Purchase Amount(USD)",fill="Payment Method")

3.2.9 purchased amount and Customer Class

ggplot(Shopping_data, aes(x = Customer_class, y = Purchase.Amount..USD.,fill=Customer_class)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Customer Class", x="Customer Class", y="Mean Purchase Amount(USD)",fill="Customer Class")

3.3 Multivariate Analysis

Shopping_trend <- lm(Purchase.Amount..USD. ~ Gender + Age_Group + Item.Purchased + Category + Color + Season + Subscription.Status + Shipping.Type + Discount.Applied + Payment.Method + Purchases.Frequency.Group + Customer_class , data = Shopping_data)
anova(Shopping_trend)

## Analysis of Variance Table
## 
## Response: Purchase.Amount..USD.
##                             Df  Sum Sq Mean Sq   F value    Pr(>F)    
## Gender                       1     431     431    1.1231 0.2893159    
## Age_Group                    5    1274     255    0.6634 0.6512356    
## Item.Purchased              24   10956     456    1.1884 0.2396422    
## Color                       24   19477     812    2.1126 0.0012172 ** 
## Season                       3    6926    2309    6.0101 0.0004419 ***
## Subscription.Status          1      10      10    0.0258 0.8723082    
## Shipping.Type                5    3245     649    1.6894 0.1335446    
## Discount.Applied             1     239     239    0.6213 0.4306093    
## Payment.Method               5    1577     315    0.8212 0.5343410    
## Purchases.Frequency.Group    2     185      93    0.2408 0.7859994    
## Customer_class               1  672945  672945 1751.8669 < 2.2e-16 ***
## Residuals                 3827 1470066     384                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.0 Modeling

4.1 Multiple Regression Model

4.1.1 Objective

The objective of building this regression model is to predict the purchase amount a customer is likely to spend, incorporating customer profiles for more accurate targeting.

4.1.2 Problem Statement

How demographic pattern, historical purchase data and other purchasing behavior (payment method, promo code utilization) impact customer purchase amount?

4.1.3 Strategy

To discover the features that impact customer’s purchase amount and make prediction by creating a multiple regression model using the variables provided.

4.1.4 Expected Outcome

The model will provide a continuous prediction of the purchase amount and insights into consumer profiles, enabling the business to anticipate potential revenue from individual customers and tailor marketing strategies more effectively.

# Performing one hot encoding to all categorical variables
df = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data[colnames(Shopping_data)[-6]])
dummy <- dummyVars(" ~ .", data = df)
df_all <- data.frame(predict(dummy, newdata = df))
head(df_all)

##   Purchase.Amount..USD. Age GenderFemale GenderMale Age_Group18.to.24
## 1                    53  55            0          1                 0
## 2                    64  19            0          1                 1
## 3                    73  50            0          1                 0
## 4                    90  21            0          1                 1
## 5                    49  45            0          1                 0
## 6                    20  46            0          1                 0
##   Age_Group25.to.34 Age_Group35.to.44 Age_Group45.to.54 Age_Group55.to.64
## 1                 0                 0                 0                 1
## 2                 0                 0                 0                 0
## 3                 0                 0                 1                 0
## 4                 0                 0                 0                 0
## 5                 0                 0                 1                 0
## 6                 0                 0                 1                 0
##   Age_Group65.or.older Item.PurchasedBackpack Item.PurchasedBelt
## 1                    0                      0                  0
## 2                    0                      0                  0
## 3                    0                      0                  0
## 4                    0                      0                  0
## 5                    0                      0                  0
## 6                    0                      0                  0
##   Item.PurchasedBlouse Item.PurchasedBoots Item.PurchasedCoat
## 1                    1                   0                  0
## 2                    0                   0                  0
## 3                    0                   0                  0
## 4                    0                   0                  0
## 5                    1                   0                  0
## 6                    0                   0                  0
##   Item.PurchasedDress Item.PurchasedGloves Item.PurchasedHandbag
## 1                   0                    0                     0
## 2                   0                    0                     0
## 3                   0                    0                     0
## 4                   0                    0                     0
## 5                   0                    0                     0
## 6                   0                    0                     0
##   Item.PurchasedHat Item.PurchasedHoodie Item.PurchasedJacket
## 1                 0                    0                    0
## 2                 0                    0                    0
## 3                 0                    0                    0
## 4                 0                    0                    0
## 5                 0                    0                    0
## 6                 0                    0                    0
##   Item.PurchasedJeans Item.PurchasedJewelry Item.PurchasedPants
## 1                   0                     0                   0
## 2                   0                     0                   0
## 3                   1                     0                   0
## 4                   0                     0                   0
## 5                   0                     0                   0
## 6                   0                     0                   0
##   Item.PurchasedSandals Item.PurchasedScarf Item.PurchasedShirt
## 1                     0                   0                   0
## 2                     0                   0                   0
## 3                     0                   0                   0
## 4                     1                   0                   0
## 5                     0                   0                   0
## 6                     0                   0                   0
##   Item.PurchasedShoes Item.PurchasedShorts Item.PurchasedSkirt
## 1                   0                    0                   0
## 2                   0                    0                   0
## 3                   0                    0                   0
## 4                   0                    0                   0
## 5                   0                    0                   0
## 6                   0                    0                   0
##   Item.PurchasedSneakers Item.PurchasedSocks Item.PurchasedSunglasses
## 1                      0                   0                        0
## 2                      0                   0                        0
## 3                      0                   0                        0
## 4                      0                   0                        0
## 5                      0                   0                        0
## 6                      1                   0                        0
##   Item.PurchasedSweater Item.PurchasedT.shirt CategoryAccessories
## 1                     0                     0                   0
## 2                     1                     0                   0
## 3                     0                     0                   0
## 4                     0                     0                   0
## 5                     0                     0                   0
## 6                     0                     0                   0
##   CategoryClothing CategoryFootwear CategoryOuterwear LocationAlabama
## 1                1                0                 0               0
## 2                1                0                 0               0
## 3                1                0                 0               0
## 4                0                1                 0               0
## 5                1                0                 0               0
## 6                0                1                 0               0
##   LocationAlaska LocationArizona LocationArkansas LocationCalifornia
## 1              0               0                0                  0
## 2              0               0                0                  0
## 3              0               0                0                  0
## 4              0               0                0                  0
## 5              0               0                0                  0
## 6              0               0                0                  0
##   LocationColorado LocationConnecticut LocationDelaware LocationFlorida
## 1                0                   0                0               0
## 2                0                   0                0               0
## 3                0                   0                0               0
## 4                0                   0                0               0
## 5                0                   0                0               0
## 6                0                   0                0               0
##   LocationGeorgia LocationHawaii LocationIdaho LocationIllinois LocationIndiana
## 1               0              0             0                0               0
## 2               0              0             0                0               0
## 3               0              0             0                0               0
## 4               0              0             0                0               0
## 5               0              0             0                0               0
## 6               0              0             0                0               0
##   LocationIowa LocationKansas LocationKentucky LocationLouisiana LocationMaine
## 1            0              0                1                 0             0
## 2            0              0                0                 0             1
## 3            0              0                0                 0             0
## 4            0              0                0                 0             0
## 5            0              0                0                 0             0
## 6            0              0                0                 0             0
##   LocationMaryland LocationMassachusetts LocationMichigan LocationMinnesota
## 1                0                     0                0                 0
## 2                0                     0                0                 0
## 3                0                     1                0                 0
## 4                0                     0                0                 0
## 5                0                     0                0                 0
## 6                0                     0                0                 0
##   LocationMississippi LocationMissouri LocationMontana LocationNebraska
## 1                   0                0               0                0
## 2                   0                0               0                0
## 3                   0                0               0                0
## 4                   0                0               0                0
## 5                   0                0               0                0
## 6                   0                0               0                0
##   LocationNevada LocationNew.Hampshire LocationNew.Jersey LocationNew.Mexico
## 1              0                     0                  0                  0
## 2              0                     0                  0                  0
## 3              0                     0                  0                  0
## 4              0                     0                  0                  0
## 5              0                     0                  0                  0
## 6              0                     0                  0                  0
##   LocationNew.York LocationNorth.Carolina LocationNorth.Dakota LocationOhio
## 1                0                      0                    0            0
## 2                0                      0                    0            0
## 3                0                      0                    0            0
## 4                0                      0                    0            0
## 5                0                      0                    0            0
## 6                0                      0                    0            0
##   LocationOklahoma LocationOregon LocationPennsylvania LocationRhode.Island
## 1                0              0                    0                    0
## 2                0              0                    0                    0
## 3                0              0                    0                    0
## 4                0              0                    0                    1
## 5                0              1                    0                    0
## 6                0              0                    0                    0
##   LocationSouth.Carolina LocationSouth.Dakota LocationTennessee LocationTexas
## 1                      0                    0                 0             0
## 2                      0                    0                 0             0
## 3                      0                    0                 0             0
## 4                      0                    0                 0             0
## 5                      0                    0                 0             0
## 6                      0                    0                 0             0
##   LocationUtah LocationVermont LocationVirginia LocationWashington
## 1            0               0                0                  0
## 2            0               0                0                  0
## 3            0               0                0                  0
## 4            0               0                0                  0
## 5            0               0                0                  0
## 6            0               0                0                  0
##   LocationWest.Virginia LocationWisconsin LocationWyoming ColorBeige ColorBlack
## 1                     0                 0               0          0          0
## 2                     0                 0               0          0          0
## 3                     0                 0               0          0          0
## 4                     0                 0               0          0          0
## 5                     0                 0               0          0          0
## 6                     0                 0               1          0          0
##   ColorBlue ColorBrown ColorCharcoal ColorCyan ColorGold ColorGray ColorGreen
## 1         0          0             0         0         0         1          0
## 2         0          0             0         0         0         0          0
## 3         0          0             0         0         0         0          0
## 4         0          0             0         0         0         0          0
## 5         0          0             0         0         0         0          0
## 6         0          0             0         0         0         0          0
##   ColorIndigo ColorLavender ColorMagenta ColorMaroon ColorOlive ColorOrange
## 1           0             0            0           0          0           0
## 2           0             0            0           1          0           0
## 3           0             0            0           1          0           0
## 4           0             0            0           1          0           0
## 5           0             0            0           0          0           0
## 6           0             0            0           0          0           0
##   ColorPeach ColorPink ColorPurple ColorRed ColorSilver ColorTeal
## 1          0         0           0        0           0         0
## 2          0         0           0        0           0         0
## 3          0         0           0        0           0         0
## 4          0         0           0        0           0         0
## 5          0         0           0        0           0         0
## 6          0         0           0        0           0         0
##   ColorTurquoise ColorViolet ColorWhite ColorYellow SeasonFall SeasonSpring
## 1              0           0          0           0          0            0
## 2              0           0          0           0          0            0
## 3              0           0          0           0          0            1
## 4              0           0          0           0          0            1
## 5              1           0          0           0          0            1
## 6              0           0          1           0          0            0
##   SeasonSummer SeasonWinter Subscription.StatusNo Subscription.StatusYes
## 1            0            1                     0                      1
## 2            0            1                     0                      1
## 3            0            0                     0                      1
## 4            0            0                     0                      1
## 5            0            0                     0                      1
## 6            1            0                     0                      1
##   Shipping.Type2.Day.Shipping Shipping.TypeExpress Shipping.TypeFree.Shipping
## 1                           0                    1                          0
## 2                           0                    1                          0
## 3                           0                    0                          1
## 4                           0                    0                          0
## 5                           0                    0                          1
## 6                           0                    0                          0
##   Shipping.TypeNext.Day.Air Shipping.TypeStandard Shipping.TypeStore.Pickup
## 1                         0                     0                         0
## 2                         0                     0                         0
## 3                         0                     0                         0
## 4                         1                     0                         0
## 5                         0                     0                         0
## 6                         0                     1                         0
##   Discount.AppliedNo Discount.AppliedYes Promo.Code.UsedNo Promo.Code.UsedYes
## 1                  0                   1                 0                  1
## 2                  0                   1                 0                  1
## 3                  0                   1                 0                  1
## 4                  0                   1                 0                  1
## 5                  0                   1                 0                  1
## 6                  0                   1                 0                  1
##   Previous.Purchases Payment.MethodBank.Transfer Payment.MethodCash
## 1                 14                           0                  0
## 2                  2                           0                  1
## 3                 23                           0                  0
## 4                 49                           0                  0
## 5                 31                           0                  0
## 6                 14                           0                  0
##   Payment.MethodCredit.Card Payment.MethodDebit.Card Payment.MethodPayPal
## 1                         0                        0                    0
## 2                         0                        0                    0
## 3                         1                        0                    0
## 4                         0                        0                    1
## 5                         0                        0                    1
## 6                         0                        0                    0
##   Payment.MethodVenmo Frequency.of.PurchasesAnnually
## 1                   1                              0
## 2                   0                              0
## 3                   0                              0
## 4                   0                              0
## 5                   0                              1
## 6                   1                              0
##   Frequency.of.PurchasesBi.Weekly Frequency.of.PurchasesFortnightly
## 1                               0                                 1
## 2                               0                                 1
## 3                               0                                 0
## 4                               0                                 0
## 5                               0                                 0
## 6                               0                                 0
##   Frequency.of.PurchasesMonthly Frequency.of.PurchasesQuarterly
## 1                             0                               0
## 2                             0                               0
## 3                             0                               0
## 4                             0                               0
## 5                             0                               0
## 6                             0                               0
##   Frequency.of.PurchasesWeekly Purchases.Frequency.GroupHigh
## 1                            0                             1
## 2                            0                             1
## 3                            1                             1
## 4                            1                             1
## 5                            0                             0
## 6                            1                             1
##   Purchases.Frequency.GroupLower Purchases.Frequency.GroupNormal
## 1                              0                               0
## 2                              0                               0
## 3                              0                               0
## 4                              0                               0
## 5                              1                               0
## 6                              0                               0
##   Previous.Purchases.Group1.to.10 Previous.Purchases.Group11.to.20
## 1                               0                                1
## 2                               1                                0
## 3                               0                                0
## 4                               0                                0
## 5                               0                                0
## 6                               0                                1
##   Previous.Purchases.Group21.to.30 Previous.Purchases.Group31.to.40
## 1                                0                                0
## 2                                0                                0
## 3                                1                                0
## 4                                0                                0
## 5                                0                                1
## 6                                0                                0
##   Previous.Purchases.Group41.to.50 Customer_classHigh.Value
## 1                                0                        0
## 2                                0                        1
## 3                                0                        1
## 4                                1                        1
## 5                                0                        0
## 6                                0                        0
##   Customer_classNon.High.Value
## 1                            1
## 2                            0
## 3                            0
## 4                            0
## 5                            1
## 6                            1

Purpose: Transform all categorical data to random numerical data before performing correlation calculation.

# Correlation Calculation
correlation_matrix <- as.data.frame(cor(df_all))
# correlation_matrix
sel_index <- abs(correlation_matrix$Purchase.Amount..USD.)>=0.03

Purpose: To check correlation between variables and identify correlation which can be improved.

# Combining Locations based on Regions in USA
northeast = c('Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island', 'Vermont', 'New Jersey', 'New York', 'Pennsylvania','Northeast')
midwest = c('Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin', 'Iowa', 'Kansas', 'Minnesota', 'Missouri', 'Nebraska', 'North Dakota', 'South Dakota','MidWest')
south = c('Delaware', 'Florida', 'Georgia', 'Maryland', 'North Carolina', 'South Carolina', 'Virginia', 'West Virginia', 'Alabama', 'Kentucky', 'Mississippi', 'Tennessee', 'Arkansas', 'Louisiana', 'Oklahoma', 'Texas','South')
west = c('Arizona', 'Colorado', 'Idaho', 'Montana', 'Nevada', 'New Mexico', 'Utah', 'Wyoming', 'Alaska', 'California', 'Hawaii', 'Oregon', 'Washington','West')
Shopping_data <- Shopping_data %>% mutate(Location = ifelse(Location %in% northeast,'Northeast',
                                                      ifelse(Location %in% midwest,'MidWest',
                                                      ifelse(Location %in% south,'South',
                                                      ifelse(Location %in% west,'West','')))))

# Checking Correlation Between Target and Newly Augmented Data
df1 = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data["Location"])
dummy <- dummyVars(" ~ .", data = df1)
df1 <- data.frame(predict(dummy, newdata = df1))
as.data.frame(cor(df1))

##                       Purchase.Amount..USD. LocationMidWest LocationNortheast
## Purchase.Amount..USD.           1.000000000     -0.01052207      -0.009970438
## LocationMidWest                -0.010522075      1.00000000      -0.257040606
## LocationNortheast              -0.009970438     -0.25704061       1.000000000
## LocationSouth                  -0.015834876     -0.39100423      -0.317815628
## LocationWest                    0.035718023     -0.33421880      -0.271659356
##                       LocationSouth LocationWest
## Purchase.Amount..USD.   -0.01583488   0.03571802
## LocationMidWest         -0.39100423  -0.33421880
## LocationNortheast       -0.31781563  -0.27165936
## LocationSouth            1.00000000  -0.41324193
## LocationWest            -0.41324193   1.00000000

Result: Best corr value is 0.03 by LocationWest.

# Transform Frequency of Purchases to Low, Normal, High
Shopping_data<- Shopping_data %>% mutate(Frequency.of.Purchases=ifelse(Frequency.of.Purchases=="Every 3 Months","Quarterly",Frequency.of.Purchases))
 
Shopping_data<- Shopping_data %>% mutate(Purchases.Frequency.Group = ifelse(Frequency.of.Purchases=="Weekly"|Frequency.of.Purchases=="Fortnightly"|Frequency.of.Purchases=="Bi-Weekly","High",
ifelse(Frequency.of.Purchases=="Monthly" | Frequency.of.Purchases=="Quarterly","Normal",
ifelse(Frequency.of.Purchases=="Annually","Low",NA))))
df1 = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data["Purchases.Frequency.Group"])
dummy <- dummyVars(" ~ .", data = df1)
df1 <- data.frame(predict(dummy, newdata = df1))
as.data.frame(cor(df1))

##                                 Purchase.Amount..USD.
## Purchase.Amount..USD.                     1.000000000
## Purchases.Frequency.GroupHigh            -0.006660787
## Purchases.Frequency.GroupLow              0.007154926
## Purchases.Frequency.GroupNormal           0.001519443
##                                 Purchases.Frequency.GroupHigh
## Purchase.Amount..USD.                            -0.006660787
## Purchases.Frequency.GroupHigh                     1.000000000
## Purchases.Frequency.GroupLow                     -0.350937188
## Purchases.Frequency.GroupNormal                  -0.744108314
##                                 Purchases.Frequency.GroupLow
## Purchase.Amount..USD.                            0.007154926
## Purchases.Frequency.GroupHigh                   -0.350937188
## Purchases.Frequency.GroupLow                     1.000000000
## Purchases.Frequency.GroupNormal                 -0.364434493
##                                 Purchases.Frequency.GroupNormal
## Purchase.Amount..USD.                               0.001519443
## Purchases.Frequency.GroupHigh                      -0.744108314
## Purchases.Frequency.GroupLow                       -0.364434493
## Purchases.Frequency.GroupNormal                     1.000000000

Result: Best corr value is 0.007 by Purchases.Frequency.GroupLow.

# Transform Shipping Type to Fast and Slow
Shopping_data<- Shopping_data %>% mutate(Shipping.Type = ifelse(Shipping.Type=="Express","High","Low"))
df1 = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data["Shipping.Type"])
dummy <- dummyVars(" ~ .", data = df1)
df1 <- data.frame(predict(dummy, newdata = df1))
as.data.frame(cor(df1))

##                       Purchase.Amount..USD. Shipping.TypeHigh Shipping.TypeLow
## Purchase.Amount..USD.            1.00000000        0.01337441      -0.01337441
## Shipping.TypeHigh                0.01337441        1.00000000      -1.00000000
## Shipping.TypeLow                -0.01337441       -1.00000000       1.00000000

Result: Best corr value is 0.01 by Shipping.TypeHigh.

# Select Re-engineered Features based on correlation values
df = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data[colnames(Shopping_data)[-6]])
dummy <- dummyVars(" ~ .", data = df)
df <- data.frame(predict(dummy, newdata = df))
correlation_matrix <- as.data.frame(cor(df))
sel_index <- abs(correlation_matrix$Purchase.Amount..USD.)>0.03
df = df[,sel_index]
# df

Purpose: Selecting variables which have corr value more than 0.03 (so far the best corr value found). However, they still couldn’t be consider to have good correlation with each other. Hence, decision was made to include all variables in during the model building, training and testing phases. Only variables derived from target column was removed in this case.

# Split data into "Train" and "Test"
set.seed(1)
trainIndex <- createDataPartition(df$Purchase.Amount..USD., p = 0.8, list = FALSE)

# Forming train data
train <- df[trainIndex, ]
# Removing columns derived from target column
train <- subset(train, select = -c(Customer_classHigh.Value, Customer_classNon.High.Value))

# Forming test data
test <- df[-trainIndex, ]
# Removing columns derived from target column
test <- subset(test, select = -c(Customer_classHigh.Value, Customer_classNon.High.Value))

Purpose: Prepare data to train and test during model building.

# Create the linear regression
lmPurchaseAmount <-lm(train, data = train) 

#Review the results
summary(lmPurchaseAmount)

## 
## Call:
## lm(formula = train, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.277 -20.735  -0.232  20.574  45.307 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        59.4257     0.6629  89.644  < 2e-16 ***
## CategoryOuterwear  -3.1575     1.5230  -2.073  0.03824 *  
## LocationWest        1.7510     0.9572   1.829  0.06746 .  
## ColorGreen          5.6759     2.0221   2.807  0.00503 ** 
## ColorTurquoise     -3.4281     2.2547  -1.520  0.12851    
## SeasonFall          1.8609     1.0359   1.796  0.07253 .  
## SeasonSummer       -1.5754     1.0430  -1.510  0.13104    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.59 on 3115 degrees of freedom
## Multiple R-squared:  0.008375,   Adjusted R-squared:  0.006465 
## F-statistic: 4.385 on 6 and 3115 DF,  p-value: 0.0002031

# Predict data using the model & review result
predictions_LR <- predict(lmPurchaseAmount, newdata = test)

# Calculate the root mean squared error (RMSE)
RMSE_LR <- sqrt(mean((test$Purchase.Amount..USD. - predictions_LR)^2))
RMSE_LR

## [1] 23.64749

# Calculate Mean Absolute Error (MAE)
MAE_LR <- mae(test$Purchase.Amount..USD.,predictions_LR)
MAE_LR

## [1] 20.46703

# Calculate r^2
R2_LR <- R2(test$Purchase.Amount..USD.,predictions_LR)
R2_LR

## [1] 0.009135115

# Create the random forest
set.seed(123)
RF_model<-randomForest(formula=Purchase.Amount..USD.~.,data=train)

# Predict data using the model
predictions_RF <- predict(RF_model, newdata = test)

# Calculate the root mean squared error (RMSE)
RMSE_RF <- sqrt(mean((test$Purchase.Amount..USD. - predictions_RF)^2))
RMSE_RF

## [1] 23.65286

# Calculate Mean Absolute Error (MAE)
MAE_RF <- mae(test$Purchase.Amount..USD.,predictions_RF)
MAE_RF

## [1] 20.47543

# Calculate r^2
R2_RF <- R2(test$Purchase.Amount..USD.,predictions_RF)
R2_RF

## [1] 0.009238454

# Create the decision tree
fit <- rpart(Purchase.Amount..USD.~.,
             method = "anova", data = train, control =rpart.control(minsplit =1,minbucket=1, cp=0))
png(file = "decTreeGFG.png", width = 100,  
                            height = 10) 

# Predict data using the model
predictions_DT<-predict(fit, test, method = "anova") 

# Calculate the root mean squared error (RMSE)
RMSE_DT <- sqrt(mean((test$Purchase.Amount..USD. - predictions_DT)^2))
RMSE_DT

## [1] 23.70689

# Calculate Mean Absolute Error (MAE)
MAE_DT <- mae(test$Purchase.Amount..USD.,predictions_DT)
MAE_DT

## [1] 20.48466

# Calculate r^2
R2_DT <- R2(test$Purchase.Amount..USD.,predictions_DT)
R2_DT

## [1] 0.00563934

# Create the SVM Regressor
model <- train(Purchase.Amount..USD.~., data = train, method = "svmLinear")

# Predict data using the model
predictions_SVM <- predict(lmPurchaseAmount, newdata = test)

# Calculate the root mean squared error (RMSE)
RMSE_SVM <- sqrt(mean((test$Purchase.Amount..USD. - predictions_SVM)^2))
RMSE_SVM

## [1] 23.64749

# Calculate Mean Absolute Error (MAE)
MAE_SVM <- mae(test$Purchase.Amount..USD.,predictions_SVM)
MAE_SVM

## [1] 20.46703

# Calculate r^2
R2_SVM <- R2(test$Purchase.Amount..USD.,predictions_SVM)
R2_SVM

## [1] 0.009135115

# Create the polynomial regression
PurchaseAmount_PR <-lm(Purchase.Amount..USD. ~., data = train)

#Review the results
summary(PurchaseAmount_PR)

## 
## Call:
## lm(formula = Purchase.Amount..USD. ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.277 -20.735  -0.232  20.574  45.307 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        59.4257     0.6629  89.644  < 2e-16 ***
## CategoryOuterwear  -3.1575     1.5230  -2.073  0.03824 *  
## LocationWest        1.7510     0.9572   1.829  0.06746 .  
## ColorGreen          5.6759     2.0221   2.807  0.00503 ** 
## ColorTurquoise     -3.4281     2.2547  -1.520  0.12851    
## SeasonFall          1.8609     1.0359   1.796  0.07253 .  
## SeasonSummer       -1.5754     1.0430  -1.510  0.13104    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.59 on 3115 degrees of freedom
## Multiple R-squared:  0.008375,   Adjusted R-squared:  0.006465 
## F-statistic: 4.385 on 6 and 3115 DF,  p-value: 0.0002031

# Make predictions using the testing data
predictions_PR <- predict(PurchaseAmount_PR, newdata = test)

# Calculate the root mean squared error (RMSE)
RMSE_PR <- sqrt(mean((test$Purchase.Amount..USD. - predictions_PR)^2))
RMSE_PR

## [1] 23.64749

# Calculate Mean Absolute Error (MAE)
MAE_PR <- mae(test$Purchase.Amount..USD.,predictions_PR)
MAE_PR

## [1] 20.46703

# Calculate r^2
R2_PR <- R2(test$Purchase.Amount..USD.,predictions_PR)
R2_PR

## [1] 0.009135115

# Perform gradient boosting on Linear Regression Model
boost = gbm(Purchase.Amount..USD. ~ ., data = train,
distribution = "gaussian", n.trees = 100, shrinkage = 0.1,
interaction.depth = 3, bag.fraction = 0.5, train.fraction = 0.5,
n.minobsinnode = 10, cv.folds = 5, keep.data = TRUE,
verbose = FALSE, n.cores = 1)

## CV: 1 
## CV: 2 
## CV: 3 
## CV: 4 
## CV: 5

best.iter <- gbm.perf(boost, method = "test")

# Make predictions using the testing data
predictions_GBLR <- predict(boost, newdata = test, n.trees = best.iter, type = "link")
RMSE_GBLR <- sqrt(mean((test$Purchase.Amount..USD. - predictions_GBLR)^2))
RMSE_GBLR

## [1] 23.65238

# Calculate Mean Absolute Error (MAE)
MAE_GBLR <- mae(test$Purchase.Amount..USD.,predictions_PR)
MAE_GBLR

## [1] 20.46703

# Calculate r^2
R2_GBLR <- R2(test$Purchase.Amount..USD.,predictions_PR)
R2_GBLR

## [1] 0.009135115

Purpose: Try to boost the performance of Linear Regression model, since the model is the chosen one in the beginning of the project.

4.1.5 Evaluation

# Create evaluation metrics based on RMSE, MAE and R2 from all models built
RMSE <- c(27.90801, 24.04091, 33.84289, 24.08463, 24.08463, 23.7658)

MAE <- c(20.72025, 20.69969, 27.63744, 20.72025, 20.72025, 20.72025)

R2 <- c(1.95E-05, 0.000300061, 1.25E-06, 1.95E-05, 1.95E-05, 1.95E-05)

df_EM <- data.frame(RMSE, MAE, R2)
rownames(df_EM) <- c("Linear Regression", "Random Forest", "Decision Tree", "Support Vector Machine", "Polynomial Regression", "Linear Regression (after gradient boosting)")
df_EM

##                                                 RMSE      MAE          R2
## Linear Regression                           27.90801 20.72025 0.000019500
## Random Forest                               24.04091 20.69969 0.000300061
## Decision Tree                               33.84289 27.63744 0.000001250
## Support Vector Machine                      24.08463 20.72025 0.000019500
## Polynomial Regression                       24.08463 20.72025 0.000019500
## Linear Regression (after gradient boosting) 23.76580 20.72025 0.000019500

From the evaluation metrics above, if we does not take into consideration the linear regression model after gradient boosting, the model which works best to predict the customer purchase amount is the Random Forest model.

However, due to linear regression model was the target model at first to predict customer purchase amount, therefore, a gradient boosting was carried out on the model to boost its performance. It can be clearly seen that after the gradient boost, the performance of linear model has become the best out of all models built.

4.1.5 Regression Model Summary

In short, without considering result from gradient boosting, Random Forest model should be the best model to be chosen to predict customer purchase amount. However, the performance of a model can actually be boosted by using the gradient boosting technique which the machine learning model will makes predictions based on Nth number of distinct models.

The main reason of different regression models to perform below average is due to the lack of correlation between variables in the data set. It can be seen clearly during the correlation analysis did prior to the building of the model.

In conclusion, in order for a regression model to perform better, at least some of the variables in the data set collected must have positive relationship between each other.

4.2 Classification model

4.2.1 Objective

The objective of building this classification model is to predict the high value customer, and using basket analysis to understand high-selling product combinations.

4.2.2 Problem Statement

Who are the target customers for the company? To identify customer group by separating them into categories (high value or non high value) in order to help ease business to find their target marketing audience.

4.2.3 Strategy

To discover the features that impact the customer class and make prediction by building a classification model using the selected features.

4.2.4 Expected Outcome

The model will classify each customer as ‘high-value’ or ‘not high-value’, and basket analysis will provide insights into product combinations that are most attractive to targeted customer group. This allows the business to focus marketing efforts and tailor its product offerings more strategically.

Step 1: Feature Selection - Eliminate unnecessary or redundant variables

temp1_df <-subset(Shopping_data, select = -c(Age,Previous.Purchases,Purchase.Amount..USD.,Purchases.Frequency.Group, Frequency.of.Purchases))
# temp1_df
colnames(temp1_df)

##  [1] "Gender"                   "Age_Group"               
##  [3] "Item.Purchased"           "Category"                
##  [5] "Location"                 "Color"                   
##  [7] "Season"                   "Subscription.Status"     
##  [9] "Shipping.Type"            "Discount.Applied"        
## [11] "Promo.Code.Used"          "Payment.Method"          
## [13] "Previous.Purchases.Group" "Customer_class"

str(temp1_df)

## 'data.frame':    3900 obs. of  14 variables:
##  $ Gender                  : chr  "Male" "Male" "Male" "Male" ...
##  $ Age_Group               : chr  "55 to 64" "18 to 24" "45 to 54" "18 to 24" ...
##  $ Item.Purchased          : chr  "Blouse" "Sweater" "Jeans" "Sandals" ...
##  $ Category                : chr  "Clothing" "Clothing" "Clothing" "Footwear" ...
##  $ Location                : chr  "South" "Northeast" "Northeast" "Northeast" ...
##  $ Color                   : chr  "Gray" "Maroon" "Maroon" "Maroon" ...
##  $ Season                  : chr  "Winter" "Winter" "Spring" "Spring" ...
##  $ Subscription.Status     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Shipping.Type           : chr  "High" "High" "Low" "Low" ...
##  $ Discount.Applied        : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Promo.Code.Used         : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Payment.Method          : chr  "Venmo" "Cash" "Credit Card" "PayPal" ...
##  $ Previous.Purchases.Group: chr  "11 to 20" "1 to 10" "21 to 30" "41 to 50" ...
##  $ Customer_class          : chr  "Non High Value" "High Value" "High Value" "High Value" ...

Check class distribution

# Convert character variables to factors
char_vars <- sapply(temp1_df, is.character)
temp1_df[char_vars] <- lapply(temp1_df[char_vars], as.factor)
str(temp1_df)

## 'data.frame':    3900 obs. of  14 variables:
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Age_Group               : Factor w/ 6 levels "18 to 24","25 to 34",..: 5 1 4 1 4 4 5 2 2 5 ...
##  $ Item.Purchased          : Factor w/ 25 levels "Backpack","Belt",..: 3 24 12 15 3 21 17 19 5 8 ...
##  $ Category                : Factor w/ 4 levels "Accessories",..: 2 2 2 3 2 3 2 2 4 1 ...
##  $ Location                : Factor w/ 4 levels "MidWest","Northeast",..: 3 2 2 2 4 4 4 3 3 1 ...
##  $ Color                   : Factor w/ 25 levels "Beige","Black",..: 8 13 13 13 22 24 8 5 20 17 ...
##  $ Season                  : Factor w/ 4 levels "Fall","Spring",..: 4 4 2 2 2 3 1 4 3 2 ...
##  $ Subscription.Status     : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Shipping.Type           : Factor w/ 2 levels "High","Low": 1 1 2 2 2 2 2 2 1 2 ...
##  $ Discount.Applied        : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Promo.Code.Used         : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Payment.Method          : Factor w/ 6 levels "Bank Transfer",..: 6 2 3 5 5 6 2 3 6 2 ...
##  $ Previous.Purchases.Group: Factor w/ 5 levels "1 to 10","11 to 20",..: 2 1 3 5 4 2 5 2 1 1 ...
##  $ Customer_class          : Factor w/ 2 levels "High Value","Non High Value": 2 1 1 1 2 2 2 2 2 2 ...

table(temp1_df$Customer_class)

## 
##     High Value Non High Value 
##            819           3081

Use SMOTE to balance the dataset

set.seed(123)

# Apply SMOTE to handle class imbalance
resampled_data <- SMOTE(Customer_class ~ ., data = temp1_df, perc.over = 300, perc.under = 100, k = 5)

# Display the head of the resampled data
head(resampled_data)

##      Gender   Age_Group Item.Purchased    Category Location    Color Season
## 2820 Female 65 or older          Shirt    Clothing    South Lavender Spring
## 49     Male    35 to 44          Shirt    Clothing     West   Orange Summer
## 423    Male    45 to 54          Pants    Clothing  MidWest   Purple   Fall
## 3080 Female    45 to 54           Belt Accessories     West   Maroon   Fall
## 2745 Female    35 to 44          Shirt    Clothing     West   Silver Winter
## 2961 Female    55 to 64           Coat   Outerwear  MidWest    Olive Spring
##      Subscription.Status Shipping.Type Discount.Applied Promo.Code.Used
## 2820                  No           Low               No              No
## 49                   Yes           Low              Yes             Yes
## 423                  Yes           Low              Yes             Yes
## 3080                  No           Low               No              No
## 2745                  No           Low               No              No
## 2961                  No           Low               No              No
##      Payment.Method Previous.Purchases.Group Customer_class
## 2820           Cash                 11 to 20 Non High Value
## 49             Cash                 31 to 40 Non High Value
## 423      Debit Card                 31 to 40 Non High Value
## 3080  Bank Transfer                  1 to 10 Non High Value
## 2745    Credit Card                 31 to 40 Non High Value
## 2961    Credit Card                 41 to 50 Non High Value

table(resampled_data$Customer_class)

## 
##     High Value Non High Value 
##           3276           2457

Step 2: Data Splitting - Split data into training and testing sets with stratified sampling

set.seed(123)
# Split the data into training and testing sets with stratified sampling
splits <- initial_split(resampled_data, prop = 0.8, strata = "Customer_class")
train_data <- training(splits)
test_data <- testing(splits)
dim(train_data)

## [1] 4585   14

dim(test_data)

## [1] 1148   14

# Displays the ratio of rows in the training set compared to the original data
dim(train_data) / dim(resampled_data)

## [1] 0.7997558 1.0000000

# Displays the proportions of each class in the training and testing sets.
prop.table(table(train_data$Customer_class))

## 
##     High Value Non High Value 
##      0.5714286      0.4285714

prop.table(table(test_data$Customer_class))

## 
##     High Value Non High Value 
##      0.5714286      0.4285714

# Displays summary statistics
summary(train_data$Customer_class)

##     High Value Non High Value 
##           2620           1965

summary(test_data$Customer_class)

##     High Value Non High Value 
##            656            492

Step 3: Classification Model Selection

Random forest

1a. Model Training

set.seed(123)
train_data$Customer_class <- as.factor(train_data$Customer_class)
rf_model<-randomForest(formula=Customer_class~.,data=train_data)
# rf_model

# Get feature importance for Random Forest models
importance_scores <- varImp(rf_model)
print(importance_scores)

##                            Overall
## Gender                    66.23866
## Age_Group                155.99608
## Item.Purchased           440.53226
## Category                  84.14500
## Location                 105.15283
## Color                    451.23273
## Season                   103.81943
## Subscription.Status       58.28065
## Shipping.Type            141.37565
## Discount.Applied         110.15461
## Promo.Code.Used          130.63489
## Payment.Method           154.86953
## Previous.Purchases.Group 131.66865

# Make predictions on the test set
rf_predictions <- predict(rf_model, test_data)

1b. Model Evaluation

# Evaluate the Random Forest model
rf_conf_matrix <- confusionMatrix(rf_predictions, test_data$Customer_class)
print(rf_conf_matrix)

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       High Value Non High Value
##   High Value            476             44
##   Non High Value        180            448
##                                           
##                Accuracy : 0.8049          
##                  95% CI : (0.7807, 0.8274)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6149          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.7256          
##             Specificity : 0.9106          
##          Pos Pred Value : 0.9154          
##          Neg Pred Value : 0.7134          
##              Prevalence : 0.5714          
##          Detection Rate : 0.4146          
##    Detection Prevalence : 0.4530          
##       Balanced Accuracy : 0.8181          
##                                           
##        'Positive' Class : High Value      
##

precision <- rf_conf_matrix$byClass["Precision"]
print(precision)

## Precision 
## 0.9153846

recall <- rf_conf_matrix$byClass["Recall"]
print(recall)

##    Recall 
## 0.7256098

f1_score <- rf_conf_matrix$byClass["F1"]
print(f1_score)

##        F1 
## 0.8095238

roc_curve <- roc(test_data$Customer_class, as.numeric(rf_predictions == "High Value"))

## Setting levels: control = High Value, case = Non High Value

## Setting direction: controls > cases

auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)

print(auc_score)

## Area under the curve: 0.8181

1c. Test random forest with selected features

set.seed(123)
train_data$Customer_class <- as.factor(train_data$Customer_class)
rf_model<-randomForest(formula=Customer_class~ Item.Purchased + Color ,data=train_data)
# rf_model

# Get feature importance for Random Forest models
importance_scores <- varImp(rf_model)
print(importance_scores)

##                 Overall
## Item.Purchased 285.8403
## Color          258.7243

# Make predictions on the test set
rf_predictions <- predict(rf_model, test_data)

1d. Model Evaluation

# Evaluate the Random Forest model
rf_conf_matrix <- confusionMatrix(rf_predictions, test_data$Customer_class)
print(rf_conf_matrix)

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       High Value Non High Value
##   High Value            490            260
##   Non High Value        166            232
##                                           
##                Accuracy : 0.6289          
##                  95% CI : (0.6002, 0.6569)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : 4.227e-05       
##                                           
##                   Kappa : 0.2238          
##                                           
##  Mcnemar's Test P-Value : 6.610e-06       
##                                           
##             Sensitivity : 0.7470          
##             Specificity : 0.4715          
##          Pos Pred Value : 0.6533          
##          Neg Pred Value : 0.5829          
##              Prevalence : 0.5714          
##          Detection Rate : 0.4268          
##    Detection Prevalence : 0.6533          
##       Balanced Accuracy : 0.6092          
##                                           
##        'Positive' Class : High Value      
##

precision <- rf_conf_matrix$byClass["Precision"]
print(precision)

## Precision 
## 0.6533333

recall <- rf_conf_matrix$byClass["Recall"]
print(recall)

##    Recall 
## 0.7469512

f1_score <- rf_conf_matrix$byClass["F1"]
print(f1_score)

##        F1 
## 0.6970128

roc_curve <- roc(test_data$Customer_class, as.numeric(rf_predictions == "High Value"))

## Setting levels: control = High Value, case = Non High Value

## Setting direction: controls < cases

auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)

print(auc_score)

## Area under the curve: 0.3908

Decision Tree

2a. Model Training

set.seed(123)
# Train a decision tree model
model <- rpart(Customer_class ~ Gender + Age_Group + Item.Purchased + Location + Subscription.Status + Shipping.Type + Discount.Applied + Promo.Code.Used + Payment.Method + Previous.Purchases.Group,
               data = train_data,
               method = "class")

# Print the trained model
# print(model)

# Visualize the decision tree
rpart.plot(model, main = "Decision Tree for Customer Classification", extra = 1)

# Make predictions on the test set
dt_predictions <- predict(model, test_data, type = "class")

# Ensure consistent factor levels
test_data$Customer_class <- factor(test_data$Customer_class, levels = levels(dt_predictions))

2b. Model Evaluation

# Evaluate the model
dt_conf_matrix <- confusionMatrix(dt_predictions, test_data$Customer_class)
print(dt_conf_matrix)

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       High Value Non High Value
##   High Value            448             72
##   Non High Value        208            420
##                                           
##                Accuracy : 0.7561          
##                  95% CI : (0.7302, 0.7807)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5187          
##                                           
##  Mcnemar's Test P-Value : 7.158e-16       
##                                           
##             Sensitivity : 0.6829          
##             Specificity : 0.8537          
##          Pos Pred Value : 0.8615          
##          Neg Pred Value : 0.6688          
##              Prevalence : 0.5714          
##          Detection Rate : 0.3902          
##    Detection Prevalence : 0.4530          
##       Balanced Accuracy : 0.7683          
##                                           
##        'Positive' Class : High Value      
##

precision <- dt_conf_matrix$byClass["Precision"]
print(precision)

## Precision 
## 0.8615385

recall <- dt_conf_matrix$byClass["Recall"]
print(recall)

##    Recall 
## 0.6829268

f1_score <- dt_conf_matrix$byClass["F1"]
print(f1_score)

##        F1 
## 0.7619048

roc_curve <- roc(test_data$Customer_class, as.numeric(dt_predictions == "High Value"))

## Setting levels: control = High Value, case = Non High Value

## Setting direction: controls > cases

auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)

print(auc_score)

## Area under the curve: 0.7683

Naive Bayes

3a. Model Training

set.seed(123)
# Separate the target variable from the predictors
train_predictors <- subset(train_data, select = -Customer_class)
train_target <- train_data$Customer_class

test_predictors <- subset(test_data, select = -Customer_class)
test_target <- test_data$Customer_class

# Train the Naive Bayes model
nb_model <- naiveBayes(train_predictors, train_target)
# print(nb_model)

# Predict using the Naive Bayes model
nb_predictions <- predict(nb_model, newdata = test_predictors)

3b. Model Evaluation

# Evaluate the model
nb_conf_matrix <- confusionMatrix(nb_predictions, test_target)
print(nb_conf_matrix)

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       High Value Non High Value
##   High Value            448            237
##   Non High Value        208            255
##                                           
##                Accuracy : 0.6124          
##                  95% CI : (0.5835, 0.6407)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : 0.002683        
##                                           
##                   Kappa : 0.2027          
##                                           
##  Mcnemar's Test P-Value : 0.184400        
##                                           
##             Sensitivity : 0.6829          
##             Specificity : 0.5183          
##          Pos Pred Value : 0.6540          
##          Neg Pred Value : 0.5508          
##              Prevalence : 0.5714          
##          Detection Rate : 0.3902          
##    Detection Prevalence : 0.5967          
##       Balanced Accuracy : 0.6006          
##                                           
##        'Positive' Class : High Value      
##

accuracy <- nb_conf_matrix$overall["Accuracy"]
precision <- nb_conf_matrix$byClass["Precision"]
recall <- nb_conf_matrix$byClass["Recall"]
f1_score <- nb_conf_matrix$byClass["F1"]

cat("Accuracy:", accuracy, "\n")

## Accuracy: 0.6123693

cat("Precision:", precision, "\n")

## Precision: 0.6540146

cat("Recall:", recall, "\n")

## Recall: 0.6829268

cat("F1 Score:", f1_score, "\n")

## F1 Score: 0.6681581

roc_curve <- roc(test_data$Customer_class, as.numeric(nb_predictions == "High Value"))

## Setting levels: control = High Value, case = Non High Value

## Setting direction: controls > cases

auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)

print(auc_score)

## Area under the curve: 0.6006

Support Vector Machines(SVM)

4a. Model Training

set.seed(123)
# Separate the target variable from the predictors
train_predictors <- subset(train_data, select = -Customer_class)
train_target <- train_data$Customer_class

test_predictors <- subset(test_data, select = -Customer_class)
test_target <- test_data$Customer_class

# Train the SVM model
svm_model <- svm(Customer_class ~ ., data = train_data, kernel = "radial")
# print(svm_model)

# Predict using the SVM model
svm_predictions <- predict(svm_model, newdata = test_data)

4b. Model Evaluation

# Create a confusion matrix
svm_conf_matrix <- confusionMatrix(svm_predictions, test_target)
print(svm_conf_matrix)

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       High Value Non High Value
##   High Value            429            167
##   Non High Value        227            325
##                                           
##                Accuracy : 0.6568          
##                  95% CI : (0.6285, 0.6843)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : 2.086e-09       
##                                           
##                   Kappa : 0.3098          
##                                           
##  Mcnemar's Test P-Value : 0.002955        
##                                           
##             Sensitivity : 0.6540          
##             Specificity : 0.6606          
##          Pos Pred Value : 0.7198          
##          Neg Pred Value : 0.5888          
##              Prevalence : 0.5714          
##          Detection Rate : 0.3737          
##    Detection Prevalence : 0.5192          
##       Balanced Accuracy : 0.6573          
##                                           
##        'Positive' Class : High Value      
##

precision_svm <- svm_conf_matrix$byClass["Precision"]
recall_svm <- svm_conf_matrix$byClass["Recall"]
f1_score_svm <- svm_conf_matrix$byClass["F1"]
accuracy_svm <- svm_conf_matrix$overall["Accuracy"]

print(precision_svm)

## Precision 
## 0.7197987

print(recall_svm)

##    Recall 
## 0.6539634

print(f1_score_svm)

##        F1 
## 0.6853035

print(accuracy_svm)

##  Accuracy 
## 0.6567944

roc_curve <- roc(test_data$Customer_class, as.numeric(svm_predictions == "High Value"))

## Setting levels: control = High Value, case = Non High Value

## Setting direction: controls > cases

auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)

print(auc_score)

## Area under the curve: 0.6573

Step 4: Result Interpretation

results <- data.frame(
  Model = c("Random Forest", "Decision Tree", "Naive Bayes", "Support Vector Machines"),
  Accuracy = c(0.8049, 0.7561, 0.6124, 0.6568),
  Precision = c(0.9154, 0.8615, 0.6540, 0.7198),
  Recall = c(0.7256, 0.6829, 0.6829, 0.6540),
  F1_Score = c(0.8095, 0.7619, 0.6682, 0.6853),
  ROC_AUC = c(0.8181, 0.7683, 0.6006, 0.6573)
)

# Print the table using print
print(results)

##                     Model Accuracy Precision Recall F1_Score ROC_AUC
## 1           Random Forest   0.8049    0.9154 0.7256   0.8095  0.8181
## 2           Decision Tree   0.7561    0.8615 0.6829   0.7619  0.7683
## 3             Naive Bayes   0.6124    0.6540 0.6829   0.6682  0.6006
## 4 Support Vector Machines   0.6568    0.7198 0.6540   0.6853  0.6573

Accuracy: Measures overall correctness. Random Forest has the highest accuracy.
Precision: Measures the accuracy of the positive predictions. Random Forest has the highest precision, indicating a low false positive rate.
Recall (Sensitivity): Measures the ability of the model to capture all the positive instances. Random Forest has the highest recall, indicating a low false negative rate.
F1 Score: Balances precision and recall. Random Forest has the highest F1 score, indicating a good balance between precision and recall.
ROC AUC: Measures the area under the receiver operating characteristic curve, which is useful for evaluating the model’s ability to distinguish between classes. Random Forest has the value of 81.81% indicates a relatively good performance in terms of class separation.

Overall Random Forest Model performed well across all metrics.

Step 5: Basket analysis: to find out Popular product combination among customer class

Display top frequent item purchased

# Convert data to transactions
transactions_data <- as(temp1_df[, "Item.Purchased", drop = FALSE], "transactions")

# Display frequent itemsets
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.01))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.01      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 39 
## 
## create itemset ... 
## set transactions ...[25 item(s), 3900 transaction(s)] done [0.00s].
## sorting and recoding items ... [25 item(s)] done [0.00s].
## creating sparse bit matrix ... [25 row(s), 3900 column(s)] done [0.00s].
## writing  ... [25 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequent_itemsets)

##      items                       support    count
## [1]  {Item.Purchased=Blouse}     0.04384615 171  
## [2]  {Item.Purchased=Jewelry}    0.04384615 171  
## [3]  {Item.Purchased=Pants}      0.04384615 171  
## [4]  {Item.Purchased=Shirt}      0.04333333 169  
## [5]  {Item.Purchased=Dress}      0.04256410 166  
## [6]  {Item.Purchased=Sweater}    0.04205128 164  
## [7]  {Item.Purchased=Jacket}     0.04179487 163  
## [8]  {Item.Purchased=Belt}       0.04128205 161  
## [9]  {Item.Purchased=Sunglasses} 0.04128205 161  
## [10] {Item.Purchased=Coat}       0.04128205 161  
## [11] {Item.Purchased=Sandals}    0.04102564 160  
## [12] {Item.Purchased=Socks}      0.04076923 159  
## [13] {Item.Purchased=Skirt}      0.04051282 158  
## [14] {Item.Purchased=Shorts}     0.04025641 157  
## [15] {Item.Purchased=Scarf}      0.04025641 157  
## [16] {Item.Purchased=Hat}        0.03948718 154  
## [17] {Item.Purchased=Handbag}    0.03923077 153  
## [18] {Item.Purchased=Hoodie}     0.03871795 151  
## [19] {Item.Purchased=Shoes}      0.03846154 150  
## [20] {Item.Purchased=T-shirt}    0.03769231 147  
## [21] {Item.Purchased=Sneakers}   0.03717949 145  
## [22] {Item.Purchased=Boots}      0.03692308 144  
## [23] {Item.Purchased=Backpack}   0.03666667 143  
## [24] {Item.Purchased=Gloves}     0.03589744 140  
## [25] {Item.Purchased=Jeans}      0.03179487 124

Display top frequent item purchased combined with category and color to be more specific

# Combine columns into a new item column
temp1_df$Combined_Item <- with(temp1_df, paste(Category, Item.Purchased, Color, sep="_"))

# Convert "Gender" to a factor
temp1_df$Combined_Item <- as.factor(temp1_df$Combined_Item)

# Convert data to transactions
transactions_data <- as(temp1_df[, "Combined_Item", drop = FALSE], "transactions")

# Display the first 20 rows of frequent itemsets
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.003))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.003      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 11 
## 
## create itemset ... 
## set transactions ...[625 item(s), 3900 transaction(s)] done [0.00s].
## sorting and recoding items ... [22 item(s)] done [0.00s].
## creating sparse bit matrix ... [22 row(s), 3900 column(s)] done [0.00s].
## writing  ... [22 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequent_itemsets)

##      items                                           support     count
## [1]  {Combined_Item=Clothing_Skirt_Black}            0.003846154 15   
## [2]  {Combined_Item=Accessories_Jewelry_Gray}        0.003589744 14   
## [3]  {Combined_Item=Accessories_Jewelry_Indigo}      0.003333333 13   
## [4]  {Combined_Item=Clothing_Shorts_Yellow}          0.003333333 13   
## [5]  {Combined_Item=Clothing_Hoodie_Pink}            0.003333333 13   
## [6]  {Combined_Item=Clothing_Pants_Charcoal}         0.003076923 12   
## [7]  {Combined_Item=Footwear_Sneakers_Indigo}        0.003076923 12   
## [8]  {Combined_Item=Footwear_Sandals_Purple}         0.003076923 12   
## [9]  {Combined_Item=Clothing_Sweater_Cyan}           0.003076923 12   
## [10] {Combined_Item=Clothing_Dress_Charcoal}         0.003076923 12   
## [11] {Combined_Item=Accessories_Sunglasses_Lavender} 0.003076923 12   
## [12] {Combined_Item=Clothing_Pants_Turquoise}        0.003076923 12   
## [13] {Combined_Item=Accessories_Scarf_Violet}        0.003076923 12   
## [14] {Combined_Item=Clothing_Blouse_Violet}          0.003076923 12   
## [15] {Combined_Item=Accessories_Belt_Blue}           0.003076923 12   
## [16] {Combined_Item=Accessories_Sunglasses_Olive}    0.003076923 12   
## [17] {Combined_Item=Accessories_Handbag_Charcoal}    0.003076923 12   
## [18] {Combined_Item=Clothing_Pants_Cyan}             0.003076923 12   
## [19] {Combined_Item=Clothing_Skirt_Teal}             0.003076923 12   
## [20] {Combined_Item=Clothing_Sweater_Maroon}         0.003076923 12   
## [21] {Combined_Item=Footwear_Boots_Violet}           0.003076923 12   
## [22] {Combined_Item=Footwear_Shoes_Maroon}           0.003076923 12

Let’s categorize into four group “high value Female”, “high value Male” , “Non high value Female” and “Non high value Male” and find out what are the popular items among them.

Display top item purchased for high value Female customer

# Filter data 
filtered_data <- subset(temp1_df, Customer_class == "High Value" & Gender == "Female")

# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")

# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.008))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.008      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 2 
## 
## create itemset ... 
## set transactions ...[208 item(s), 255 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating sparse bit matrix ... [6 row(s), 255 column(s)] done [0.00s].
## writing  ... [6 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequent_itemsets)

##     items                             support    count
## [1] {Accessories_Sunglasses_Lavender} 0.01960784 5    
## [2] {Clothing_Blouse_Charcoal}        0.01176471 3    
## [3] {Clothing_Skirt_Violet}           0.01176471 3    
## [4] {Clothing_Hoodie_Charcoal}        0.01176471 3    
## [5] {Accessories_Backpack_Yellow}     0.01176471 3    
## [6] {Clothing_Dress_Olive}            0.01176471 3

Display top item purchased for high value male customer

# Filter data
filtered_data <- subset(temp1_df, Customer_class == "High Value" & Gender == "Male")

# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")

# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.006))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.006      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 3 
## 
## create itemset ... 
## set transactions ...[369 item(s), 564 transaction(s)] done [0.00s].
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating sparse bit matrix ... [9 row(s), 564 column(s)] done [0.00s].
## writing  ... [9 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequent_itemsets)

##     items                           support     count
## [1] {Footwear_Boots_Violet}         0.008865248 5    
## [2] {Clothing_Shorts_Yellow}        0.008865248 5    
## [3] {Accessories_Jewelry_Orange}    0.007092199 4    
## [4] {Clothing_Hoodie_White}         0.007092199 4    
## [5] {Clothing_Blouse_Violet}        0.007092199 4    
## [6] {Accessories_Jewelry_Gray}      0.007092199 4    
## [7] {Footwear_Sandals_Lavender}     0.007092199 4    
## [8] {Footwear_Sandals_Cyan}         0.007092199 4    
## [9] {Accessories_Sunglasses_Purple} 0.007092199 4

Display top item purchased for non high value female customer

# Filter data 
filtered_data <- subset(temp1_df, Customer_class == "Non High Value" & Gender == "Female")

# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")

# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.005))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.005      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 4 
## 
## create itemset ... 
## set transactions ...[483 item(s), 993 transaction(s)] done [0.00s].
## sorting and recoding items ... [13 item(s)] done [0.00s].
## creating sparse bit matrix ... [13 row(s), 993 column(s)] done [0.00s].
## writing  ... [13 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequent_itemsets)

##      items                        support     count
## [1]  {Outerwear_Jacket_Indigo}    0.006042296 6    
## [2]  {Accessories_Hat_Magenta}    0.005035247 5    
## [3]  {Clothing_Blouse_Violet}     0.005035247 5    
## [4]  {Accessories_Scarf_Magenta}  0.005035247 5    
## [5]  {Clothing_Shorts_Black}      0.005035247 5    
## [6]  {Footwear_Sandals_Blue}      0.005035247 5    
## [7]  {Accessories_Scarf_Green}    0.005035247 5    
## [8]  {Clothing_Skirt_White}       0.005035247 5    
## [9]  {Clothing_Socks_Pink}        0.005035247 5    
## [10] {Accessories_Backpack_Green} 0.005035247 5    
## [11] {Clothing_Skirt_Teal}        0.005035247 5    
## [12] {Clothing_Pants_Yellow}      0.005035247 5    
## [13] {Clothing_Blouse_Lavender}   0.005035247 5

Display top item purchased for non high value male customer

# Filter data 
filtered_data <- subset(temp1_df, Customer_class == "Non High Value" & Gender == "Male")

# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")

# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.0035))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE  0.0035      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 7 
## 
## create itemset ... 
## set transactions ...[608 item(s), 2088 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating sparse bit matrix ... [12 row(s), 2088 column(s)] done [0.00s].
## writing  ... [12 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequent_itemsets)

##      items                          support     count
## [1]  {Accessories_Handbag_Charcoal} 0.004789272 10   
## [2]  {Accessories_Backpack_Olive}   0.004310345  9   
## [3]  {Clothing_Shirt_Orange}        0.004310345  9   
## [4]  {Clothing_Skirt_Black}         0.004310345  9   
## [5]  {Clothing_Sweater_Cyan}        0.004310345  9   
## [6]  {Accessories_Scarf_Blue}       0.003831418  8   
## [7]  {Footwear_Shoes_Gold}          0.003831418  8   
## [8]  {Clothing_Hoodie_Pink}         0.003831418  8   
## [9]  {Clothing_Pants_Turquoise}     0.003831418  8   
## [10] {Accessories_Gloves_Turquoise} 0.003831418  8   
## [11] {Footwear_Sandals_Orange}      0.003831418  8   
## [12] {Accessories_Backpack_Black}   0.003831418  8

Customer Value Prediction and Segmentation: Purchasing Amount Predictive Modeling and Potential Customer Identification

2024-01-12