Customer Value Prediction and Segmentation: Purchasing Amount Predictive Modeling and Potential Customer Identification

GROUP MEMBERS(GROUP 9)

IDA TEH PING LI 22074634

CHEN XIN WEI 17202767

SUM SI YENG 22068843

YAP ZHI REN 22056395

YEAP YU XIAN 17203286

1.Introduction

In the current era of data-driven decision making, businesses continuously strive to understand and enhance customer value. The project is designed to leverage data analytics in deciphering customer behaviors and preferences. This endeavor involves utilizing advanced statistical models and machine learning techniques to not only predict customer spending but also to segment customers based on their potential value to the business. The overarching aim is to equip the company with actionable insights that enable more effective targeting, personalized marketing, and optimized resource allocation.

1.1 Problems

Prediction of Purchase Amount (Regression Problem):

-Problem Statement: Determining the extent to which demographic patterns, historical purchase data, and purchasing behaviors such as payment methods and promo code usage, influence a customer’s purchase amount.

-Relevance: Understanding these variables can significantly enhance the accuracy of predictive models, enabling businesses to forecast sales and tailor their marketing strategies to individual customer profiles.

Identification of Potential Customers (Classification Problem):

-Problem Statement: Identifying key customer segments, particularly those who are likely to be high-value, using methods like basket analysis to understand popular product combinations.

-Relevance: This segmentation is critical for targeted marketing efforts, ensuring that resources are focused on the most profitable segments.

1.2 Objectives

1.2.1 Objective for Regression: To predict the purchase amount a customer is likely to spend, incorporating customer profiles for more accurate targeting.

Problem Statement: How demographic pattern, historical purchase data and other purchasing behavior (payment method, promo code utilization) impact customer purchase amount?

Strategy: To discover the features that impact customer’s purchase amount and make prediction by creating a multiple regression model using the variables provided.

Expected Outcome: The model will provide a continuous prediction of the purchase amount and insights into consumer profiles, enabling the business to anticipate potential revenue from individual customers and tailor marketing strategies more effectively.

1.2.2 Objective for Classification: To predict the high value customer, and using basket analysis to understand high-selling product combinations.

Problem Statement: Who are the target customers for the company? To identify customer group by separating them into categories (high value or non high value) in order to help ease business to find their target marketing audience.

Strategy: To discover the features that impact the customer class and make prediction by building a classification model using the selected features.

Expected Outcome: The model will classify each customer as ‘high-value’ or ‘not high-value’, and basket analysis will provide insights into product combinations that are most attractive to targeted customer group. This allows the business to focus marketing efforts and tailor its product offerings more strategically.

Data Preprocessing

2.1 Dataset introduction

Dataset: Customer Shopping Trends

Source: https://www.kaggle.com/datasets/iamsouravbanerjee/customer-shopping-trends-dataset?select=shopping_trends_updated.csv

The dataset that we utilize is “Customer Shopping Trends”. This dataset consists of 18 column and 3900 observations.

The dataset is saved in csv format before import to R.

2.2 Import dataset

2.3 Call libraries #put all libraries here

library(plyr)

## Warning: package 'plyr' was built under R version 4.3.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.3.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)

## Warning: package 'readr' was built under R version 4.3.2

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.2

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.2

## corrplot 0.92 loaded

library(caret)

## Warning: package 'caret' was built under R version 4.3.2

## Loading required package: lattice

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.2

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(rpart)

## Warning: package 'rpart' was built under R version 4.3.2

library(gbm)

## Warning: package 'gbm' was built under R version 4.3.2

## Loaded gbm 2.1.9

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(Metrics)

## Warning: package 'Metrics' was built under R version 4.3.2

## 
## Attaching package: 'Metrics'

## The following objects are masked from 'package:caret':
## 
##     precision, recall

library(DMwR)

## Loading required package: grid

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## 
## Attaching package: 'DMwR'

## The following object is masked from 'package:plyr':
## 
##     join

library(rsample)

## Warning: package 'rsample' was built under R version 4.3.2

library(pROC)

## Warning: package 'pROC' was built under R version 4.3.2

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following object is masked from 'package:Metrics':
## 
##     auc

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.3.2

library(arules)

## Warning: package 'arules' was built under R version 4.3.2

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following object is masked from 'package:kernlab':
## 
##     size

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(e1071)

## Warning: package 'e1071' was built under R version 4.3.2

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:rsample':
## 
##     permutations

2.4 Data Exploration

class(Shopping_data)

## [1] "data.frame"

str(Shopping_data)

## 'data.frame':    3900 obs. of  18 variables:
##  $ Customer.ID           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age                   : int  55 19 50 21 45 46 63 27 26 57 ...
##  $ Gender                : chr  "Male" "Male" "Male" "Male" ...
##  $ Item.Purchased        : chr  "Blouse" "Sweater" "Jeans" "Sandals" ...
##  $ Category              : chr  "Clothing" "Clothing" "Clothing" "Footwear" ...
##  $ Purchase.Amount..USD. : int  53 64 73 90 49 20 85 34 97 31 ...
##  $ Location              : chr  "Kentucky" "Maine" "Massachusetts" "Rhode Island" ...
##  $ Size                  : chr  "L" "L" "S" "M" ...
##  $ Color                 : chr  "Gray" "Maroon" "Maroon" "Maroon" ...
##  $ Season                : chr  "Winter" "Winter" "Spring" "Spring" ...
##  $ Review.Rating         : num  3.1 3.1 3.1 3.5 2.7 2.9 3.2 3.2 2.6 4.8 ...
##  $ Subscription.Status   : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Shipping.Type         : chr  "Express" "Express" "Free Shipping" "Next Day Air" ...
##  $ Discount.Applied      : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Promo.Code.Used       : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Previous.Purchases    : int  14 2 23 49 31 14 49 19 8 4 ...
##  $ Payment.Method        : chr  "Venmo" "Cash" "Credit Card" "PayPal" ...
##  $ Frequency.of.Purchases: chr  "Fortnightly" "Fortnightly" "Weekly" "Weekly" ...

dim(Shopping_data)

## [1] 3900   18

summary(Shopping_data)

##   Customer.ID          Age           Gender          Item.Purchased    
##  Min.   :   1.0   Min.   :18.00   Length:3900        Length:3900       
##  1st Qu.: 975.8   1st Qu.:31.00   Class :character   Class :character  
##  Median :1950.5   Median :44.00   Mode  :character   Mode  :character  
##  Mean   :1950.5   Mean   :44.07                                        
##  3rd Qu.:2925.2   3rd Qu.:57.00                                        
##  Max.   :3900.0   Max.   :70.00                                        
##    Category         Purchase.Amount..USD.   Location             Size          
##  Length:3900        Min.   : 20.00        Length:3900        Length:3900       
##  Class :character   1st Qu.: 39.00        Class :character   Class :character  
##  Mode  :character   Median : 60.00        Mode  :character   Mode  :character  
##                     Mean   : 59.76                                             
##                     3rd Qu.: 81.00                                             
##                     Max.   :100.00                                             
##     Color              Season          Review.Rating  Subscription.Status
##  Length:3900        Length:3900        Min.   :2.50   Length:3900        
##  Class :character   Class :character   1st Qu.:3.10   Class :character   
##  Mode  :character   Mode  :character   Median :3.70   Mode  :character   
##                                        Mean   :3.75                      
##                                        3rd Qu.:4.40                      
##                                        Max.   :5.00                      
##  Shipping.Type      Discount.Applied   Promo.Code.Used    Previous.Purchases
##  Length:3900        Length:3900        Length:3900        Min.   : 1.00     
##  Class :character   Class :character   Class :character   1st Qu.:13.00     
##  Mode  :character   Mode  :character   Mode  :character   Median :25.00     
##                                                           Mean   :25.35     
##                                                           3rd Qu.:38.00     
##                                                           Max.   :50.00     
##  Payment.Method     Frequency.of.Purchases
##  Length:3900        Length:3900           
##  Class :character   Class :character      
##  Mode  :character   Mode  :character      
##                                           
##                                           
##

colSums(is.na(Shopping_data))

##            Customer.ID                    Age                 Gender 
##                      0                      0                      0 
##         Item.Purchased               Category  Purchase.Amount..USD. 
##                      0                      0                      0 
##               Location                   Size                  Color 
##                      0                      0                      0 
##                 Season          Review.Rating    Subscription.Status 
##                      0                      0                      0 
##          Shipping.Type       Discount.Applied        Promo.Code.Used 
##                      0                      0                      0 
##     Previous.Purchases         Payment.Method Frequency.of.Purchases 
##                      0                      0                      0

sum(duplicated(Shopping_data))

## [1] 0

2.4.1 Data Transformation

2.4.1.1 Age Group

Shopping_data <- Shopping_data %>% mutate(
  Age_Group=ifelse(Age>=18 & Age<=24,"18 to 24",
                             ifelse(Age>=25 & Age<=34,"25 to 34",
                                    ifelse(Age>=35 & Age<=44,"35 to 44",
                                           ifelse(Age>=45 & Age<=54,"45 to 54",
                                                  ifelse(Age>=55 & Age<=64,"55 to 64",
                                                         ifelse(Age>=65,"65 or older",NA)))))))

Shopping_data <- Shopping_data %>%
  select(1:(which(names(Shopping_data) == "Gender")),
         "Age_Group",
         (which(names(Shopping_data) == "Gender") + 1):length(names(Shopping_data)))

2.4.1.2 Purchasing Frequency Group

Shopping_data<- Shopping_data %>% mutate(Frequency.of.Purchases=ifelse(Frequency.of.Purchases=="Every 3 Months","Quarterly",Frequency.of.Purchases))
 
Shopping_data<- Shopping_data %>% mutate(
  Purchases.Frequency.Group = ifelse(Frequency.of.Purchases=="Weekly" | Frequency.of.Purchases=="Fortnightly" |Frequency.of.Purchases=="Bi-Weekly","High",
                   ifelse(Frequency.of.Purchases=="Monthly" | Frequency.of.Purchases=="Quarterly","Normal",
                          ifelse(Frequency.of.Purchases=="Annually","Lower",NA))))

2.4.1.3 Previous Purchase Group

Shopping_data <- Shopping_data %>% mutate(
  Previous.Purchases.Group = ifelse(Previous.Purchases>=1 & Previous.Purchases<=10,"1 to 10",
                            ifelse(Previous.Purchases>=11 & Previous.Purchases<=20,"11 to 20",
                                   ifelse(Previous.Purchases>=21 & Previous.Purchases<=30,"21 to 30",
                                          ifelse(Previous.Purchases>=31 & Previous.Purchases<=40,"31 to 40",
                                                 ifelse(Previous.Purchases>=41 & Previous.Purchases<=50,"41 to 50",NA))))))

2.4.1.3 Identify Customer Class

Shopping_data<- Shopping_data %>% 
  mutate(
    Customer_class=ifelse((Purchase.Amount..USD.>=60 & Purchases.Frequency.Group=="High"),"High Value",
                          "Non High Value"))

2.5 Remove unnecessary Column

Shopping_data <- subset(Shopping_data, select = -c(Customer.ID,Review.Rating, Size))

2.6

write_csv(Shopping_data, file = "Shopping Trend Dataset (Cleaned).csv")

Exploratory Data Analysis

3.1 Univarate Analysis

3.1.1 Purchase Amount (num)

ggplot(Shopping_data, aes(x = Purchase.Amount..USD., fill = ..count..)) +
  geom_histogram(binwidth = 10, color = "white", aes(y = ..count..), alpha = 0.7) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "Histogram of Purchase Amount",
       x = "Purchase Amount (USD)",
       y = "Frequency")

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

3.1.2 age group (cat)

Age_cat<- table(Shopping_data$Age_Group)
names(Age_cat)[1] <- "Age"
Age_cat <- data.frame(Age = c("18 to 24, 12.46%","25 to 34, 19.36%","35 to 44, 18.69%","45 to 54, 19.28%","55 to 64, 19.26%","65 or older, 10.95%"),
                         freq = c(486,755,729,752,751,427))
pie(Age_cat$freq, labels = Age_cat$Age , col = c("pink", "yellow", "lightblue","lightgreen","lightyellow","rosybrown"), main = "Age Distribution")

3.1.3 gender (cat)

gender_cat<- table(Shopping_data$Gender)
names(gender_cat)[1] <- "Gender"
gender_cat <- data.frame(Gender = c("Female, 32%","Male, 68%"),
                         freq = c(1248,2652))
pie(gender_cat$freq, labels = gender_cat$Gender , col = c("pink", "lightblue"), main = "Gender Distribution")

3.1.5 subscription status (cat)

substatus_cat<- table(Shopping_data$Subscription.Status)
names(substatus_cat)[1] <- "Status"
substatus_cat <- data.frame(Status = c("Yes, 73%","No, 27%"),
                         freq = c(2847,1053))
pie(substatus_cat$freq, labels = substatus_cat$Status , col = c("rosybrown", "lightyellow"), main = "Subscription Status")

3.1.6 discount applied (cat)

disc_cat<- table(Shopping_data$Discount.Applied)
names(disc_cat)[1] <- "Discount Application"
disc_cat <- data.frame(Discount= c("Yes, 57%","No, 43%"),
                         freq = c(2223,1677))
pie(disc_cat$freq, labels = disc_cat$Discount , col = c("lightblue", "lightyellow"), main = "Discount Application")

3.1.8 previous purchase group

purchasegrp_cat<- table(Shopping_data$Previous.Purchases.Group)
names(purchasegrp_cat)[1] <- "Group"
purchasegrp_cat <- data.frame(Group= c("1 to 10, 19.18%","11 to 20, 19.92%","21 to 30, 20.26%","31 to 40, 20.1%","41 to 50, 19.62%"),
                         freq = c(784,777,790,784,765))
pie(purchasegrp_cat$freq, labels = purchasegrp_cat$Group , col = c("pink", "lightblue","lightgreen","lightyellow","rosybrown"), main = "Previous Purchases Category")

3.1.9 frequency of purchases (cat)

max_frequency <- max(Shopping_data$Frequency.of.Purchases)
ggplot(Shopping_data, aes(x = Frequency.of.Purchases, fill = Frequency.of.Purchases)) +
  geom_bar() +
  scale_fill_manual(values = ifelse(Shopping_data$Frequency.of.Purchases == max_frequency, "lightblue", "darkblue")) +
  labs(title = "Bar Chart of Frequency of Purchases",
       x = "Frequency of Purchases",
       y = "Count")

3.1.10 Purchase Frequency Group

purchasefreq_cat<- table(Shopping_data$Purchases.Frequency.Group)
names(purchasefreq_cat)[1] <- "Group"
purchasefreq_cat <- data.frame(Group= c("High, 41.74%","Lower, 14.67%","Normal, 43.59%"),
                         freq = c(1628,572,1700))
pie(purchasefreq_cat$freq, labels = purchasefreq_cat$Group , col = c("pink", "lightblue","lightyellow"), main = "Purchases Frequency")

3.1.11 Customer Class

custclass_cat<- table(Shopping_data$Customer_class)
names(custclass_cat)[1] <- "Class"
custclass_cat <- data.frame(Class= c("High Value, 21%","Non High Value, 79%"),
                         freq = c(819,3081))
pie(custclass_cat$freq, labels = custclass_cat$Class , col = c("gold","lightgrey"), main = "Customer Class")

3.2 Bivariate Analysis

3.2.1 purchased amount and gender

ggplot(Shopping_data, aes(x = Gender, y = Purchase.Amount..USD.,fill=Gender)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Gender", x="Gender", y="Mean Purchase Amount(USD)",fill="Gender")

3.2.2 purchased amount and historical purchase data

ggplot(Shopping_data, aes(x = Previous.Purchases.Group, y = Purchase.Amount..USD.,fill=Previous.Purchases.Group)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Previous Purchases Group", x="Previous Purchases Group", 
       y="Mean Purchase Amount(USD)",fill="Previous Purchases Group")

3.2.3 purchased amount and frequency of purchases

ggplot(Shopping_data, aes(x = Purchases.Frequency.Group, y = Purchase.Amount..USD.,fill=Purchases.Frequency.Group)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Purchases Frequency Group", x="Purchases Frequency Group", 
       y="Mean Purchase Amount(USD)",fill="Purchases Frequency Group")

3.2.4 purchased amount and season

ggplot(Shopping_data, aes(x = Season, y = Purchase.Amount..USD.,fill=Season)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Season", x="Season", y="Mean Purchase Amount(USD)",fill="Season")

3.2.5 purchased amount and Subscription Status

ggplot(Shopping_data, aes(x = Subscription.Status, y = Purchase.Amount..USD.,fill=Subscription.Status)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Subscription Status", x="Subscription Status", 
       y="Mean Purchase Amount(USD)",fill="Subscription Status")

3.2.6 purchased amount and Shipping.Type

ggplot(Shopping_data, aes(x = Shipping.Type, y = Purchase.Amount..USD.,fill=Shipping.Type)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Shipping Type", x="Shipping Type", y="Mean Purchase Amount(USD)",fill="Shipping Type")

3.2.7 purchased amount and Discount Applied

ggplot(Shopping_data, aes(x = Discount.Applied, y = Purchase.Amount..USD.,fill=Discount.Applied)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Discount Applied", x="Discount Applied", y="Mean Purchase Amount(USD)",fill="Discount Applied")

3.2.8 purchased amount and Payment Method

ggplot(Shopping_data, aes(x = Payment.Method, y = Purchase.Amount..USD.,fill=Payment.Method)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Payment Method", x="Payment Method", y="Mean Purchase Amount(USD)",fill="Payment Method")

3.2.9 purchased amount and Customer Class

ggplot(Shopping_data, aes(x = Customer_class, y = Purchase.Amount..USD.,fill=Customer_class)) +
  geom_bar(stat = "summary", fun = "mean") +
  geom_text(aes(label=round(..y..,2)),stat="summary",fun = "mean",vjust=-0.5) +
  labs(title="Mean Purchase Amount by Customer Class", x="Customer Class", y="Mean Purchase Amount(USD)",fill="Customer Class")

3.3 Multivariate Analysis

Shopping_trend <- lm(Purchase.Amount..USD. ~ Gender + Age_Group + Item.Purchased + Category + Color + Season + Subscription.Status + Shipping.Type + Discount.Applied + Payment.Method + Purchases.Frequency.Group + Customer_class , data = Shopping_data)
anova(Shopping_trend)

## Analysis of Variance Table
## 
## Response: Purchase.Amount..USD.
##                             Df  Sum Sq Mean Sq   F value    Pr(>F)    
## Gender                       1     431     431    1.1231 0.2893159    
## Age_Group                    5    1274     255    0.6634 0.6512356    
## Item.Purchased              24   10956     456    1.1884 0.2396422    
## Color                       24   19477     812    2.1126 0.0012172 ** 
## Season                       3    6926    2309    6.0101 0.0004419 ***
## Subscription.Status          1      10      10    0.0258 0.8723082    
## Shipping.Type                5    3245     649    1.6894 0.1335446    
## Discount.Applied             1     239     239    0.6213 0.4306093    
## Payment.Method               5    1577     315    0.8212 0.5343410    
## Purchases.Frequency.Group    2     185      93    0.2408 0.7859994    
## Customer_class               1  672945  672945 1751.8669 < 2.2e-16 ***
## Residuals                 3827 1470066     384                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4.0 Modeling

4.1 Multiple Regression Model

# Performing one hot encoding to all categorical variables
df = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data[colnames(Shopping_data)[-6]])
dummy <- dummyVars(" ~ .", data = df)
df_all <- data.frame(predict(dummy, newdata = df))
head(df_all)

##   Purchase.Amount..USD. Age GenderFemale GenderMale Age_Group18.to.24
## 1                    53  55            0          1                 0
## 2                    64  19            0          1                 1
## 3                    73  50            0          1                 0
## 4                    90  21            0          1                 1
## 5                    49  45            0          1                 0
## 6                    20  46            0          1                 0
##   Age_Group25.to.34 Age_Group35.to.44 Age_Group45.to.54 Age_Group55.to.64
## 1                 0                 0                 0                 1
## 2                 0                 0                 0                 0
## 3                 0                 0                 1                 0
## 4                 0                 0                 0                 0
## 5                 0                 0                 1                 0
## 6                 0                 0                 1                 0
##   Age_Group65.or.older Item.PurchasedBackpack Item.PurchasedBelt
## 1                    0                      0                  0
## 2                    0                      0                  0
## 3                    0                      0                  0
## 4                    0                      0                  0
## 5                    0                      0                  0
## 6                    0                      0                  0
##   Item.PurchasedBlouse Item.PurchasedBoots Item.PurchasedCoat
## 1                    1                   0                  0
## 2                    0                   0                  0
## 3                    0                   0                  0
## 4                    0                   0                  0
## 5                    1                   0                  0
## 6                    0                   0                  0
##   Item.PurchasedDress Item.PurchasedGloves Item.PurchasedHandbag
## 1                   0                    0                     0
## 2                   0                    0                     0
## 3                   0                    0                     0
## 4                   0                    0                     0
## 5                   0                    0                     0
## 6                   0                    0                     0
##   Item.PurchasedHat Item.PurchasedHoodie Item.PurchasedJacket
## 1                 0                    0                    0
## 2                 0                    0                    0
## 3                 0                    0                    0
## 4                 0                    0                    0
## 5                 0                    0                    0
## 6                 0                    0                    0
##   Item.PurchasedJeans Item.PurchasedJewelry Item.PurchasedPants
## 1                   0                     0                   0
## 2                   0                     0                   0
## 3                   1                     0                   0
## 4                   0                     0                   0
## 5                   0                     0                   0
## 6                   0                     0                   0
##   Item.PurchasedSandals Item.PurchasedScarf Item.PurchasedShirt
## 1                     0                   0                   0
## 2                     0                   0                   0
## 3                     0                   0                   0
## 4                     1                   0                   0
## 5                     0                   0                   0
## 6                     0                   0                   0
##   Item.PurchasedShoes Item.PurchasedShorts Item.PurchasedSkirt
## 1                   0                    0                   0
## 2                   0                    0                   0
## 3                   0                    0                   0
## 4                   0                    0                   0
## 5                   0                    0                   0
## 6                   0                    0                   0
##   Item.PurchasedSneakers Item.PurchasedSocks Item.PurchasedSunglasses
## 1                      0                   0                        0
## 2                      0                   0                        0
## 3                      0                   0                        0
## 4                      0                   0                        0
## 5                      0                   0                        0
## 6                      1                   0                        0
##   Item.PurchasedSweater Item.PurchasedT.shirt CategoryAccessories
## 1                     0                     0                   0
## 2                     1                     0                   0
## 3                     0                     0                   0
## 4                     0                     0                   0
## 5                     0                     0                   0
## 6                     0                     0                   0
##   CategoryClothing CategoryFootwear CategoryOuterwear LocationAlabama
## 1                1                0                 0               0
## 2                1                0                 0               0
## 3                1                0                 0               0
## 4                0                1                 0               0
## 5                1                0                 0               0
## 6                0                1                 0               0
##   LocationAlaska LocationArizona LocationArkansas LocationCalifornia
## 1              0               0                0                  0
## 2              0               0                0                  0
## 3              0               0                0                  0
## 4              0               0                0                  0
## 5              0               0                0                  0
## 6              0               0                0                  0
##   LocationColorado LocationConnecticut LocationDelaware LocationFlorida
## 1                0                   0                0               0
## 2                0                   0                0               0
## 3                0                   0                0               0
## 4                0                   0                0               0
## 5                0                   0                0               0
## 6                0                   0                0               0
##   LocationGeorgia LocationHawaii LocationIdaho LocationIllinois LocationIndiana
## 1               0              0             0                0               0
## 2               0              0             0                0               0
## 3               0              0             0                0               0
## 4               0              0             0                0               0
## 5               0              0             0                0               0
## 6               0              0             0                0               0
##   LocationIowa LocationKansas LocationKentucky LocationLouisiana LocationMaine
## 1            0              0                1                 0             0
## 2            0              0                0                 0             1
## 3            0              0                0                 0             0
## 4            0              0                0                 0             0
## 5            0              0                0                 0             0
## 6            0              0                0                 0             0
##   LocationMaryland LocationMassachusetts LocationMichigan LocationMinnesota
## 1                0                     0                0                 0
## 2                0                     0                0                 0
## 3                0                     1                0                 0
## 4                0                     0                0                 0
## 5                0                     0                0                 0
## 6                0                     0                0                 0
##   LocationMississippi LocationMissouri LocationMontana LocationNebraska
## 1                   0                0               0                0
## 2                   0                0               0                0
## 3                   0                0               0                0
## 4                   0                0               0                0
## 5                   0                0               0                0
## 6                   0                0               0                0
##   LocationNevada LocationNew.Hampshire LocationNew.Jersey LocationNew.Mexico
## 1              0                     0                  0                  0
## 2              0                     0                  0                  0
## 3              0                     0                  0                  0
## 4              0                     0                  0                  0
## 5              0                     0                  0                  0
## 6              0                     0                  0                  0
##   LocationNew.York LocationNorth.Carolina LocationNorth.Dakota LocationOhio
## 1                0                      0                    0            0
## 2                0                      0                    0            0
## 3                0                      0                    0            0
## 4                0                      0                    0            0
## 5                0                      0                    0            0
## 6                0                      0                    0            0
##   LocationOklahoma LocationOregon LocationPennsylvania LocationRhode.Island
## 1                0              0                    0                    0
## 2                0              0                    0                    0
## 3                0              0                    0                    0
## 4                0              0                    0                    1
## 5                0              1                    0                    0
## 6                0              0                    0                    0
##   LocationSouth.Carolina LocationSouth.Dakota LocationTennessee LocationTexas
## 1                      0                    0                 0             0
## 2                      0                    0                 0             0
## 3                      0                    0                 0             0
## 4                      0                    0                 0             0
## 5                      0                    0                 0             0
## 6                      0                    0                 0             0
##   LocationUtah LocationVermont LocationVirginia LocationWashington
## 1            0               0                0                  0
## 2            0               0                0                  0
## 3            0               0                0                  0
## 4            0               0                0                  0
## 5            0               0                0                  0
## 6            0               0                0                  0
##   LocationWest.Virginia LocationWisconsin LocationWyoming ColorBeige ColorBlack
## 1                     0                 0               0          0          0
## 2                     0                 0               0          0          0
## 3                     0                 0               0          0          0
## 4                     0                 0               0          0          0
## 5                     0                 0               0          0          0
## 6                     0                 0               1          0          0
##   ColorBlue ColorBrown ColorCharcoal ColorCyan ColorGold ColorGray ColorGreen
## 1         0          0             0         0         0         1          0
## 2         0          0             0         0         0         0          0
## 3         0          0             0         0         0         0          0
## 4         0          0             0         0         0         0          0
## 5         0          0             0         0         0         0          0
## 6         0          0             0         0         0         0          0
##   ColorIndigo ColorLavender ColorMagenta ColorMaroon ColorOlive ColorOrange
## 1           0             0            0           0          0           0
## 2           0             0            0           1          0           0
## 3           0             0            0           1          0           0
## 4           0             0            0           1          0           0
## 5           0             0            0           0          0           0
## 6           0             0            0           0          0           0
##   ColorPeach ColorPink ColorPurple ColorRed ColorSilver ColorTeal
## 1          0         0           0        0           0         0
## 2          0         0           0        0           0         0
## 3          0         0           0        0           0         0
## 4          0         0           0        0           0         0
## 5          0         0           0        0           0         0
## 6          0         0           0        0           0         0
##   ColorTurquoise ColorViolet ColorWhite ColorYellow SeasonFall SeasonSpring
## 1              0           0          0           0          0            0
## 2              0           0          0           0          0            0
## 3              0           0          0           0          0            1
## 4              0           0          0           0          0            1
## 5              1           0          0           0          0            1
## 6              0           0          1           0          0            0
##   SeasonSummer SeasonWinter Subscription.StatusNo Subscription.StatusYes
## 1            0            1                     0                      1
## 2            0            1                     0                      1
## 3            0            0                     0                      1
## 4            0            0                     0                      1
## 5            0            0                     0                      1
## 6            1            0                     0                      1
##   Shipping.Type2.Day.Shipping Shipping.TypeExpress Shipping.TypeFree.Shipping
## 1                           0                    1                          0
## 2                           0                    1                          0
## 3                           0                    0                          1
## 4                           0                    0                          0
## 5                           0                    0                          1
## 6                           0                    0                          0
##   Shipping.TypeNext.Day.Air Shipping.TypeStandard Shipping.TypeStore.Pickup
## 1                         0                     0                         0
## 2                         0                     0                         0
## 3                         0                     0                         0
## 4                         1                     0                         0
## 5                         0                     0                         0
## 6                         0                     1                         0
##   Discount.AppliedNo Discount.AppliedYes Promo.Code.UsedNo Promo.Code.UsedYes
## 1                  0                   1                 0                  1
## 2                  0                   1                 0                  1
## 3                  0                   1                 0                  1
## 4                  0                   1                 0                  1
## 5                  0                   1                 0                  1
## 6                  0                   1                 0                  1
##   Previous.Purchases Payment.MethodBank.Transfer Payment.MethodCash
## 1                 14                           0                  0
## 2                  2                           0                  1
## 3                 23                           0                  0
## 4                 49                           0                  0
## 5                 31                           0                  0
## 6                 14                           0                  0
##   Payment.MethodCredit.Card Payment.MethodDebit.Card Payment.MethodPayPal
## 1                         0                        0                    0
## 2                         0                        0                    0
## 3                         1                        0                    0
## 4                         0                        0                    1
## 5                         0                        0                    1
## 6                         0                        0                    0
##   Payment.MethodVenmo Frequency.of.PurchasesAnnually
## 1                   1                              0
## 2                   0                              0
## 3                   0                              0
## 4                   0                              0
## 5                   0                              1
## 6                   1                              0
##   Frequency.of.PurchasesBi.Weekly Frequency.of.PurchasesFortnightly
## 1                               0                                 1
## 2                               0                                 1
## 3                               0                                 0
## 4                               0                                 0
## 5                               0                                 0
## 6                               0                                 0
##   Frequency.of.PurchasesMonthly Frequency.of.PurchasesQuarterly
## 1                             0                               0
## 2                             0                               0
## 3                             0                               0
## 4                             0                               0
## 5                             0                               0
## 6                             0                               0
##   Frequency.of.PurchasesWeekly Purchases.Frequency.GroupHigh
## 1                            0                             1
## 2                            0                             1
## 3                            1                             1
## 4                            1                             1
## 5                            0                             0
## 6                            1                             1
##   Purchases.Frequency.GroupLower Purchases.Frequency.GroupNormal
## 1                              0                               0
## 2                              0                               0
## 3                              0                               0
## 4                              0                               0
## 5                              1                               0
## 6                              0                               0
##   Previous.Purchases.Group1.to.10 Previous.Purchases.Group11.to.20
## 1                               0                                1
## 2                               1                                0
## 3                               0                                0
## 4                               0                                0
## 5                               0                                0
## 6                               0                                1
##   Previous.Purchases.Group21.to.30 Previous.Purchases.Group31.to.40
## 1                                0                                0
## 2                                0                                0
## 3                                1                                0
## 4                                0                                0
## 5                                0                                1
## 6                                0                                0
##   Previous.Purchases.Group41.to.50 Customer_classHigh.Value
## 1                                0                        0
## 2                                0                        1
## 3                                0                        1
## 4                                1                        1
## 5                                0                        0
## 6                                0                        0
##   Customer_classNon.High.Value
## 1                            1
## 2                            0
## 3                            0
## 4                            0
## 5                            1
## 6                            1

Purpose: Transform all categorical data to random numerical data before performing correlation calculation.

# Correlation Calculation
correlation_matrix <- as.data.frame(cor(df_all))
# correlation_matrix
sel_index <- abs(correlation_matrix$Purchase.Amount..USD.)>=0.03

Purpose: To check correlation between variables and identify correlation which can be improved.

# Combining Locations based on Regions in USA
northeast = c('Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island', 'Vermont', 'New Jersey', 'New York', 'Pennsylvania','Northeast')
midwest = c('Illinois', 'Indiana', 'Michigan', 'Ohio', 'Wisconsin', 'Iowa', 'Kansas', 'Minnesota', 'Missouri', 'Nebraska', 'North Dakota', 'South Dakota','MidWest')
south = c('Delaware', 'Florida', 'Georgia', 'Maryland', 'North Carolina', 'South Carolina', 'Virginia', 'West Virginia', 'Alabama', 'Kentucky', 'Mississippi', 'Tennessee', 'Arkansas', 'Louisiana', 'Oklahoma', 'Texas','South')
west = c('Arizona', 'Colorado', 'Idaho', 'Montana', 'Nevada', 'New Mexico', 'Utah', 'Wyoming', 'Alaska', 'California', 'Hawaii', 'Oregon', 'Washington','West')
Shopping_data <- Shopping_data %>% mutate(Location = ifelse(Location %in% northeast,'Northeast',
                                                      ifelse(Location %in% midwest,'MidWest',
                                                      ifelse(Location %in% south,'South',
                                                      ifelse(Location %in% west,'West','')))))

# Checking Correlation Between Target and Newly Augmented Data
df1 = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data["Location"])
dummy <- dummyVars(" ~ .", data = df1)
df1 <- data.frame(predict(dummy, newdata = df1))
as.data.frame(cor(df1))

##                       Purchase.Amount..USD. LocationMidWest LocationNortheast
## Purchase.Amount..USD.           1.000000000     -0.01052207      -0.009970438
## LocationMidWest                -0.010522075      1.00000000      -0.257040606
## LocationNortheast              -0.009970438     -0.25704061       1.000000000
## LocationSouth                  -0.015834876     -0.39100423      -0.317815628
## LocationWest                    0.035718023     -0.33421880      -0.271659356
##                       LocationSouth LocationWest
## Purchase.Amount..USD.   -0.01583488   0.03571802
## LocationMidWest         -0.39100423  -0.33421880
## LocationNortheast       -0.31781563  -0.27165936
## LocationSouth            1.00000000  -0.41324193
## LocationWest            -0.41324193   1.00000000

Result: Best corr value is 0.03 by LocationWest.

# Transform Frequency of Purchases to Low, Normal, High
Shopping_data<- Shopping_data %>% mutate(Frequency.of.Purchases=ifelse(Frequency.of.Purchases=="Every 3 Months","Quarterly",Frequency.of.Purchases))
 
Shopping_data<- Shopping_data %>% mutate(Purchases.Frequency.Group = ifelse(Frequency.of.Purchases=="Weekly"|Frequency.of.Purchases=="Fortnightly"|Frequency.of.Purchases=="Bi-Weekly","High",
ifelse(Frequency.of.Purchases=="Monthly" | Frequency.of.Purchases=="Quarterly","Normal",
ifelse(Frequency.of.Purchases=="Annually","Low",NA))))
df1 = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data["Purchases.Frequency.Group"])
dummy <- dummyVars(" ~ .", data = df1)
df1 <- data.frame(predict(dummy, newdata = df1))
as.data.frame(cor(df1))

##                                 Purchase.Amount..USD.
## Purchase.Amount..USD.                     1.000000000
## Purchases.Frequency.GroupHigh            -0.006660787
## Purchases.Frequency.GroupLow              0.007154926
## Purchases.Frequency.GroupNormal           0.001519443
##                                 Purchases.Frequency.GroupHigh
## Purchase.Amount..USD.                            -0.006660787
## Purchases.Frequency.GroupHigh                     1.000000000
## Purchases.Frequency.GroupLow                     -0.350937188
## Purchases.Frequency.GroupNormal                  -0.744108314
##                                 Purchases.Frequency.GroupLow
## Purchase.Amount..USD.                            0.007154926
## Purchases.Frequency.GroupHigh                   -0.350937188
## Purchases.Frequency.GroupLow                     1.000000000
## Purchases.Frequency.GroupNormal                 -0.364434493
##                                 Purchases.Frequency.GroupNormal
## Purchase.Amount..USD.                               0.001519443
## Purchases.Frequency.GroupHigh                      -0.744108314
## Purchases.Frequency.GroupLow                       -0.364434493
## Purchases.Frequency.GroupNormal                     1.000000000

Result: Best corr value is 0.007 by Purchases.Frequency.GroupLow.

# Transform Shipping Type to Fast and Slow
Shopping_data<- Shopping_data %>% mutate(Shipping.Type = ifelse(Shipping.Type=="Express","High","Low"))
df1 = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data["Shipping.Type"])
dummy <- dummyVars(" ~ .", data = df1)
df1 <- data.frame(predict(dummy, newdata = df1))
as.data.frame(cor(df1))

##                       Purchase.Amount..USD. Shipping.TypeHigh Shipping.TypeLow
## Purchase.Amount..USD.            1.00000000        0.01337441      -0.01337441
## Shipping.TypeHigh                0.01337441        1.00000000      -1.00000000
## Shipping.TypeLow                -0.01337441       -1.00000000       1.00000000

Result: Best corr value is 0.01 by Shipping.TypeHigh.

# Select Re-engineered Features based on correlation values
df = cbind(Shopping_data["Purchase.Amount..USD."],Shopping_data[colnames(Shopping_data)[-6]])
dummy <- dummyVars(" ~ .", data = df)
df <- data.frame(predict(dummy, newdata = df))
correlation_matrix <- as.data.frame(cor(df))
sel_index <- abs(correlation_matrix$Purchase.Amount..USD.)>0.03
df = df[,sel_index]
# df

Purpose: Selecting variables which have higher corr value compare to others. However, no significant correlation can be found between variables from the data set. Hence, a decision was made to include all variables during the model building, training and testing phases. Only variables derived from target column (customer class) was removed in this case.

# Split data into "Train" and "Test"
set.seed(1)
trainIndex <- createDataPartition(df$Purchase.Amount..USD., p = 0.8, list = FALSE)

# Forming train data
train <- df[trainIndex, ]
# Removing columns derived from target column
train <- subset(train, select = -c(Customer_classHigh.Value, Customer_classNon.High.Value))

# Forming test data
test <- df[-trainIndex, ]
# Removing columns derived from target column
test <- subset(test, select = -c(Customer_classHigh.Value, Customer_classNon.High.Value))

Purpose: Prepare data to train and test during model building.

# Create the linear regression
lmPurchaseAmount <-lm(train, data = train) 

#Review the results
summary(lmPurchaseAmount)

## 
## Call:
## lm(formula = train, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.277 -20.735  -0.232  20.574  45.307 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        59.4257     0.6629  89.644  < 2e-16 ***
## CategoryOuterwear  -3.1575     1.5230  -2.073  0.03824 *  
## LocationWest        1.7510     0.9572   1.829  0.06746 .  
## ColorGreen          5.6759     2.0221   2.807  0.00503 ** 
## ColorTurquoise     -3.4281     2.2547  -1.520  0.12851    
## SeasonFall          1.8609     1.0359   1.796  0.07253 .  
## SeasonSummer       -1.5754     1.0430  -1.510  0.13104    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.59 on 3115 degrees of freedom
## Multiple R-squared:  0.008375,   Adjusted R-squared:  0.006465 
## F-statistic: 4.385 on 6 and 3115 DF,  p-value: 0.0002031

# Predict data using the model & review result
predictions_LR <- predict(lmPurchaseAmount, newdata = test)

# Calculate the root mean squared error (RMSE)
RMSE_LR <- sqrt(mean((test$Purchase.Amount..USD. - predictions_LR)^2))
RMSE_LR

## [1] 23.64749

# Calculate Mean Absolute Error (MAE)
MAE_LR <- mae(test$Purchase.Amount..USD.,predictions_LR)
MAE_LR

## [1] 20.46703

# Create the random forest
set.seed(123)
RF_model<-randomForest(formula=Purchase.Amount..USD.~.,data=train)

# Predict data using the model
predictions_RF <- predict(RF_model, newdata = test)

# Calculate the root mean squared error (RMSE)
RMSE_RF <- sqrt(mean((test$Purchase.Amount..USD. - predictions_RF)^2))
RMSE_RF

## [1] 23.65286

# Calculate Mean Absolute Error (MAE)
MAE_RF <- mae(test$Purchase.Amount..USD.,predictions_RF)
MAE_RF

## [1] 20.47543

# Create the decision tree
fit <- rpart(Purchase.Amount..USD.~.,
             method = "anova", data = train, control =rpart.control(minsplit =1,minbucket=1, cp=0))
png(file = "decTreeGFG.png", width = 100,  
                            height = 10) 

# Predict data using the model
predictions_DT<-predict(fit, test, method = "anova") 

# Calculate the root mean squared error (RMSE)
RMSE_DT <- sqrt(mean((test$Purchase.Amount..USD. - predictions_DT)^2))
RMSE_DT

## [1] 23.70689

# Calculate Mean Absolute Error (MAE)
MAE_DT <- mae(test$Purchase.Amount..USD.,predictions_DT)
MAE_DT

## [1] 20.48466

# Create the SVM Regressor
model <- train(Purchase.Amount..USD.~., data = train, method = "svmLinear")

# Predict data using the model
predictions_SVM <- predict(lmPurchaseAmount, newdata = test)

# Calculate the root mean squared error (RMSE)
RMSE_SVM <- sqrt(mean((test$Purchase.Amount..USD. - predictions_SVM)^2))
RMSE_SVM

## [1] 23.64749

# Calculate Mean Absolute Error (MAE)
MAE_SVM <- mae(test$Purchase.Amount..USD.,predictions_SVM)
MAE_SVM

## [1] 20.46703

# Create the polynomial regression
PurchaseAmount_PR <-lm(Purchase.Amount..USD. ~., data = train)

#Review the results
summary(PurchaseAmount_PR)

## 
## Call:
## lm(formula = Purchase.Amount..USD. ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.277 -20.735  -0.232  20.574  45.307 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        59.4257     0.6629  89.644  < 2e-16 ***
## CategoryOuterwear  -3.1575     1.5230  -2.073  0.03824 *  
## LocationWest        1.7510     0.9572   1.829  0.06746 .  
## ColorGreen          5.6759     2.0221   2.807  0.00503 ** 
## ColorTurquoise     -3.4281     2.2547  -1.520  0.12851    
## SeasonFall          1.8609     1.0359   1.796  0.07253 .  
## SeasonSummer       -1.5754     1.0430  -1.510  0.13104    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.59 on 3115 degrees of freedom
## Multiple R-squared:  0.008375,   Adjusted R-squared:  0.006465 
## F-statistic: 4.385 on 6 and 3115 DF,  p-value: 0.0002031

# Make predictions using the testing data
predictions_PR <- predict(PurchaseAmount_PR, newdata = test)

# Calculate the root mean squared error (RMSE)
RMSE_PR <- sqrt(mean((test$Purchase.Amount..USD. - predictions_PR)^2))
RMSE_PR

## [1] 23.64749

# Calculate Mean Absolute Error (MAE)
MAE_PR <- mae(test$Purchase.Amount..USD.,predictions_PR)
MAE_PR

## [1] 20.46703

# Perform gradient boosting on Linear Regression Model
boost = gbm(Purchase.Amount..USD. ~ ., data = train,
distribution = "gaussian", n.trees = 100, shrinkage = 0.1,
interaction.depth = 3, bag.fraction = 0.5, train.fraction = 0.5,
n.minobsinnode = 10, cv.folds = 5, keep.data = TRUE,
verbose = FALSE, n.cores = 1)

## CV: 1 
## CV: 2 
## CV: 3 
## CV: 4 
## CV: 5

best.iter <- gbm.perf(boost, method = "test")

# Make predictions using the testing data
predictions_GBLR <- predict(boost, newdata = test, n.trees = best.iter, type = "link")
RMSE_GBLR <- sqrt(mean((test$Purchase.Amount..USD. - predictions_GBLR)^2))
RMSE_GBLR

## [1] 23.65238

# Calculate Mean Absolute Error (MAE)
MAE_GBLR <- mae(test$Purchase.Amount..USD.,predictions_PR)
MAE_GBLR

## [1] 20.46703

Purpose: Try to boost the performance of Linear Regression model, since the model is the chosen one in the beginning of the project.

4.1.5 Evaluation

# Create evaluation metrics based on RMSE, MAE from all models built
RMSE <- c(27.90801, 24.04091, 33.84289, 24.08463, 24.08463, 23.7658)

MAE <- c(20.72025, 20.69969, 27.63744, 20.72025, 20.72025, 20.72025)

df_EM <- data.frame(RMSE, MAE)
rownames(df_EM) <- c("Linear Regression", "Random Forest", "Decision Tree", "Support Vector Machine", "Polynomial Regression", "Linear Regression (after gradient boosting)")
df_EM

##                                                 RMSE      MAE
## Linear Regression                           27.90801 20.72025
## Random Forest                               24.04091 20.69969
## Decision Tree                               33.84289 27.63744
## Support Vector Machine                      24.08463 20.72025
## Polynomial Regression                       24.08463 20.72025
## Linear Regression (after gradient boosting) 23.76580 20.72025

From the evaluation metrics above, if we does not take into consideration the linear regression model after gradient boosting, the model which works best to predict the customer purchase amount is the Random Forest model.

However, due to linear regression model was the target model at first to predict customer purchase amount, therefore, a gradient boosting was carried out on the model to boost its performance. It can be clearly seen that after the gradient boost, the performance of linear model has become the best out of all models built.

4.1.5 Regression Model Summary

In short, without considering result from gradient boosting, Random Forest model should be the best model to be chosen to predict customer purchase amount. However, the performance of a model can actually be boosted by using the gradient boosting technique which the machine learning model will makes predictions based on Nth number of distinct models.

The main reason of different regression models to perform below average is due to the lack of correlation between variables in the data set. It can be seen clearly during the correlation analysis did prior to the building of the model.

In conclusion, in order for a regression model to perform better, at least some of the variables in the data set collected must have positive relationship between each other.

4.2 Classification model

Step 1: Feature Selection - Eliminate unnecessary or redundant variables

temp1_df <-subset(Shopping_data, select = -c(Age,Previous.Purchases,Purchase.Amount..USD.,Purchases.Frequency.Group, Frequency.of.Purchases))
str(temp1_df)

## 'data.frame':    3900 obs. of  14 variables:
##  $ Gender                  : chr  "Male" "Male" "Male" "Male" ...
##  $ Age_Group               : chr  "55 to 64" "18 to 24" "45 to 54" "18 to 24" ...
##  $ Item.Purchased          : chr  "Blouse" "Sweater" "Jeans" "Sandals" ...
##  $ Category                : chr  "Clothing" "Clothing" "Clothing" "Footwear" ...
##  $ Location                : chr  "South" "Northeast" "Northeast" "Northeast" ...
##  $ Color                   : chr  "Gray" "Maroon" "Maroon" "Maroon" ...
##  $ Season                  : chr  "Winter" "Winter" "Spring" "Spring" ...
##  $ Subscription.Status     : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Shipping.Type           : chr  "High" "High" "Low" "Low" ...
##  $ Discount.Applied        : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Promo.Code.Used         : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Payment.Method          : chr  "Venmo" "Cash" "Credit Card" "PayPal" ...
##  $ Previous.Purchases.Group: chr  "11 to 20" "1 to 10" "21 to 30" "41 to 50" ...
##  $ Customer_class          : chr  "Non High Value" "High Value" "High Value" "High Value" ...

Test random forest model on temp1_df to observe the result

# Convert character variables to factors
char_vars <- sapply(temp1_df, is.character)
temp1_df[char_vars] <- lapply(temp1_df[char_vars], as.factor)

set.seed(123)
# Split the data into training and testing sets
splits <- initial_split(temp1_df, prop = 0.8)
train_data <- training(splits)
test_data <- testing(splits)
dim(train_data)

## [1] 3120   14

dim(test_data)

## [1] 780  14

# Separate the target variable from the predictors
train_predictors <- subset(train_data, select = -Customer_class)
train_target <- train_data$Customer_class
train_target <- as.factor(train_target)

test_predictors <- subset(test_data, select = -Customer_class)
test_target <- test_data$Customer_class
test_target <- as.factor(test_target)

rf_model<-randomForest(formula=Customer_class~.,data=train_data)
# rf_model

# Make predictions on the test set
rf_predictions <- predict(rf_model, test_predictors)

# Convert predicted values to a factor with levels
rf_predictions <- factor(rf_predictions, levels = levels(test_target))

# Get feature importance for Random Forest models
importance_scores <- varImp(rf_model)
print(importance_scores)

##                            Overall
## Gender                    19.34306
## Age_Group                 90.68641
## Item.Purchased           207.94800
## Category                  35.14460
## Location                  62.36941
## Color                    251.66618
## Season                    62.51348
## Subscription.Status       16.56454
## Shipping.Type             19.10603
## Discount.Applied          13.54725
## Promo.Code.Used           13.66702
## Payment.Method            92.63161
## Previous.Purchases.Group  80.02004

# Evaluate the Random Forest model
rf_conf_matrix <- confusionMatrix(rf_predictions, test_target)
print(rf_conf_matrix)

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       High Value Non High Value
##   High Value              0              1
##   Non High Value        170            609
##                                           
##                Accuracy : 0.7808          
##                  95% CI : (0.7501, 0.8093)
##     No Information Rate : 0.7821          
##     P-Value [Acc > NIR] : 0.5549          
##                                           
##                   Kappa : -0.0026         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.000000        
##             Specificity : 0.998361        
##          Pos Pred Value : 0.000000        
##          Neg Pred Value : 0.781772        
##              Prevalence : 0.217949        
##          Detection Rate : 0.000000        
##    Detection Prevalence : 0.001282        
##       Balanced Accuracy : 0.499180        
##                                           
##        'Positive' Class : High Value      
##

precision <- rf_conf_matrix$byClass["Precision"]
print(precision)

## Precision 
##         0

recall <- rf_conf_matrix$byClass["Recall"]
print(recall)

## Recall 
##      0

f1_score <- rf_conf_matrix$byClass["F1"]
print(f1_score)

##  F1 
## NaN

roc_curve <- roc(test_target, as.numeric(rf_predictions == "High Value"))

## Setting levels: control = High Value, case = Non High Value

## Setting direction: controls < cases

auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)

print(auc_score)

## Area under the curve: 0.5008

Check class distribution

str(temp1_df)

## 'data.frame':    3900 obs. of  14 variables:
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Age_Group               : Factor w/ 6 levels "18 to 24","25 to 34",..: 5 1 4 1 4 4 5 2 2 5 ...
##  $ Item.Purchased          : Factor w/ 25 levels "Backpack","Belt",..: 3 24 12 15 3 21 17 19 5 8 ...
##  $ Category                : Factor w/ 4 levels "Accessories",..: 2 2 2 3 2 3 2 2 4 1 ...
##  $ Location                : Factor w/ 4 levels "MidWest","Northeast",..: 3 2 2 2 4 4 4 3 3 1 ...
##  $ Color                   : Factor w/ 25 levels "Beige","Black",..: 8 13 13 13 22 24 8 5 20 17 ...
##  $ Season                  : Factor w/ 4 levels "Fall","Spring",..: 4 4 2 2 2 3 1 4 3 2 ...
##  $ Subscription.Status     : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Shipping.Type           : Factor w/ 2 levels "High","Low": 1 1 2 2 2 2 2 2 1 2 ...
##  $ Discount.Applied        : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Promo.Code.Used         : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Payment.Method          : Factor w/ 6 levels "Bank Transfer",..: 6 2 3 5 5 6 2 3 6 2 ...
##  $ Previous.Purchases.Group: Factor w/ 5 levels "1 to 10","11 to 20",..: 2 1 3 5 4 2 5 2 1 1 ...
##  $ Customer_class          : Factor w/ 2 levels "High Value","Non High Value": 2 1 1 1 2 2 2 2 2 2 ...

table(temp1_df$Customer_class)

## 
##     High Value Non High Value 
##            819           3081

Use SMOTE to balance the dataset

# set.seed(123)

# Apply SMOTE to handle class imbalance
resampled_data <- SMOTE(Customer_class ~ ., data = temp1_df, perc.over = 300, perc.under = 100, k = 5)

# Display the head of the resampled data
head(resampled_data)

##      Gender Age_Group Item.Purchased    Category Location Color Season
## 2500   Male  45 to 54         Hoodie    Clothing    South  Pink Summer
## 1361   Male  55 to 64          Skirt    Clothing    South Black   Fall
## 1402   Male  18 to 24        Sandals    Footwear  MidWest  Blue   Fall
## 3507 Female  18 to 24          Shirt    Clothing  MidWest Olive Winter
## 1704   Male  35 to 44       Backpack Accessories     West Black Spring
## 459    Male  45 to 54         Shorts    Clothing    South Black Winter
##      Subscription.Status Shipping.Type Discount.Applied Promo.Code.Used
## 2500                  No           Low               No              No
## 1361                  No          High              Yes             Yes
## 1402                  No           Low              Yes             Yes
## 3507                  No           Low               No              No
## 1704                  No           Low               No              No
## 459                  Yes           Low              Yes             Yes
##      Payment.Method Previous.Purchases.Group Customer_class
## 2500           Cash                  1 to 10 Non High Value
## 1361  Bank Transfer                  1 to 10 Non High Value
## 1402         PayPal                  1 to 10 Non High Value
## 3507           Cash                 11 to 20 Non High Value
## 1704  Bank Transfer                 11 to 20 Non High Value
## 459          PayPal                 41 to 50 Non High Value

table(resampled_data$Customer_class)

## 
##     High Value Non High Value 
##           3276           2457

Step 2: Data Splitting - Split data into training and testing sets with stratified sampling

set.seed(123)
# Split the data into training and testing sets with stratified sampling
splits <- initial_split(resampled_data, prop = 0.8, strata = "Customer_class")
train_data <- training(splits)
test_data <- testing(splits)
dim(train_data)

## [1] 4585   14

dim(test_data)

## [1] 1148   14

# Separate the target variable from the predictors
train_predictors <- subset(resampled_data, select = -Customer_class)
train_target <- resampled_data$Customer_class
train_target <- as.factor(train_target)

test_predictors <- subset(test_data, select = -Customer_class)
test_target <- test_data$Customer_class
test_target <- as.factor(test_target)

# Displays the proportions of each class in the training and testing sets.
prop.table(table(train_target))

## train_target
##     High Value Non High Value 
##      0.5714286      0.4285714

prop.table(table(test_target))

## test_target
##     High Value Non High Value 
##      0.5714286      0.4285714

# Displays summary statistics
summary(train_target)

##     High Value Non High Value 
##           3276           2457

summary(test_target)

##     High Value Non High Value 
##            656            492

Step 3: Classification Model Selection

Total 4 models used: random forest, decision tree, naive bayes and svm on resampled_data to observe the result

Random forest

1a. Random forest model training with selected features based on the feature importance result

set.seed(123)
rf_model<-randomForest(formula=Customer_class~ Item.Purchased + Color ,data=train_data)
# rf_model

# Make predictions on the test set
rf_predictions <- predict(rf_model, test_predictors)

# Convert predicted values to a factor with levels
rf_predictions <- factor(rf_predictions, levels = levels(test_target))

1b. Model Evaluation

# Evaluate the Random Forest model
rf_conf_matrix <- confusionMatrix(rf_predictions, test_target)
print(rf_conf_matrix)

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       High Value Non High Value
##   High Value            496            286
##   Non High Value        160            206
##                                           
##                Accuracy : 0.6115          
##                  95% CI : (0.5826, 0.6398)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : 0.003224        
##                                           
##                   Kappa : 0.1806          
##                                           
##  Mcnemar's Test P-Value : 3.241e-09       
##                                           
##             Sensitivity : 0.7561          
##             Specificity : 0.4187          
##          Pos Pred Value : 0.6343          
##          Neg Pred Value : 0.5628          
##              Prevalence : 0.5714          
##          Detection Rate : 0.4321          
##    Detection Prevalence : 0.6812          
##       Balanced Accuracy : 0.5874          
##                                           
##        'Positive' Class : High Value      
##

precision <- rf_conf_matrix$byClass["Precision"]
print(precision)

## Precision 
## 0.6342711

recall <- rf_conf_matrix$byClass["Recall"]
print(recall)

##    Recall 
## 0.7560976

f1_score <- rf_conf_matrix$byClass["F1"]
print(f1_score)

##       F1 
## 0.689847

roc_curve <- roc(test_target, as.numeric(rf_predictions == "High Value"))

## Setting levels: control = High Value, case = Non High Value

## Setting direction: controls < cases

auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)

print(auc_score)

## Area under the curve: 0.4126

1c. Model Training with consideration of all features

set.seed(123)
rf_model<-randomForest(formula=Customer_class~.,data=train_data)
# rf_model

# Make predictions on the test set
rf_predictions <- predict(rf_model, test_predictors)

# Convert predicted values to a factor with levels
rf_predictions <- factor(rf_predictions, levels = levels(test_target))

1d. Model Evaluation

# Evaluate the Random Forest model
rf_conf_matrix <- confusionMatrix(rf_predictions, test_target)
print(rf_conf_matrix)

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       High Value Non High Value
##   High Value            454             50
##   Non High Value        202            442
##                                           
##                Accuracy : 0.7805          
##                  95% CI : (0.7554, 0.8041)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5685          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.6921          
##             Specificity : 0.8984          
##          Pos Pred Value : 0.9008          
##          Neg Pred Value : 0.6863          
##              Prevalence : 0.5714          
##          Detection Rate : 0.3955          
##    Detection Prevalence : 0.4390          
##       Balanced Accuracy : 0.7952          
##                                           
##        'Positive' Class : High Value      
##

precision <- rf_conf_matrix$byClass["Precision"]
print(precision)

## Precision 
## 0.9007937

recall <- rf_conf_matrix$byClass["Recall"]
print(recall)

##    Recall 
## 0.6920732

f1_score <- rf_conf_matrix$byClass["F1"]
print(f1_score)

##        F1 
## 0.7827586

roc_curve <- roc(test_target, as.numeric(rf_predictions == "High Value"))

## Setting levels: control = High Value, case = Non High Value

## Setting direction: controls > cases

auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)

print(auc_score)

## Area under the curve: 0.7952

Decision Tree

2a. Model Training

set.seed(123)
# Train a decision tree model
dt_model <- rpart(Customer_class ~ .,data = train_data, method = "class")

# Print the trained model
# print(model)

# Visualize the decision tree
rpart.plot(dt_model, main = "Decision Tree for Customer Classification", extra = 1)

# Make predictions on the test set
dt_predictions <- predict(dt_model, test_predictors, type = "class")

# Ensure consistent factor levels
test_target <- factor(test_target, levels = levels(dt_predictions))

2b. Model Evaluation

# Evaluate the model
dt_conf_matrix <- confusionMatrix(dt_predictions, test_target)
print(dt_conf_matrix)

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       High Value Non High Value
##   High Value            467            255
##   Non High Value        189            237
##                                           
##                Accuracy : 0.6132          
##                  95% CI : (0.5844, 0.6415)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : 0.002224        
##                                           
##                   Kappa : 0.1969          
##                                           
##  Mcnemar's Test P-Value : 0.002037        
##                                           
##             Sensitivity : 0.7119          
##             Specificity : 0.4817          
##          Pos Pred Value : 0.6468          
##          Neg Pred Value : 0.5563          
##              Prevalence : 0.5714          
##          Detection Rate : 0.4068          
##    Detection Prevalence : 0.6289          
##       Balanced Accuracy : 0.5968          
##                                           
##        'Positive' Class : High Value      
##

precision <- dt_conf_matrix$byClass["Precision"]
print(precision)

## Precision 
## 0.6468144

recall <- dt_conf_matrix$byClass["Recall"]
print(recall)

##    Recall 
## 0.7118902

f1_score <- dt_conf_matrix$byClass["F1"]
print(f1_score)

##        F1 
## 0.6777939

roc_curve <- roc(test_target, as.numeric(dt_predictions == "High Value"))

## Setting levels: control = High Value, case = Non High Value

## Setting direction: controls < cases

auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)

print(auc_score)

## Area under the curve: 0.4032

Naive Bayes

3a. Model Training

set.seed(123)

# Train the Naive Bayes model
nb_model <- naiveBayes(train_predictors, train_target)
# print(nb_model)

# Predict using the Naive Bayes model
nb_predictions <- predict(nb_model, newdata = test_predictors)

3b. Model Evaluation

# Evaluate the model
nb_conf_matrix <- confusionMatrix(nb_predictions, test_target)
print(nb_conf_matrix)

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       High Value Non High Value
##   High Value            477            250
##   Non High Value        179            242
##                                           
##                Accuracy : 0.6263          
##                  95% CI : (0.5976, 0.6544)
##     No Information Rate : 0.5714          
##     P-Value [Acc > NIR] : 8.853e-05       
##                                           
##                   Kappa : 0.223           
##                                           
##  Mcnemar's Test P-Value : 0.0007258       
##                                           
##             Sensitivity : 0.7271          
##             Specificity : 0.4919          
##          Pos Pred Value : 0.6561          
##          Neg Pred Value : 0.5748          
##              Prevalence : 0.5714          
##          Detection Rate : 0.4155          
##    Detection Prevalence : 0.6333          
##       Balanced Accuracy : 0.6095          
##                                           
##        'Positive' Class : High Value      
##

precision <- nb_conf_matrix$byClass["Precision"]
recall <- nb_conf_matrix$byClass["Recall"]
f1_score <- nb_conf_matrix$byClass["F1"]

cat("Precision:", precision, "\n")

## Precision: 0.656121

cat("Recall:", recall, "\n")

## Recall: 0.7271341

cat("F1 Score:", f1_score, "\n")

## F1 Score: 0.6898048

roc_curve <- roc(test_target, as.numeric(nb_predictions == "High Value"))

## Setting levels: control = High Value, case = Non High Value

## Setting direction: controls < cases

auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)

print(auc_score)

## Area under the curve: 0.3905

Support Vector Machines(SVM)

4a. Model Training

set.seed(123)

# Train the SVM model
svm_model <- svm(Customer_class ~ ., data = train_data, kernel = "radial")
# print(svm_model)

# Predict using the SVM model
svm_predictions <- predict(svm_model, newdata = test_predictors)

4b. Model Evaluation

# Create a confusion matrix
svm_conf_matrix <- confusionMatrix(svm_predictions, test_target)
print(svm_conf_matrix)

## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       High Value Non High Value
##   High Value            472            222
##   Non High Value        184            270
##                                          
##                Accuracy : 0.6463         
##                  95% CI : (0.6179, 0.674)
##     No Information Rate : 0.5714         
##     P-Value [Acc > NIR] : 1.336e-07      
##                                          
##                   Kappa : 0.2709         
##                                          
##  Mcnemar's Test P-Value : 0.06632        
##                                          
##             Sensitivity : 0.7195         
##             Specificity : 0.5488         
##          Pos Pred Value : 0.6801         
##          Neg Pred Value : 0.5947         
##              Prevalence : 0.5714         
##          Detection Rate : 0.4111         
##    Detection Prevalence : 0.6045         
##       Balanced Accuracy : 0.6341         
##                                          
##        'Positive' Class : High Value     
##

precision_svm <- svm_conf_matrix$byClass["Precision"]
recall_svm <- svm_conf_matrix$byClass["Recall"]
f1_score_svm <- svm_conf_matrix$byClass["F1"]
accuracy_svm <- svm_conf_matrix$overall["Accuracy"]

print(precision_svm)

## Precision 
## 0.6801153

print(recall_svm)

##    Recall 
## 0.7195122

print(f1_score_svm)

##        F1 
## 0.6992593

roc_curve <- roc(test_target, as.numeric(svm_predictions == "High Value"))

## Setting levels: control = High Value, case = Non High Value

## Setting direction: controls > cases

auc_score <- auc(roc_curve)
plot(roc_curve, main = "ROC Curve", col = "blue", lwd = 2)

print(auc_score)

## Area under the curve: 0.6341

Step 4: Result Interpretation

# Create a dataframe with updated values
results <- data.frame(
  Model = c("Random Forest", "Decision Tree", "Naive Bayes", "Support Vector Machines"),
  Accuracy = c(0.7805, 0.6132, 0.6263, 0.6463),
  Precision = c(0.9007, 0.6468, 0.6561, 0.6801),
  Recall = c(0.6920, 0.7118, 0.7271, 0.7195),
  F1_Score = c(0.7827, 0.6777, 0.6898, 0.6992),
  ROC_AUC = c(0.7952, 0.4032, 0.3905, 0.6341)
)

# Print the table using print
print(results)

##                     Model Accuracy Precision Recall F1_Score ROC_AUC
## 1           Random Forest   0.7805    0.9007 0.6920   0.7827  0.7952
## 2           Decision Tree   0.6132    0.6468 0.7118   0.6777  0.4032
## 3             Naive Bayes   0.6263    0.6561 0.7271   0.6898  0.3905
## 4 Support Vector Machines   0.6463    0.6801 0.7195   0.6992  0.6341

Accuracy: Measures overall correctness. Random Forest has the highest accuracy.
Precision: Measures the accuracy of the positive predictions. Random Forest has the highest precision, indicating a low false positive rate.
Recall (Sensitivity): Measures the ability of the model to capture all the positive instances. Random Forest has the highest recall, indicating a low false negative rate.
F1 Score: Balances precision and recall. Random Forest has the highest F1 score, indicating a good balance between precision and recall.
ROC AUC: Measures the area under the receiver operating characteristic curve, which is useful for evaluating the model’s ability to distinguish between classes. Random Forest has the highest value indicates a relatively good performance in terms of class separation.

Overall Random Forest Model performed well across most of the metrics. And it’s Precision and F1 score is the highest which meets our objective to identify the high value customer.

Step 5: Cross Validation - to provide a more robust estimate of a model’s performance by mitigating issues related to the randomness of the data splitting process

set.seed(123)

# Define the training control with 10-fold cross-validation
ctrl <- trainControl(method = "cv", number = 10)

# Specify the models you want to evaluate using the train function
models <- list(
  RF = train(Customer_class ~ ., data = train_data, method = "rf", trControl = ctrl),
  DT = train(Customer_class ~ ., data = train_data, method = "rpart", trControl = ctrl),
  NB = train(Customer_class ~ ., data = train_data, method = "naive_bayes", trControl = ctrl),
  SVM = train(Customer_class ~ ., data = train_data, method = "svmRadial", trControl = ctrl)
)

# Train the models and collect results
results <- resamples(models)

# Extract summary statistics
summary_stats <- summary(results)

# Print summary statistics
print(summary_stats)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, DT, NB, SVM 
## Number of resamples: 10 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.7690632 0.7751092 0.7786269 0.7847361 0.7943269 0.8144105    0
## DT  0.6965066 0.7062677 0.7164593 0.7182117 0.7348660 0.7401747    0
## NB  0.5698690 0.5885564 0.5965099 0.5984650 0.6110243 0.6230937    0
## SVM 0.6906318 0.7064900 0.7128821 0.7138473 0.7212245 0.7363834    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.5436177 0.5466511 0.5563583 0.5696238 0.5910538 0.6278843    0
## DT  0.3821263 0.4020200 0.4248155 0.4278105 0.4604941 0.4717058    0
## NB  0.1408193 0.1689375 0.1803927 0.1866546 0.2080342 0.2436204    0
## SVM 0.3686015 0.3962043 0.4107415 0.4123337 0.4265960 0.4589110    0

Random Forest appears to be the most consistent performer, with higher accuracy and kappa values.

Step 6: Basket analysis: to find out Popular product combination among customer class

Display top frequent item purchased

# Convert data to transactions
transactions_data <- as(temp1_df[, "Item.Purchased", drop = FALSE], "transactions")

# Display frequent itemsets
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.01))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.01      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 39 
## 
## create itemset ... 
## set transactions ...[25 item(s), 3900 transaction(s)] done [0.00s].
## sorting and recoding items ... [25 item(s)] done [0.00s].
## creating sparse bit matrix ... [25 row(s), 3900 column(s)] done [0.00s].
## writing  ... [25 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequent_itemsets)

##      items                       support    count
## [1]  {Item.Purchased=Blouse}     0.04384615 171  
## [2]  {Item.Purchased=Jewelry}    0.04384615 171  
## [3]  {Item.Purchased=Pants}      0.04384615 171  
## [4]  {Item.Purchased=Shirt}      0.04333333 169  
## [5]  {Item.Purchased=Dress}      0.04256410 166  
## [6]  {Item.Purchased=Sweater}    0.04205128 164  
## [7]  {Item.Purchased=Jacket}     0.04179487 163  
## [8]  {Item.Purchased=Belt}       0.04128205 161  
## [9]  {Item.Purchased=Sunglasses} 0.04128205 161  
## [10] {Item.Purchased=Coat}       0.04128205 161  
## [11] {Item.Purchased=Sandals}    0.04102564 160  
## [12] {Item.Purchased=Socks}      0.04076923 159  
## [13] {Item.Purchased=Skirt}      0.04051282 158  
## [14] {Item.Purchased=Shorts}     0.04025641 157  
## [15] {Item.Purchased=Scarf}      0.04025641 157  
## [16] {Item.Purchased=Hat}        0.03948718 154  
## [17] {Item.Purchased=Handbag}    0.03923077 153  
## [18] {Item.Purchased=Hoodie}     0.03871795 151  
## [19] {Item.Purchased=Shoes}      0.03846154 150  
## [20] {Item.Purchased=T-shirt}    0.03769231 147  
## [21] {Item.Purchased=Sneakers}   0.03717949 145  
## [22] {Item.Purchased=Boots}      0.03692308 144  
## [23] {Item.Purchased=Backpack}   0.03666667 143  
## [24] {Item.Purchased=Gloves}     0.03589744 140  
## [25] {Item.Purchased=Jeans}      0.03179487 124

Display top frequent item purchased combined with category and color to be more specific

# Combine columns into a new item column
temp1_df$Combined_Item <- with(temp1_df, paste(Category, Item.Purchased, Color, sep="_"))

# Convert "Gender" to a factor
temp1_df$Combined_Item <- as.factor(temp1_df$Combined_Item)

# Convert data to transactions
transactions_data <- as(temp1_df[, "Combined_Item", drop = FALSE], "transactions")

# Display the first 20 rows of frequent itemsets
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.003))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.003      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 11 
## 
## create itemset ... 
## set transactions ...[625 item(s), 3900 transaction(s)] done [0.00s].
## sorting and recoding items ... [22 item(s)] done [0.00s].
## creating sparse bit matrix ... [22 row(s), 3900 column(s)] done [0.00s].
## writing  ... [22 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequent_itemsets)

##      items                                           support     count
## [1]  {Combined_Item=Clothing_Skirt_Black}            0.003846154 15   
## [2]  {Combined_Item=Accessories_Jewelry_Gray}        0.003589744 14   
## [3]  {Combined_Item=Accessories_Jewelry_Indigo}      0.003333333 13   
## [4]  {Combined_Item=Clothing_Shorts_Yellow}          0.003333333 13   
## [5]  {Combined_Item=Clothing_Hoodie_Pink}            0.003333333 13   
## [6]  {Combined_Item=Clothing_Pants_Charcoal}         0.003076923 12   
## [7]  {Combined_Item=Footwear_Sneakers_Indigo}        0.003076923 12   
## [8]  {Combined_Item=Footwear_Sandals_Purple}         0.003076923 12   
## [9]  {Combined_Item=Clothing_Sweater_Cyan}           0.003076923 12   
## [10] {Combined_Item=Clothing_Dress_Charcoal}         0.003076923 12   
## [11] {Combined_Item=Accessories_Sunglasses_Lavender} 0.003076923 12   
## [12] {Combined_Item=Clothing_Pants_Turquoise}        0.003076923 12   
## [13] {Combined_Item=Accessories_Scarf_Violet}        0.003076923 12   
## [14] {Combined_Item=Clothing_Blouse_Violet}          0.003076923 12   
## [15] {Combined_Item=Accessories_Belt_Blue}           0.003076923 12   
## [16] {Combined_Item=Accessories_Sunglasses_Olive}    0.003076923 12   
## [17] {Combined_Item=Accessories_Handbag_Charcoal}    0.003076923 12   
## [18] {Combined_Item=Clothing_Pants_Cyan}             0.003076923 12   
## [19] {Combined_Item=Clothing_Skirt_Teal}             0.003076923 12   
## [20] {Combined_Item=Clothing_Sweater_Maroon}         0.003076923 12   
## [21] {Combined_Item=Footwear_Boots_Violet}           0.003076923 12   
## [22] {Combined_Item=Footwear_Shoes_Maroon}           0.003076923 12

Let’s categorize into four group “high value Female”, “high value Male” , “Non high value Female” and “Non high value Male” and find out what are the popular items among them.

Display top item purchased for high value Female customer

# Filter data 
filtered_data <- subset(temp1_df, Customer_class == "High Value" & Gender == "Female")

# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")

# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.008))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.008      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 2 
## 
## create itemset ... 
## set transactions ...[208 item(s), 255 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating sparse bit matrix ... [6 row(s), 255 column(s)] done [0.00s].
## writing  ... [6 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequent_itemsets)

##     items                             support    count
## [1] {Accessories_Sunglasses_Lavender} 0.01960784 5    
## [2] {Clothing_Blouse_Charcoal}        0.01176471 3    
## [3] {Clothing_Skirt_Violet}           0.01176471 3    
## [4] {Clothing_Hoodie_Charcoal}        0.01176471 3    
## [5] {Accessories_Backpack_Yellow}     0.01176471 3    
## [6] {Clothing_Dress_Olive}            0.01176471 3

Display top item purchased for high value male customer

# Filter data
filtered_data <- subset(temp1_df, Customer_class == "High Value" & Gender == "Male")

# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")

# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.006))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.006      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 3 
## 
## create itemset ... 
## set transactions ...[369 item(s), 564 transaction(s)] done [0.00s].
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating sparse bit matrix ... [9 row(s), 564 column(s)] done [0.00s].
## writing  ... [9 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequent_itemsets)

##     items                           support     count
## [1] {Footwear_Boots_Violet}         0.008865248 5    
## [2] {Clothing_Shorts_Yellow}        0.008865248 5    
## [3] {Accessories_Jewelry_Orange}    0.007092199 4    
## [4] {Clothing_Hoodie_White}         0.007092199 4    
## [5] {Clothing_Blouse_Violet}        0.007092199 4    
## [6] {Accessories_Jewelry_Gray}      0.007092199 4    
## [7] {Footwear_Sandals_Lavender}     0.007092199 4    
## [8] {Footwear_Sandals_Cyan}         0.007092199 4    
## [9] {Accessories_Sunglasses_Purple} 0.007092199 4

Display top item purchased for non high value female customer

# Filter data 
filtered_data <- subset(temp1_df, Customer_class == "Non High Value" & Gender == "Female")

# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")

# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.005))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE   0.005      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 4 
## 
## create itemset ... 
## set transactions ...[483 item(s), 993 transaction(s)] done [0.00s].
## sorting and recoding items ... [13 item(s)] done [0.00s].
## creating sparse bit matrix ... [13 row(s), 993 column(s)] done [0.00s].
## writing  ... [13 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequent_itemsets)

##      items                        support     count
## [1]  {Outerwear_Jacket_Indigo}    0.006042296 6    
## [2]  {Accessories_Hat_Magenta}    0.005035247 5    
## [3]  {Clothing_Blouse_Violet}     0.005035247 5    
## [4]  {Accessories_Scarf_Magenta}  0.005035247 5    
## [5]  {Clothing_Shorts_Black}      0.005035247 5    
## [6]  {Footwear_Sandals_Blue}      0.005035247 5    
## [7]  {Accessories_Scarf_Green}    0.005035247 5    
## [8]  {Clothing_Skirt_White}       0.005035247 5    
## [9]  {Clothing_Socks_Pink}        0.005035247 5    
## [10] {Accessories_Backpack_Green} 0.005035247 5    
## [11] {Clothing_Skirt_Teal}        0.005035247 5    
## [12] {Clothing_Pants_Yellow}      0.005035247 5    
## [13] {Clothing_Blouse_Lavender}   0.005035247 5

Display top item purchased for non high value male customer

# Filter data 
filtered_data <- subset(temp1_df, Customer_class == "Non High Value" & Gender == "Male")

# Convert data to transactions using transactions function
transactions_data <- as(split(filtered_data$Combined_Item, seq_along(filtered_data$Combined_Item)), "transactions")

# Display frequent itemsets for the specific category
frequent_itemsets <- eclat(transactions_data, parameter = list(supp = 0.0035))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE  0.0035      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 7 
## 
## create itemset ... 
## set transactions ...[608 item(s), 2088 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating sparse bit matrix ... [12 row(s), 2088 column(s)] done [0.00s].
## writing  ... [12 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(frequent_itemsets)

##      items                          support     count
## [1]  {Accessories_Handbag_Charcoal} 0.004789272 10   
## [2]  {Accessories_Backpack_Olive}   0.004310345  9   
## [3]  {Clothing_Shirt_Orange}        0.004310345  9   
## [4]  {Clothing_Skirt_Black}         0.004310345  9   
## [5]  {Clothing_Sweater_Cyan}        0.004310345  9   
## [6]  {Accessories_Scarf_Blue}       0.003831418  8   
## [7]  {Footwear_Shoes_Gold}          0.003831418  8   
## [8]  {Clothing_Hoodie_Pink}         0.003831418  8   
## [9]  {Clothing_Pants_Turquoise}     0.003831418  8   
## [10] {Accessories_Gloves_Turquoise} 0.003831418  8   
## [11] {Footwear_Sandals_Orange}      0.003831418  8   
## [12] {Accessories_Backpack_Black}   0.003831418  8

The comprehensive basket analysis across various customer segments reveals nuanced purchasing preferences.

High-value female customers tend to choose elegant accessories and specific clothing items, such as lavender sunglasses and charcoal blouses.

High-value male customers show a similar trend towards distinctive clothing and accessories, like violet boots.

Conversely, non high-value female and male customers display different preferences, with choices leaning towards practical and varied items like indigo jackets and charcoal handbags.

These insights are crucial for businesses to enhance their targeting and inventory strategies, aligning with the specific tastes and demands of each customer group, ultimately leading to improved customer satisfaction and increased sales efficiency.

5.0 Discussion

5.1 EDA

Discussion: The analysis of the data indicates that while some factors like customer class and seasonality have a noticeable effect on purchase amounts, others such as subscription status and payment method do not significantly influence how much customers spend. The data suggests a need to focus on high-value customers and consider seasonality in marketing and inventory strategies. Moreover, shipping preferences might be an indicator of customer’s willingness to spend more, possibly linked to the urgency or importance of the purchase.

The similarity in spending across many groups suggests that customers’ value perception may not be significantly influenced by factors such as subscription status or discounts, which are often considered levers to increase spending. Instead, intrinsic factors like the nature of the product and customer classification seem to play a more pivotal role.

This discussion of output should be used to inform business decisions and strategy, focusing on areas that the data has shown to have the most significant impact on customer spending. The company might consider investing more in understanding the high-value customer base and leveraging seasonality trends to maximize revenue.

5.2 Modeling

Discussion: The Random Forest model emerges as the most robust for both regression and classification tasks. It not only predicts purchase amounts with lower errors but also classifies customers with high accuracy and precision. The high ROC AUC value suggests that it is good at distinguishing between high-value and non-high-value customers.

Gradient boosting improved the performance of the Linear Regression model, indicating the potential benefits of ensemble methods in predictive modeling. However, the MAE across models remains consistent, suggesting a limitation in the models’ ability to improve the average error.

In summary, the Random Forest model should be prioritized for both predicting customer purchase amounts and for classifying customer value. For future model development, efforts might focus on further reducing the MAE and exploring additional ensemble techniques to enhance model performance.

6.0 Conclusion:

Regression model: The Random Forest model was initially considered the best for regression, while the Random Forest model was used for classification. However, the performance of the linear regression model improved after gradient boosting, making it the best model overall for predicting customer purchase amount.

Classification model: The Random Forest model outperform other model in identifying high value customer class.

In conclusion, our rigorous data analysis has provided valuable insights into the purchasing behaviors and preferences of our customer base. We have identified significant factors influencing purchase amounts, such as previous purchasing history, frequency, seasonality, subscription status, shipping type, discount application, and payment method. Notably, high-value customers were found to contribute substantially to the purchase amount, indicating the importance of focusing marketing efforts on this segment.

Our multiple regression models facilitated the prediction of purchase amounts, while classification models enabled the segmentation of customers into high-value and non-high-value groups. However, the lack of strong correlations among some variables presented challenges, suggesting a need for richer data for more accurate predictions.

The decision tree for customer classification illuminated the importance of specific products and demographic characteristics in predicting customer value. These insights can directly inform targeted marketing strategies and inventory management.

For future work, we recommend enhancing data collection processes to capture more nuanced customer interactions, and exploring advanced machine learning techniques to refine our predictive models. Continual reevaluation and adaptation of strategies in response to emerging data trends will be crucial to maintaining a competitive edge in customer value optimization.

Customer Value Prediction and Segmentation: Purchasing Amount Predictive Modeling and Potential Customer Identification

2024-01-14