This data is downloaded from Kaggle[https://www.kaggle.com/carrie1/ecommerce-data] and its original source is UCI Machine Learning Repository
Loading libraries
library(data.table)
data.table 1.12.6 using 2 threads (see ?getDTthreads). Latest news: r-datatable.com
Attaching package: 㤼㸱data.table㤼㸲
The following objects are masked from 㤼㸱package:dplyr㤼㸲:
between, first, last
The following object is masked from 㤼㸱package:purrr㤼㸲:
transpose
library(dplyr)
library(ggplot2)
library(stringr)
library(tidyr)
library(tidyverse)
library(arules)
Attaching package: 㤼㸱arules㤼㸲
The following object is masked from 㤼㸱package:car㤼㸲:
recode
The following object is masked from 㤼㸱package:dplyr㤼㸲:
recode
The following objects are masked from 㤼㸱package:base㤼㸲:
abbreviate, write
library(arulesViz)
Loading required package: grid
Registered S3 method overwritten by 'seriation':
method from
reorder.hclust gclus
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
library(knitr)
library(gridExtra)
Attaching package: 㤼㸱gridExtra㤼㸲
The following object is masked from 㤼㸱package:randomForest㤼㸲:
combine
The following object is masked from 㤼㸱package:dplyr㤼㸲:
combine
library(lubridate)
Attaching package: 㤼㸱lubridate㤼㸲
The following objects are masked from 㤼㸱package:data.table㤼㸲:
hour, isoweek, mday, minute, month, quarter, second, wday, week, yday, year
The following object is masked from 㤼㸱package:base㤼㸲:
date
library(plyr)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Attaching package: 㤼㸱plyr㤼㸲
The following object is masked from 㤼㸱package:lubridate㤼㸲:
here
The following objects are masked from 㤼㸱package:dplyr㤼㸲:
arrange, count, desc, failwith, id, mutate, rename, summarise, summarize
The following object is masked from 㤼㸱package:purrr㤼㸲:
compact
Loading data
dim(df_data)
[1] 406829 8
Data Cleaning and Feature Engineering
#Removing NAs
df_data <- df_data %>%
drop_na()
#Converting everything to factors
df_data <- df_data %>%
mutate(InvoiceNo=as.factor(InvoiceNo), StockCode=as.factor(StockCode),
InvoiceDate=as.Date(InvoiceDate, '%m/%d/%Y %H:%M'), CustomerID=as.factor(CustomerID),
Country=as.factor(Country))
Creating Recency,Frequency and Monetory variables for every user for RFM Analysis !
df_RFM <- df_data %>%
group_by(CustomerID) %>%
summarise(recency=as.numeric(as.Date("2012-01-01")-max(InvoiceDate)),
frequenci=n_distinct(InvoiceNo), monitery= sum(total_dolar)/n_distinct(InvoiceNo))
#lets see how these variables are distributed
ggplot(df_RFM,aes(x=recency)) + geom_histogram()
ggplot(df_RFM,aes(x=frequenci)) + geom_histogram()
ggplot(df_RFM,aes(x=monitery)) + geom_histogram()
NA
NA
Now these users can be clustered into different group by :
1.Scoring them on quantile values of Recency,Frequency and Monitory 2.K-means clustering 3.Heirarchical clustering
Lets choose heirarchial clustering here!
Market Basket Analysis to see which products were sold together! This can be combined with the targetting/discounting strategy above to show the products which have high frequency of being bought together!
class(tr)
[1] "transactions"
attr(,"package")
[1] "arules"
Lets see how does our products look
itemFrequencyPlot(tr,topN=20,type="absolute",col=brewer.pal(8,'Pastel2'), main="Absolute Item Frequency Plot")
#Hanging T-Light holder is sold the most followed by Regency Cakestand 3 tier
#Using apriori algorithm to set up association rules
association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8,maxlen=10))
Apriori
Parameter specification:
Algorithmic control:
Absolute minimum support count: 22
set item appearances ...[0 item(s)] done [0.03s].
set transactions ...[8181 item(s), 22191 transaction(s)] done [0.56s].
sorting and recoding items ... [2623 item(s)] done [0.03s].
creating transaction tree ... done [0.02s].
checking subsets of size 1 2 3 4 5 6 7 8 9 10
Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
done [0.64s].
writing ... [49122 rule(s)] done [0.10s].
creating S4 object ... done [0.08s].
summary(association.rules)
set of 49122 rules
rule length distribution (lhs + rhs):sizes
2 3 4 5 6 7 8 9 10
105 2111 6854 16424 14855 6102 1937 613 121
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 5.000 5.000 5.499 6.000 10.000
summary of quality measures:
support confidence lift count
Min. :0.001036 Min. :0.8000 Min. : 9.846 Min. : 23.00
1st Qu.:0.001082 1st Qu.:0.8333 1st Qu.: 22.237 1st Qu.: 24.00
Median :0.001262 Median :0.8788 Median : 28.760 Median : 28.00
Mean :0.001417 Mean :0.8849 Mean : 64.589 Mean : 31.45
3rd Qu.:0.001532 3rd Qu.:0.9259 3rd Qu.: 69.200 3rd Qu.: 34.00
Max. :0.015997 Max. :1.0000 Max. :715.839 Max. :355.00
mining info:
We can see that Billboard font design & Wrap, Art lights & funk monkey, Swiss roll towel and chocolate spots have very high probability of being sold together, as compared to being sold individually!