Given dataset is a retail rocket dataset and we aim to build a recommender system that predicts the transaction and event pattern of a visitor. With the given data we can do so by exploring the events.csv file. It records the data about the visitor that performs any one event (view, add to cart or transaction) for a particular item at a particular time. If the event performed is a ‘transaction’, only then the ‘transaction id’ column will have a value, otherwise it will be N/A.
Next, we load the files into RStudio using the following commands:
getwd()
## [1] "C:/Users/surtabhi/Desktop"
category_tree<- read.csv("category_tree.csv", header=T)
events <- read.csv("events.csv", header=T)
item_properties1 <- read.csv("item_properties_part1.csv", header=T)
item_properties2 <- read.csv("item_properties_part2.csv", header=T)
Given 4 csv files along their column names are as follows: * category_tree
colnames(category_tree)
## [1] "categoryid" "parentid"
colnames(events)
## [1] "timestamp" "visitorid" "event" "itemid"
## [5] "transactionid"
colnames(item_properties1)
## [1] "timestamp" "itemid" "property" "value"
Getting the number of rows for each data frame as follows:
nrow(events)
## [1] 2756101
nrow(category_tree)
## [1] 1669
nrow(item_properties1)
## [1] 10999999
nrow(item_properties2)
## [1] 9275903
The file category_tree.csv has 25 N/A values out of 1669. However, both the columns represent an ‘id’ so we cannot fill it ourselves with the mean of existing values or a random number. For the events.csv file, we have N/A values in the column ‘transactionid’. Plausible explanation for this is that a ‘transactionid’ exists only when event= transaction. So, no data preprocessing is required so far.
Looking at the head of each file and getting to know the structure of the files. The files item_properties_part1.csv and item_properties_part2.csv contain hashed values for columns ‘property’ and ‘value’.
head(category_tree) #all id's
## categoryid parentid
## 1 1016 213
## 2 809 169
## 3 570 9
## 4 1691 885
## 5 536 1691
## 6 231 NA
head(events) #No missing timestamp values. Other columns have expected data
## timestamp visitorid event itemid transactionid
## 1 1.433221e+12 257597 view 355908 NA
## 2 1.433224e+12 992329 view 248676 NA
## 3 1.433222e+12 111016 view 318965 NA
## 4 1.433222e+12 483717 view 253185 NA
## 5 1.433221e+12 951259 view 367447 NA
## 6 1.433224e+12 972639 view 22556 NA
head(item_properties1,2): Duplicate values for different timestamp but different ‘property’ values have already been removed. Values for ‘property’ and ‘value’ have been hashed except for categoryid (id of the category; corresponding to it’s parent’s id as well- from category_tree.csv file) and available (tells availability of the item:0,1) which convey the corresponding information.
head(item_properties1)
## timestamp itemid property value
## 1 1.435460e+12 460429 categoryid 1338
## 2 1.441508e+12 206783 888 1116713 960601 n277.200
## 3 1.439089e+12 395014 400 n552.000 639502 n720.000 424566
## 4 1.431227e+12 59481 790 n15360.000
## 5 1.431832e+12 156781 917 828513
## 6 1.436065e+12 285026 available 0
Once we get to know about the data, we explore it more. To begin with, we look at the similarities between visitors and the items they look at. We will use ‘similarity’ function for that (under ‘recommenderlab’ package). The similarity function will accept input value as: realRatingMatrix. Relevant code is as follows:
library(recommenderlab)
## Warning: package 'recommenderlab' was built under R version 3.4.1
## Loading required package: Matrix
## Loading required package: arules
## Warning: package 'arules' was built under R version 3.4.1
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## Loading required package: proxy
## Warning: package 'proxy' was built under R version 3.4.1
##
## Attaching package: 'proxy'
## The following object is masked from 'package:Matrix':
##
## as.matrix
## The following objects are masked from 'package:stats':
##
## as.dist, dist
## The following object is masked from 'package:base':
##
## as.matrix
## Loading required package: registry
## Warning: package 'registry' was built under R version 3.4.1
events_matrix<- as(events, "matrix")
class(events_matrix)
## [1] "matrix"
events_rating_matrix<- as(events, "realRatingMatrix")
class(events_rating_matrix)
## [1] "realRatingMatrix"
## attr(,"package")
## [1] "recommenderlab"
We divide the given data set into training and test sets as follows:
library(recommenderlab)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
##
## intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(101) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data
sample <- sample.int(n = nrow(events_rating_matrix), size = floor(.75*nrow(events_rating_matrix)), replace = F)
train_data <- events_rating_matrix[sample, ]
test_data <- events_rating_matrix[-sample, ]
The different kinds of recommender models that can be used are as follows:
recommender_models <- recommenderRegistry$get_entries(dataType =
"realRatingMatrix")
names(recommender_models)
## [1] "ALS_realRatingMatrix" "ALS_implicit_realRatingMatrix"
## [3] "IBCF_realRatingMatrix" "POPULAR_realRatingMatrix"
## [5] "RANDOM_realRatingMatrix" "RERECOMMEND_realRatingMatrix"
## [7] "SVD_realRatingMatrix" "SVDF_realRatingMatrix"
## [9] "UBCF_realRatingMatrix"
Out of these, UBCF will be used since it solely serves our purpose.
Method used is UBCF: User based collaborative filtering. This is the best option to be used for the given data set since we need to study the user activities and then predict their future activities based on previous ones. The data types of the data frames are beig changed into matrix as that is what is accepted by the Recommender and other functions. ALso, because of such a large data set, storing data as a matrix or rating matrix will be efficient for the memory usage.
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.4.1
library(arules)
rec_model <- Recommender(data = train_data, method = "UBCF", param=list(normalize = "Z-score",method="Cosine",n=5))
predict_item <- function(visitorid)
{
visitor_items <- events$itemid
visitor_items <- as(visitor_items, "matrix")
visitor_items_matrix <- as(visitor_items, "binaryRatingMatrix")
# perform the prediction
recommend <- predict(rec_model, visitor_items_matrix,n=3)
return(as(recommend,"list"))
}
From the given data, we can draw some actionable insights that can be very helpful for the organization. Here, the original data has not been modified but new data frames are created that are actually subsets of the original data.
#subset of data
date_only <- events["timestamp"]
event_only <- events["event"]
events_visitorid <- events["visitorid"]
new_events2 <- events[100:200,c("visitorid","event")]
new_events <- events[1:100,c("visitorid","itemid")]
new_events_matrix <- as(new_events, "matrix")
new_events_rating <- as(new_events, "realRatingMatrix")
slotNames(new_events_rating) #has a data slot
## [1] "data" "normalize"
class(new_events_rating@data)
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
dim(new_events_rating@data)
## [1] 94 98
item_occurrence <- as.vector(new_events_rating@data)
unique(item_occurrence)
## [1] 0 1 2
sum(item_occurrence)
## [1] 100
occurrence_table <- table(item_occurrence)
occurrence_table
## item_occurrence
## 0 1 2
## 9113 98 1
occurrence <- table(new_events$itemid)
occurrence_not1 <- occurrence[occurrence!=1]
unique(new_events2$event)
## [1] view addtocart transaction
## Levels: addtocart transaction view
Mode <- function(event)
{
event_u <- unique(new_events$event)
event_u[which.max(tabulate(match(event, event_u)))]
}
sk <- apply(new_events2, 1, function(event) sum(event == Mode(event)))
table(event_only)
## event_only
## addtocart transaction view
## 69332 22457 2664312
hist(table(event_only), col= "red", main = "Histogram of Events")
max <- max(table(events_visitorid))
min <- min(table(events_visitorid))
barplot(table(events_visitorid), col = "wheat", main = "Occurences of various visitor IDs")
val <- date_only$timestamp
date_modified <- as.Date(as.POSIXct(val, origin="1970-01-01"))
date_original <- date_only$timestamp
date_df <- data.frame(date_original, date_modified)
tail(names(sort(table(date_df$date_original))), 5)
## [1] "1435017962693" "1436979796576" "1438630520761" "1439914941023"
## [5] "1441212019542"
These are the original Unix timestamps that have the highest frequency in the given data.
date_occurences<-table(unlist(date_only$timestamp))
date_occurences<-table(unlist(date_df$date_modified))
max(date_occurences)
## [1] 110
tail(names(sort(table(date_df$date_modified))), 5)
## [1] "47495-09-07" "47495-11-28" "47533-08-31" "47496-04-01" "47496-06-23"
So the maximum times a date occured is 110. This is a decent number given the number of customer records.
tail(names(sort(table(events$visitorid))), 5)
## [1] "163561" "895999" "152963" "530559" "1150086"
These are the top 5 most active visitors.
visitor_occurences<-table(unlist(events$visitorid))
#boxplot(visitor_occurences, col = "blue")
hist(events$itemid, col = "blue")
rug(events$itemid)
max(visitor_occurences)
## [1] 7757
min(visitor_occurences)
## [1] 1
So the maximum number of times a visitor performed an activity is 7757 and the minimum is 1.