Recommender System

1 Background Problems
2 Data Collection
3 Exploratory Data Analysis
4 Annotations

1 Background Problems

Recommender System is a system or a machine that produces recommendations for users based on the interactions between users and items. It is one of the most common application of machine learning. Almost all products that has users and items is using recommender system. But how does it work?

For example, let’s say we have an online marketplace. We have users defined by username, items defined by products, and values defined by the ratings given by the users. Using these features, we can recommend products based on the similarities between each users. This is the basic concept behind product recommendation, although in the practice, it is much more complicated.

If used appropriately, such as having a good accuracy and variation, a recommendation system can boost sales and user’s satisfaction. This will result in increasing loyalty and reducing the likeliness of churning. According to a study by McKinsey¹, 75% of what the consumers watch on Netflix comes from the recommendation system. Netflix executives Carlos A. Gomez-Uribe and Neil Hunt also claim that their recommender system save them about $1 billion per year². Amazon also credited their recommendation system for 35% of the revenue³.

The problem is, a lot of implementation of recommender system nowadays are way too simple. Considering the huge increase in number of online actions during the modern era, using a model that is too simple will result in bunch of inappropriate recommendations. This is proven by the a lot of customers who complained on e-commerce or social media about how terrible the recommendations they received.

There are many recommendation system algorithms that have emerged in recent years. Based on the analysis here, we can compare 8 different models based on RMSE.

Recommender System Algorithms Comparison

Most companies today are using UBCF and IBCF, combined with Popular. Some are using matrix factorization with SVD. But there are only a few who have taken the advantage of the gradient descent and deep learning capabilities, probably due to the complexity and slow training time.

This is where the hybrid method comes in handy. A hybrid method, which is simply the combination of UBCF/IBCF and Content-Based Filtering, doesn’t require a lot of training time and doesn’t have a complex setup, whilst still being a powerful model.

In this project, I will use various methods and algorithms to build a recommender system, aiming to suit the interest of the user the best way possible. The client can use this dashboard to pick a username and find out the top-n items that should be recommended.

This project will be focusing on transactions within an e-commerce, but recommender systems can also be used in various fields, as I mentioned earlier. Some of the most famous recommender systems we can find are :

YouTube video recommendation
Spotify songs recommendation
Netflix movie recommendation
etc.

2 Data Collection

The dataset was collected from here, an open source of Amazon reviews dataset, made available by Jianmo Ni from UCSD⁴. The site provides complete reviews from Amazon users in the range of May 1996 - Oct 2018, with a total of 233.1 million reviews. There is a “smaller” subset of data called 5-Core, which is a data with all users and items with at least 5 reviews.

The data I collected was the 5-Core from Electronics category. I have processed the data so the reviews are in span of 2017 to 2018.

The data contains a lot of informations from 15 different columns

 [1] "overall"      "vote"         "verified"     "reviewTime"   "reviewerID"  
 [6] "asin"         "reviewerName" "reviewText"   "summary"      "category"    
[11] "title"        "image_y"      "brand"        "date"         "price"

reviewerID - ID of the reviewer, e.g. AC6C9VNULPB3T
reviewerName - name of the reviewer
verified - whether the reviewer is verified or not
asin - ID of the product, e.g. 0000013714
title - name of the product
category - list of categories the product belongs to
brand - brand name
price - price in US dollars (at time of crawl)
image_y - url of the product image
date - date when the product was added
reviewTime - time of the review (raw)
summary - summary of the review
reviewText - text of the review
overall - rating of the product
vote - helpful votes of the review

Let’s see what we’ll be needing for further analysis :

user-item-rating : to build the collaborative filtering model
reviews text-item-rating : to build a content-based filtering model

We can split the data into two, ratings and reviews. The reviews can also be labeled into positive and negative reviews, where :
1. Ratings of 3 and above are considered positive
2. Ratings below 3 are considered negative

Ratings Dataset

Reviews Dataset

3 Exploratory Data Analysis

1. How many users and how many items are there?

length(unique(ratings$user))

[1] 22708

length(unique(ratings$item))

[1] 1331

There are 22708 distinct users and 1331 different items.

2. How is the distribution of the ratings?

Most of the ratings are either 4 or 5.

3. What are the top 10 selling items?

NETGEAR 5-Port Gigabit Ethernet has the highest selling, followed by Tiffen Protection Filters.

4. Who are the top 10 customers with highest number of transactions?

As we can see, Mike has the most transactions of all customers, followed by John and Michael.

5. How is the positive reviews compared to negative?


  Negative   Positive 
0.09009951 0.90990049

There are roughly 91% of positive reviews compared to 9% of negative reviews. Since the data is unbalanced, during the modelling process, we’ll have to adjust the model parameters so that the model learns the data in a balanced way.

This snippet of code will convert a data of texts into a plot of top 10 most occurring words. The text has been cleansed, through the process of removing numbers, punctuation, and stopwords, followed by stemming and tokenizing.

getPlot <- function(data){
  library(tm)
  library(slam)
  text <- VCorpus(VectorSource(data))
  text <- tm_map(text, content_transformer(tolower))
  text <- tm_map(text, removeNumbers)
  text <- tm_map(text, removePunctuation)
  text <- tm_map(text, removeWords, stopwords("en"))
  text <- tm_map(text, stemDocument)
  text <- tm_map(text, stripWhitespace)
  dtm <- DocumentTermMatrix(text)
  colTotals <-  col_sums(dtm)
  dtm <- dtm[,which(colTotals > 20)]
  dtm.matrix <- as.matrix(dtm)
  df <- data.frame(terms = colnames(dtm.matrix), freq = colSums(dtm.matrix), row.names = NULL)
  df <- df[order(df$freq, decreasing = T),]
  ggplot(head(df,10), aes(reorder(terms, -freq), freq)) + geom_col(fill="skyblue") + theme_minimal() + 
  labs(title="Top 10 Most Frequent Words", x="Terms", y="Frequencies")
}

Then we can split the reviews into positive and negative.

reviews %>% 
  filter(label=="Positive") -> pos

reviews %>% 
  filter(label=="Negative") -> neg

6. What are the most occuring words in positive reviews?

getPlot(pos$text)

The words work, great, use, good, etc are the most frequent words in the positive reviews.

7. What are the most occuring words in the negative reviews?

getPlot(neg$text)

The words work, use, one, get, etc are the most frequent words in the negative reviews.

4 Annotations

[1] : https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers#

[2] : https://sigmoidal.io/recommender-systems-recommendation-engine/

[3] : https://research.aimultiple.com/recommendation-system/

[4] : Justifying recommendations using distantly-labeled reviews and fined-grained aspects. Jianmo Ni, Jiacheng Li, Julian McAuley. Empirical Methods in Natural Language Processing (EMNLP), 2019

1↩
2↩
3↩
4↩