library(arules)
library(dplyr)
library(tidyr)
library(arulesViz)
library(tidyverse)
library(knitr)

Intro

For the Association Rule Mining project I will focus on movie recommendations. Based on the movies users have seen and the ratings they have given we will be able to determine what kind of movies they might enjoy to watch next. The dataset we are using is called MovieLens, and it’s one of the most popular and up-to-date databases on movies

movies <- read.csv("movies.csv")
ratings <- read.csv("ratings.csv")

show(str(movies))
## 'data.frame':    9742 obs. of  3 variables:
##  $ movieId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ title  : chr  "Toy Story (1995)" "Jumanji (1995)" "Grumpier Old Men (1995)" "Waiting to Exhale (1995)" ...
##  $ genres : chr  "Adventure|Animation|Children|Comedy|Fantasy" "Adventure|Children|Fantasy" "Comedy|Romance" "Comedy|Drama|Romance" ...
## NULL
show(head(movies))
##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
## 6       6                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller

Data processing

we will filter the movies to obtain a more modern dataset, to make the task easier for my computer. I will choose movies released after the year 2009

movies$year <- as.numeric(substr(movies$title, nchar(movies$title) - 4, nchar(movies$title) - 1))
movies %>% drop_na()
movies <- movies %>% filter(year > 2009)
movies$movieId <- as.factor(movies$movieId)
show(str(ratings))
## 'data.frame':    100836 obs. of  4 variables:
##  $ userId   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ movieId  : int  1 3 6 47 50 70 101 110 151 157 ...
##  $ rating   : num  4 4 4 5 5 3 5 4 5 5 ...
##  $ timestamp: int  964982703 964981247 964982224 964983815 964982931 964982400 964980868 964982176 964984041 964984100 ...
## NULL
show(head(movies))
##   movieId                   title                         genres year
## 1   73268      Daybreakers (2010)   Action|Drama|Horror|Thriller 2010
## 2   73319        Leap Year (2010)                 Comedy|Romance 2010
## 3   73321 Book of Eli, The (2010)         Action|Adventure|Drama 2010
## 4   73929           Legion (2010) Action|Fantasy|Horror|Thriller 2010
## 5   74154     When in Rome (2010)                 Comedy|Romance 2010
## 6   74324   Temple Grandin (2010)                          Drama 2010

We have timestamp variable in the ratings, but we won’t be analyzing time series here, so we don’t need it. Additionally, our goal is to suggest movies that an individual will enjoy for the most part. That’s why I will modify the dataset to show movies with an average rating above 3.49/5, so 7/10 and above.

ratings$timestamp <- NULL
avg_rating <- ratings %>% group_by(movieId) %>% summarise(avg_rating = round(mean(rating),2))
## `summarise()` ungrouping output (override with `.groups` argument)
ratings <- inner_join(ratings, avg_rating, by = "movieId")
ratings <- ratings %>% filter((avg_rating > 3.49) & (movieId %in% movies$movieId))
ratings$movieId <- as.factor(ratings$movieId)

now we will join the movies and the ratings dataset together to obtain the userid and all the titles seen by the users

full_data <-inner_join(movies, ratings, by = 'movieId')
full_data$title <- substr(full_data$title, 0, nchar(full_data$title)-7)
full_data$genres <- tolower(full_data$genres)

Narrowing our focus

selecting specific genre. This will help us avoid situations in which antecedent movies recommend a consequent movie of completely different topicality. For now the only indicators we base our recommendations on are the ratings and the frequency. This means that you could get Schindler’s List recommended because you watched Shrek. We will focus on the animation genre.

print(unique(unlist(str_split(unique(full_data$genres), "\\|"))))
##  [1] "drama"              "mystery"            "thriller"          
##  [4] "romance"            "war"                "action"            
##  [7] "sci-fi"             "adventure"          "animation"         
## [10] "children"           "fantasy"            "imax"              
## [13] "comedy"             "documentary"        "musical"           
## [16] "crime"              "horror"             "film-noir"         
## [19] "western"            "(no genres listed)"
unique_genres <- unlist(unique(full_data$genres))
genre_matrix <- str_detect("animation", unique_genres)
genre_list <- unique_genres[genre_matrix]
animation_data <- full_data %>% filter(genres %in% genre_list)

Transforming the data for the association rule mining.

now we split the data to display the ‘buckets’ of films particular users have seen, and then we use the arules package to transform our data into association rule mining format that will allow us to perform the apriori algorithm

id_views <- split(animation_data$title, animation_data$userId)
matrix <- as(id_views, "transactions")
#Item frequency allows us to view the support of each movie, in relative - percentage, and absolute way
itemFrequency(matrix, type="relative")[1:5]
##                                                 A Silent Voice 
##                                                    0.015037594 
##                                      Adventures of Tintin, The 
##                                                    0.090225564 
##                                              All-Star Superman 
##                                                    0.015037594 
##                        Alpha and Omega 3: The Great Wolf Games 
##                                                    0.007518797 
## Asterix: The Land of the Gods (Astérix: Le domaine des dieux) 
##                                                    0.015037594
itemFrequency(matrix, type="absolute")[1:5]
##                                                 A Silent Voice 
##                                                              2 
##                                      Adventures of Tintin, The 
##                                                             12 
##                                              All-Star Superman 
##                                                              2 
##                        Alpha and Omega 3: The Great Wolf Games 
##                                                              1 
## Asterix: The Land of the Gods (Astérix: Le domaine des dieux) 
##                                                              2
#here we can look at some sample bundles of movies seen by users
inspect(matrix[1:5])
##     items                                     transactionID
## [1] {Despicable Me,                                        
##      Despicable Me 2,                                      
##      Frozen,                                               
##      Tangled,                                              
##      Tangled Ever After}                                 10
## [2] {Inside Out,                                           
##      Sausage Party,                                        
##      Zootopia}                                           15
## [3] {Toy Story 3}                                        16
## [4] {Toy Story 3}                                        17
## [5] {Batman: The Dark Knight Returns, Part 1,              
##      Batman: The Dark Knight Returns, Part 2,              
##      Inside Out,                                           
##      Toy Story 3}                                        18

we can also look which movies were the most popular

itemFrequencyPlot(matrix, topN = 15)

for the apriori I am choosing rather high support and confidence, which narrows down the data, but it also provides more validity in the recommendations. The minlen and maxlen setting allows for choosing the size of the antecedent. I don’t want recommendations to be based off of a single movie, nor do I want bundles too large, because those can get pretty heavy for computing

movie_rules <- apriori(matrix, parameter = list(support = 0.05, confidence = 0.8, minlen = 3, maxlen = 5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.8    0.1    1 none FALSE            TRUE       5    0.05      3
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 6 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[99 item(s), 133 transaction(s)] done [0.00s].
## sorting and recoding items ... [27 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5
## Warning in apriori(matrix, parameter = list(support = 0.05, confidence = 0.8, :
## Mining stopped (maxlen reached). Only patterns up to a length of 5 returned!
##  done [0.00s].
## writing ... [499 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
#summary of the Apriori algorithm result  
summary(movie_rules)
## set of 499 rules
## 
## rule length distribution (lhs + rhs):sizes
##   3   4   5 
## 140 243 116 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   3.000   4.000   3.952   4.000   5.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.05263   Min.   :0.8000   Min.   :0.05263   Min.   :1.935  
##  1st Qu.:0.05263   1st Qu.:0.8462   1st Qu.:0.06015   1st Qu.:2.418  
##  Median :0.06015   Median :0.8750   Median :0.06767   Median :2.838  
##  Mean   :0.06229   Mean   :0.8942   Mean   :0.07035   Mean   :2.978  
##  3rd Qu.:0.06767   3rd Qu.:0.9643   3rd Qu.:0.07519   3rd Qu.:3.244  
##  Max.   :0.14286   Max.   :1.0000   Max.   :0.17293   Max.   :8.312  
##      count       
##  Min.   : 7.000  
##  1st Qu.: 7.000  
##  Median : 8.000  
##  Mean   : 8.285  
##  3rd Qu.: 9.000  
##  Max.   :19.000  
## 
## mining info:
##    data ntransactions support confidence
##  matrix           133    0.05        0.8
#sample antecedent - consequent relationships, aka the movies recommendations
inspect(head(movie_rules, 5))
##     lhs                            rhs                  support confidence   coverage     lift count
## [1] {Croods, The,                                                                                   
##      Despicable Me 2}           => {Despicable Me}   0.05263158      0.875 0.06015038 3.062500     7
## [2] {Croods, The,                                                                                   
##      Despicable Me}             => {Despicable Me 2} 0.05263158      1.000 0.05263158 7.000000     7
## [3] {Adventures of Tintin, The,                                                                     
##      Wreck-It Ralph}            => {Big Hero 6}      0.05263158      1.000 0.05263158 3.243902     7
## [4] {Adventures of Tintin, The,                                                                     
##      Big Hero 6}                => {Wreck-It Ralph}  0.05263158      0.875 0.06015038 5.289773     7
## [5] {Adventures of Tintin, The,                                                                     
##      Wreck-It Ralph}            => {Inside Out}      0.05263158      1.000 0.05263158 3.093023     7
#we can inspect which bundles of movie antecedent - consequent relationships are the highest in terms of the
#lift support and count(frequency)
inspect(sort(movie_rules, by = "lift")[1:5])
##     lhs                  rhs                      support confidence   coverage     lift count
## [1] {Despicable Me 2,                                                                         
##      Inside Out,                                                                              
##      Toy Story 3}     => {Monsters University} 0.05263158      1.000 0.05263158 8.312500     7
## [2] {Despicable Me 2,                                                                         
##      Inside Out,                                                                              
##      The Lego Movie,                                                                          
##      Toy Story 3}     => {Monsters University} 0.05263158      1.000 0.05263158 8.312500     7
## [3] {Despicable Me 2,                                                                         
##      Inside Out,                                                                              
##      The Lego Movie}  => {Monsters University} 0.05263158      0.875 0.06015038 7.273438     7
## [4] {Big Hero 6,                                                                              
##      Despicable Me 2,                                                                         
##      Toy Story 3}     => {Monsters University} 0.05263158      0.875 0.06015038 7.273438     7
## [5] {Croods, The,                                                                             
##      Despicable Me}   => {Despicable Me 2}     0.05263158      1.000 0.05263158 7.000000     7
inspect(sort(movie_rules, by = "support")[1:5])
##     lhs                           rhs                          support confidence  coverage     lift count
## [1] {How to Train Your Dragon,                                                                            
##      Toy Story 3}              => {Despicable Me}            0.1428571  0.8260870 0.1729323 2.891304    19
## [2] {Toy Story 3,                                                                                         
##      Zootopia}                 => {Inside Out}               0.1203008  0.8888889 0.1353383 2.749354    16
## [3] {Big Hero 6,                                                                                          
##      Inside Out}               => {Toy Story 3}              0.1203008  0.8000000 0.1503759 1.934545    16
## [4] {Big Hero 6,                                                                                          
##      Despicable Me}            => {How to Train Your Dragon} 0.1203008  0.8000000 0.1503759 2.007547    16
## [5] {Despicable Me,                                                                                       
##      The Lego Movie}           => {How to Train Your Dragon} 0.1127820  0.8823529 0.1278195 2.214206    15
inspect(sort(movie_rules, by = "count")[1:5])
##     lhs                           rhs                          support confidence  coverage     lift count
## [1] {How to Train Your Dragon,                                                                            
##      Toy Story 3}              => {Despicable Me}            0.1428571  0.8260870 0.1729323 2.891304    19
## [2] {Toy Story 3,                                                                                         
##      Zootopia}                 => {Inside Out}               0.1203008  0.8888889 0.1353383 2.749354    16
## [3] {Big Hero 6,                                                                                          
##      Inside Out}               => {Toy Story 3}              0.1203008  0.8000000 0.1503759 1.934545    16
## [4] {Big Hero 6,                                                                                          
##      Despicable Me}            => {How to Train Your Dragon} 0.1203008  0.8000000 0.1503759 2.007547    16
## [5] {Despicable Me,                                                                                       
##      The Lego Movie}           => {How to Train Your Dragon} 0.1127820  0.8823529 0.1278195 2.214206    15

let’s explore a particular movie, zootopia. Here we set the rhs to zootopia, which means we will look at which movies will lead to recommending zootopia as your next movie

zootopia_rules <-apriori(data = matrix, parameter=list(supp=0.05,conf = 0.2, minlen = 3, maxlen = 5), 
appearance=list(default="lhs", rhs="Zootopia"), control=list(verbose=F))

#sample recommendations
inspect(zootopia_rules[1:10])
##      lhs                             rhs           support confidence   coverage     lift count
## [1]  {How to Train Your Dragon,                                                                
##       Rise of the Guardians}      => {Zootopia} 0.05263158  0.6363636 0.08270677 2.644886     7
## [2]  {Finding Dory,                                                                            
##       The Lego Movie}             => {Zootopia} 0.05263158  0.7777778 0.06766917 3.232639     7
## [3]  {Big Hero 6,                                                                              
##       Finding Dory}               => {Zootopia} 0.06015038  0.8888889 0.06766917 3.694444     8
## [4]  {Finding Dory,                                                                            
##       Inside Out}                 => {Zootopia} 0.06015038  0.8888889 0.06766917 3.694444     8
## [5]  {Despicable Me,                                                                           
##       Finding Dory}               => {Zootopia} 0.06015038  0.8888889 0.06766917 3.694444     8
## [6]  {Finding Dory,                                                                            
##       How to Train Your Dragon}   => {Zootopia} 0.06766917  1.0000000 0.06766917 4.156250     9
## [7]  {Monsters University,                                                                     
##       The Lego Movie}             => {Zootopia} 0.05263158  0.7000000 0.07518797 2.909375     7
## [8]  {How to Train Your Dragon 2,                                                              
##       Inside Out}                 => {Zootopia} 0.05263158  0.8750000 0.06015038 3.636719     7
## [9]  {Despicable Me,                                                                           
##       How to Train Your Dragon 2} => {Zootopia} 0.05263158  0.5384615 0.09774436 2.237981     7
## [10] {How to Train Your Dragon,                                                                
##       How to Train Your Dragon 2} => {Zootopia} 0.05263158  0.5384615 0.09774436 2.237981     7
zootopia_rules_by_confidence <- sort(zootopia_rules, by="confidence", decreasing=TRUE)
inspect(head(zootopia_rules_by_confidence, 5))
##     lhs                           rhs           support confidence   coverage    lift count
## [1] {Finding Dory,                                                                         
##      How to Train Your Dragon} => {Zootopia} 0.06766917          1 0.06766917 4.15625     9
## [2] {Finding Dory,                                                                         
##      How to Train Your Dragon,                                                             
##      The Lego Movie}           => {Zootopia} 0.05263158          1 0.05263158 4.15625     7
## [3] {Big Hero 6,                                                                           
##      Finding Dory,                                                                         
##      How to Train Your Dragon} => {Zootopia} 0.05263158          1 0.05263158 4.15625     7
## [4] {Finding Dory,                                                                         
##      How to Train Your Dragon,                                                             
##      Inside Out}               => {Zootopia} 0.05263158          1 0.05263158 4.15625     7
## [5] {Despicable Me,                                                                        
##      Finding Dory,                                                                         
##      How to Train Your Dragon} => {Zootopia} 0.05263158          1 0.05263158 4.15625     7

we can also discover what kind of recommendations a particular bundle of movies will yield, let’s check out Zootpia along with Sausage Party and How to Train Your Dragon

multiple_rules <-apriori(data = matrix, parameter=list(supp=0.02,conf = 0.2, minlen = 3, maxlen = 5), 
appearance=list(default="rhs", 
                lhs=list("Zootopia", "Sausage Party", "How to Train Your Dragon")), 
                  control=list(verbose=F))

inspect(tail(multiple_rules))
##     lhs                           rhs                 support confidence   coverage     lift count
## [1] {How to Train Your Dragon,                                                                    
##      Zootopia}                 => {Inside Out}     0.11278195       0.75 0.15037594 2.319767    15
## [2] {How to Train Your Dragon,                                                                    
##      Zootopia}                 => {Despicable Me}  0.09774436       0.65 0.15037594 2.275000    13
## [3] {How to Train Your Dragon,                                                                    
##      Zootopia}                 => {Toy Story 3}    0.09774436       0.65 0.15037594 1.571818    13
## [4] {How to Train Your Dragon,                                                                    
##      Sausage Party,                                                                               
##      Zootopia}                 => {Wreck-It Ralph} 0.02255639       1.00 0.02255639 6.045455     3
## [5] {How to Train Your Dragon,                                                                    
##      Sausage Party,                                                                               
##      Zootopia}                 => {The Lego Movie} 0.02255639       1.00 0.02255639 4.290323     3
## [6] {How to Train Your Dragon,                                                                    
##      Sausage Party,                                                                               
##      Zootopia}                 => {Big Hero 6}     0.02255639       1.00 0.02255639 3.243902     3

finally, we are also able to explore suggestions where a movie title occurs in the antecedent and the consequent, so how a particular movie affects the entire recommendation ecosystem

frozen_rules <- subset(movie_rules, items %in% "Frozen")
inspect(head(frozen_rules))
##     lhs                                           rhs                       
## [1] {Frozen,Monsters University}               => {Toy Story 3}             
## [2] {Despicable Me 2,Tangled}                  => {Frozen}                  
## [3] {Despicable Me 2,Frozen}                   => {Tangled}                 
## [4] {Despicable Me 2,Frozen}                   => {How to Train Your Dragon}
## [5] {Despicable Me 2,How to Train Your Dragon} => {Frozen}                  
## [6] {Frozen,Megamind}                          => {How to Train Your Dragon}
##     support    confidence coverage   lift     count
## [1] 0.06015038 1.0000000  0.06015038 2.418182 8    
## [2] 0.06015038 0.8888889  0.06766917 4.076628 8    
## [3] 0.06015038 0.8000000  0.07518797 4.433333 8    
## [4] 0.06015038 0.8000000  0.07518797 2.007547 8    
## [5] 0.06015038 0.8000000  0.07518797 3.668966 8    
## [6] 0.05263158 1.0000000  0.05263158 2.509434 7

Visualization

We can display the rules/recommendations in three dimensions. The redder and the higher and more to the right, the stronger the recommendation. We can display the data in a variety of ways.

plot(zootopia_rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

if we set the method to ‘grouped’, we can see the significance of particular set of bundles and recommendation rules associated with them

plot(zootopia_rules, method="grouped")

graph method allows us to see a mapping of the recommendations with two dimensions:size and support and lift and the intensity of the color

plot(zootopia_rules, method="graph", max = 10)
## Warning: plot: Too many rules supplied. Only plotting the best 10 rules using
## 'support' (change control parameter max if needed)

paracoord allows us to see sort of a network of the recommendations, and how they follow up to the movie that we are inspecting

plot(zootopia_rules, method="paracoord", max = 10)

thank you.