library(arules)
library(dplyr)
library(tidyr)
library(arulesViz)
library(tidyverse)
library(knitr)
For the Association Rule Mining project I will focus on movie recommendations. Based on the movies users have seen and the ratings they have given we will be able to determine what kind of movies they might enjoy to watch next. The dataset we are using is called MovieLens, and it’s one of the most popular and up-to-date databases on movies
movies <- read.csv("movies.csv")
ratings <- read.csv("ratings.csv")
show(str(movies))
## 'data.frame': 9742 obs. of 3 variables:
## $ movieId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ title : chr "Toy Story (1995)" "Jumanji (1995)" "Grumpier Old Men (1995)" "Waiting to Exhale (1995)" ...
## $ genres : chr "Adventure|Animation|Children|Comedy|Fantasy" "Adventure|Children|Fantasy" "Comedy|Romance" "Comedy|Drama|Romance" ...
## NULL
show(head(movies))
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
we will filter the movies to obtain a more modern dataset, to make the task easier for my computer. I will choose movies released after the year 2009
movies$year <- as.numeric(substr(movies$title, nchar(movies$title) - 4, nchar(movies$title) - 1))
movies %>% drop_na()
movies <- movies %>% filter(year > 2009)
movies$movieId <- as.factor(movies$movieId)
show(str(ratings))
## 'data.frame': 100836 obs. of 4 variables:
## $ userId : int 1 1 1 1 1 1 1 1 1 1 ...
## $ movieId : int 1 3 6 47 50 70 101 110 151 157 ...
## $ rating : num 4 4 4 5 5 3 5 4 5 5 ...
## $ timestamp: int 964982703 964981247 964982224 964983815 964982931 964982400 964980868 964982176 964984041 964984100 ...
## NULL
show(head(movies))
## movieId title genres year
## 1 73268 Daybreakers (2010) Action|Drama|Horror|Thriller 2010
## 2 73319 Leap Year (2010) Comedy|Romance 2010
## 3 73321 Book of Eli, The (2010) Action|Adventure|Drama 2010
## 4 73929 Legion (2010) Action|Fantasy|Horror|Thriller 2010
## 5 74154 When in Rome (2010) Comedy|Romance 2010
## 6 74324 Temple Grandin (2010) Drama 2010
We have timestamp variable in the ratings, but we won’t be analyzing time series here, so we don’t need it. Additionally, our goal is to suggest movies that an individual will enjoy for the most part. That’s why I will modify the dataset to show movies with an average rating above 3.49/5, so 7/10 and above.
ratings$timestamp <- NULL
avg_rating <- ratings %>% group_by(movieId) %>% summarise(avg_rating = round(mean(rating),2))
## `summarise()` ungrouping output (override with `.groups` argument)
ratings <- inner_join(ratings, avg_rating, by = "movieId")
ratings <- ratings %>% filter((avg_rating > 3.49) & (movieId %in% movies$movieId))
ratings$movieId <- as.factor(ratings$movieId)
now we will join the movies and the ratings dataset together to obtain the userid and all the titles seen by the users
full_data <-inner_join(movies, ratings, by = 'movieId')
full_data$title <- substr(full_data$title, 0, nchar(full_data$title)-7)
full_data$genres <- tolower(full_data$genres)
selecting specific genre. This will help us avoid situations in which antecedent movies recommend a consequent movie of completely different topicality. For now the only indicators we base our recommendations on are the ratings and the frequency. This means that you could get Schindler’s List recommended because you watched Shrek. We will focus on the animation genre.
print(unique(unlist(str_split(unique(full_data$genres), "\\|"))))
## [1] "drama" "mystery" "thriller"
## [4] "romance" "war" "action"
## [7] "sci-fi" "adventure" "animation"
## [10] "children" "fantasy" "imax"
## [13] "comedy" "documentary" "musical"
## [16] "crime" "horror" "film-noir"
## [19] "western" "(no genres listed)"
unique_genres <- unlist(unique(full_data$genres))
genre_matrix <- str_detect("animation", unique_genres)
genre_list <- unique_genres[genre_matrix]
animation_data <- full_data %>% filter(genres %in% genre_list)
now we split the data to display the ‘buckets’ of films particular users have seen, and then we use the arules package to transform our data into association rule mining format that will allow us to perform the apriori algorithm
id_views <- split(animation_data$title, animation_data$userId)
matrix <- as(id_views, "transactions")
#Item frequency allows us to view the support of each movie, in relative - percentage, and absolute way
itemFrequency(matrix, type="relative")[1:5]
## A Silent Voice
## 0.015037594
## Adventures of Tintin, The
## 0.090225564
## All-Star Superman
## 0.015037594
## Alpha and Omega 3: The Great Wolf Games
## 0.007518797
## Asterix: The Land of the Gods (Astérix: Le domaine des dieux)
## 0.015037594
itemFrequency(matrix, type="absolute")[1:5]
## A Silent Voice
## 2
## Adventures of Tintin, The
## 12
## All-Star Superman
## 2
## Alpha and Omega 3: The Great Wolf Games
## 1
## Asterix: The Land of the Gods (Astérix: Le domaine des dieux)
## 2
#here we can look at some sample bundles of movies seen by users
inspect(matrix[1:5])
## items transactionID
## [1] {Despicable Me,
## Despicable Me 2,
## Frozen,
## Tangled,
## Tangled Ever After} 10
## [2] {Inside Out,
## Sausage Party,
## Zootopia} 15
## [3] {Toy Story 3} 16
## [4] {Toy Story 3} 17
## [5] {Batman: The Dark Knight Returns, Part 1,
## Batman: The Dark Knight Returns, Part 2,
## Inside Out,
## Toy Story 3} 18
we can also look which movies were the most popular
itemFrequencyPlot(matrix, topN = 15)
for the apriori I am choosing rather high support and confidence, which narrows down the data, but it also provides more validity in the recommendations. The minlen and maxlen setting allows for choosing the size of the antecedent. I don’t want recommendations to be based off of a single movie, nor do I want bundles too large, because those can get pretty heavy for computing
movie_rules <- apriori(matrix, parameter = list(support = 0.05, confidence = 0.8, minlen = 3, maxlen = 5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.05 3
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 6
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[99 item(s), 133 transaction(s)] done [0.00s].
## sorting and recoding items ... [27 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5
## Warning in apriori(matrix, parameter = list(support = 0.05, confidence = 0.8, :
## Mining stopped (maxlen reached). Only patterns up to a length of 5 returned!
## done [0.00s].
## writing ... [499 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
#summary of the Apriori algorithm result
summary(movie_rules)
## set of 499 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5
## 140 243 116
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 3.000 4.000 3.952 4.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.05263 Min. :0.8000 Min. :0.05263 Min. :1.935
## 1st Qu.:0.05263 1st Qu.:0.8462 1st Qu.:0.06015 1st Qu.:2.418
## Median :0.06015 Median :0.8750 Median :0.06767 Median :2.838
## Mean :0.06229 Mean :0.8942 Mean :0.07035 Mean :2.978
## 3rd Qu.:0.06767 3rd Qu.:0.9643 3rd Qu.:0.07519 3rd Qu.:3.244
## Max. :0.14286 Max. :1.0000 Max. :0.17293 Max. :8.312
## count
## Min. : 7.000
## 1st Qu.: 7.000
## Median : 8.000
## Mean : 8.285
## 3rd Qu.: 9.000
## Max. :19.000
##
## mining info:
## data ntransactions support confidence
## matrix 133 0.05 0.8
#sample antecedent - consequent relationships, aka the movies recommendations
inspect(head(movie_rules, 5))
## lhs rhs support confidence coverage lift count
## [1] {Croods, The,
## Despicable Me 2} => {Despicable Me} 0.05263158 0.875 0.06015038 3.062500 7
## [2] {Croods, The,
## Despicable Me} => {Despicable Me 2} 0.05263158 1.000 0.05263158 7.000000 7
## [3] {Adventures of Tintin, The,
## Wreck-It Ralph} => {Big Hero 6} 0.05263158 1.000 0.05263158 3.243902 7
## [4] {Adventures of Tintin, The,
## Big Hero 6} => {Wreck-It Ralph} 0.05263158 0.875 0.06015038 5.289773 7
## [5] {Adventures of Tintin, The,
## Wreck-It Ralph} => {Inside Out} 0.05263158 1.000 0.05263158 3.093023 7
#we can inspect which bundles of movie antecedent - consequent relationships are the highest in terms of the
#lift support and count(frequency)
inspect(sort(movie_rules, by = "lift")[1:5])
## lhs rhs support confidence coverage lift count
## [1] {Despicable Me 2,
## Inside Out,
## Toy Story 3} => {Monsters University} 0.05263158 1.000 0.05263158 8.312500 7
## [2] {Despicable Me 2,
## Inside Out,
## The Lego Movie,
## Toy Story 3} => {Monsters University} 0.05263158 1.000 0.05263158 8.312500 7
## [3] {Despicable Me 2,
## Inside Out,
## The Lego Movie} => {Monsters University} 0.05263158 0.875 0.06015038 7.273438 7
## [4] {Big Hero 6,
## Despicable Me 2,
## Toy Story 3} => {Monsters University} 0.05263158 0.875 0.06015038 7.273438 7
## [5] {Croods, The,
## Despicable Me} => {Despicable Me 2} 0.05263158 1.000 0.05263158 7.000000 7
inspect(sort(movie_rules, by = "support")[1:5])
## lhs rhs support confidence coverage lift count
## [1] {How to Train Your Dragon,
## Toy Story 3} => {Despicable Me} 0.1428571 0.8260870 0.1729323 2.891304 19
## [2] {Toy Story 3,
## Zootopia} => {Inside Out} 0.1203008 0.8888889 0.1353383 2.749354 16
## [3] {Big Hero 6,
## Inside Out} => {Toy Story 3} 0.1203008 0.8000000 0.1503759 1.934545 16
## [4] {Big Hero 6,
## Despicable Me} => {How to Train Your Dragon} 0.1203008 0.8000000 0.1503759 2.007547 16
## [5] {Despicable Me,
## The Lego Movie} => {How to Train Your Dragon} 0.1127820 0.8823529 0.1278195 2.214206 15
inspect(sort(movie_rules, by = "count")[1:5])
## lhs rhs support confidence coverage lift count
## [1] {How to Train Your Dragon,
## Toy Story 3} => {Despicable Me} 0.1428571 0.8260870 0.1729323 2.891304 19
## [2] {Toy Story 3,
## Zootopia} => {Inside Out} 0.1203008 0.8888889 0.1353383 2.749354 16
## [3] {Big Hero 6,
## Inside Out} => {Toy Story 3} 0.1203008 0.8000000 0.1503759 1.934545 16
## [4] {Big Hero 6,
## Despicable Me} => {How to Train Your Dragon} 0.1203008 0.8000000 0.1503759 2.007547 16
## [5] {Despicable Me,
## The Lego Movie} => {How to Train Your Dragon} 0.1127820 0.8823529 0.1278195 2.214206 15
let’s explore a particular movie, zootopia. Here we set the rhs to zootopia, which means we will look at which movies will lead to recommending zootopia as your next movie
zootopia_rules <-apriori(data = matrix, parameter=list(supp=0.05,conf = 0.2, minlen = 3, maxlen = 5),
appearance=list(default="lhs", rhs="Zootopia"), control=list(verbose=F))
#sample recommendations
inspect(zootopia_rules[1:10])
## lhs rhs support confidence coverage lift count
## [1] {How to Train Your Dragon,
## Rise of the Guardians} => {Zootopia} 0.05263158 0.6363636 0.08270677 2.644886 7
## [2] {Finding Dory,
## The Lego Movie} => {Zootopia} 0.05263158 0.7777778 0.06766917 3.232639 7
## [3] {Big Hero 6,
## Finding Dory} => {Zootopia} 0.06015038 0.8888889 0.06766917 3.694444 8
## [4] {Finding Dory,
## Inside Out} => {Zootopia} 0.06015038 0.8888889 0.06766917 3.694444 8
## [5] {Despicable Me,
## Finding Dory} => {Zootopia} 0.06015038 0.8888889 0.06766917 3.694444 8
## [6] {Finding Dory,
## How to Train Your Dragon} => {Zootopia} 0.06766917 1.0000000 0.06766917 4.156250 9
## [7] {Monsters University,
## The Lego Movie} => {Zootopia} 0.05263158 0.7000000 0.07518797 2.909375 7
## [8] {How to Train Your Dragon 2,
## Inside Out} => {Zootopia} 0.05263158 0.8750000 0.06015038 3.636719 7
## [9] {Despicable Me,
## How to Train Your Dragon 2} => {Zootopia} 0.05263158 0.5384615 0.09774436 2.237981 7
## [10] {How to Train Your Dragon,
## How to Train Your Dragon 2} => {Zootopia} 0.05263158 0.5384615 0.09774436 2.237981 7
zootopia_rules_by_confidence <- sort(zootopia_rules, by="confidence", decreasing=TRUE)
inspect(head(zootopia_rules_by_confidence, 5))
## lhs rhs support confidence coverage lift count
## [1] {Finding Dory,
## How to Train Your Dragon} => {Zootopia} 0.06766917 1 0.06766917 4.15625 9
## [2] {Finding Dory,
## How to Train Your Dragon,
## The Lego Movie} => {Zootopia} 0.05263158 1 0.05263158 4.15625 7
## [3] {Big Hero 6,
## Finding Dory,
## How to Train Your Dragon} => {Zootopia} 0.05263158 1 0.05263158 4.15625 7
## [4] {Finding Dory,
## How to Train Your Dragon,
## Inside Out} => {Zootopia} 0.05263158 1 0.05263158 4.15625 7
## [5] {Despicable Me,
## Finding Dory,
## How to Train Your Dragon} => {Zootopia} 0.05263158 1 0.05263158 4.15625 7
we can also discover what kind of recommendations a particular bundle of movies will yield, let’s check out Zootpia along with Sausage Party and How to Train Your Dragon
multiple_rules <-apriori(data = matrix, parameter=list(supp=0.02,conf = 0.2, minlen = 3, maxlen = 5),
appearance=list(default="rhs",
lhs=list("Zootopia", "Sausage Party", "How to Train Your Dragon")),
control=list(verbose=F))
inspect(tail(multiple_rules))
## lhs rhs support confidence coverage lift count
## [1] {How to Train Your Dragon,
## Zootopia} => {Inside Out} 0.11278195 0.75 0.15037594 2.319767 15
## [2] {How to Train Your Dragon,
## Zootopia} => {Despicable Me} 0.09774436 0.65 0.15037594 2.275000 13
## [3] {How to Train Your Dragon,
## Zootopia} => {Toy Story 3} 0.09774436 0.65 0.15037594 1.571818 13
## [4] {How to Train Your Dragon,
## Sausage Party,
## Zootopia} => {Wreck-It Ralph} 0.02255639 1.00 0.02255639 6.045455 3
## [5] {How to Train Your Dragon,
## Sausage Party,
## Zootopia} => {The Lego Movie} 0.02255639 1.00 0.02255639 4.290323 3
## [6] {How to Train Your Dragon,
## Sausage Party,
## Zootopia} => {Big Hero 6} 0.02255639 1.00 0.02255639 3.243902 3
finally, we are also able to explore suggestions where a movie title occurs in the antecedent and the consequent, so how a particular movie affects the entire recommendation ecosystem
frozen_rules <- subset(movie_rules, items %in% "Frozen")
inspect(head(frozen_rules))
## lhs rhs
## [1] {Frozen,Monsters University} => {Toy Story 3}
## [2] {Despicable Me 2,Tangled} => {Frozen}
## [3] {Despicable Me 2,Frozen} => {Tangled}
## [4] {Despicable Me 2,Frozen} => {How to Train Your Dragon}
## [5] {Despicable Me 2,How to Train Your Dragon} => {Frozen}
## [6] {Frozen,Megamind} => {How to Train Your Dragon}
## support confidence coverage lift count
## [1] 0.06015038 1.0000000 0.06015038 2.418182 8
## [2] 0.06015038 0.8888889 0.06766917 4.076628 8
## [3] 0.06015038 0.8000000 0.07518797 4.433333 8
## [4] 0.06015038 0.8000000 0.07518797 2.007547 8
## [5] 0.06015038 0.8000000 0.07518797 3.668966 8
## [6] 0.05263158 1.0000000 0.05263158 2.509434 7
We can display the rules/recommendations in three dimensions. The redder and the higher and more to the right, the stronger the recommendation. We can display the data in a variety of ways.
plot(zootopia_rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
if we set the method to ‘grouped’, we can see the significance of particular set of bundles and recommendation rules associated with them
plot(zootopia_rules, method="grouped")
graph method allows us to see a mapping of the recommendations with two dimensions:size and support and lift and the intensity of the color
plot(zootopia_rules, method="graph", max = 10)
## Warning: plot: Too many rules supplied. Only plotting the best 10 rules using
## 'support' (change control parameter max if needed)
paracoord allows us to see sort of a network of the recommendations, and how they follow up to the movie that we are inspecting
plot(zootopia_rules, method="paracoord", max = 10)