Introduction

Market basket analysis is used to determine when objects occur frequently together. The aim of this paper is to discover when objects occur simultaneously, in this case, when two movies from MovieLens web site are recommended by the same reviewers. The data for affinity analysis is often described in the form of a transaction, like a transaction at a store, determining this way when objects (movies) are purchased together

The algorithm used in this paper is going to be the apriori algorithm of association rules to find relationships between the movies viewed by a set of users.

Review of the Dataset

The dataset chosen is the small version published by the GroupLens research lab in the Department of Computer Science and Engineering at the University of Minnesota.

  • Small dataset: 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.

Preparing the data

#Libraries needed for the whole paper
library(arules)
library(dplyr)
library(dbplyr)
library(ggplot2)
library(stringr)
library(arulesViz)

Loading and showing the header of the data with the support level in a new column:

#Data comes from 2 different files: a list of movies with their id and genres classification plus a rating list of the movies
movies <- read.csv("movies.csv", colClasses = c("integer","character","character"), sep = ",", stringsAsFactors = FALSE)
ratings <- read.csv("ratings.csv", colClasses = c("integer","integer","numeric","character"), sep=",", stringsAsFactors = FALSE)
#Merging together both files and adding the support column
users <- length(unique(ratings$userId))
numberofmovies <- length(unique(movies$movieId))

support <- ratings %>%  group_by(movieId) %>%  summarise(q_movie = n_distinct(userId), support = q_movie/users)
support <- merge(support, movies, by = "movieId", all.x = T)

head(support)
##   movieId q_movie    support                              title
## 1       1     215 0.35245902                   Toy Story (1995)
## 2       2     110 0.18032787                     Jumanji (1995)
## 3       3      52 0.08524590            Grumpier Old Men (1995)
## 4       4       7 0.01147541           Waiting to Exhale (1995)
## 5       5      49 0.08032787 Father of the Bride Part II (1995)
## 6       6     102 0.16721311                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller

Creating the transaction matrix:

user_item_matrix <- as(split(ratings[,"movieId"], ratings[,"userId"]), "transactions")
user_item_matrix
## transactions in sparse format with
##  610 transactions (rows) and
##  9724 items (columns)
trans <- size(user_item_matrix)
summary(trans)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    20.0    35.0    70.5   165.3   168.0  2698.0

The database have 610 users and 9742 movies, and we define Support of a movie as the number of transactions containing the movie divided by the total number of transactions. To be clearer, a transaction is all the movies watched by each user.

On average each user has watched 165 movies, with a minimum of 20 and a maximum of 2698.

Top 10 movies ordered by their support level:

top_n(support, 10, q_movie) %>% arrange(desc(q_movie))
##    movieId q_movie   support                                     title
## 1      356     329 0.5393443                       Forrest Gump (1994)
## 2      318     317 0.5196721          Shawshank Redemption, The (1994)
## 3      296     307 0.5032787                       Pulp Fiction (1994)
## 4      593     279 0.4573770          Silence of the Lambs, The (1991)
## 5     2571     278 0.4557377                        Matrix, The (1999)
## 6      260     251 0.4114754 Star Wars: Episode IV - A New Hope (1977)
## 7      480     238 0.3901639                      Jurassic Park (1993)
## 8      110     237 0.3885246                         Braveheart (1995)
## 9      589     224 0.3672131         Terminator 2: Judgment Day (1991)
## 10     527     220 0.3606557                   Schindler's List (1993)
##                              genres
## 1          Comedy|Drama|Romance|War
## 2                       Crime|Drama
## 3       Comedy|Crime|Drama|Thriller
## 4             Crime|Horror|Thriller
## 5            Action|Sci-Fi|Thriller
## 6           Action|Adventure|Sci-Fi
## 7  Action|Adventure|Sci-Fi|Thriller
## 8                  Action|Drama|War
## 9                     Action|Sci-Fi
## 10                        Drama|War

The movie Forrest Gump was the most watched movie with a total of 329 distinct users out of a total of 610, its support is 0.54, and in second place is The Shawshank Redemption with 317 users.

Apriori algorithm

The next graph shows the distribution of the size of the transactions. It is concentrated in low values, with half of the users having rated up to 70.5 films.

data.frame(trans) %>%
  ggplot(aes(x = trans)) +
  geom_histogram() +
  labs(title = "Size of the transactions",
       x = "Amount of movies") +
  theme_bw()

Parameters

  • support: minimum support an itemset must have to be considered frequent.
  • minlen: minimum number of items that an itemset must have to be included in the results.
  • maxlen: maximum number of items that an itemset can have to be included in the results.
  • target: type of results to be generated by the algorithm, “frequent itemsets”, “maximally frequent itemsets”, “closed frequent itemsets”, “rules” or “hyperedgesets”.
  • confidence: minimum confidence that a rule must have to be included in the results.
  • maxtime: maximum time the algorithm can search for subsets.

To find significant support we define transactions with at least 71 items (Median).

parameters = list(
  supp = 71 / dim(user_item_matrix)[1],
  conf = 0.7,
  maxlen = 2)

rules <- apriori(user_item_matrix, parameter = parameters)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime   support minlen
##         0.7    0.1    1 none FALSE            TRUE       5 0.1163934      1
##  maxlen target  ext
##       2  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 71 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[9724 item(s), 610 transaction(s)] done [0.03s].
## sorting and recoding items ... [252 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2
## Warning in apriori(user_item_matrix, parameter = parameters): Mining stopped
## (maxlen reached). Only patterns up to a length of 2 returned!
##  done [0.00s].
## writing ... [1016 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules_table = inspect(head(rules), linebreak = FALSE)
##     lhs      rhs   support   confidence coverage  lift     count
## [1] {300} => {296} 0.1163934 0.8765432  0.1327869 1.741666 71   
## [2] {292} => {165} 0.1163934 0.7029703  0.1655738 2.977860 71   
## [3] {292} => {316} 0.1196721 0.7227723  0.1655738 3.149222 73   
## [4] {292} => {150} 0.1295082 0.7821782  0.1655738 2.373775 79   
## [5] {292} => {380} 0.1344262 0.8118812  0.1655738 2.782289 82   
## [6] {292} => {457} 0.1360656 0.8217822  0.1655738 2.638353 83

Discussion of the model

Summary of the rules found:

summary(rules)
## set of 1016 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2 
## 1016 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2       2       2       2       2       2 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1164   Min.   :0.7000   Min.   :0.1279   Min.   :1.298  
##  1st Qu.:0.1279   1st Qu.:0.7264   1st Qu.:0.1672   1st Qu.:1.535  
##  Median :0.1426   Median :0.7604   Median :0.1885   Median :1.907  
##  Mean   :0.1567   Mean   :0.7688   Mean   :0.2046   Mean   :2.029  
##  3rd Qu.:0.1689   3rd Qu.:0.7993   3rd Qu.:0.2213   3rd Qu.:2.386  
##  Max.   :0.3787   Max.   :0.9690   Max.   :0.5393   Max.   :4.887  
##      count       
##  Min.   : 71.00  
##  1st Qu.: 78.00  
##  Median : 87.00  
##  Mean   : 95.58  
##  3rd Qu.:103.00  
##  Max.   :231.00  
## 
## mining info:
##              data ntransactions   support confidence
##  user_item_matrix           610 0.1163934        0.7

At least 1016 significant rules have been found with a mean value of support of 15.7% and maximum of 37.9%, the confidence mean value is 76.88%. Also the mean value of the lift level is around 2, so, on average, the recommendations found are well associated.

As result we also obtain some levels of confidence and lift:

  • Confidence is the percentage in which the recommended movie is also satisfied upon the particular antecedent, i.e., the proportion of transactions where the title.x results in the presence of title.y, it can be understood also as a measure of centainty or trustworthiness associated with each discovered pattern.

  • The lift level is a measure of control for the support(frequency) of the recommended movies while calculating the conditional probability of occurrence of title.y given title.x. A value of lift greater than 1 vouches for positive associations between the title.y and title.x, lift around 1 implies independent titles, and a value of lift smaller than 1 implies negative associations.

Relation between support and confidence for each rule:

plot(rules, engine="plotly", max=1016)

We can see that the higher the support the lower the lift level, and that the higher levels of lift are around the mean value of confidence 77%.

Relation between support and the lift:

plot(rules, engine="plotly", measure = c("support", "lift"), shading = "confidence", max=1016)

The graph shows us how most of the movies have a support lower than the 20% and lift levels between 2.5 and 1.3.

We change the format of the database to see the name and genre of the movies:

associations <- as(rules,"data.frame")

rules = sapply(associations$rules,function(x){
  x = gsub("[\\{\\}]", "", regmatches(x, gregexpr("\\{.*\\}", x))[[1]])
  x = gsub("=>",",",x)
  x = str_replace_all(x," ","")
  return( x )
})

rules <- as.character(rules)
rules <- str_split(rules,",")

associations$lhs_movie <- sapply(rules, "[[", 1)
associations$rhs_movie <- sapply(rules, "[[", 2)

associations$lhs_movie = as.numeric(associations$lhs_movie)
associations$rhs_movie = as.numeric(associations$rhs_movie)

all_genres <- unique(unlist(str_split(movies$genres,"\\|")))

for(genre in all_genres){
  movies[genre] = 
    ifelse((str_detect(movies$genres,genre) | 
              str_detect(movies$genres,"no genres")) , 1 , 0)
}

movies$genres <- NULL

associations <- associations %>% 
  left_join(movies[, c(1,2)], by=c("lhs_movie" = "movieId"))

associations <- associations %>% 
  left_join(movies[, c(1,2)], by=c("rhs_movie" = "movieId"))

head(movies)
##   movieId                              title Adventure Animation Children
## 1       1                   Toy Story (1995)         1         1        1
## 2       2                     Jumanji (1995)         1         0        1
## 3       3            Grumpier Old Men (1995)         0         0        0
## 4       4           Waiting to Exhale (1995)         0         0        0
## 5       5 Father of the Bride Part II (1995)         0         0        0
## 6       6                        Heat (1995)         0         0        0
##   Comedy Fantasy Romance Drama Action Crime Thriller Horror Mystery Sci-Fi War
## 1      1       1       0     0      0     0        0      0       0      0   0
## 2      0       1       0     0      0     0        0      0       0      0   0
## 3      1       0       1     0      0     0        0      0       0      0   0
## 4      1       0       1     1      0     0        0      0       0      0   0
## 5      1       0       0     0      0     0        0      0       0      0   0
## 6      0       0       0     0      1     1        1      0       0      0   0
##   Musical Documentary IMAX Western Film-Noir (no genres listed)
## 1       0           0    0       0         0                  0
## 2       0           0    0       0         0                  0
## 3       0           0    0       0         0                  0
## 4       0           0    0       0         0                  0
## 5       0           0    0       0         0                  0
## 6       0           0    0       0         0                  0

Rules that have a higher lift:

associations %>% 
  arrange(desc(lift)) %>% 
  select(title.x, title.y, support, confidence, lift, count) %>% 
  head()
##                                                                                          title.x
## 1                                                Harry Potter and the Prisoner of Azkaban (2004)
## 2                                                 Harry Potter and the Chamber of Secrets (2002)
## 3                                                Harry Potter and the Prisoner of Azkaban (2004)
## 4                                                 Harry Potter and the Chamber of Secrets (2002)
## 5 Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
## 6                                                                       Kill Bill: Vol. 2 (2004)
##                                                                                          title.y
## 1                                                 Harry Potter and the Chamber of Secrets (2002)
## 2                                                Harry Potter and the Prisoner of Azkaban (2004)
## 3 Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
## 4 Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
## 5                                                 Harry Potter and the Chamber of Secrets (2002)
## 6                                                                       Kill Bill: Vol. 1 (2003)
##     support confidence     lift count
## 1 0.1245902  0.8172043 4.887202    76
## 2 0.1245902  0.7450980 4.887202    76
## 3 0.1213115  0.7956989 4.536228    74
## 4 0.1327869  0.7941176 4.527213    81
## 5 0.1327869  0.7570093 4.527213    81
## 6 0.1688525  0.9363636 4.360167   103

As we can see the first rules show us movies form the series of Harry Potter recommended for another movie from the same series. To discard these cases of sequels, sagas or series, we eliminate the rules in which the first word of the title.x is equal to the first word of the title.y.

Now we look again at the first ten rules according to the lift.

assoc_2 <- associations %>% 
  
  filter(substr(title.x, 1, str_locate(title.x, ' ')-1) != substr(title.y, 1, str_locate(title.y, ' ')-1)) %>%
  
  filter(substr(title.x, 1, str_locate(title.x, "[[:punct:]]")-1) != substr(title.y, 1, str_locate(title.y, " ")-1)) %>%
  
  filter(substr(title.x, 1, str_locate(title.x, " ")-1) != substr(title.y, 1, str_locate(title.y, "[[:punct:]]")-1)) %>%
  
  arrange(desc(lift))

head(assoc_2 %>% 
       select(title.x, title.y, support, confidence, lift),10)
##                                                title.x
## 1                                        WALL·E (2008)
## 2                                  Crimson Tide (1995)
## 3                                   Cliffhanger (1993)
## 4  Star Wars: Episode II - Attack of the Clones (2002)
## 5          Indiana Jones and the Temple of Doom (1984)
## 6                           Catch Me If You Can (2002)
## 7                                      Iron Man (2008)
## 8                                       Shrek 2 (2004)
## 9                         2001: A Space Odyssey (1968)
## 10                              Minority Report (2002)
##                                       title.y   support confidence     lift
## 1                                   Up (2009) 0.1196721  0.7019231 4.077839
## 2             Clear and Present Danger (1994) 0.1213115  0.7184466 3.984113
## 3             Clear and Present Danger (1994) 0.1180328  0.7128713 3.953195
## 4                           Spider-Man (2002) 0.1180328  0.7826087 3.913043
## 5  Ghostbusters (a.k.a. Ghost Busters) (1984) 0.1278689  0.7222222 3.671296
## 6                       Ocean's Eleven (2001) 0.1344262  0.7130435 3.655097
## 7                     Dark Knight, The (2008) 0.1344262  0.8723404 3.571327
## 8                       Monsters, Inc. (2001) 0.1163934  0.7717391 3.566370
## 9                         Blade Runner (1982) 0.1278689  0.7155963 3.520272
## 10                          Spider-Man (2002) 0.1377049  0.7000000 3.500000

The rule with the highest lift associates the movies “WALL·E” with “Up” with a support of 12%, a confidence of 70.2% and a lift of 4.1, in second place is the association between “Crimson Tide” and “Clear and Present Danger”.

The following graph shows the top 30 rules according to the lift value:

assoc_lift <- assoc_2[1:30, ]

assoc_lift$title.x <- substr(assoc_lift$title.x, 1, str_locate(assoc_lift$title.x,"\\(")-1)

assoc_lift$title.y <- substr(assoc_lift$title.y, 1, str_locate(assoc_lift$title.y,"\\(")-1)

assoc_lift$title.x <- trimws(assoc_lift$title.x, "both")
assoc_lift$title.y <- trimws(assoc_lift$title.y, "both")

ggplot(assoc_lift, aes(x = title.x, y = title.y, fill = lift)) +
  geom_tile() + 
  labs(title = "Association rules", subtitle = "Top 30 rules according with the lift")+
  theme(axis.text.x=element_text(angle = -45, hjust = 0))

Top 30 rules according to the value of confidence:

assoc_conf <- assoc_2 %>% arrange(desc(confidence))

assoc_conf <- assoc_conf[1:30, ]

assoc_conf$title.x <- substr(assoc_conf$title.x, 1, str_locate(assoc_conf$title.x,"\\(")-1)

assoc_conf$title.y <- substr(assoc_conf$title.y, 1, str_locate(assoc_conf$title.y,"\\(")-1)

assoc_conf$title.x <- trimws(assoc_conf$title.x, "both")
assoc_conf$title.y <- trimws(assoc_conf$title.y, "both")

ggplot(assoc_conf, aes(x = title.x, y = title.y, fill = confidence)) +
  geom_tile() + 
  labs(title = "Association rules", subtitle = "Top 30 rules according with the confidence")+
  theme(axis.text.x=element_text(angle = -45, hjust = 0))

Example as conclusion

Now we could try to check the recommendations of only one movie and focus on how the algorithm works for that movie. Let’s take as example the movie ‘Blade Runner’:

bladerunner1 <- assoc_2 %>% filter(grepl("Blade Runner",title.x)) %>%  arrange(desc(lift))
head(bladerunner1 %>%  select(title.x, title.y, support, confidence, lift, count), 10)
##               title.x
## 1 Blade Runner (1982)
## 2 Blade Runner (1982)
## 3 Blade Runner (1982)
## 4 Blade Runner (1982)
## 5 Blade Runner (1982)
## 6 Blade Runner (1982)
## 7 Blade Runner (1982)
##                                                                          title.y
## 1                          Star Wars: Episode V - The Empire Strikes Back (1980)
## 2 Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
## 3                                      Star Wars: Episode IV - A New Hope (1977)
## 4                                                             Matrix, The (1999)
## 5                                               Silence of the Lambs, The (1991)
## 6                                                            Pulp Fiction (1994)
## 7                                                            Forrest Gump (1994)
##     support confidence     lift count
## 1 0.1573770  0.7741935 2.238190    96
## 2 0.1426230  0.7016129 2.139919    87
## 3 0.1573770  0.7741935 1.881506    96
## 4 0.1688525  0.8306452 1.822639   103
## 5 0.1475410  0.7258065 1.586889    90
## 6 0.1508197  0.7419355 1.474204    92
## 7 0.1573770  0.7741935 1.435435    96
bladerunner2 <- assoc_2 %>%  filter(grepl("Blade Runner",title.y)) %>%  arrange(desc(lift))
head(bladerunner2 %>%  select(title.x, title.y, support, confidence, lift, count), 10)
##                        title.x             title.y   support confidence
## 1 2001: A Space Odyssey (1968) Blade Runner (1982) 0.1278689  0.7155963
##       lift count
## 1 3.520272    78

As we can see if the movie watched is Blade Runner, the algorithm will recommend another 7 possible movies being ‘Star Wars: Episode V’ the one with higher lift level 2.24. On the other hand, Blade Runner is not one of the most recommended movies since it’s only a recommendation for the movie ‘2001: A Space Odyssey’, despite of that it is recommended with a very high lift level of 3.52, so they are very well associated.

Sources

  1. Jacek Lewkowicz. Unsupervised Learning - Master in Data Science and Business Analytics - University of Warsaw. Lecture slides
  2. https://rpubs.com/airam/usl_p3
  3. https://rpubs.com/honkalimonka/UL3
  4. Data - grouplens.org: https://grouplens.org/datasets/movielens/latest/ F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872