Market basket analysis is used to determine when objects occur frequently together. The aim of this paper is to discover when objects occur simultaneously, in this case, when two movies from MovieLens web site are recommended by the same reviewers. The data for affinity analysis is often described in the form of a transaction, like a transaction at a store, determining this way when objects (movies) are purchased together
The algorithm used in this paper is going to be the apriori algorithm of association rules to find relationships between the movies viewed by a set of users.
The dataset chosen is the small version published by the GroupLens research lab in the Department of Computer Science and Engineering at the University of Minnesota.
Preparing the data
#Libraries needed for the whole paper
library(arules)
library(dplyr)
library(dbplyr)
library(ggplot2)
library(stringr)
library(arulesViz)
Loading and showing the header of the data with the support level in a new column:
#Data comes from 2 different files: a list of movies with their id and genres classification plus a rating list of the movies
movies <- read.csv("movies.csv", colClasses = c("integer","character","character"), sep = ",", stringsAsFactors = FALSE)
ratings <- read.csv("ratings.csv", colClasses = c("integer","integer","numeric","character"), sep=",", stringsAsFactors = FALSE)
#Merging together both files and adding the support column
users <- length(unique(ratings$userId))
numberofmovies <- length(unique(movies$movieId))
support <- ratings %>% group_by(movieId) %>% summarise(q_movie = n_distinct(userId), support = q_movie/users)
support <- merge(support, movies, by = "movieId", all.x = T)
head(support)
## movieId q_movie support title
## 1 1 215 0.35245902 Toy Story (1995)
## 2 2 110 0.18032787 Jumanji (1995)
## 3 3 52 0.08524590 Grumpier Old Men (1995)
## 4 4 7 0.01147541 Waiting to Exhale (1995)
## 5 5 49 0.08032787 Father of the Bride Part II (1995)
## 6 6 102 0.16721311 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
Creating the transaction matrix:
user_item_matrix <- as(split(ratings[,"movieId"], ratings[,"userId"]), "transactions")
user_item_matrix
## transactions in sparse format with
## 610 transactions (rows) and
## 9724 items (columns)
trans <- size(user_item_matrix)
summary(trans)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.0 35.0 70.5 165.3 168.0 2698.0
The database have 610 users and 9742 movies, and we define Support of a movie as the number of transactions containing the movie divided by the total number of transactions. To be clearer, a transaction is all the movies watched by each user.
On average each user has watched 165 movies, with a minimum of 20 and a maximum of 2698.
Top 10 movies ordered by their support level:
top_n(support, 10, q_movie) %>% arrange(desc(q_movie))
## movieId q_movie support title
## 1 356 329 0.5393443 Forrest Gump (1994)
## 2 318 317 0.5196721 Shawshank Redemption, The (1994)
## 3 296 307 0.5032787 Pulp Fiction (1994)
## 4 593 279 0.4573770 Silence of the Lambs, The (1991)
## 5 2571 278 0.4557377 Matrix, The (1999)
## 6 260 251 0.4114754 Star Wars: Episode IV - A New Hope (1977)
## 7 480 238 0.3901639 Jurassic Park (1993)
## 8 110 237 0.3885246 Braveheart (1995)
## 9 589 224 0.3672131 Terminator 2: Judgment Day (1991)
## 10 527 220 0.3606557 Schindler's List (1993)
## genres
## 1 Comedy|Drama|Romance|War
## 2 Crime|Drama
## 3 Comedy|Crime|Drama|Thriller
## 4 Crime|Horror|Thriller
## 5 Action|Sci-Fi|Thriller
## 6 Action|Adventure|Sci-Fi
## 7 Action|Adventure|Sci-Fi|Thriller
## 8 Action|Drama|War
## 9 Action|Sci-Fi
## 10 Drama|War
The movie Forrest Gump was the most watched movie with a total of 329 distinct users out of a total of 610, its support is 0.54, and in second place is The Shawshank Redemption with 317 users.
The next graph shows the distribution of the size of the transactions. It is concentrated in low values, with half of the users having rated up to 70.5 films.
data.frame(trans) %>%
ggplot(aes(x = trans)) +
geom_histogram() +
labs(title = "Size of the transactions",
x = "Amount of movies") +
theme_bw()
To find significant support we define transactions with at least 71 items (Median).
parameters = list(
supp = 71 / dim(user_item_matrix)[1],
conf = 0.7,
maxlen = 2)
rules <- apriori(user_item_matrix, parameter = parameters)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.7 0.1 1 none FALSE TRUE 5 0.1163934 1
## maxlen target ext
## 2 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 71
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[9724 item(s), 610 transaction(s)] done [0.03s].
## sorting and recoding items ... [252 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2
## Warning in apriori(user_item_matrix, parameter = parameters): Mining stopped
## (maxlen reached). Only patterns up to a length of 2 returned!
## done [0.00s].
## writing ... [1016 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules_table = inspect(head(rules), linebreak = FALSE)
## lhs rhs support confidence coverage lift count
## [1] {300} => {296} 0.1163934 0.8765432 0.1327869 1.741666 71
## [2] {292} => {165} 0.1163934 0.7029703 0.1655738 2.977860 71
## [3] {292} => {316} 0.1196721 0.7227723 0.1655738 3.149222 73
## [4] {292} => {150} 0.1295082 0.7821782 0.1655738 2.373775 79
## [5] {292} => {380} 0.1344262 0.8118812 0.1655738 2.782289 82
## [6] {292} => {457} 0.1360656 0.8217822 0.1655738 2.638353 83
Summary of the rules found:
summary(rules)
## set of 1016 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 1016
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1164 Min. :0.7000 Min. :0.1279 Min. :1.298
## 1st Qu.:0.1279 1st Qu.:0.7264 1st Qu.:0.1672 1st Qu.:1.535
## Median :0.1426 Median :0.7604 Median :0.1885 Median :1.907
## Mean :0.1567 Mean :0.7688 Mean :0.2046 Mean :2.029
## 3rd Qu.:0.1689 3rd Qu.:0.7993 3rd Qu.:0.2213 3rd Qu.:2.386
## Max. :0.3787 Max. :0.9690 Max. :0.5393 Max. :4.887
## count
## Min. : 71.00
## 1st Qu.: 78.00
## Median : 87.00
## Mean : 95.58
## 3rd Qu.:103.00
## Max. :231.00
##
## mining info:
## data ntransactions support confidence
## user_item_matrix 610 0.1163934 0.7
At least 1016 significant rules have been found with a mean value of support of 15.7% and maximum of 37.9%, the confidence mean value is 76.88%. Also the mean value of the lift level is around 2, so, on average, the recommendations found are well associated.
As result we also obtain some levels of confidence and lift:
Confidence is the percentage in which the recommended movie is also satisfied upon the particular antecedent, i.e., the proportion of transactions where the title.x results in the presence of title.y, it can be understood also as a measure of centainty or trustworthiness associated with each discovered pattern.
The lift level is a measure of control for the support(frequency) of the recommended movies while calculating the conditional probability of occurrence of title.y given title.x. A value of lift greater than 1 vouches for positive associations between the title.y and title.x, lift around 1 implies independent titles, and a value of lift smaller than 1 implies negative associations.
Relation between support and confidence for each rule:
plot(rules, engine="plotly", max=1016)
We can see that the higher the support the lower the lift level, and that the higher levels of lift are around the mean value of confidence 77%.
Relation between support and the lift:
plot(rules, engine="plotly", measure = c("support", "lift"), shading = "confidence", max=1016)
The graph shows us how most of the movies have a support lower than the 20% and lift levels between 2.5 and 1.3.
We change the format of the database to see the name and genre of the movies:
associations <- as(rules,"data.frame")
rules = sapply(associations$rules,function(x){
x = gsub("[\\{\\}]", "", regmatches(x, gregexpr("\\{.*\\}", x))[[1]])
x = gsub("=>",",",x)
x = str_replace_all(x," ","")
return( x )
})
rules <- as.character(rules)
rules <- str_split(rules,",")
associations$lhs_movie <- sapply(rules, "[[", 1)
associations$rhs_movie <- sapply(rules, "[[", 2)
associations$lhs_movie = as.numeric(associations$lhs_movie)
associations$rhs_movie = as.numeric(associations$rhs_movie)
all_genres <- unique(unlist(str_split(movies$genres,"\\|")))
for(genre in all_genres){
movies[genre] =
ifelse((str_detect(movies$genres,genre) |
str_detect(movies$genres,"no genres")) , 1 , 0)
}
movies$genres <- NULL
associations <- associations %>%
left_join(movies[, c(1,2)], by=c("lhs_movie" = "movieId"))
associations <- associations %>%
left_join(movies[, c(1,2)], by=c("rhs_movie" = "movieId"))
head(movies)
## movieId title Adventure Animation Children
## 1 1 Toy Story (1995) 1 1 1
## 2 2 Jumanji (1995) 1 0 1
## 3 3 Grumpier Old Men (1995) 0 0 0
## 4 4 Waiting to Exhale (1995) 0 0 0
## 5 5 Father of the Bride Part II (1995) 0 0 0
## 6 6 Heat (1995) 0 0 0
## Comedy Fantasy Romance Drama Action Crime Thriller Horror Mystery Sci-Fi War
## 1 1 1 0 0 0 0 0 0 0 0 0
## 2 0 1 0 0 0 0 0 0 0 0 0
## 3 1 0 1 0 0 0 0 0 0 0 0
## 4 1 0 1 1 0 0 0 0 0 0 0
## 5 1 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 1 1 1 0 0 0 0
## Musical Documentary IMAX Western Film-Noir (no genres listed)
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
Rules that have a higher lift:
associations %>%
arrange(desc(lift)) %>%
select(title.x, title.y, support, confidence, lift, count) %>%
head()
## title.x
## 1 Harry Potter and the Prisoner of Azkaban (2004)
## 2 Harry Potter and the Chamber of Secrets (2002)
## 3 Harry Potter and the Prisoner of Azkaban (2004)
## 4 Harry Potter and the Chamber of Secrets (2002)
## 5 Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
## 6 Kill Bill: Vol. 2 (2004)
## title.y
## 1 Harry Potter and the Chamber of Secrets (2002)
## 2 Harry Potter and the Prisoner of Azkaban (2004)
## 3 Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
## 4 Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)
## 5 Harry Potter and the Chamber of Secrets (2002)
## 6 Kill Bill: Vol. 1 (2003)
## support confidence lift count
## 1 0.1245902 0.8172043 4.887202 76
## 2 0.1245902 0.7450980 4.887202 76
## 3 0.1213115 0.7956989 4.536228 74
## 4 0.1327869 0.7941176 4.527213 81
## 5 0.1327869 0.7570093 4.527213 81
## 6 0.1688525 0.9363636 4.360167 103
As we can see the first rules show us movies form the series of Harry Potter recommended for another movie from the same series. To discard these cases of sequels, sagas or series, we eliminate the rules in which the first word of the title.x is equal to the first word of the title.y.
Now we look again at the first ten rules according to the lift.
assoc_2 <- associations %>%
filter(substr(title.x, 1, str_locate(title.x, ' ')-1) != substr(title.y, 1, str_locate(title.y, ' ')-1)) %>%
filter(substr(title.x, 1, str_locate(title.x, "[[:punct:]]")-1) != substr(title.y, 1, str_locate(title.y, " ")-1)) %>%
filter(substr(title.x, 1, str_locate(title.x, " ")-1) != substr(title.y, 1, str_locate(title.y, "[[:punct:]]")-1)) %>%
arrange(desc(lift))
head(assoc_2 %>%
select(title.x, title.y, support, confidence, lift),10)
## title.x
## 1 WALL·E (2008)
## 2 Crimson Tide (1995)
## 3 Cliffhanger (1993)
## 4 Star Wars: Episode II - Attack of the Clones (2002)
## 5 Indiana Jones and the Temple of Doom (1984)
## 6 Catch Me If You Can (2002)
## 7 Iron Man (2008)
## 8 Shrek 2 (2004)
## 9 2001: A Space Odyssey (1968)
## 10 Minority Report (2002)
## title.y support confidence lift
## 1 Up (2009) 0.1196721 0.7019231 4.077839
## 2 Clear and Present Danger (1994) 0.1213115 0.7184466 3.984113
## 3 Clear and Present Danger (1994) 0.1180328 0.7128713 3.953195
## 4 Spider-Man (2002) 0.1180328 0.7826087 3.913043
## 5 Ghostbusters (a.k.a. Ghost Busters) (1984) 0.1278689 0.7222222 3.671296
## 6 Ocean's Eleven (2001) 0.1344262 0.7130435 3.655097
## 7 Dark Knight, The (2008) 0.1344262 0.8723404 3.571327
## 8 Monsters, Inc. (2001) 0.1163934 0.7717391 3.566370
## 9 Blade Runner (1982) 0.1278689 0.7155963 3.520272
## 10 Spider-Man (2002) 0.1377049 0.7000000 3.500000
The rule with the highest lift associates the movies “WALL·E” with “Up” with a support of 12%, a confidence of 70.2% and a lift of 4.1, in second place is the association between “Crimson Tide” and “Clear and Present Danger”.
The following graph shows the top 30 rules according to the lift value:
assoc_lift <- assoc_2[1:30, ]
assoc_lift$title.x <- substr(assoc_lift$title.x, 1, str_locate(assoc_lift$title.x,"\\(")-1)
assoc_lift$title.y <- substr(assoc_lift$title.y, 1, str_locate(assoc_lift$title.y,"\\(")-1)
assoc_lift$title.x <- trimws(assoc_lift$title.x, "both")
assoc_lift$title.y <- trimws(assoc_lift$title.y, "both")
ggplot(assoc_lift, aes(x = title.x, y = title.y, fill = lift)) +
geom_tile() +
labs(title = "Association rules", subtitle = "Top 30 rules according with the lift")+
theme(axis.text.x=element_text(angle = -45, hjust = 0))
Top 30 rules according to the value of confidence:
assoc_conf <- assoc_2 %>% arrange(desc(confidence))
assoc_conf <- assoc_conf[1:30, ]
assoc_conf$title.x <- substr(assoc_conf$title.x, 1, str_locate(assoc_conf$title.x,"\\(")-1)
assoc_conf$title.y <- substr(assoc_conf$title.y, 1, str_locate(assoc_conf$title.y,"\\(")-1)
assoc_conf$title.x <- trimws(assoc_conf$title.x, "both")
assoc_conf$title.y <- trimws(assoc_conf$title.y, "both")
ggplot(assoc_conf, aes(x = title.x, y = title.y, fill = confidence)) +
geom_tile() +
labs(title = "Association rules", subtitle = "Top 30 rules according with the confidence")+
theme(axis.text.x=element_text(angle = -45, hjust = 0))
Now we could try to check the recommendations of only one movie and focus on how the algorithm works for that movie. Let’s take as example the movie ‘Blade Runner’:
bladerunner1 <- assoc_2 %>% filter(grepl("Blade Runner",title.x)) %>% arrange(desc(lift))
head(bladerunner1 %>% select(title.x, title.y, support, confidence, lift, count), 10)
## title.x
## 1 Blade Runner (1982)
## 2 Blade Runner (1982)
## 3 Blade Runner (1982)
## 4 Blade Runner (1982)
## 5 Blade Runner (1982)
## 6 Blade Runner (1982)
## 7 Blade Runner (1982)
## title.y
## 1 Star Wars: Episode V - The Empire Strikes Back (1980)
## 2 Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
## 3 Star Wars: Episode IV - A New Hope (1977)
## 4 Matrix, The (1999)
## 5 Silence of the Lambs, The (1991)
## 6 Pulp Fiction (1994)
## 7 Forrest Gump (1994)
## support confidence lift count
## 1 0.1573770 0.7741935 2.238190 96
## 2 0.1426230 0.7016129 2.139919 87
## 3 0.1573770 0.7741935 1.881506 96
## 4 0.1688525 0.8306452 1.822639 103
## 5 0.1475410 0.7258065 1.586889 90
## 6 0.1508197 0.7419355 1.474204 92
## 7 0.1573770 0.7741935 1.435435 96
bladerunner2 <- assoc_2 %>% filter(grepl("Blade Runner",title.y)) %>% arrange(desc(lift))
head(bladerunner2 %>% select(title.x, title.y, support, confidence, lift, count), 10)
## title.x title.y support confidence
## 1 2001: A Space Odyssey (1968) Blade Runner (1982) 0.1278689 0.7155963
## lift count
## 1 3.520272 78
As we can see if the movie watched is Blade Runner, the algorithm will recommend another 7 possible movies being ‘Star Wars: Episode V’ the one with higher lift level 2.24. On the other hand, Blade Runner is not one of the most recommended movies since it’s only a recommendation for the movie ‘2001: A Space Odyssey’, despite of that it is recommended with a very high lift level of 3.52, so they are very well associated.