Introduction
Description of data
Data preparation
Association rule - by movie title
Significant rules
Closed items
Similarity and dissimilarity measure
Visualizations
Association rule - by movie genre
Summary
Recommendation systems are being widely used in all forms of media platforms recently - from product advertisement based on your previous web searches, to song/videos recommendations based on user’s previous preferences. As the topic is really complex, many methods can be applied. One of them is Association rule, which enables to discover existing relations between features in the database.
In the following research, the method will be applied in a form of movie recommendation system. By analyzing the database which contains movie ratings of distinguishable users, we could find some interesting rules occurring in analyzed data. For instance: if user liked movie A and B, he will most likely would like to watch movie C. With that analysis, platforms offering access to movies and tv series, would be able to recommend users movies which they would most likely enjoy watching, based on their previous movie ratings.
Data which will be used in the following research comes from Kaggle platfrom, and can be found under the link: https://www.kaggle.com/grouplens/movielens-20m-dataset?select=rating.csv
It contains movie ratings from different users on MovieLens platform.
Two tables from the database will be used:
rating <- read.csv("rating.csv", sep=",", dec=".", header=TRUE)
head(rating)
## userId movieId rating timestamp
## 1 1 2 3.5 2005-04-02 23:53:47
## 2 1 29 3.5 2005-04-02 23:31:16
## 3 1 32 3.5 2005-04-02 23:33:39
## 4 1 47 3.5 2005-04-02 23:32:07
## 5 1 50 3.5 2005-04-02 23:29:40
## 6 1 112 3.5 2004-09-10 03:09:00
movie <- read.csv("movie.csv", sep=",", dec=".", header=TRUE)
head(movie)
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
Before associate rule analysis, data needs to be properly transformed. Both datasets - rating and movie need to be merged, so that we have users’ rating per data title, not per movie ID. It will make interpretation easier.
dataset <- merge(rating, movie, by = "movieId")
dataset <- dataset[,-c(1,4)]
head(dataset)
## userId rating title genres
## 1 124152 5.0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
## 2 93599 4.5 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
## 3 136201 2.0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
## 4 8863 5.0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
## 5 4903 4.0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
## 6 28307 5.0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
Data contains now approximately 20 million rows. To reduce number of computations (and at the same time performance of algorithm run time), sample of the database will be taken. In final data, only movies published in years 1990 - 2020 will be taken.
dataset$year <- substr(dataset$title, nchar(dataset$title) - 4,nchar(dataset$title)-1)
years <- c()
for(i in 1990:2020){
year <- toString(i)
years[length(years)+1] = year
}
dataset <- dataset[dataset$year %in% years,]
dataset <- dataset[order(dataset$userId),]
user_rating <- dataset[,c(1,2,3)]
The next step will be to choose what is the level of movie rating from 0-5, that would mean the user liked the movie. In this case, reasonably would be to include in the data only movies which have ratings above 3. With that being said, rules from association rule would only consider the movies that users liked, not only watched by chance.
user_rating$like <- ifelse(user_rating$rating >= 3, 1, 0)
user_rating <- user_rating[user_rating$like == 1,-c(2,4)]
write.csv(user_rating, file="user_rating.csv")
nrow(user_rating)
## [1] 11514743
After all transformations, the database contains approximately 11 millions of observations.
library(arules)
library(arulesViz)
library(arulesCBA)
First step of the analysis, is to read data in a form of transactions. It can be done in the following way:
trans1<-read.transactions("user_rating.csv", format="single", sep=",", cols=c("userId","title"), header=TRUE) # reading the file as transactions
trans1
## transactions in sparse format with
## 138341 transactions (rows) and
## 15703 items (columns)
After reading the data, Eclust analysis can be performed. It will help to see most frequent itemsets of movies, from which rules will also be obtained.
freq.items<-eclat(trans1, parameter=list(supp=0.25, maxlen=15)) # podstawowy eclat
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.25 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 34585
##
## create itemset ...
## set transactions ...[15703 item(s), 138341 transaction(s)] done [5.24s].
## sorting and recoding items ... [26 item(s)] done [0.04s].
## creating bit matrix ... [26 row(s), 138341 column(s)] done [0.02s].
## writing ... [38 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(freq.items)
## items support transIdenticalToItemsets count
## [1] {Forrest Gump (1994),
## Fugitive, The (1993)} 0.2505837 34666 34666
## [2] {Fugitive, The (1993),
## Silence of the Lambs, The (1991)} 0.2504970 34654 34654
## [3] {Braveheart (1995),
## Forrest Gump (1994)} 0.2532366 35033 35033
## [4] {Jurassic Park (1993),
## Terminator 2: Judgment Day (1991)} 0.2591278 35848 35848
## [5] {Forrest Gump (1994),
## Jurassic Park (1993)} 0.2859311 39556 39556
## [6] {Jurassic Park (1993),
## Silence of the Lambs, The (1991)} 0.2523547 34911 34911
## [7] {Pulp Fiction (1994),
## Shawshank Redemption, The (1994)} 0.3046241 42142 42142
## [8] {Forrest Gump (1994),
## Shawshank Redemption, The (1994)} 0.2812543 38909 38909
## [9] {Shawshank Redemption, The (1994),
## Silence of the Lambs, The (1991)} 0.2944246 40731 40731
## [10] {Pulp Fiction (1994),
## Silence of the Lambs, The (1991)} 0.3143898 43493 43493
## [11] {Forrest Gump (1994),
## Silence of the Lambs, The (1991)} 0.2862492 39600 39600
## [12] {Forrest Gump (1994),
## Pulp Fiction (1994)} 0.2899430 40111 40111
## [13] {Pulp Fiction (1994)} 0.4529026 62655 62655
## [14] {Forrest Gump (1994)} 0.4431586 61307 61307
## [15] {Silence of the Lambs, The (1991)} 0.4380697 60603 60603
## [16] {Shawshank Redemption, The (1994)} 0.4485438 62052 62052
## [17] {Matrix, The (1999)} 0.3482409 48176 48176
## [18] {Jurassic Park (1993)} 0.3839787 53120 53120
## [19] {Terminator 2: Judgment Day (1991)} 0.3522528 48731 48731
## [20] {Braveheart (1995)} 0.3591343 49683 49683
## [21] {Usual Suspects, The (1995)} 0.3305889 45734 45734
## [22] {Toy Story (1995)} 0.3342827 46245 46245
## [23] {American Beauty (1999)} 0.3044145 42113 42113
## [24] {Seven (a.k.a. Se7en) (1995)} 0.2957692 40917 40917
## [25] {Fugitive, The (1993)} 0.3462242 47897 47897
## [26] {Schindler's List (1993)} 0.3479373 48134 48134
## [27] {Sixth Sense, The (1999)} 0.2660600 36807 36807
## [28] {Twelve Monkeys (a.k.a. 12 Monkeys) (1995)} 0.3047976 42166 42166
## [29] {Fight Club (1999)} 0.2731511 37788 37788
## [30] {Fargo (1996)} 0.2927765 40503 40503
## [31] {Apollo 13 (1995)} 0.3240399 44828 44828
## [32] {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.2504680 34650 34650
## [33] {Speed (1994)} 0.2567569 35520 35520
## [34] {Lion King, The (1994)} 0.2572412 35587 35587
## [35] {Independence Day (a.k.a. ID4) (1996)} 0.2693706 37265 37265
## [36] {Aladdin (1992)} 0.2747342 38007 38007
## [37] {True Lies (1994)} 0.2702019 37380 37380
## [38] {Dances with Wolves (1990)} 0.2875214 39776 39776
Now rules from frequent sets can be obtained:
## getting rules
freq.rules<-ruleInduction(freq.items, trans1, confidence=0.60)
inspect(freq.rules) # screening the rules
## lhs rhs support confidence lift itemset
## [1] {Fugitive, The (1993)} => {Forrest Gump (1994)} 0.2505837 0.7237614 1.633188 1
## [2] {Fugitive, The (1993)} => {Silence of the Lambs, The (1991)} 0.2504970 0.7235109 1.651588 2
## [3] {Braveheart (1995)} => {Forrest Gump (1994)} 0.2532366 0.7051305 1.591147 3
## [4] {Terminator 2: Judgment Day (1991)} => {Jurassic Park (1993)} 0.2591278 0.7356303 1.915810 4
## [5] {Jurassic Park (1993)} => {Terminator 2: Judgment Day (1991)} 0.2591278 0.6748494 1.915810 4
## [6] {Jurassic Park (1993)} => {Forrest Gump (1994)} 0.2859311 0.7446536 1.680332 5
## [7] {Forrest Gump (1994)} => {Jurassic Park (1993)} 0.2859311 0.6452118 1.680332 5
## [8] {Jurassic Park (1993)} => {Silence of the Lambs, The (1991)} 0.2523547 0.6572101 1.500241 6
## [9] {Shawshank Redemption, The (1994)} => {Pulp Fiction (1994)} 0.3046241 0.6791401 1.499528 7
## [10] {Pulp Fiction (1994)} => {Shawshank Redemption, The (1994)} 0.3046241 0.6726039 1.499528 7
## [11] {Shawshank Redemption, The (1994)} => {Forrest Gump (1994)} 0.2812543 0.6270386 1.414931 8
## [12] {Forrest Gump (1994)} => {Shawshank Redemption, The (1994)} 0.2812543 0.6346584 1.414931 8
## [13] {Silence of the Lambs, The (1991)} => {Shawshank Redemption, The (1994)} 0.2944246 0.6720954 1.498394 9
## [14] {Shawshank Redemption, The (1994)} => {Silence of the Lambs, The (1991)} 0.2944246 0.6564011 1.498394 9
## [15] {Silence of the Lambs, The (1991)} => {Pulp Fiction (1994)} 0.3143898 0.7176707 1.584603 10
## [16] {Pulp Fiction (1994)} => {Silence of the Lambs, The (1991)} 0.3143898 0.6941665 1.584603 10
## [17] {Silence of the Lambs, The (1991)} => {Forrest Gump (1994)} 0.2862492 0.6534330 1.474490 11
## [18] {Forrest Gump (1994)} => {Silence of the Lambs, The (1991)} 0.2862492 0.6459295 1.474490 11
## [19] {Pulp Fiction (1994)} => {Forrest Gump (1994)} 0.2899430 0.6401883 1.444603 12
## [20] {Forrest Gump (1994)} => {Pulp Fiction (1994)} 0.2899430 0.6542646 1.444603 12
Apriori method on the other hand, is able to create rules for chosen parameters of minimum support and minimum confidence. The bigger both parameters, the better. Nevertheless, their level depends on specifics of the data set. Choosing the best parameters is done by trial and error method.
Let’s look for rules with support higher than 0.17, and confidence higher than 0.90.
# creating rules - standard settings
rules.trans1<-apriori(trans1, parameter=list(supp=0.17, conf=0.9))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.17 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 23517
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[15703 item(s), 138341 transaction(s)] done [5.53s].
## sorting and recoding items ... [60 item(s)] done [0.07s].
## creating transaction tree ... done [0.11s].
## checking subsets of size 1 2 3 4 done [0.20s].
## writing ... [2 rule(s)] done [0.00s].
## creating S4 object ... done [0.04s].
One can display the rules and sort them by different measures, such as: confidence, lift, count and support.
rules.by.conf<-sort(rules.trans1, by="confidence", decreasing=TRUE)
inspect(head(rules.by.conf))
## lhs rhs support confidence coverage lift count
## [1] {Lord of the Rings: The Return of the King, The (2003),
## Lord of the Rings: The Two Towers, The (2002)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.1738458 0.9488677 0.1832139 3.788378 24050
## [2] {Lord of the Rings: The Fellowship of the Ring, The (2001),
## Lord of the Rings: The Return of the King, The (2003)} => {Lord of the Rings: The Two Towers, The (2002)} 0.1738458 0.9284280 0.1872475 4.095522 24050
# sorting rules by lift
rules.by.lift<-sort(rules.trans1, by="lift", decreasing=TRUE) # sorting by lift
inspect(head(rules.by.lift))
## lhs rhs support confidence coverage lift count
## [1] {Lord of the Rings: The Fellowship of the Ring, The (2001),
## Lord of the Rings: The Return of the King, The (2003)} => {Lord of the Rings: The Two Towers, The (2002)} 0.1738458 0.9284280 0.1872475 4.095522 24050
## [2] {Lord of the Rings: The Return of the King, The (2003),
## Lord of the Rings: The Two Towers, The (2002)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.1738458 0.9488677 0.1832139 3.788378 24050
# sorting rules by count
rules.by.count<- sort(rules.trans1, by="count", decreasing=TRUE) # sorting by count
inspect(head(rules.by.count))
## lhs rhs support confidence coverage lift count
## [1] {Lord of the Rings: The Return of the King, The (2003),
## Lord of the Rings: The Two Towers, The (2002)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.1738458 0.9488677 0.1832139 3.788378 24050
## [2] {Lord of the Rings: The Fellowship of the Ring, The (2001),
## Lord of the Rings: The Return of the King, The (2003)} => {Lord of the Rings: The Two Towers, The (2002)} 0.1738458 0.9284280 0.1872475 4.095522 24050
# sorting by support
rules.by.supp<-sort(rules.trans1, by="support", decreasing=TRUE)
inspect(head(rules.by.supp))
## lhs rhs support confidence coverage lift count
## [1] {Lord of the Rings: The Return of the King, The (2003),
## Lord of the Rings: The Two Towers, The (2002)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.1738458 0.9488677 0.1832139 3.788378 24050
## [2] {Lord of the Rings: The Fellowship of the Ring, The (2001),
## Lord of the Rings: The Return of the King, The (2003)} => {Lord of the Rings: The Two Towers, The (2002)} 0.1738458 0.9284280 0.1872475 4.095522 24050
Looking at the previous results, we can take some interesting conclusions of users preferences. Yet, some of the rules may be redundant. That means, there exists a more general rule, which is characterized by higher or the same confidence. They might be found undesirable, as they may complicate interpretation of the results, as much as confusing recommendation system (the problem of which rule is better).
For the sake of better analysis, redundant rules will be removed from now on.
Apriori rules can be analyzed from two sides:
In this section RHS method of apriori will be performed for 3 different types of movies. Comparing results of three different movies will show, if rules reflect users’ preferences. Chosen movies are: Harry Potter and the Chamber of Secrets, Finding Nemo and The Dark Knight. All of them are from different genre types. Let’s see what the apriori algorithm proposes.
## Harry potter
rules.HP<-apriori(data=trans1, parameter=list(supp=0.03, conf=0.85), appearance=list(default="lhs", rhs="Harry Potter and the Chamber of Secrets (2002)"), control=list(verbose=F))
# removing redundant rules
rules.HP <- rules.HP[!is.redundant(rules.HP)]
# sorting and generating rules
rules.rating.byconf<-sort(rules.HP, by="confidence", decreasing=TRUE)
inspect(head(rules.rating.byconf))
## lhs rhs support confidence coverage lift count
## [1] {Harry Potter and the Goblet of Fire (2005),
## Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)} => {Harry Potter and the Chamber of Secrets (2002)} 0.04192539 0.8700870 0.04818528 9.759889 5800
## [2] {Harry Potter and the Prisoner of Azkaban (2004),
## Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)} => {Harry Potter and the Chamber of Secrets (2002)} 0.05401146 0.8623197 0.06263508 9.672761 7472
rules.Nemo<-apriori(data=trans1, parameter=list(supp=0.06, conf=0.85), appearance=list(default="lhs", rhs="Finding Nemo (2003)"), control=list(verbose=F))
# removing redundant rules
rules.Nemo <- rules.Nemo[!is.redundant(rules.Nemo)]
# sorting and generating rules
rules.rating.byconf<-sort(rules.Nemo, by="confidence", decreasing=TRUE)
inspect(head(rules.rating.byconf))
## lhs rhs support confidence coverage lift count
## [1] {Incredibles, The (2004),
## Monsters, Inc. (2001),
## Shrek (2001)} => {Finding Nemo (2003)} 0.06359648 0.8715206 0.07297186 5.549180 8798
## [2] {Incredibles, The (2004),
## Lord of the Rings: The Fellowship of the Ring, The (2001),
## Monsters, Inc. (2001)} => {Finding Nemo (2003)} 0.06067616 0.8594246 0.07060091 5.472162 8394
## [3] {Incredibles, The (2004),
## Matrix, The (1999),
## Monsters, Inc. (2001)} => {Finding Nemo (2003)} 0.06151466 0.8551904 0.07193095 5.445202 8510
## [4] {Monsters, Inc. (2001),
## Pirates of the Caribbean: The Curse of the Black Pearl (2003),
## Toy Story (1995)} => {Finding Nemo (2003)} 0.06095084 0.8531822 0.07143941 5.432415 8432
## [5] {Incredibles, The (2004),
## Shrek (2001),
## Toy Story (1995)} => {Finding Nemo (2003)} 0.06129058 0.8521608 0.07192372 5.425911 8479
rules.knight<-apriori(data=trans1, parameter=list(supp=0.03, conf=0.85), appearance=list(default="lhs", rhs="Dark Knight, The (2008)"), control=list(verbose=F))
# removing redundant rules
rules.knight <- rules.knight[!is.redundant(rules.knight)]
# sorting and generating rules
rules.rating.byconf<-sort(rules.knight, by="confidence", decreasing=TRUE)
inspect(head(rules.rating.byconf))
## lhs rhs support confidence coverage lift count
## [1] {Batman Begins (2005),
## District 9 (2009)} => {Dark Knight, The (2008)} 0.03079347 0.9426864 0.03266566 6.699830 4260
## [2] {Batman Begins (2005),
## Inglourious Basterds (2009)} => {Dark Knight, The (2008)} 0.03436436 0.9425059 0.03646063 6.698547 4754
## [3] {Batman Begins (2005),
## Star Trek (2009)} => {Dark Knight, The (2008)} 0.03272349 0.9324408 0.03509444 6.627012 4527
## [4] {Batman Begins (2005),
## Inception (2010)} => {Dark Knight, The (2008)} 0.04103628 0.9312664 0.04406503 6.618666 5677
## [5] {Batman Begins (2005),
## Up (2009)} => {Dark Knight, The (2008)} 0.03137176 0.9228152 0.03399571 6.558602 4340
## [6] {Batman Begins (2005),
## WALL·E (2008)} => {Dark Knight, The (2008)} 0.04181696 0.9169440 0.04560470 6.516874 5785
Results show, that association rule in fact created reasonable rules by showing similar movies. Even for movie “Finding Nemo”, algorithm found other animated movies for children. There might be however some movie divergence, as the algorithm is looking for relations between movie rating, not causality of the choice.
LHS is the second way to analyze the apriori algoritm. Let’s look at the results of the same three movies, but from LHS perspective.
## Harry potter
rules.HP<-apriori(data=trans1, parameter=list(supp=0.04,conf = 0.08),
appearance=list(default="rhs",lhs="Harry Potter and the Chamber of Secrets (2002)"), control=list(verbose=F))
# removing redundant rules
rules.HP <- rules.HP[!is.redundant(rules.HP)]
rules.HP.byconf<-sort(rules.HP, by="confidence", decreasing=TRUE)
inspect(head(rules.HP.byconf))
## lhs rhs support confidence coverage lift count
## [1] {Harry Potter and the Chamber of Secrets (2002)} => {Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)} 0.07221287 0.8100219 0.08914928 7.593633 9990
## [2] {Harry Potter and the Chamber of Secrets (2002)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.07152616 0.8023190 0.08914928 3.203279 9895
## [3] {Harry Potter and the Chamber of Secrets (2002)} => {Lord of the Rings: The Two Towers, The (2002)} 0.06903955 0.7744263 0.08914928 3.416183 9551
## [4] {Harry Potter and the Chamber of Secrets (2002)} => {Matrix, The (1999)} 0.06783239 0.7608854 0.08914928 2.184940 9384
## [5] {Harry Potter and the Chamber of Secrets (2002)} => {Shrek (2001)} 0.06592406 0.7394794 0.08914928 3.516804 9120
## [6] {Harry Potter and the Chamber of Secrets (2002)} => {Lord of the Rings: The Return of the King, The (2003)} 0.06265677 0.7028298 0.08914928 3.329345 8668
rules.Nemo<-apriori(data=trans1, parameter=list(supp=0.1,conf = 0.7),
appearance=list(default="rhs",lhs="Finding Nemo (2003)"), control=list(verbose=F))
# removing redundant rules
rules.Nemo <- rules.Nemo[!is.redundant(rules.Nemo)]
rules.Nemo.byconf<-sort(rules.Nemo, by="confidence", decreasing=TRUE)
inspect(head(rules.Nemo.byconf))
## lhs rhs support confidence coverage lift count
## [1] {Finding Nemo (2003)} => {Matrix, The (1999)} 0.1175212 0.7482855 0.1570539 2.148758 16258
## [2] {Finding Nemo (2003)} => {Shrek (2001)} 0.1158008 0.7373314 0.1570539 3.506589 16020
## [3] {Finding Nemo (2003)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.1138853 0.7251346 0.1570539 2.895118 15755
## [4] {Finding Nemo (2003)} => {Forrest Gump (1994)} 0.1112468 0.7083353 0.1570539 1.598379 15390
## [5] {Finding Nemo (2003)} => {Lord of the Rings: The Two Towers, The (2002)} 0.1107481 0.7051595 0.1570539 3.110630 15321
## Harry potter
rules.knight<-apriori(data=trans1, parameter=list(supp=0.04,conf = 0.08),
appearance=list(default="rhs",lhs="Dark Knight, The (2008)"), control=list(verbose=F))
# removing redundant rules
rules.knight <- rules.knight[!is.redundant(rules.knight)]
rules.knight.byconf<-sort(rules.knight, by="confidence", decreasing=TRUE)
inspect(head(rules.knight.byconf))
## lhs rhs support confidence coverage lift count
## [1] {Dark Knight, The (2008)} => {Matrix, The (1999)} 0.10260154 0.7292063 0.140703 2.093971 14194
## [2] {Dark Knight, The (2008)} => {Fight Club (1999)} 0.09956557 0.7076291 0.140703 2.590614 13774
## [3] {Dark Knight, The (2008)} => {Shawshank Redemption, The (1994)} 0.09456343 0.6720781 0.140703 1.498355 13082
## [4] {Dark Knight, The (2008)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.08913482 0.6334960 0.140703 2.529249 12331
## [5] {Dark Knight, The (2008)} => {Lord of the Rings: The Return of the King, The (2003)} 0.08700964 0.6183920 0.140703 2.929358 12037
## [6] {Dark Knight, The (2008)} => {Pulp Fiction (1994)} 0.08690844 0.6176727 0.140703 1.363809 12023
Another way of analyzing association rule is to look for significant rules according to Fisher’s exact test. Let’s see what are the significant rules which have the highest confidence rule:
significant <- is.significant(rules.HP, trans1)
inspect(head(sort(rules.HP[significant == TRUE], by="confidence", decreasing=TRUE)))
## lhs rhs support confidence coverage lift count
## [1] {Harry Potter and the Chamber of Secrets (2002)} => {Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)} 0.07221287 0.8100219 0.08914928 7.593633 9990
## [2] {Harry Potter and the Chamber of Secrets (2002)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.07152616 0.8023190 0.08914928 3.203279 9895
## [3] {Harry Potter and the Chamber of Secrets (2002)} => {Lord of the Rings: The Two Towers, The (2002)} 0.06903955 0.7744263 0.08914928 3.416183 9551
## [4] {Harry Potter and the Chamber of Secrets (2002)} => {Matrix, The (1999)} 0.06783239 0.7608854 0.08914928 2.184940 9384
## [5] {Harry Potter and the Chamber of Secrets (2002)} => {Shrek (2001)} 0.06592406 0.7394794 0.08914928 3.516804 9120
## [6] {Harry Potter and the Chamber of Secrets (2002)} => {Lord of the Rings: The Return of the King, The (2003)} 0.06265677 0.7028298 0.08914928 3.329345 8668
significant <- is.significant(rules.Nemo, trans1)
inspect(head(sort(rules.Nemo[significant == TRUE], by="confidence", decreasing=TRUE)))
## lhs rhs support confidence coverage lift count
## [1] {Finding Nemo (2003)} => {Matrix, The (1999)} 0.1175212 0.7482855 0.1570539 2.148758 16258
## [2] {Finding Nemo (2003)} => {Shrek (2001)} 0.1158008 0.7373314 0.1570539 3.506589 16020
## [3] {Finding Nemo (2003)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.1138853 0.7251346 0.1570539 2.895118 15755
## [4] {Finding Nemo (2003)} => {Forrest Gump (1994)} 0.1112468 0.7083353 0.1570539 1.598379 15390
## [5] {Finding Nemo (2003)} => {Lord of the Rings: The Two Towers, The (2002)} 0.1107481 0.7051595 0.1570539 3.110630 15321
significant <- is.significant(rules.knight, trans1)
inspect(head(sort(rules.knight[significant == TRUE], by="confidence", decreasing=TRUE)))
## lhs rhs support confidence coverage lift count
## [1] {Dark Knight, The (2008)} => {Matrix, The (1999)} 0.10260154 0.7292063 0.140703 2.093971 14194
## [2] {Dark Knight, The (2008)} => {Fight Club (1999)} 0.09956557 0.7076291 0.140703 2.590614 13774
## [3] {Dark Knight, The (2008)} => {Shawshank Redemption, The (1994)} 0.09456343 0.6720781 0.140703 1.498355 13082
## [4] {Dark Knight, The (2008)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.08913482 0.6334960 0.140703 2.529249 12331
## [5] {Dark Knight, The (2008)} => {Lord of the Rings: The Return of the King, The (2003)} 0.08700964 0.6183920 0.140703 2.929358 12037
## [6] {Dark Knight, The (2008)} => {Pulp Fiction (1994)} 0.08690844 0.6176727 0.140703 1.363809 12023
The results are the same as the ones obtained before. That would imply that they all were significant, and that conclusions may be drawn from the results.
Another useful analysis of obtained results would be to find closed frequent items. They are items that have higher support than their supersets, but also higher or equal support to the one fixed at the beginning of the analysis.
Frequent closed items can be found by apriori algorithm in the following way.
trans1.closed<-apriori(trans1, parameter=list(target="closed frequent itemsets", support=0.1))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 closed frequent itemsets TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 13834
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[15703 item(s), 138341 transaction(s)] done [5.53s].
## sorting and recoding items ... [159 item(s)] done [0.13s].
## creating transaction tree ... done [0.12s].
## checking subsets of size 1 2 3 4 5 6 done [2.04s].
## filtering closed item sets ... done [0.00s].
## sorting transactions ... done [0.04s].
## writing ... [4864 set(s)] done [0.00s].
## creating S4 object ... done [0.04s].
inspect(head(sort(trans1.closed, by="support", decreasing=TRUE)))
## items support transIdenticalToItemsets count
## [1] {Pulp Fiction (1994)} 0.4529026 0.0009830780 62655
## [2] {Shawshank Redemption, The (1994)} 0.4485438 0.0032672888 62052
## [3] {Forrest Gump (1994)} 0.4431586 0.0011782480 61307
## [4] {Silence of the Lambs, The (1991)} 0.4380697 0.0010698202 60603
## [5] {Jurassic Park (1993)} 0.3839787 0.0006144238 53120
## [6] {Braveheart (1995)} 0.3591343 0.0006361093 49683
Additionally, one could also obtain the results by eclat function.
freq.closed<-eclat(trans1, parameter=list(supp=0.15, maxlen=15, target="closed frequent itemsets"))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.15 1 15 closed frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 20751
##
## create itemset ...
## set transactions ...[15703 item(s), 138341 transaction(s)] done [5.33s].
## sorting and recoding items ... [83 item(s)] done [0.09s].
## creating bit matrix ... [83 row(s), 138341 column(s)] done [0.03s].
## writing ... [497 set(s)] done [0.02s].
## Creating S4 object ... done [0.00s].
inspect(head(sort(freq.closed, by="support", decreasing=TRUE)))
## items support transIdenticalToItemsets count
## [1] {Pulp Fiction (1994)} 0.4529026 62655 62655
## [2] {Shawshank Redemption, The (1994)} 0.4485438 62052 62052
## [3] {Forrest Gump (1994)} 0.4431586 61307 61307
## [4] {Silence of the Lambs, The (1991)} 0.4380697 60603 60603
## [5] {Jurassic Park (1993)} 0.3839787 53120 53120
## [6] {Braveheart (1995)} 0.3591343 49683 49683
Checking similarity and dissimilarity measures enables to look for movies which most frequently are not together in transactions. One of the ways to measure it is Jaccard Index. Its values can be found in below matrix - the higher the measure, the more movies do not overlap with each other in transactions.
diss<-trans1[,itemFrequency(trans1)>0.36]
d.jac.i<-dissimilarity(diss, which="items")
round(d.jac.i,2)
## Forrest Gump (1994) Jurassic Park (1993)
## Jurassic Park (1993) 0.47
## Pulp Fiction (1994) 0.52 0.58
## Shawshank Redemption, The (1994) 0.54 0.61
## Silence of the Lambs, The (1991) 0.52 0.56
## Pulp Fiction (1994)
## Jurassic Park (1993)
## Pulp Fiction (1994)
## Shawshank Redemption, The (1994) 0.49
## Silence of the Lambs, The (1991) 0.45
## Shawshank Redemption, The (1994)
## Jurassic Park (1993)
## Pulp Fiction (1994)
## Shawshank Redemption, The (1994)
## Silence of the Lambs, The (1991) 0.50
Association rule can also be visualized in various ways.
For instance, one can see the most frequent items in the dataset:
itemFrequencyPlot(trans1, topN=10, type="relative", main="Item Frequency")
One can also look for relation between:
trans1<-read.transactions("user_rating.csv", format="single", sep=",", cols=c("userId","title"), header=TRUE)
rules.trans1<-apriori(trans1, parameter=list(supp=0.1, conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 13834
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[15703 item(s), 138341 transaction(s)] done [5.37s].
## sorting and recoding items ... [159 item(s)] done [0.12s].
## creating transaction tree ... done [0.13s].
## checking subsets of size 1 2 3 4 5 6 done [2.70s].
## writing ... [13056 rule(s)] done [0.00s].
## creating S4 object ... done [0.05s].
plot(sort(rules.trans1, by = "confidence", decreasing = TRUE), measure=c("support","lift"),shading="confidence")
plot(sort(rules.trans1, by = "confidence", decreasing = TRUE), measure=c("support","confidence"),shading="confidence")
Moreover, it is also possible to add to relation between support and confidence, the number of items in a given rule. That number is characterized by color (order on the legend).
plot(rules.trans1, shading="order", control=list(main="Two-key plot"))
The association rule does not have to be performed on the individual movie titles. The ratings can also be aggregated on the higher level, i.e. the movie genre. With the following analysis, let’s see which movie genres go together.
At first, we need to prepare the data and merge table rating with table movie, which contains movie genre for each title.
# Matching movie title with the genre
user_rating <- merge(user_rating,movie,by = "title")
user_rating <- user_rating[,-c(3)]
hierarchy <- user_rating[,c(2,3)]
hierarchy <- hierarchy[!duplicated(hierarchy), ]
hierarchy <- hierarchy[order(hierarchy$userId),]
write.csv(hierarchy, file="hierarchy.csv")
After data transformation, the same apriori algorithm can be used, but on the movie genres instead of titles.
trans2<-read.transactions("hierarchy.csv", format="single", sep=",", cols=c("userId","genres"), header=TRUE) # reading the file as transactions
rules.trans2_lev2<-apriori(trans2, parameter=list(supp=0.1, conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.1 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 13834
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1035 item(s), 138341 transaction(s)] done [1.52s].
## sorting and recoding items ... [132 item(s)] done [0.13s].
## creating transaction tree ... done [0.12s].
## checking subsets of size 1 2 3 4 done [14.57s].
## writing ... [656078 rule(s)] done [0.09s].
## creating S4 object ... done [0.19s].
rules.by.conf<-sort(rules.trans2_lev2, by="confidence", decreasing=TRUE)
inspect(head(rules.by.conf))
## lhs rhs support confidence coverage lift count
## [1] {Adventure|Animation|Children|Comedy|Musical,
## Adventure|Drama|IMAX,
## Animation|Children|Fantasy|Musical|Romance|IMAX} => {Adventure|Animation|Children|Comedy|Fantasy} 0.1272508 0.9996593 0.1272941 1.879478 17604
## [2] {Adventure|Animation|Children|Comedy|Musical,
## Adventure|Animation|Children|Drama|Musical|IMAX,
## Adventure|Drama|IMAX} => {Adventure|Animation|Children|Comedy|Fantasy} 0.1234341 0.9994732 0.1234992 1.879128 17076
## [3] {Adventure|Animation|Children|Comedy|Musical,
## Animation|Children|Fantasy|Musical|Romance|IMAX,
## Comedy|Crime|Drama|Thriller} => {Adventure|Animation|Children|Comedy|Fantasy} 0.1272580 0.9994323 0.1273303 1.879051 17605
## [4] {Action|Adventure|Comedy|Romance|Thriller,
## Adventure|Animation|Children|Comedy|Musical,
## Animation|Children|Fantasy|Musical|Romance|IMAX} => {Adventure|Animation|Children|Comedy|Fantasy} 0.1100831 0.9994094 0.1101481 1.879008 15229
## [5] {Adventure|Animation|Children|Comedy|Musical,
## Animation|Children|Fantasy|Musical|Romance|IMAX,
## Thriller} => {Adventure|Animation|Children|Comedy|Fantasy} 0.1361852 0.9993635 0.1362720 1.878922 18840
## [6] {Action|Adventure|Comedy|Romance|Thriller,
## Adventure|Animation|Children|Comedy|Musical,
## Adventure|Drama|IMAX} => {Adventure|Animation|Children|Comedy|Fantasy} 0.1192127 0.9993335 0.1192922 1.878865 16492
To sum up, using association rule for finding relations between watched movies brought promising results. As association is mostly used in products transactions, this research showed that it can be used in other areas as well. There is still much to discover in the association rule field, yet even now it shows interesting results.