Association Rule for movie ratings

Introduction
Description of data
Data preparation
Association rule - by movie title
Significant rules
Closed items
Similarity and dissimilarity measure
Visualizations
Association rule - by movie genre
Summary

1) Introduction

Recommendation systems are being widely used in all forms of media platforms recently - from product advertisement based on your previous web searches, to song/videos recommendations based on user’s previous preferences. As the topic is really complex, many methods can be applied. One of them is Association rule, which enables to discover existing relations between features in the database.

In the following research, the method will be applied in a form of movie recommendation system. By analyzing the database which contains movie ratings of distinguishable users, we could find some interesting rules occurring in analyzed data. For instance: if user liked movie A and B, he will most likely would like to watch movie C. With that analysis, platforms offering access to movies and tv series, would be able to recommend users movies which they would most likely enjoy watching, based on their previous movie ratings.

2) Description of data

Data which will be used in the following research comes from Kaggle platfrom, and can be found under the link: https://www.kaggle.com/grouplens/movielens-20m-dataset?select=rating.csv

It contains movie ratings from different users on MovieLens platform.

Two tables from the database will be used:

rating, which contains information about ratings of movies by user

rating <- read.csv("rating.csv", sep=",", dec=".", header=TRUE) 
head(rating)

##   userId movieId rating           timestamp
## 1      1       2    3.5 2005-04-02 23:53:47
## 2      1      29    3.5 2005-04-02 23:31:16
## 3      1      32    3.5 2005-04-02 23:33:39
## 4      1      47    3.5 2005-04-02 23:32:07
## 5      1      50    3.5 2005-04-02 23:29:40
## 6      1     112    3.5 2004-09-10 03:09:00

movie, which contains information about movie title and genre

movie <- read.csv("movie.csv", sep=",", dec=".", header=TRUE)
head(movie)

##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
## 6       6                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller

2) Data preparation

Before associate rule analysis, data needs to be properly transformed. Both datasets - rating and movie need to be merged, so that we have users’ rating per data title, not per movie ID. It will make interpretation easier.

dataset <- merge(rating, movie, by = "movieId")
dataset <- dataset[,-c(1,4)]
head(dataset)

##   userId rating            title                                      genres
## 1 124152    5.0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
## 2  93599    4.5 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
## 3 136201    2.0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
## 4   8863    5.0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
## 5   4903    4.0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
## 6  28307    5.0 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy

Data contains now approximately 20 million rows. To reduce number of computations (and at the same time performance of algorithm run time), sample of the database will be taken. In final data, only movies published in years 1990 - 2020 will be taken.

dataset$year <- substr(dataset$title,  nchar(dataset$title) - 4,nchar(dataset$title)-1)

years <- c()

for(i in 1990:2020){
  year <- toString(i)
  years[length(years)+1] = year
}

dataset <- dataset[dataset$year %in% years,]

dataset <- dataset[order(dataset$userId),]

user_rating <- dataset[,c(1,2,3)]

The next step will be to choose what is the level of movie rating from 0-5, that would mean the user liked the movie. In this case, reasonably would be to include in the data only movies which have ratings above 3. With that being said, rules from association rule would only consider the movies that users liked, not only watched by chance.

user_rating$like <- ifelse(user_rating$rating >= 3, 1, 0)
user_rating <- user_rating[user_rating$like == 1,-c(2,4)]
write.csv(user_rating, file="user_rating.csv")
nrow(user_rating)

## [1] 11514743

After all transformations, the database contains approximately 11 millions of observations.

4. Association rule - by movie title

Packages required for Association rule analysis

library(arules)
library(arulesViz)
library(arulesCBA)

Reading transactions

First step of the analysis, is to read data in a form of transactions. It can be done in the following way:

trans1<-read.transactions("user_rating.csv", format="single", sep=",", cols=c("userId","title"), header=TRUE) # reading the file as transactions
trans1

## transactions in sparse format with
##  138341 transactions (rows) and
##  15703 items (columns)

Eclust method

After reading the data, Eclust analysis can be performed. It will help to see most frequent itemsets of movies, from which rules will also be obtained.

freq.items<-eclat(trans1, parameter=list(supp=0.25, maxlen=15)) # podstawowy eclat

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.25      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 34585 
## 
## create itemset ... 
## set transactions ...[15703 item(s), 138341 transaction(s)] done [5.24s].
## sorting and recoding items ... [26 item(s)] done [0.04s].
## creating bit matrix ... [26 row(s), 138341 column(s)] done [0.02s].
## writing  ... [38 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

inspect(freq.items)

##      items                                                         support transIdenticalToItemsets count
## [1]  {Forrest Gump (1994),                                                                               
##       Fugitive, The (1993)}                                      0.2505837                    34666 34666
## [2]  {Fugitive, The (1993),                                                                              
##       Silence of the Lambs, The (1991)}                          0.2504970                    34654 34654
## [3]  {Braveheart (1995),                                                                                 
##       Forrest Gump (1994)}                                       0.2532366                    35033 35033
## [4]  {Jurassic Park (1993),                                                                              
##       Terminator 2: Judgment Day (1991)}                         0.2591278                    35848 35848
## [5]  {Forrest Gump (1994),                                                                               
##       Jurassic Park (1993)}                                      0.2859311                    39556 39556
## [6]  {Jurassic Park (1993),                                                                              
##       Silence of the Lambs, The (1991)}                          0.2523547                    34911 34911
## [7]  {Pulp Fiction (1994),                                                                               
##       Shawshank Redemption, The (1994)}                          0.3046241                    42142 42142
## [8]  {Forrest Gump (1994),                                                                               
##       Shawshank Redemption, The (1994)}                          0.2812543                    38909 38909
## [9]  {Shawshank Redemption, The (1994),                                                                  
##       Silence of the Lambs, The (1991)}                          0.2944246                    40731 40731
## [10] {Pulp Fiction (1994),                                                                               
##       Silence of the Lambs, The (1991)}                          0.3143898                    43493 43493
## [11] {Forrest Gump (1994),                                                                               
##       Silence of the Lambs, The (1991)}                          0.2862492                    39600 39600
## [12] {Forrest Gump (1994),                                                                               
##       Pulp Fiction (1994)}                                       0.2899430                    40111 40111
## [13] {Pulp Fiction (1994)}                                       0.4529026                    62655 62655
## [14] {Forrest Gump (1994)}                                       0.4431586                    61307 61307
## [15] {Silence of the Lambs, The (1991)}                          0.4380697                    60603 60603
## [16] {Shawshank Redemption, The (1994)}                          0.4485438                    62052 62052
## [17] {Matrix, The (1999)}                                        0.3482409                    48176 48176
## [18] {Jurassic Park (1993)}                                      0.3839787                    53120 53120
## [19] {Terminator 2: Judgment Day (1991)}                         0.3522528                    48731 48731
## [20] {Braveheart (1995)}                                         0.3591343                    49683 49683
## [21] {Usual Suspects, The (1995)}                                0.3305889                    45734 45734
## [22] {Toy Story (1995)}                                          0.3342827                    46245 46245
## [23] {American Beauty (1999)}                                    0.3044145                    42113 42113
## [24] {Seven (a.k.a. Se7en) (1995)}                               0.2957692                    40917 40917
## [25] {Fugitive, The (1993)}                                      0.3462242                    47897 47897
## [26] {Schindler's List (1993)}                                   0.3479373                    48134 48134
## [27] {Sixth Sense, The (1999)}                                   0.2660600                    36807 36807
## [28] {Twelve Monkeys (a.k.a. 12 Monkeys) (1995)}                 0.3047976                    42166 42166
## [29] {Fight Club (1999)}                                         0.2731511                    37788 37788
## [30] {Fargo (1996)}                                              0.2927765                    40503 40503
## [31] {Apollo 13 (1995)}                                          0.3240399                    44828 44828
## [32] {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.2504680                    34650 34650
## [33] {Speed (1994)}                                              0.2567569                    35520 35520
## [34] {Lion King, The (1994)}                                     0.2572412                    35587 35587
## [35] {Independence Day (a.k.a. ID4) (1996)}                      0.2693706                    37265 37265
## [36] {Aladdin (1992)}                                            0.2747342                    38007 38007
## [37] {True Lies (1994)}                                          0.2702019                    37380 37380
## [38] {Dances with Wolves (1990)}                                 0.2875214                    39776 39776

Now rules from frequent sets can be obtained:

## getting rules
freq.rules<-ruleInduction(freq.items, trans1, confidence=0.60) 
inspect(freq.rules) # screening the rules

##      lhs                                    rhs                                   support confidence     lift itemset
## [1]  {Fugitive, The (1993)}              => {Forrest Gump (1994)}               0.2505837  0.7237614 1.633188       1
## [2]  {Fugitive, The (1993)}              => {Silence of the Lambs, The (1991)}  0.2504970  0.7235109 1.651588       2
## [3]  {Braveheart (1995)}                 => {Forrest Gump (1994)}               0.2532366  0.7051305 1.591147       3
## [4]  {Terminator 2: Judgment Day (1991)} => {Jurassic Park (1993)}              0.2591278  0.7356303 1.915810       4
## [5]  {Jurassic Park (1993)}              => {Terminator 2: Judgment Day (1991)} 0.2591278  0.6748494 1.915810       4
## [6]  {Jurassic Park (1993)}              => {Forrest Gump (1994)}               0.2859311  0.7446536 1.680332       5
## [7]  {Forrest Gump (1994)}               => {Jurassic Park (1993)}              0.2859311  0.6452118 1.680332       5
## [8]  {Jurassic Park (1993)}              => {Silence of the Lambs, The (1991)}  0.2523547  0.6572101 1.500241       6
## [9]  {Shawshank Redemption, The (1994)}  => {Pulp Fiction (1994)}               0.3046241  0.6791401 1.499528       7
## [10] {Pulp Fiction (1994)}               => {Shawshank Redemption, The (1994)}  0.3046241  0.6726039 1.499528       7
## [11] {Shawshank Redemption, The (1994)}  => {Forrest Gump (1994)}               0.2812543  0.6270386 1.414931       8
## [12] {Forrest Gump (1994)}               => {Shawshank Redemption, The (1994)}  0.2812543  0.6346584 1.414931       8
## [13] {Silence of the Lambs, The (1991)}  => {Shawshank Redemption, The (1994)}  0.2944246  0.6720954 1.498394       9
## [14] {Shawshank Redemption, The (1994)}  => {Silence of the Lambs, The (1991)}  0.2944246  0.6564011 1.498394       9
## [15] {Silence of the Lambs, The (1991)}  => {Pulp Fiction (1994)}               0.3143898  0.7176707 1.584603      10
## [16] {Pulp Fiction (1994)}               => {Silence of the Lambs, The (1991)}  0.3143898  0.6941665 1.584603      10
## [17] {Silence of the Lambs, The (1991)}  => {Forrest Gump (1994)}               0.2862492  0.6534330 1.474490      11
## [18] {Forrest Gump (1994)}               => {Silence of the Lambs, The (1991)}  0.2862492  0.6459295 1.474490      11
## [19] {Pulp Fiction (1994)}               => {Forrest Gump (1994)}               0.2899430  0.6401883 1.444603      12
## [20] {Forrest Gump (1994)}               => {Pulp Fiction (1994)}               0.2899430  0.6542646 1.444603      12

Apriori method

Apriori method on the other hand, is able to create rules for chosen parameters of minimum support and minimum confidence. The bigger both parameters, the better. Nevertheless, their level depends on specifics of the data set. Choosing the best parameters is done by trial and error method.

Let’s look for rules with support higher than 0.17, and confidence higher than 0.90.

# creating rules - standard settings
rules.trans1<-apriori(trans1, parameter=list(supp=0.17, conf=0.9))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5    0.17      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 23517 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[15703 item(s), 138341 transaction(s)] done [5.53s].
## sorting and recoding items ... [60 item(s)] done [0.07s].
## creating transaction tree ... done [0.11s].
## checking subsets of size 1 2 3 4 done [0.20s].
## writing ... [2 rule(s)] done [0.00s].
## creating S4 object  ... done [0.04s].

One can display the rules and sort them by different measures, such as: confidence, lift, count and support.

Rules sorted by confidence:

rules.by.conf<-sort(rules.trans1, by="confidence", decreasing=TRUE)
inspect(head(rules.by.conf))

##     lhs                                                            rhs                                                           support confidence  coverage     lift count
## [1] {Lord of the Rings: The Return of the King, The (2003),                                                                                                                 
##      Lord of the Rings: The Two Towers, The (2002)}             => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.1738458  0.9488677 0.1832139 3.788378 24050
## [2] {Lord of the Rings: The Fellowship of the Ring, The (2001),                                                                                                             
##      Lord of the Rings: The Return of the King, The (2003)}     => {Lord of the Rings: The Two Towers, The (2002)}             0.1738458  0.9284280 0.1872475 4.095522 24050

Rules sorted by lift:

# sorting rules by lift
rules.by.lift<-sort(rules.trans1, by="lift", decreasing=TRUE) # sorting by lift
inspect(head(rules.by.lift))

##     lhs                                                            rhs                                                           support confidence  coverage     lift count
## [1] {Lord of the Rings: The Fellowship of the Ring, The (2001),                                                                                                             
##      Lord of the Rings: The Return of the King, The (2003)}     => {Lord of the Rings: The Two Towers, The (2002)}             0.1738458  0.9284280 0.1872475 4.095522 24050
## [2] {Lord of the Rings: The Return of the King, The (2003),                                                                                                                 
##      Lord of the Rings: The Two Towers, The (2002)}             => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.1738458  0.9488677 0.1832139 3.788378 24050

Rules sorted by count:

# sorting rules by count 

rules.by.count<- sort(rules.trans1, by="count", decreasing=TRUE) # sorting by count
inspect(head(rules.by.count))

##     lhs                                                            rhs                                                           support confidence  coverage     lift count
## [1] {Lord of the Rings: The Return of the King, The (2003),                                                                                                                 
##      Lord of the Rings: The Two Towers, The (2002)}             => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.1738458  0.9488677 0.1832139 3.788378 24050
## [2] {Lord of the Rings: The Fellowship of the Ring, The (2001),                                                                                                             
##      Lord of the Rings: The Return of the King, The (2003)}     => {Lord of the Rings: The Two Towers, The (2002)}             0.1738458  0.9284280 0.1872475 4.095522 24050

Rules sorted by support:

# sorting by support
rules.by.supp<-sort(rules.trans1, by="support", decreasing=TRUE) 
inspect(head(rules.by.supp))

##     lhs                                                            rhs                                                           support confidence  coverage     lift count
## [1] {Lord of the Rings: The Return of the King, The (2003),                                                                                                                 
##      Lord of the Rings: The Two Towers, The (2002)}             => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.1738458  0.9488677 0.1832139 3.788378 24050
## [2] {Lord of the Rings: The Fellowship of the Ring, The (2001),                                                                                                             
##      Lord of the Rings: The Return of the King, The (2003)}     => {Lord of the Rings: The Two Towers, The (2002)}             0.1738458  0.9284280 0.1872475 4.095522 24050

Redundant rules

Looking at the previous results, we can take some interesting conclusions of users preferences. Yet, some of the rules may be redundant. That means, there exists a more general rule, which is characterized by higher or the same confidence. They might be found undesirable, as they may complicate interpretation of the results, as much as confusing recommendation system (the problem of which rule is better).

For the sake of better analysis, redundant rules will be removed from now on.

RHS and LHS - apriori rules

Apriori rules can be analyzed from two sides:

Right hand side (RHS) - Which previously watched (and liked) movies determine, that user will watch movie C?
Left hand side (LHS) - User watched (and liked) movie C, what other movies will he probably watch?

RHS

In this section RHS method of apriori will be performed for 3 different types of movies. Comparing results of three different movies will show, if rules reflect users’ preferences. Chosen movies are: Harry Potter and the Chamber of Secrets, Finding Nemo and The Dark Knight. All of them are from different genre types. Let’s see what the apriori algorithm proposes.

Harry Potter

## Harry potter

rules.HP<-apriori(data=trans1, parameter=list(supp=0.03, conf=0.85), appearance=list(default="lhs", rhs="Harry Potter and the Chamber of Secrets (2002)"), control=list(verbose=F)) 

# removing redundant rules
rules.HP <- rules.HP[!is.redundant(rules.HP)]

# sorting and generating rules
rules.rating.byconf<-sort(rules.HP, by="confidence", decreasing=TRUE)
inspect(head(rules.rating.byconf))

##     lhs                                                                                                 rhs                                                 support confidence   coverage     lift count
## [1] {Harry Potter and the Goblet of Fire (2005),                                                                                                                                                        
##      Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)} => {Harry Potter and the Chamber of Secrets (2002)} 0.04192539  0.8700870 0.04818528 9.759889  5800
## [2] {Harry Potter and the Prisoner of Azkaban (2004),                                                                                                                                                   
##      Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)} => {Harry Potter and the Chamber of Secrets (2002)} 0.05401146  0.8623197 0.06263508 9.672761  7472

Finding Nemo

rules.Nemo<-apriori(data=trans1, parameter=list(supp=0.06, conf=0.85), appearance=list(default="lhs", rhs="Finding Nemo (2003)"), control=list(verbose=F)) 

# removing redundant rules
rules.Nemo <- rules.Nemo[!is.redundant(rules.Nemo)]

# sorting and generating rules
rules.rating.byconf<-sort(rules.Nemo, by="confidence", decreasing=TRUE)
inspect(head(rules.rating.byconf))

##     lhs                                                                rhs                      support confidence   coverage     lift count
## [1] {Incredibles, The (2004),                                                                                                               
##      Monsters, Inc. (2001),                                                                                                                 
##      Shrek (2001)}                                                  => {Finding Nemo (2003)} 0.06359648  0.8715206 0.07297186 5.549180  8798
## [2] {Incredibles, The (2004),                                                                                                               
##      Lord of the Rings: The Fellowship of the Ring, The (2001),                                                                             
##      Monsters, Inc. (2001)}                                         => {Finding Nemo (2003)} 0.06067616  0.8594246 0.07060091 5.472162  8394
## [3] {Incredibles, The (2004),                                                                                                               
##      Matrix, The (1999),                                                                                                                    
##      Monsters, Inc. (2001)}                                         => {Finding Nemo (2003)} 0.06151466  0.8551904 0.07193095 5.445202  8510
## [4] {Monsters, Inc. (2001),                                                                                                                 
##      Pirates of the Caribbean: The Curse of the Black Pearl (2003),                                                                         
##      Toy Story (1995)}                                              => {Finding Nemo (2003)} 0.06095084  0.8531822 0.07143941 5.432415  8432
## [5] {Incredibles, The (2004),                                                                                                               
##      Shrek (2001),                                                                                                                          
##      Toy Story (1995)}                                              => {Finding Nemo (2003)} 0.06129058  0.8521608 0.07192372 5.425911  8479

The Dark Knight

rules.knight<-apriori(data=trans1, parameter=list(supp=0.03, conf=0.85), appearance=list(default="lhs", rhs="Dark Knight, The (2008)"), control=list(verbose=F)) 

# removing redundant rules
rules.knight <- rules.knight[!is.redundant(rules.knight)]

# sorting and generating rules

rules.rating.byconf<-sort(rules.knight, by="confidence", decreasing=TRUE)
inspect(head(rules.rating.byconf))

##     lhs                              rhs                          support confidence   coverage     lift count
## [1] {Batman Begins (2005),                                                                                    
##      District 9 (2009)}           => {Dark Knight, The (2008)} 0.03079347  0.9426864 0.03266566 6.699830  4260
## [2] {Batman Begins (2005),                                                                                    
##      Inglourious Basterds (2009)} => {Dark Knight, The (2008)} 0.03436436  0.9425059 0.03646063 6.698547  4754
## [3] {Batman Begins (2005),                                                                                    
##      Star Trek (2009)}            => {Dark Knight, The (2008)} 0.03272349  0.9324408 0.03509444 6.627012  4527
## [4] {Batman Begins (2005),                                                                                    
##      Inception (2010)}            => {Dark Knight, The (2008)} 0.04103628  0.9312664 0.04406503 6.618666  5677
## [5] {Batman Begins (2005),                                                                                    
##      Up (2009)}                   => {Dark Knight, The (2008)} 0.03137176  0.9228152 0.03399571 6.558602  4340
## [6] {Batman Begins (2005),                                                                                    
##      WALLÂ·E (2008)}              => {Dark Knight, The (2008)} 0.04181696  0.9169440 0.04560470 6.516874  5785

Results show, that association rule in fact created reasonable rules by showing similar movies. Even for movie “Finding Nemo”, algorithm found other animated movies for children. There might be however some movie divergence, as the algorithm is looking for relations between movie rating, not causality of the choice.

LHS

LHS is the second way to analyze the apriori algoritm. Let’s look at the results of the same three movies, but from LHS perspective.

Harry Potter

## Harry potter

rules.HP<-apriori(data=trans1, parameter=list(supp=0.04,conf = 0.08), 
                      appearance=list(default="rhs",lhs="Harry Potter and the Chamber of Secrets (2002)"), control=list(verbose=F)) 

# removing redundant rules
rules.HP <- rules.HP[!is.redundant(rules.HP)]

rules.HP.byconf<-sort(rules.HP, by="confidence", decreasing=TRUE)
inspect(head(rules.HP.byconf))

##     lhs                                                 rhs                                                                                                 support confidence   coverage     lift count
## [1] {Harry Potter and the Chamber of Secrets (2002)} => {Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)} 0.07221287  0.8100219 0.08914928 7.593633  9990
## [2] {Harry Potter and the Chamber of Secrets (2002)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)}                                      0.07152616  0.8023190 0.08914928 3.203279  9895
## [3] {Harry Potter and the Chamber of Secrets (2002)} => {Lord of the Rings: The Two Towers, The (2002)}                                                  0.06903955  0.7744263 0.08914928 3.416183  9551
## [4] {Harry Potter and the Chamber of Secrets (2002)} => {Matrix, The (1999)}                                                                             0.06783239  0.7608854 0.08914928 2.184940  9384
## [5] {Harry Potter and the Chamber of Secrets (2002)} => {Shrek (2001)}                                                                                   0.06592406  0.7394794 0.08914928 3.516804  9120
## [6] {Harry Potter and the Chamber of Secrets (2002)} => {Lord of the Rings: The Return of the King, The (2003)}                                          0.06265677  0.7028298 0.08914928 3.329345  8668

Finding Nemo

rules.Nemo<-apriori(data=trans1, parameter=list(supp=0.1,conf = 0.7), 
                        appearance=list(default="rhs",lhs="Finding Nemo (2003)"), control=list(verbose=F)) 

# removing redundant rules
rules.Nemo <- rules.Nemo[!is.redundant(rules.Nemo)]

rules.Nemo.byconf<-sort(rules.Nemo, by="confidence", decreasing=TRUE)
inspect(head(rules.Nemo.byconf))

##     lhs                      rhs                                                           support confidence  coverage     lift count
## [1] {Finding Nemo (2003)} => {Matrix, The (1999)}                                        0.1175212  0.7482855 0.1570539 2.148758 16258
## [2] {Finding Nemo (2003)} => {Shrek (2001)}                                              0.1158008  0.7373314 0.1570539 3.506589 16020
## [3] {Finding Nemo (2003)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.1138853  0.7251346 0.1570539 2.895118 15755
## [4] {Finding Nemo (2003)} => {Forrest Gump (1994)}                                       0.1112468  0.7083353 0.1570539 1.598379 15390
## [5] {Finding Nemo (2003)} => {Lord of the Rings: The Two Towers, The (2002)}             0.1107481  0.7051595 0.1570539 3.110630 15321

The Dark Knight

## Harry potter

rules.knight<-apriori(data=trans1, parameter=list(supp=0.04,conf = 0.08), 
                      appearance=list(default="rhs",lhs="Dark Knight, The (2008)"), control=list(verbose=F)) 

# removing redundant rules
rules.knight <- rules.knight[!is.redundant(rules.knight)]

rules.knight.byconf<-sort(rules.knight, by="confidence", decreasing=TRUE)
inspect(head(rules.knight.byconf))

##     lhs                          rhs                                                            support confidence coverage     lift count
## [1] {Dark Knight, The (2008)} => {Matrix, The (1999)}                                        0.10260154  0.7292063 0.140703 2.093971 14194
## [2] {Dark Knight, The (2008)} => {Fight Club (1999)}                                         0.09956557  0.7076291 0.140703 2.590614 13774
## [3] {Dark Knight, The (2008)} => {Shawshank Redemption, The (1994)}                          0.09456343  0.6720781 0.140703 1.498355 13082
## [4] {Dark Knight, The (2008)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.08913482  0.6334960 0.140703 2.529249 12331
## [5] {Dark Knight, The (2008)} => {Lord of the Rings: The Return of the King, The (2003)}     0.08700964  0.6183920 0.140703 2.929358 12037
## [6] {Dark Knight, The (2008)} => {Pulp Fiction (1994)}                                       0.08690844  0.6176727 0.140703 1.363809 12023

5. Significant rules

Another way of analyzing association rule is to look for significant rules according to Fisher’s exact test. Let’s see what are the significant rules which have the highest confidence rule:

Harry Potter

significant <- is.significant(rules.HP, trans1)
inspect(head(sort(rules.HP[significant == TRUE], by="confidence", decreasing=TRUE)))

##     lhs                                                 rhs                                                                                                 support confidence   coverage     lift count
## [1] {Harry Potter and the Chamber of Secrets (2002)} => {Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)} 0.07221287  0.8100219 0.08914928 7.593633  9990
## [2] {Harry Potter and the Chamber of Secrets (2002)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)}                                      0.07152616  0.8023190 0.08914928 3.203279  9895
## [3] {Harry Potter and the Chamber of Secrets (2002)} => {Lord of the Rings: The Two Towers, The (2002)}                                                  0.06903955  0.7744263 0.08914928 3.416183  9551
## [4] {Harry Potter and the Chamber of Secrets (2002)} => {Matrix, The (1999)}                                                                             0.06783239  0.7608854 0.08914928 2.184940  9384
## [5] {Harry Potter and the Chamber of Secrets (2002)} => {Shrek (2001)}                                                                                   0.06592406  0.7394794 0.08914928 3.516804  9120
## [6] {Harry Potter and the Chamber of Secrets (2002)} => {Lord of the Rings: The Return of the King, The (2003)}                                          0.06265677  0.7028298 0.08914928 3.329345  8668

Finding Nemo

significant <- is.significant(rules.Nemo, trans1)
inspect(head(sort(rules.Nemo[significant == TRUE], by="confidence", decreasing=TRUE)))

##     lhs                      rhs                                                           support confidence  coverage     lift count
## [1] {Finding Nemo (2003)} => {Matrix, The (1999)}                                        0.1175212  0.7482855 0.1570539 2.148758 16258
## [2] {Finding Nemo (2003)} => {Shrek (2001)}                                              0.1158008  0.7373314 0.1570539 3.506589 16020
## [3] {Finding Nemo (2003)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.1138853  0.7251346 0.1570539 2.895118 15755
## [4] {Finding Nemo (2003)} => {Forrest Gump (1994)}                                       0.1112468  0.7083353 0.1570539 1.598379 15390
## [5] {Finding Nemo (2003)} => {Lord of the Rings: The Two Towers, The (2002)}             0.1107481  0.7051595 0.1570539 3.110630 15321

Dark Knight

significant <- is.significant(rules.knight, trans1)
inspect(head(sort(rules.knight[significant == TRUE], by="confidence", decreasing=TRUE)))

##     lhs                          rhs                                                            support confidence coverage     lift count
## [1] {Dark Knight, The (2008)} => {Matrix, The (1999)}                                        0.10260154  0.7292063 0.140703 2.093971 14194
## [2] {Dark Knight, The (2008)} => {Fight Club (1999)}                                         0.09956557  0.7076291 0.140703 2.590614 13774
## [3] {Dark Knight, The (2008)} => {Shawshank Redemption, The (1994)}                          0.09456343  0.6720781 0.140703 1.498355 13082
## [4] {Dark Knight, The (2008)} => {Lord of the Rings: The Fellowship of the Ring, The (2001)} 0.08913482  0.6334960 0.140703 2.529249 12331
## [5] {Dark Knight, The (2008)} => {Lord of the Rings: The Return of the King, The (2003)}     0.08700964  0.6183920 0.140703 2.929358 12037
## [6] {Dark Knight, The (2008)} => {Pulp Fiction (1994)}                                       0.08690844  0.6176727 0.140703 1.363809 12023

The results are the same as the ones obtained before. That would imply that they all were significant, and that conclusions may be drawn from the results.

6. Closed items

Another useful analysis of obtained results would be to find closed frequent items. They are items that have higher support than their supersets, but also higher or equal support to the one fixed at the beginning of the analysis.

Frequent closed items can be found by apriori algorithm in the following way.

trans1.closed<-apriori(trans1, parameter=list(target="closed frequent itemsets", support=0.1))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen                   target  ext
##      10 closed frequent itemsets TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 13834 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[15703 item(s), 138341 transaction(s)] done [5.53s].
## sorting and recoding items ... [159 item(s)] done [0.13s].
## creating transaction tree ... done [0.12s].
## checking subsets of size 1 2 3 4 5 6 done [2.04s].
## filtering closed item sets ... done [0.00s].
## sorting transactions ... done [0.04s].
## writing ... [4864 set(s)] done [0.00s].
## creating S4 object  ... done [0.04s].

inspect(head(sort(trans1.closed, by="support", decreasing=TRUE)))

##     items                              support   transIdenticalToItemsets count
## [1] {Pulp Fiction (1994)}              0.4529026 0.0009830780             62655
## [2] {Shawshank Redemption, The (1994)} 0.4485438 0.0032672888             62052
## [3] {Forrest Gump (1994)}              0.4431586 0.0011782480             61307
## [4] {Silence of the Lambs, The (1991)} 0.4380697 0.0010698202             60603
## [5] {Jurassic Park (1993)}             0.3839787 0.0006144238             53120
## [6] {Braveheart (1995)}                0.3591343 0.0006361093             49683

Additionally, one could also obtain the results by eclat function.

freq.closed<-eclat(trans1, parameter=list(supp=0.15, maxlen=15, target="closed frequent itemsets"))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen                   target  ext
##     FALSE    0.15      1     15 closed frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 20751 
## 
## create itemset ... 
## set transactions ...[15703 item(s), 138341 transaction(s)] done [5.33s].
## sorting and recoding items ... [83 item(s)] done [0.09s].
## creating bit matrix ... [83 row(s), 138341 column(s)] done [0.03s].
## writing  ... [497 set(s)] done [0.02s].
## Creating S4 object  ... done [0.00s].

inspect(head(sort(freq.closed, by="support", decreasing=TRUE)))

##     items                              support   transIdenticalToItemsets count
## [1] {Pulp Fiction (1994)}              0.4529026 62655                    62655
## [2] {Shawshank Redemption, The (1994)} 0.4485438 62052                    62052
## [3] {Forrest Gump (1994)}              0.4431586 61307                    61307
## [4] {Silence of the Lambs, The (1991)} 0.4380697 60603                    60603
## [5] {Jurassic Park (1993)}             0.3839787 53120                    53120
## [6] {Braveheart (1995)}                0.3591343 49683                    49683

7. Similarity and dissimilarity measure

Checking similarity and dissimilarity measures enables to look for movies which most frequently are not together in transactions. One of the ways to measure it is Jaccard Index. Its values can be found in below matrix - the higher the measure, the more movies do not overlap with each other in transactions.

diss<-trans1[,itemFrequency(trans1)>0.36] 
d.jac.i<-dissimilarity(diss, which="items") 
round(d.jac.i,2)

##                                  Forrest Gump (1994) Jurassic Park (1993)
## Jurassic Park (1993)                            0.47                     
## Pulp Fiction (1994)                             0.52                 0.58
## Shawshank Redemption, The (1994)                0.54                 0.61
## Silence of the Lambs, The (1991)                0.52                 0.56
##                                  Pulp Fiction (1994)
## Jurassic Park (1993)                                
## Pulp Fiction (1994)                                 
## Shawshank Redemption, The (1994)                0.49
## Silence of the Lambs, The (1991)                0.45
##                                  Shawshank Redemption, The (1994)
## Jurassic Park (1993)                                             
## Pulp Fiction (1994)                                              
## Shawshank Redemption, The (1994)                                 
## Silence of the Lambs, The (1991)                             0.50

8. Visualizations

Association rule can also be visualized in various ways.

For instance, one can see the most frequent items in the dataset:

itemFrequencyPlot(trans1, topN=10, type="relative", main="Item Frequency")

One can also look for relation between:

support and lift:

trans1<-read.transactions("user_rating.csv", format="single", sep=",", cols=c("userId","title"), header=TRUE) 
rules.trans1<-apriori(trans1, parameter=list(supp=0.1, conf=0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 13834 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[15703 item(s), 138341 transaction(s)] done [5.37s].
## sorting and recoding items ... [159 item(s)] done [0.12s].
## creating transaction tree ... done [0.13s].
## checking subsets of size 1 2 3 4 5 6 done [2.70s].
## writing ... [13056 rule(s)] done [0.00s].
## creating S4 object  ... done [0.05s].

plot(sort(rules.trans1, by = "confidence", decreasing =  TRUE), measure=c("support","lift"),shading="confidence")

or support and confidence:

plot(sort(rules.trans1, by = "confidence", decreasing =  TRUE), measure=c("support","confidence"),shading="confidence")

Moreover, it is also possible to add to relation between support and confidence, the number of items in a given rule. That number is characterized by color (order on the legend).

plot(rules.trans1, shading="order", control=list(main="Two-key plot"))

9. Association rule - by movie genre

Hierarchical rules

The association rule does not have to be performed on the individual movie titles. The ratings can also be aggregated on the higher level, i.e. the movie genre. With the following analysis, let’s see which movie genres go together.

At first, we need to prepare the data and merge table rating with table movie, which contains movie genre for each title.

# Matching movie title with the genre

user_rating <- merge(user_rating,movie,by = "title")
user_rating <- user_rating[,-c(3)]

hierarchy <- user_rating[,c(2,3)]

hierarchy <- hierarchy[!duplicated(hierarchy), ]
hierarchy <- hierarchy[order(hierarchy$userId),]

write.csv(hierarchy, file="hierarchy.csv")

After data transformation, the same apriori algorithm can be used, but on the movie genres instead of titles.

trans2<-read.transactions("hierarchy.csv", format="single", sep=",", cols=c("userId","genres"), header=TRUE) # reading the file as transactions

rules.trans2_lev2<-apriori(trans2, parameter=list(supp=0.1, conf=0.5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.1      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 13834 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[1035 item(s), 138341 transaction(s)] done [1.52s].
## sorting and recoding items ... [132 item(s)] done [0.13s].
## creating transaction tree ... done [0.12s].
## checking subsets of size 1 2 3 4 done [14.57s].
## writing ... [656078 rule(s)] done [0.09s].
## creating S4 object  ... done [0.19s].

rules.by.conf<-sort(rules.trans2_lev2, by="confidence", decreasing=TRUE) 
inspect(head(rules.by.conf))

##     lhs                                                  rhs                                             support confidence  coverage     lift count
## [1] {Adventure|Animation|Children|Comedy|Musical,                                                                                                   
##      Adventure|Drama|IMAX,                                                                                                                          
##      Animation|Children|Fantasy|Musical|Romance|IMAX} => {Adventure|Animation|Children|Comedy|Fantasy} 0.1272508  0.9996593 0.1272941 1.879478 17604
## [2] {Adventure|Animation|Children|Comedy|Musical,                                                                                                   
##      Adventure|Animation|Children|Drama|Musical|IMAX,                                                                                               
##      Adventure|Drama|IMAX}                            => {Adventure|Animation|Children|Comedy|Fantasy} 0.1234341  0.9994732 0.1234992 1.879128 17076
## [3] {Adventure|Animation|Children|Comedy|Musical,                                                                                                   
##      Animation|Children|Fantasy|Musical|Romance|IMAX,                                                                                               
##      Comedy|Crime|Drama|Thriller}                     => {Adventure|Animation|Children|Comedy|Fantasy} 0.1272580  0.9994323 0.1273303 1.879051 17605
## [4] {Action|Adventure|Comedy|Romance|Thriller,                                                                                                      
##      Adventure|Animation|Children|Comedy|Musical,                                                                                                   
##      Animation|Children|Fantasy|Musical|Romance|IMAX} => {Adventure|Animation|Children|Comedy|Fantasy} 0.1100831  0.9994094 0.1101481 1.879008 15229
## [5] {Adventure|Animation|Children|Comedy|Musical,                                                                                                   
##      Animation|Children|Fantasy|Musical|Romance|IMAX,                                                                                               
##      Thriller}                                        => {Adventure|Animation|Children|Comedy|Fantasy} 0.1361852  0.9993635 0.1362720 1.878922 18840
## [6] {Action|Adventure|Comedy|Romance|Thriller,                                                                                                      
##      Adventure|Animation|Children|Comedy|Musical,                                                                                                   
##      Adventure|Drama|IMAX}                            => {Adventure|Animation|Children|Comedy|Fantasy} 0.1192127  0.9993335 0.1192922 1.878865 16492

10. Conclusions

To sum up, using association rule for finding relations between watched movies brought promising results. As association is mostly used in products transactions, this research showed that it can be used in other areas as well. There is still much to discover in the association rule field, yet even now it shows interesting results.

Association Rule for movie ratings

Klaudia Bury

Table of contents

1) Introduction

2) Description of data

2) Data preparation

4. Association rule - by movie title

Packages required for Association rule analysis

Reading transactions

Eclust method

Apriori method

Redundant rules

RHS and LHS - apriori rules

RHS

Harry Potter

Finding Nemo

The Dark Knight

LHS

Harry Potter

Finding Nemo

The Dark Knight

5. Significant rules

Harry Potter

Finding Nemo

Dark Knight

6. Closed items

7. Similarity and dissimilarity measure

8. Visualizations

9. Association rule - by movie genre

Hierarchical rules

10. Conclusions