The aim of this paper is to determine a possible recommendations of movies based on the user’s ratings with the help of association rules. Association Rules is a non-linear, unsupervised learning measure used in examining and establishing relations between variables in substantial data sets, allocating patterns in behavious and more (Djenouri et al., 2018). Undoubtedly, the most popular application of this measure is in the market basket analysis, where the identification of products that are sold together is implemented. Moreover, association rules are often used in the recommendations systems and this is the implementation we are going to conduct.
We can distinguish three commons means of Association Rules to determine their quality: support confidence and lift.
Simply put - support is a measure of how many times the joint itemset appears in the database of use.
A percentage value showing how frequently the rule’s head appears amongst all the groups that contain the rule’s body. It indicates how reliable such a rule is (IBM, 2021a). The higher such confidence is, the stronger the rule.
X itemset - antecedent
Y itemset - consequent
Ratio measure of the confidence of the rule and its expected confidence. The higher it is, the higher the chance of co-occurence of X and Y. Lift can has a value between 0 and infinity:
Now, let’s move on the dataset preparation.
The data is dervied from Kaggle and it consists of two datasets: movies and ratings (Karthik, 2020). It has movies on a wide spectrum of genres, from movies such as “Toy Story” to “Inception” and a broad group of users. Since Association Rules are good for big data sets, let’s implement one that can be considered as such, as our movies list is over 30.000 titles, and the ratings count surpasses 2 million.
First, let’s investigate the movies dataset.
#Loading all the packages first
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
##
## intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
library(Matrix)
library(stringr)
library(stringdist)
library(ggplot2)
library(arulesViz)
## Loading required package: grid
movies <- read.csv("movies.csv", stringsAsFactors = FALSE)
head(movies)
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
As we can see we have 3 variables: movieId, title and genres. First, let us separate the year of release from the actual title for the better clasrity of the dataset.
movies$year <- as.numeric(str_sub(str_trim(movies$title), start = -5, end = -2))
## Warning: pojawiły się wartości NA na skutek przekształcenia
head(movies)
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres year
## 1 Adventure|Animation|Children|Comedy|Fantasy 1995
## 2 Adventure|Children|Fantasy 1995
## 3 Comedy|Romance 1995
## 4 Comedy|Drama|Romance 1995
## 5 Comedy 1995
## 6 Action|Crime|Thriller 1995
Now, we have a fourth variable - year. However, there are some movies without the release year, and as a result of coercion, we implemented NA values to the database. Now, we will get rid of those movies, and thereby of NAs for the analysis.
movies_without_year <- which(is.na(movies$year))
movies$title[movies_without_year]
## [1] "Babylon 5"
## [2] "Millions Game, The (Das Millionenspiel)"
## [3] "Bicycle, Spoon, Apple (Bicicleta, cullera, poma)"
## [4] "Mona and the Time of Burning Love (Mona ja palavan rakkauden aika) (1983))"
## [5] "Diplomatic Immunity (2009– )"
## [6] "Big Bang Theory, The (2007-)"
## [7] "Brazil: In the Shadow of the Stadiums"
## [8] "Slaying the Badger"
## [9] "Tatort: Im Schmerz geboren"
## [10] "National Theatre Live: Frankenstein"
## [11] "The Court-Martial of Jackie Robinson"
## [12] "In Our Garden"
## [13] "Stephen Fry In America - New World"
## [14] "Two: The Story of Roman & Nyro"
## [15] "Li'l Quinquin"
## [16] "A Year Along the Abandoned Road"
## [17] "Body/Cialo"
## [18] "Polskie gówno"
## [19] "The Third Reich: The Rise & Fall"
## [20] "My Own Man"
## [21] "Moving Alan"
## [22] "Michael Laudrup - en Fodboldspiller"
## [23] "Doli Saja Ke Rakhna"
## [24] "The Dead Lands"
## [25] "C'mon, Let's Live a Little"
## [26] "For a Book of Dollars"
## [27] "Bad Boys 3"
## [28] "Señorita Justice"
## [29] "Red Victoria"
## [30] "Vaastupurush"
## [31] "Sierra Leone's Refugee All Stars"
## [32] "L'uomo della carità"
## [33] "Filmage: The Story of Descendents/All"
## [34] "About Sarah"
## [35] "Swallows and Amazons"
## [36] "Ready Player One"
## [37] "Los tontos y los estúpidos"
## [38] "The Naked Truth (1957) (Your Past Is Showing)"
## [39] "Disaster Playground"
## [40] "Nice Guy"
## [41] "OMG, I'm a Robot!"
## [42] "KillerSaurus"
## [43] "Viva"
## [44] "Ollaan vapaita"
## [45] "Fakta Ladh Mhana"
## [46] "Sentimentalnyy roman"
## [47] "Yedyanchi Jatra"
## [48] "Dhadakebaaz"
## [49] "Ittefaq"
## [50] "Elämältä kaiken sain"
## [51] "Dil Kya Kare"
## [52] "Hogi Pyar Ki Jeet"
## [53] "Monk by Blood"
## [54] "I Am Syd Stone"
## [55] "Alone With People"
## [56] "38 Parrots"
## [57] "The Adventures of Sherlock Holmes and Doctor Watson"
## [58] "The Adventures of Sherlock Holmes and Doctor Watson: The Treasures of Agra"
## [59] "The Republic "
## [60] "A Fare to Remember"
## [61] "The Code"
## [62] "101次求婚"
## [63] "S: Saigo no Keikan - Dakkan: Recovery of Our Future"
## [64] "Vrijdag"
## [65] "Aimy in a Cage"
## [66] "Trophy Kids"
## [67] "Jasne Błękitne Okna"
## [68] "Mr. Kuka's Advice"
## [69] "Hundra"
We have 69 titles with no year of the premiere. Let’s discard them from the dataset.
movies <- movies[-movies_without_year, ]
#Checking for NAs one more time as a precaution
sum(is.na(movies))
## [1] 0
Now, we have no NAs and we can extract the title, from the title variable.
movies$title <- str_sub(str_trim(movies$title), start = 1, end = -8)
As you can see, in the dataset movies we also have the genres variable. However, multiple genres are stacked into one cell. Let’s look at the unique genres in the table. But first we need to exctract the individual genres from the stacked cells.
uniqueGenres <- unique(unlist(str_split(movies$genres, "\\|")))
uniqueGenres
## [1] "Adventure" "Animation" "Children"
## [4] "Comedy" "Fantasy" "Romance"
## [7] "Drama" "Action" "Crime"
## [10] "Thriller" "Horror" "Mystery"
## [13] "Sci-Fi" "IMAX" "Documentary"
## [16] "War" "Musical" "Western"
## [19] "Film-Noir" "(no genres listed)"
We have 20 different genres. However, we can see that two of them are not really genre defining: “(no genres listed)” and “IMAX”.
Let’s see how many positions we have with no genres and discard them from the dataset.
movies %>% filter(str_detect(genres, "(no genres listed)")) %>% nrow()
## [1] 1104
movies <- movies[! movies$genres == "(no genres listed)", ]
#Check how many position wihout genres we have now
movies %>% filter(str_detect(genres, "(no genres listed)")) %>% nrow()
## [1] 0
uniqueGenres <- uniqueGenres[! uniqueGenres == "IMAX"]
uniqueGenres
## [1] "Adventure" "Animation" "Children"
## [4] "Comedy" "Fantasy" "Romance"
## [7] "Drama" "Action" "Crime"
## [10] "Thriller" "Horror" "Mystery"
## [13] "Sci-Fi" "Documentary" "War"
## [16] "Musical" "Western" "Film-Noir"
## [19] "(no genres listed)"
Now we have 18 genres left. And we will create a binary variable and assign it to each genre it belongs. After that, we can drop the genres variable from the movies data set.
for(genre in uniqueGenres) {
movies[str_c("genre_", genre)] = ifelse(str_detect(movies$genres, genre), 1, 0)
}
head(movies, 5)
## movieId title
## 1 1 Toy Story
## 2 2 Jumanji
## 3 3 Grumpier Old Men
## 4 4 Waiting to Exhale
## 5 5 Father of the Bride Part II
## genres year genre_Adventure
## 1 Adventure|Animation|Children|Comedy|Fantasy 1995 1
## 2 Adventure|Children|Fantasy 1995 1
## 3 Comedy|Romance 1995 0
## 4 Comedy|Drama|Romance 1995 0
## 5 Comedy 1995 0
## genre_Animation genre_Children genre_Comedy genre_Fantasy genre_Romance
## 1 1 1 1 1 0
## 2 0 1 0 1 0
## 3 0 0 1 0 1
## 4 0 0 1 0 1
## 5 0 0 1 0 0
## genre_Drama genre_Action genre_Crime genre_Thriller genre_Horror
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 1 0 0 0 0
## 5 0 0 0 0 0
## genre_Mystery genre_Sci-Fi genre_Documentary genre_War genre_Musical
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## genre_Western genre_Film-Noir genre_(no genres listed)
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
#Discaring genres variavle
movies <- select(movies, -genres)
Let’s see the genre distribution now.
genreDist <- colSums(movies[, 4:21])
genreDistDF <- data.frame(genre = names(genreDist),count = genreDist)
genreDistDF$genre <- str_sub(str_trim(genreDistDF$genre), start = 7, end = -1)
genreDistDF
## genre count
## genre_Adventure Adventure 2762
## genre_Animation Animation 1386
## genre_Children Children 1607
## genre_Comedy Comedy 10115
## genre_Fantasy Fantasy 1689
## genre_Romance Romance 4875
## genre_Drama Drama 15765
## genre_Action Action 4441
## genre_Crime Crime 3443
## genre_Thriller Thriller 5297
## genre_Horror Horror 3363
## genre_Mystery Mystery 1836
## genre_Sci-Fi Sci-Fi 2152
## genre_Documentary Documentary 3033
## genre_War War 1345
## genre_Musical Musical 1051
## genre_Western Western 779
## genre_Film-Noir Film-Noir 338
ggplot(genreDistDF, aes(x = reorder(genre, -count), y = count, fill = genre)) + geom_bar(stat = "identity") + ggtitle("Distribution of Genres") + theme(legend.position = "none", axis.text.x = element_text(angle = 90)) + xlab("Genre") + ylab("Count")
Now we can the distribution of our dataset. Majority of the movies are in the drama or comedy genre. After comedy, there is a substantial decrease in the count of the movies.
Since we have some more understanding of the movies dataset and a correct format, let’s move onto the second dataset - ratings.
After reading it, we can see that the ratings dataset consists of over 2 million observations, let’s decrease it to 1 000 000 observations for a smoother work. Then, we will get rid of the unnecessary variables - timestamp and rating, as the rating itself holds little interest to the study. The activity of rating itself is more important, as it shows the interest in watching a given movie and not particularly liking it - for an appeal based recommendation, we would need another project :).
ratings <- read.csv("ratings.csv")
#Choosing the first 1.000.000 observations
ratings <- ratings[1:1000000, ]
ratings <- select(ratings, userId, movieId)
head(ratings, 5)
## userId movieId
## 1 1 169
## 2 1 2471
## 3 1 48516
## 4 2 2571
## 5 2 109487
Now, let us adjust this dataset more to the dataset of movies and the ratings of the movies that do not exist in the final movies dataset anymore.
ratings <- ratings %>% filter(! movieId %in% movies)
dim(ratings)
## [1] 1000000 2
We can see that we still have 1000000 observations left.
Now, we can move onto applying the frequent itemset mining with the Apriori algorithm. First, we need to build a User-Item matrix with 1/0 values, whether a movie has been seen by a user or not respectively. To represent the matrix, we will implement the object transactions, to prevent most of the elements to become 0s.
matrix1 <- as(split(ratings[ , "movieId"], ratings[ , "userId"]), "transactions")
matrix1
## transactions in sparse format with
## 10790 transactions (rows) and
## 17292 items (columns)
Now, after establishing the matrix with 1052 transactions (number of raters) and 9484 items (number of movies), we can move to finding frequent pair of films that the users watch.
We can put out such an assumption, that if X and Y items are viewed together often, there is ought to be some underlying connection between those two position, that would help in establishing a viewer’s choice behaviour. Such finding can help in recommending movie X, has the user watched item Y and the other way around.
Now is the time to set the support and confidence values. Let’s set confidence to 0.01, so that the pair is watched by at least 107 users, and the confidence, that had the user watched film X, film Y will also be seen, to 75%. Following to that, we can run the Apriori with this rule.
ruleParameters <- list(supp = 0.01, conf = 0.75, maxlen = 2)
associationRules <- apriori(matrix1, parameter = ruleParameters)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.75 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 2 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 107
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[17292 item(s), 10790 transaction(s)] done [0.52s].
## sorting and recoding items ... [1968 item(s)] done [0.03s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2
## Warning in apriori(matrix1, parameter = ruleParameters): Mining stopped (maxlen
## reached). Only patterns up to a length of 2 returned!
## done [0.35s].
## writing ... [5896 rule(s)] done [0.01s].
## creating S4 object ... done [0.01s].
summary(associationRules)
## set of 5896 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 5896
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01001 Min. :0.7500 Min. :0.01057 Min. : 2.330
## 1st Qu.:0.01196 1st Qu.:0.7667 1st Qu.:0.01492 1st Qu.: 2.776
## Median :0.01576 Median :0.7875 Median :0.01974 Median : 3.750
## Mean :0.02286 Mean :0.7965 Mean :0.02885 Mean : 4.198
## 3rd Qu.:0.02382 3rd Qu.:0.8174 3rd Qu.:0.02966 3rd Qu.: 4.811
## Max. :0.20816 Max. :0.9561 Max. :0.27711 Max. :54.682
## count
## Min. : 108.0
## 1st Qu.: 129.0
## Median : 170.0
## Mean : 246.7
## 3rd Qu.: 257.0
## Max. :2246.0
##
## mining info:
## data ntransactions support confidence
## matrix1 10790 0.01 0.75
set.seed(240)
plot(associationRules, method = "graph", measure = "support", shading = "lift", main = "Association Rules Graph")
## Warning: plot: Too many rules supplied. Only plotting the best 100 rules using
## 'support' (change control parameter max if needed)
With this chunk of code we created 281554 rules. From the summary we can also determine the lift statistics for all the rules combined. Lift, as a remainder, is the dependency measure, where we compute chances of X and Y occurring together. From what we can see, our minimum lift is over 2, therefore all the rules have the positive dependency.
By looking at the plot, we can see that the rules can be applied to substantial amount of inputs, however, the strongest associations are between the groups 7153, 4993 and 5952.
arulesViz::plotly_arules(associationRules, method = "matrix", measure=c("support","confidence"))
## Warning: 'arulesViz::plotly_arules' is deprecated.
## Use 'plot' instead.
## See help("Deprecated")
## Warning: plot: Too many rules supplied. Only plotting the best 1000 rules using
## lift (change parameter max if needed)
Here we can see the associations graphically between the particular movies by rules with the indication of the lift value. Because our lift values are all positive and some of them are really strong, only a small number of rules is in the top lift spectrum.
Because we have such a big number of rules, let us filter those that are above the 3rd quartile (4.811).
associationRules <- subset(associationRules, lift >= 4.811)
summary(associationRules)
## set of 1474 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 1474
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01001 Min. :0.7500 Min. :0.01103 Min. : 4.813
## 1st Qu.:0.01140 1st Qu.:0.7680 1st Qu.:0.01446 1st Qu.: 5.191
## Median :0.01423 Median :0.7902 Median :0.01807 Median : 6.190
## Mean :0.01733 Mean :0.7974 Mean :0.02177 Mean : 6.826
## 3rd Qu.:0.01863 3rd Qu.:0.8177 3rd Qu.:0.02345 3rd Qu.: 7.339
## Max. :0.13411 Max. :0.9328 Max. :0.16636 Max. :54.682
## count
## Min. : 108.0
## 1st Qu.: 123.0
## Median : 153.5
## Mean : 186.9
## 3rd Qu.: 201.0
## Max. :1447.0
##
## mining info:
## data ntransactions support confidence
## matrix1 10790 0.01 0.75
associationRules <- as(associationRules, "data.frame")
tail(associationRules, 5)
## rules support confidence coverage lift count
## 5753 {2916} => {1580} 0.06830399 0.7790698 0.08767377 5.166664 737
## 5804 {2115} => {1291} 0.07367933 0.8103976 0.09091752 5.908236 795
## 5834 {1200} => {1214} 0.09434662 0.7965571 0.11844300 6.014592 1018
## 5845 {7153} => {5952} 0.13410565 0.8226265 0.16302132 4.944925 1447
## 5846 {5952} => {7153} 0.13410565 0.8061281 0.16635774 4.944925 1447
From what we can see on the head of the associationRules data frame, the rules contain the movieId inputs. So let’s divide them up.
rules <- sapply(associationRules$rules, function(x){
x = gsub("[\\{\\}]", "", regmatches(x, gregexpr("\\{.*\\}", x))[[1]])
x = gsub("=>",",",x)
x = str_replace_all(x," ","")
return( x )
})
rules <- as.character(rules)
rules <- str_split(rules, ",")
associationRules$movieLeftSide <- sapply( rules, "[[", 1)
associationRules$movieRightSide <- sapply( rules, "[[", 2)
associationRules$movieLeftSide <- as.numeric(associationRules$movieLeftSide)
associationRules$movieRightSide <- as.numeric(associationRules$movieRightSide)
Now, let’s get rid of the rules variable.
associationRules$rules <- NULL
Now, we can join two dataframes: associationRules and movies. We can get the titles both on the left and right hand sides of the rules respectively with the according genre.
associationRules <- associationRules %>% left_join(movies, by = c("movieLeftSide" = "movieId"))
associationRules$movieLeftSide <- NULL
columnNames <- colnames(associationRules)
columnNames[5] <- str_c("Left_", columnNames[5])
columnNames[7:25] <- str_c("Left_", columnNames[7:25])
colnames(associationRules) <- columnNames
#Now the same for right hand side movies
associationRules <- associationRules %>% left_join(movies, by = c("movieRightSide" = "movieId"))
associationRules$movieRightSide <- NULL
columnNames <- colnames(associationRules)
columnNames[26:45] <- str_c("Right_", columnNames[26:45])
colnames(associationRules) <- columnNames
colnames(associationRules)
## [1] "support" "confidence"
## [3] "coverage" "lift"
## [5] "Left_count" "Left_title"
## [7] "Left_year" "Left_genre_Adventure"
## [9] "Left_genre_Animation" "Left_genre_Children"
## [11] "Left_genre_Comedy" "Left_genre_Fantasy"
## [13] "Left_genre_Romance" "Left_genre_Drama"
## [15] "Left_genre_Action" "Left_genre_Crime"
## [17] "Left_genre_Thriller" "Left_genre_Horror"
## [19] "Left_genre_Mystery" "Left_genre_Sci-Fi"
## [21] "Left_genre_Documentary" "Left_genre_War"
## [23] "Left_genre_Musical" "Left_genre_Western"
## [25] "genre_Film-Noir.x" "Right_genre_(no genres listed).x"
## [27] "Right_title" "Right_year"
## [29] "Right_genre_Adventure" "Right_genre_Animation"
## [31] "Right_genre_Children" "Right_genre_Comedy"
## [33] "Right_genre_Fantasy" "Right_genre_Romance"
## [35] "Right_genre_Drama" "Right_genre_Action"
## [37] "Right_genre_Crime" "Right_genre_Thriller"
## [39] "Right_genre_Horror" "Right_genre_Mystery"
## [41] "Right_genre_Sci-Fi" "Right_genre_Documentary"
## [43] "Right_genre_War" "Right_genre_Musical"
## [45] "Right_genre_Western" "genre_Film-Noir.y"
## [47] "genre_(no genres listed).y"
Now that we have established the titles and genres with the rules, we can look back at what we achieved here. Let’s look at the rules with the highest lift, so the highest positive dependency on one another, thereby the strongest recommendations.
associationRules <- arrange(associationRules, desc(lift))
associationRules <- select(associationRules, Left_title, Left_year, Right_title, Right_year, support, confidence, lift)
head(associationRules)
## Left_title Left_year
## 1 Manon of the Spring (Manon des sources) 1986
## 2 Fantastic Four: Rise of the Silver Surfer 2007
## 3 Hobbit: The Desolation of Smaug, The 2013
## 4 Resident Evil: Apocalypse 2004
## 5 Saw II 2005
## 6 Hunger Games: Catching Fire, The 2013
## Right_title Right_year support confidence lift
## 1 Jean de Florette 1986 0.01028730 0.7551020 54.68155
## 2 Fantastic Four 2005 0.01037998 0.7567568 36.13011
## 3 Hobbit: An Unexpected Journey, The 2012 0.01251158 0.7758621 34.88147
## 4 Resident Evil 2002 0.01075070 0.7581699 33.12006
## 5 Saw 2004 0.01028730 0.8283582 30.92728
## 6 Hunger Games, The 2012 0.01399444 0.7905759 29.41488
head(associationRules, 150)
## Left_title Left_year
## 1 Manon of the Spring (Manon des sources) 1986
## 2 Fantastic Four: Rise of the Silver Surfer 2007
## 3 Hobbit: The Desolation of Smaug, The 2013
## 4 Resident Evil: Apocalypse 2004
## 5 Saw II 2005
## 6 Hunger Games: Catching Fire, The 2013
## 7 Three Colors: White (Trzy kolory: Bialy) 1994
## 8 Transformers: Revenge of the Fallen 2009
## 9 Three Colors: White (Trzy kolory: Bialy) 1994
## 10 Captain America: The Winter Soldier 2014
## 11 Iron Man 3 2013
## 12 Captain America: The First Avenger 2011
## 13 Amazing Spider-Man, The 2012
## 14 Captain America: The Winter Soldier 2014
## 15 Scary Movie 2 2001
## 16 X-Men: First Class 2011
## 17 Star Trek Into Darkness 2013
## 18 Birdman: Or (The Unexpected Virtue of Ignorance) 2014
## 19 Edge of Tomorrow 2014
## 20 Pirates of the Caribbean: At World's End 2007
## 21 Beverly Hills Cop II 1987
## 22 Chronicles of Riddick, The 2004
## 23 Harry Potter and the Order of the Phoenix 2007
## 24 Ice Age 2: The Meltdown 2006
## 25 Thunderball 1965
## 26 28 Weeks Later 2007
## 27 Death Proof 2007
## 28 Fantastic Four: Rise of the Silver Surfer 2007
## 29 Lars and the Real Girl 2007
## 30 Shooter 2007
## 31 Terminator Salvation 2009
## 32 Quantum of Solace 2008
## 33 Kiki's Delivery Service (Majo no takkyûbin) 1989
## 34 Nausicaä of the Valley of the Wind (Kaze no tani no Naushika) 1984
## 35 Harry Potter and the Goblet of Fire 2005
## 36 Grave of the Fireflies (Hotaru no haka) 1988
## 37 Dawn of the Dead 1978
## 38 Hellboy II: The Golden Army 2008
## 39 For a Few Dollars More (Per qualche dollaro in più) 1965
## 40 Iron Man 2 2010
## 41 Iron Man 3 2013
## 42 Laputa: Castle in the Sky (Tenkû no shiro Rapyuta) 1986
## 43 Incredible Hulk, The 2008
## 44 Fistful of Dollars, A (Per un pugno di dollari) 1964
## 45 Harry Potter and the Order of the Phoenix 2007
## 46 Mission: Impossible III 2006
## 47 My Neighbor Totoro (Tonari no Totoro) 1988
## 48 Terminator Salvation 2009
## 49 Captain America: The First Avenger 2011
## 50 Shrek the Third 2007
## 51 Wallace & Gromit: A Close Shave 1995
## 52 Broken Flowers 2005
## 53 Fantastic Four: Rise of the Silver Surfer 2007
## 54 Harry Potter and the Half-Blood Prince 2009
## 55 Spider-Man 3 2007
## 56 Thor 2011
## 57 Howl's Moving Castle (Hauru no ugoku shiro) 2004
## 58 Hancock 2008
## 59 Grand Day Out with Wallace and Gromit, A 1989
## 60 Ice Age 2: The Meltdown 2006
## 61 Captain America: The Winter Soldier 2014
## 62 X-Men Origins: Wolverine 2009
## 63 Amazing Spider-Man, The 2012
## 64 Quantum of Solace 2008
## 65 Star Trek Into Darkness 2013
## 66 Just Cause 1995
## 67 Matrix Revolutions, The 2003
## 68 Transformers: Revenge of the Fallen 2009
## 69 Wanted 2008
## 70 X-Men: The Last Stand 2006
## 71 Michael Clayton 2007
## 72 Animatrix, The 2003
## 73 Prometheus 2012
## 74 I Heart Huckabees 2004
## 75 Rise of the Planet of the Apes 2011
## 76 Hulk 2003
## 77 Bananas 1971
## 78 Mission: Impossible - Ghost Protocol 2011
## 79 Tropic Thunder 2008
## 80 Spider-Man 3 2007
## 81 Superman Returns 2006
## 82 Daredevil 2003
## 83 Star Trek: Nemesis 2002
## 84 Gone Baby Gone 2007
## 85 Harry Potter and the Goblet of Fire 2005
## 86 Death Proof 2007
## 87 Planet Terror 2007
## 88 Harry Potter and the Order of the Phoenix 2007
## 89 F/X 1986
## 90 Animatrix, The 2003
## 91 Reign of Fire 2002
## 92 Animatrix, The 2003
## 93 Grindhouse 2007
## 94 Terminator Salvation 2009
## 95 Zodiac 2007
## 96 Town, The 2010
## 97 History of Violence, A 2005
## 98 Ruthless People 1986
## 99 Eastern Promises 2007
## 100 Who's Afraid of Virginia Woolf? 1966
## 101 Underworld 2003
## 102 Star Wars: Episode III - Revenge of the Sith 2005
## 103 Chronicles of Riddick, The 2004
## 104 Star Trek: Nemesis 2002
## 105 Fountain, The 2006
## 106 American Gangster 2007
## 107 Van Helsing 2004
## 108 Sunshine 2007
## 109 Resident Evil: Apocalypse 2004
## 110 Terminator 3: Rise of the Machines 2003
## 111 Mimic 1997
## 112 Harry Potter and the Chamber of Secrets 2002
## 113 Mission: Impossible III 2006
## 114 Harry Potter and the Order of the Phoenix 2007
## 115 Fantastic Four: Rise of the Silver Surfer 2007
## 116 Ruthless People 1986
## 117 Hard Target 1993
## 118 Prometheus 2012
## 119 State and Main 2000
## 120 Harry Potter and the Goblet of Fire 2005
## 121 Cloud Atlas 2012
## 122 Man of Steel 2013
## 123 Looper 2012
## 124 Source Code 2011
## 125 Super 8 2011
## 126 Tron: Legacy 2010
## 127 Airplane II: The Sequel 1982
## 128 Gravity 2013
## 129 Ruthless People 1986
## 130 Planet Terror 2007
## 131 Rise of the Planet of the Apes 2011
## 132 Death Proof 2007
## 133 Scanner Darkly, A 2006
## 134 Bourne Supremacy, The 2004
## 135 Captain America: The Winter Soldier 2014
## 136 Limitless 2011
## 137 Iron Man 3 2013
## 138 Star Trek Into Darkness 2013
## 139 Innerspace 1987
## 140 Superman Returns 2006
## 141 The Lego Movie 2014
## 142 Cabin in the Woods, The 2012
## 143 Mission: Impossible III 2006
## 144 Birdman: Or (The Unexpected Virtue of Ignorance) 2014
## 145 Town, The 2010
## 146 Wild Bunch, The 1969
## 147 Moon 2009
## 148 Primer 2004
## 149 Peggy Sue Got Married 1986
## 150 Moonrise Kingdom 2012
## Right_title
## 1 Jean de Florette
## 2 Fantastic Four
## 3 Hobbit: An Unexpected Journey, The
## 4 Resident Evil
## 5 Saw
## 6 Hunger Games, The
## 7 Three Colors: Blue (Trois couleurs: Bleu)
## 8 Transformers
## 9 Three Colors: Red (Trois couleurs: Rouge)
## 10 Avengers, The
## 11 Avengers, The
## 12 Avengers, The
## 13 Avengers, The
## 14 Dark Knight Rises, The
## 15 Scary Movie
## 16 Avengers, The
## 17 Star Trek
## 18 Interstellar
## 19 Interstellar
## 20 Pirates of the Caribbean: Dead Man's Chest
## 21 Beverly Hills Cop
## 22 I, Robot
## 23 Harry Potter and the Goblet of Fire
## 24 Ice Age
## 25 Goldfinger
## 26 28 Days Later
## 27 No Country for Old Men
## 28 300
## 29 Juno
## 30 Bourne Ultimatum, The
## 31 Avatar
## 32 Casino Royale
## 33 Spirited Away (Sen to Chihiro no kamikakushi)
## 34 Spirited Away (Sen to Chihiro no kamikakushi)
## 35 Harry Potter and the Prisoner of Azkaban
## 36 Spirited Away (Sen to Chihiro no kamikakushi)
## 37 Shaun of the Dead
## 38 Iron Man
## 39 Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il)
## 40 Iron Man
## 41 Iron Man
## 42 Spirited Away (Sen to Chihiro no kamikakushi)
## 43 Iron Man
## 44 Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il)
## 45 Harry Potter and the Prisoner of Azkaban
## 46 Casino Royale
## 47 Spirited Away (Sen to Chihiro no kamikakushi)
## 48 Iron Man
## 49 Iron Man
## 50 Shrek 2
## 51 Wallace & Gromit: The Wrong Trousers
## 52 Lost in Translation
## 53 Iron Man
## 54 Harry Potter and the Prisoner of Azkaban
## 55 Spider-Man 2
## 56 Iron Man
## 57 Spirited Away (Sen to Chihiro no kamikakushi)
## 58 Iron Man
## 59 Wallace & Gromit: The Wrong Trousers
## 60 Shrek 2
## 61 Iron Man
## 62 Iron Man
## 63 Iron Man
## 64 Iron Man
## 65 Iron Man
## 66 Client, The
## 67 Matrix Reloaded, The
## 68 Iron Man
## 69 Iron Man
## 70 X2: X-Men United
## 71 Departed, The
## 72 Matrix Reloaded, The
## 73 Iron Man
## 74 Lost in Translation
## 75 Iron Man
## 76 X2: X-Men United
## 77 Annie Hall
## 78 Iron Man
## 79 Iron Man
## 80 Iron Man
## 81 Spider-Man 2
## 82 X2: X-Men United
## 83 Star Wars: Episode II - Attack of the Clones
## 84 Departed, The
## 85 Harry Potter and the Chamber of Secrets
## 86 Sin City
## 87 Sin City
## 88 Harry Potter and the Chamber of Secrets
## 89 Lethal Weapon
## 90 Sin City
## 91 Star Wars: Episode II - Attack of the Clones
## 92 V for Vendetta
## 93 Sin City
## 94 Matrix Reloaded, The
## 95 Departed, The
## 96 Departed, The
## 97 Sin City
## 98 Lethal Weapon
## 99 Departed, The
## 100 Graduate, The
## 101 Matrix Reloaded, The
## 102 Star Wars: Episode II - Attack of the Clones
## 103 Matrix Reloaded, The
## 104 Matrix Reloaded, The
## 105 V for Vendetta
## 106 Departed, The
## 107 Matrix Reloaded, The
## 108 V for Vendetta
## 109 Matrix Reloaded, The
## 110 Matrix Reloaded, The
## 111 Face/Off
## 112 Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone)
## 113 V for Vendetta
## 114 Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone)
## 115 V for Vendetta
## 116 Fish Called Wanda, A
## 117 Demolition Man
## 118 Inception
## 119 O Brother, Where Art Thou?
## 120 Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone)
## 121 Inception
## 122 Inception
## 123 Inception
## 124 Inception
## 125 Inception
## 126 Inception
## 127 Airplane!
## 128 Inception
## 129 Big
## 130 Kill Bill: Vol. 2
## 131 Inception
## 132 Kill Bill: Vol. 2
## 133 Donnie Darko
## 134 Bourne Identity, The
## 135 Inception
## 136 Inception
## 137 Inception
## 138 Inception
## 139 Back to the Future Part II
## 140 Batman Begins
## 141 Inception
## 142 Inception
## 143 Batman Begins
## 144 Inception
## 145 Inception
## 146 Psycho
## 147 Inception
## 148 Donnie Darko
## 149 When Harry Met Sally...
## 150 Inception
## Right_year support confidence lift
## 1 1986 0.01028730 0.7551020 54.681550
## 2 2005 0.01037998 0.7567568 36.130112
## 3 2012 0.01251158 0.7758621 34.881466
## 4 2002 0.01075070 0.7581699 33.120055
## 5 2004 0.01028730 0.8283582 30.927284
## 6 2012 0.01399444 0.7905759 29.414876
## 7 1993 0.01770158 0.7764228 26.851287
## 8 2007 0.01037998 0.7724138 26.208632
## 9 1994 0.01742354 0.7642276 22.905601
## 10 2012 0.01260426 0.8343558 21.034344
## 11 2012 0.01390176 0.8333333 21.008567
## 12 2012 0.01399444 0.8074866 20.356964
## 13 2012 0.01241891 0.7976190 20.108200
## 14 2012 0.01149212 0.7607362 19.405068
## 15 2000 0.01269694 0.7828571 19.285453
## 16 2012 0.01909175 0.7545788 19.023142
## 17 2009 0.01436515 0.8288770 18.104419
## 18 2014 0.01149212 0.7948718 17.980433
## 19 2014 0.01696015 0.7530864 17.035225
## 20 2006 0.02187210 0.7892977 16.633832
## 21 1984 0.01380908 0.7720207 16.495255
## 22 2004 0.01232623 0.7556818 15.987857
## 23 2005 0.02659870 0.7798913 15.847509
## 24 2002 0.01121409 0.8120805 15.345620
## 25 1964 0.01529194 0.7932692 15.284598
## 26 2002 0.01612604 0.8365385 15.195707
## 27 2007 0.01047266 0.7533333 14.161092
## 28 2007 0.01028730 0.7500000 13.880789
## 29 2007 0.01010195 0.8384615 13.854518
## 30 2007 0.01000927 0.8000000 13.466459
## 31 2009 0.01019462 0.7746479 13.438024
## 32 2006 0.01612604 0.8613861 13.164811
## 33 2001 0.01075070 0.8345324 12.808825
## 34 2001 0.01380908 0.8324022 12.776131
## 35 2004 0.04096386 0.8323917 12.757822
## 36 2001 0.01093605 0.8251748 12.665201
## 37 2004 0.01010195 0.7569444 12.662683
## 38 2008 0.01037998 0.8682171 12.540913
## 39 1966 0.01705283 0.7965368 12.528618
## 40 2008 0.02511585 0.8658147 12.506212
## 41 2008 0.01436515 0.8611111 12.438272
## 42 2001 0.01399444 0.8074866 12.393714
## 43 2008 0.01436515 0.8563536 12.369552
## 44 1966 0.02168675 0.7826087 12.309545
## 45 2004 0.02715477 0.7961957 12.203056
## 46 2006 0.01223355 0.7951807 12.152975
## 47 2001 0.02307692 0.7830189 12.018170
## 48 2008 0.01084337 0.8239437 11.901408
## 49 2008 0.01427247 0.8235294 11.895425
## 50 2004 0.01028730 0.7872340 11.880077
## 51 1993 0.04096386 0.7906977 11.767763
## 52 2003 0.01102873 0.8150685 11.741774
## 53 2008 0.01112141 0.8108108 11.711712
## 54 2004 0.02177943 0.7605178 11.656232
## 55 2004 0.01992586 0.8113208 11.625698
## 56 2008 0.01575533 0.8018868 11.582809
## 57 2001 0.02224282 0.7523511 11.547466
## 58 2008 0.01455051 0.7969543 11.511562
## 59 1993 0.02845227 0.7713568 11.479917
## 60 2004 0.01047266 0.7583893 11.444783
## 61 2008 0.01195551 0.7914110 11.431493
## 62 2008 0.01529194 0.7894737 11.403509
## 63 2008 0.01223355 0.7857143 11.349206
## 64 2008 0.01464319 0.7821782 11.298130
## 65 2008 0.01353105 0.7807487 11.277481
## 66 1994 0.01260426 0.7555556 11.198413
## 67 2003 0.04559778 0.8395904 11.184174
## 68 2008 0.01037998 0.7724138 11.157088
## 69 2008 0.01130677 0.7672956 11.083159
## 70 2003 0.02446710 0.7719298 11.002804
## 71 2006 0.01075070 0.8169014 10.976795
## 72 2003 0.01084337 0.8239437 10.975743
## 73 2008 0.01177016 0.7559524 10.919312
## 74 2003 0.01130677 0.7577640 10.916253
## 75 2008 0.01316033 0.7553191 10.910165
## 76 2003 0.01751622 0.7651822 10.906626
## 77 1977 0.01010195 0.7622378 10.893438
## 78 2008 0.01158480 0.7530120 10.876841
## 79 2008 0.01371640 0.7512690 10.851664
## 80 2008 0.01844300 0.7509434 10.846960
## 81 2004 0.01464319 0.7523810 10.781129
## 82 2003 0.01714551 0.7551020 10.762947
## 83 2002 0.01102873 0.7933333 10.740360
## 84 2006 0.01139944 0.7987013 10.732238
## 85 2002 0.03707136 0.7532957 10.694816
## 86 2005 0.01093605 0.7866667 10.676897
## 87 2005 0.01084337 0.7852349 10.657465
## 88 2002 0.02557924 0.7500000 10.648026
## 89 1987 0.01112141 0.7741935 10.627924
## 90 2005 0.01028730 0.7816901 10.609354
## 91 2002 0.01010195 0.7730496 10.465754
## 92 2006 0.01047266 0.7957746 10.445752
## 93 2005 0.01445783 0.7684729 10.429966
## 94 2003 0.01028730 0.7816901 10.412885
## 95 2006 0.01816497 0.7747036 10.409778
## 96 2006 0.01019462 0.7746479 10.409030
## 97 2005 0.01362373 0.7656250 10.391313
## 98 1987 0.01010195 0.7517241 10.319470
## 99 2006 0.01260426 0.7640449 10.266557
## 100 1967 0.01121409 0.7610063 10.264072
## 101 2003 0.01501390 0.7641509 10.179245
## 102 2002 0.04365153 0.7511962 10.169895
## 103 2003 0.01241891 0.7613636 10.142116
## 104 2003 0.01056534 0.7600000 10.123951
## 105 2006 0.01028730 0.7708333 10.118360
## 106 2006 0.01974050 0.7526502 10.113444
## 107 2003 0.01575533 0.7555556 10.064746
## 108 2006 0.01000927 0.7659574 10.054356
## 109 2003 0.01065802 0.7516340 10.012507
## 110 2003 0.02826691 0.7512315 10.007146
## 111 1997 0.01130677 0.7770701 9.899157
## 112 2001 0.05514365 0.7828947 9.891609
## 113 2006 0.01158480 0.7530120 9.884428
## 114 2001 0.02659870 0.7798913 9.853662
## 115 2006 0.01028730 0.7500000 9.844891
## 116 1988 0.01056534 0.7862069 9.795811
## 117 1993 0.01084337 0.7500000 9.680024
## 118 2010 0.01417980 0.9107143 9.643383
## 119 2000 0.01000927 0.8244275 9.627243
## 120 2001 0.03744208 0.7608286 9.612811
## 121 2010 0.01186284 0.9078014 9.612539
## 122 2010 0.01075070 0.8992248 9.521723
## 123 2010 0.01974050 0.8987342 9.516528
## 124 2010 0.02113068 0.8976378 9.504918
## 125 2010 0.01047266 0.8897638 9.421542
## 126 2010 0.01019462 0.8870968 9.393301
## 127 1980 0.01445783 0.7878788 9.362568
## 128 2010 0.01964782 0.8796680 9.314640
## 129 1988 0.01019462 0.7586207 9.312306
## 130 2004 0.01130677 0.8187919 9.251063
## 131 2010 0.01519926 0.8723404 9.237049
## 132 2004 0.01130677 0.8133333 9.189389
## 133 2001 0.01204819 0.8227848 9.180815
## 134 2002 0.05468026 0.8477011 9.174218
## 135 2010 0.01306766 0.8650307 9.159648
## 136 2010 0.01631140 0.8627451 9.135446
## 137 2010 0.01436515 0.8611111 9.118144
## 138 2010 0.01492122 0.8609626 9.116571
## 139 1989 0.01056534 0.7651007 9.111961
## 140 2005 0.01640408 0.8428571 9.094429
## 141 2010 0.01010195 0.8582677 9.088036
## 142 2010 0.01112141 0.8571429 9.076125
## 143 2005 0.01288230 0.8373494 9.035000
## 144 2010 0.01232623 0.8525641 9.027641
## 145 2010 0.01121409 0.8521127 9.022861
## 146 1960 0.01084337 0.7597403 9.018259
## 147 2010 0.02437442 0.8511327 9.012484
## 148 2001 0.01167748 0.8076923 9.012410
## 149 1989 0.01102873 0.8095238 8.995635
## 150 2010 0.01612604 0.8487805 8.987577
Comment: We can see now, that the rules with the highest list establish a sequel/prequel or the same universum connection between the inputs. That was rather predictable, as series are very often watched as a whole. However, we also looked at the first 150 observations to looks for some other associations.
A great example of a successful association would be composed of movies from different universes, series, directors etc.
Some other associations:
Now that we have established the basic Association Rules, we can apply endless variations and ideas to modify and explore deeper the recommendations, as going one by one through the dataset containing almost 1500 connection will be highly tedious.
For example, we can check, whether there are rules, where older movies led the audience to the newer positions.
associationRules %>%
filter(Left_year < 1990 & Right_year > 2000) %>%
arrange(desc(lift)) %>%
head(25)
## Left_title Left_year
## 1 Kiki's Delivery Service (Majo no takkyûbin) 1989
## 2 Nausicaä of the Valley of the Wind (Kaze no tani no Naushika) 1984
## 3 Grave of the Fireflies (Hotaru no haka) 1988
## 4 Dawn of the Dead 1978
## 5 Laputa: Castle in the Sky (Tenkû no shiro Rapyuta) 1986
## 6 My Neighbor Totoro (Tonari no Totoro) 1988
## 7 WarGames 1983
## 8 Bourne Identity, The 1988
## 9 Grave of the Fireflies (Hotaru no haka) 1988
## Right_title Right_year support
## 1 Spirited Away (Sen to Chihiro no kamikakushi) 2001 0.01075070
## 2 Spirited Away (Sen to Chihiro no kamikakushi) 2001 0.01380908
## 3 Spirited Away (Sen to Chihiro no kamikakushi) 2001 0.01093605
## 4 Shaun of the Dead 2004 0.01010195
## 5 Spirited Away (Sen to Chihiro no kamikakushi) 2001 0.01399444
## 6 Spirited Away (Sen to Chihiro no kamikakushi) 2001 0.02307692
## 7 Minority Report 2002 0.01566265
## 8 Dark Knight, The 2008 0.01065802
## 9 Lord of the Rings: The Two Towers, The 2002 0.01065802
## confidence lift
## 1 0.8345324 12.808825
## 2 0.8324022 12.776131
## 3 0.8251748 12.665201
## 4 0.7569444 12.662683
## 5 0.8074866 12.393714
## 6 0.7830189 12.018170
## 7 0.7824074 7.106209
## 8 0.7986111 6.679856
## 9 0.8041958 4.834135
As we can see, majority of the positions on this short list are of Japanese origins. My thought is, that the world of Japanese animation has enough power the draw the viewers into its world, no matter what the year of th production is. Let’s try the other way around now.
associationRules %>%
filter(Left_year > 2000 & Right_year < 1990) %>%
arrange(desc(lift)) %>%
head(25)
## Left_title Left_year Right_title Right_year
## 1 Terminator 3: Rise of the Machines 2003 Terminator, The 1984
## 2 Terminator Salvation 2009 Terminator, The 1984
## 3 Star Trek: Nemesis 2002 Terminator, The 1984
## support confidence lift
## 1 0.03012048 0.8004926 5.667530
## 2 0.01019462 0.7746479 5.484548
## 3 0.01047266 0.7533333 5.333640
Here, we also have only a few instances, all from the genre of Science Fiction, inspiring out viewers to explore the old version of the Terminator. Since the yearly association has no robustness in our dataset, let’s try out the last modification.
We can also use the Association Rules measure to recommend a potential movie based on a given title. Let’s explore the connection with the amazing science fiction movie directed by the great Christopher Nolan “Inception”.
InceptionLeft <- associationRules %>%
filter(str_detect(Left_title, "Inception")) %>%
head(20)
InceptionRight <- associationRules %>%
filter(str_detect(Right_title, "Inception"))
InceptionLeft
## [1] Left_title Left_year Right_title Right_year support confidence
## [7] lift
## <0 rows> (or 0-length row.names)
InceptionRight
## Left_title Left_year
## 1 Prometheus 2012
## 2 Cloud Atlas 2012
## 3 Man of Steel 2013
## 4 Looper 2012
## 5 Source Code 2011
## 6 Super 8 2011
## 7 Tron: Legacy 2010
## 8 Gravity 2013
## 9 Rise of the Planet of the Apes 2011
## 10 Captain America: The Winter Soldier 2014
## 11 Limitless 2011
## 12 Iron Man 3 2013
## 13 Star Trek Into Darkness 2013
## 14 The Lego Movie 2014
## 15 Cabin in the Woods, The 2012
## 16 Birdman: Or (The Unexpected Virtue of Ignorance) 2014
## 17 Town, The 2010
## 18 Moon 2009
## 19 Moonrise Kingdom 2012
## 20 Shutter Island 2010
## 21 Adjustment Bureau, The 2011
## 22 Dark Knight Rises, The 2012
## 23 X-Men: First Class 2011
## 24 X-Men: Days of Future Past 2014
## 25 Argo 2012
## 26 Kick-Ass 2010
## 27 Hugo 2011
## 28 Hobbit: An Unexpected Journey, The 2012
## 29 Skyfall 2012
## 30 Thor 2011
## 31 Scott Pilgrim vs. the World 2010
## 32 Wolf of Wall Street, The 2013
## 33 Mission: Impossible - Ghost Protocol 2011
## 34 Fighter, The 2010
## 35 Hunger Games: Catching Fire, The 2013
## 36 Hobbit: The Desolation of Smaug, The 2013
## 37 Sherlock Holmes: A Game of Shadows 2011
## 38 Drive 2011
## 39 Black Swan 2010
## 40 True Grit 2010
## 41 Dallas Buyers Club 2013
## 42 Moneyball 2011
## 43 127 Hours 2010
## 44 Social Network, The 2010
## 45 Life of Pi 2012
## 46 Sherlock Holmes 2009
## 47 Edge of Tomorrow 2014
## 48 Captain America: The First Avenger 2011
## 49 Ex Machina 2015
## 50 Grand Budapest Hotel, The 2014
## 51 Her 2013
## 52 Girl with the Dragon Tattoo, The (Män som hatar kvinnor) 2009
## 53 Amazing Spider-Man, The 2012
## 54 Silver Linings Playbook 2012
## 55 Guardians of the Galaxy 2014
## 56 Midnight in Paris 2011
## 57 50/50 2011
## 58 Fantastic Mr. Fox 2009
## 59 Hunger Games, The 2012
## 60 Iron Man 2 2010
## 61 Hurt Locker, The 2008
## 62 Django Unchained 2012
## 63 Mad Max: Fury Road 2015
## 64 Up in the Air 2009
## 65 Despicable Me 2010
## 66 Avengers, The 2012
## 67 Girl with the Dragon Tattoo, The 2011
## 68 Toy Story 3 2010
## 69 District 9 2009
## Right_title Right_year support confidence lift
## 1 Inception 2010 0.01417980 0.9107143 9.643383
## 2 Inception 2010 0.01186284 0.9078014 9.612539
## 3 Inception 2010 0.01075070 0.8992248 9.521723
## 4 Inception 2010 0.01974050 0.8987342 9.516528
## 5 Inception 2010 0.02113068 0.8976378 9.504918
## 6 Inception 2010 0.01047266 0.8897638 9.421542
## 7 Inception 2010 0.01019462 0.8870968 9.393301
## 8 Inception 2010 0.01964782 0.8796680 9.314640
## 9 Inception 2010 0.01519926 0.8723404 9.237049
## 10 Inception 2010 0.01306766 0.8650307 9.159648
## 11 Inception 2010 0.01631140 0.8627451 9.135446
## 12 Inception 2010 0.01436515 0.8611111 9.118144
## 13 Inception 2010 0.01492122 0.8609626 9.116571
## 14 Inception 2010 0.01010195 0.8582677 9.088036
## 15 Inception 2010 0.01112141 0.8571429 9.076125
## 16 Inception 2010 0.01232623 0.8525641 9.027641
## 17 Inception 2010 0.01121409 0.8521127 9.022861
## 18 Inception 2010 0.02437442 0.8511327 9.012484
## 19 Inception 2010 0.01612604 0.8487805 8.987577
## 20 Inception 2010 0.03632994 0.8484848 8.984447
## 21 Inception 2010 0.01075070 0.8467153 8.965710
## 22 Inception 2010 0.03317887 0.8463357 8.961690
## 23 Inception 2010 0.02131603 0.8424908 8.920978
## 24 Inception 2010 0.01519926 0.8410256 8.905463
## 25 Inception 2010 0.01658943 0.8403756 8.898580
## 26 Inception 2010 0.02381835 0.8371336 8.864250
## 27 Inception 2010 0.01019462 0.8333333 8.824010
## 28 Inception 2010 0.01853568 0.8333333 8.824010
## 29 Inception 2010 0.01835032 0.8319328 8.809180
## 30 Inception 2010 0.01631140 0.8301887 8.790712
## 31 Inception 2010 0.01807229 0.8297872 8.786461
## 32 Inception 2010 0.01936979 0.8260870 8.747280
## 33 Inception 2010 0.01269694 0.8253012 8.738960
## 34 Inception 2010 0.01037998 0.8235294 8.720199
## 35 Inception 2010 0.01455051 0.8219895 8.703893
## 36 Inception 2010 0.01325301 0.8218391 8.702300
## 37 Inception 2010 0.01603336 0.8199052 8.681823
## 38 Inception 2010 0.01603336 0.8199052 8.681823
## 39 Inception 2010 0.02780352 0.8174387 8.655705
## 40 Inception 2010 0.01603336 0.8160377 8.640871
## 41 Inception 2010 0.01121409 0.8120805 8.598969
## 42 Inception 2010 0.01288230 0.8081395 8.557238
## 43 Inception 2010 0.01167748 0.8076923 8.552502
## 44 Inception 2010 0.02826691 0.8047493 8.521340
## 45 Inception 2010 0.01334569 0.8044693 8.518374
## 46 Inception 2010 0.02808156 0.8037135 8.510372
## 47 Inception 2010 0.01807229 0.8024691 8.497195
## 48 Inception 2010 0.01390176 0.8021390 8.493700
## 49 Inception 2010 0.01501390 0.8019802 8.492018
## 50 Inception 2010 0.02131603 0.8013937 8.485808
## 51 Inception 2010 0.01631140 0.8000000 8.471050
## 52 Inception 2010 0.01556997 0.8000000 8.471050
## 53 Inception 2010 0.01241891 0.7976190 8.445839
## 54 Inception 2010 0.01640408 0.7972973 8.442432
## 55 Inception 2010 0.02150139 0.7972509 8.441940
## 56 Inception 2010 0.01353105 0.7934783 8.401993
## 57 Inception 2010 0.01102873 0.7933333 8.400458
## 58 Inception 2010 0.01177016 0.7888199 8.352666
## 59 Inception 2010 0.02113068 0.7862069 8.324997
## 60 Inception 2010 0.02279889 0.7859425 8.322198
## 61 Inception 2010 0.01723818 0.7848101 8.310207
## 62 Inception 2010 0.03113994 0.7832168 8.293336
## 63 Inception 2010 0.01594069 0.7818182 8.278526
## 64 Inception 2010 0.01427247 0.7777778 8.235743
## 65 Inception 2010 0.01482854 0.7766990 8.224320
## 66 Inception 2010 0.03067655 0.7733645 8.189012
## 67 Inception 2010 0.01399444 0.7704082 8.157708
## 68 Inception 2010 0.02789620 0.7678571 8.130695
## 69 Inception 2010 0.03438369 0.7540650 7.984653
Interestingly we do not have Inception in the left hand association. However, on the right hand side association, becauce Inception is a very popular movie we have a much larger scope to work on, 69 rules to be exact. Moreover, all lift values are above 8, giving us a strong dependency between the two observations.
Noteworthy, we can spot a few different tendencies here in associating Inception with other positions. For example, movies such as Cloud Atlas, Source Code or Limitless base their connection on the plot, incorporated themes and genres. Movies like Looper or Shutter Island, are an interesting case, connecting both themes alike, for instance, Inception and Shutter Island both deal with a mental illnesses, while Inception and Looper both evolve around fight scenes, and action packed parts, but also actors, because Leonardo Di Caprio played in the first pair mentioned, and Joseph Gordon Levitt in the second one. We also see associations by the direction, in the pair with The Dark Knight Rises (Batman saga).
It is fascinating, how many variants of a connection can one movie create, providing something worthy for everyone.
Implementing the Association Rules measure can be extremely useful in the process of recommendation, of establishing behaviour patterns. It is a great technique for an extraction of useful information and remarks from a dataset that is hard to read or too vast in its dimensions. The algorithms transforms the provided dataset into a readable and useful outputs, comprehendable even for a beginner, what can be an extreme advantage in the business environment, where a depiction of the conclusions and the outputs, to someone who is not familiar with the works of such algorithms, can sometimes be a tedious procedure. Association rules use the traditional probability theory and its statistics, therefore anyone can be able to understand the impacts.
The measure can be modified and implemented in many ways, depending on the user’s interest. A deeper look into the outputs can established additional rules for a more detailed analysis.
Association Rules application on the dataset of movie ratings provided a set of very interesting rules, where a dependency between two movies can be conditional on different rules, or a combination of a few of them. Although the presented technique is not a sophisticated measure for establishing a general recommendation pattern, it provided us with an underlying relationships between the movies. Such approach can also be incorporated in many activities, for instance in behaviour analysis, product suggestion or a marketing campaign.
Djenouri, Y., Belhadi, A., Fournier-Viger, P., & Fujita, H. (2018). Mining diversified association rules in big datasets: A cluster/GPU/genetic approach. Information Sciences, 459, 117-134. Retrieved from https://www.sciencedirect.com/science/article/abs/pii/S0020025518303980
Fjällström, P. (2016). A way to compare measures in association rule mining. Retrieved from https://www.diva-portal.org/smash/get/diva2:956424/FULLTEXT01.pdf
IBM. (2021a). Confidence in an association rule. Retrieved from https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.7.0/com.ibm.im.model.doc/c_confidence_in_an_association_rule.html
IBM. (2021b). Lift in an association rule. Retrieved from https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.1.0/com.ibm.im.model.doc/c_lift_in_an_association_rule.html
Karthik, B. (2020). Movie Recommendation System. Retrieved from https://www.kaggle.com/bandikarthik/movie-recommendation-system?select=ratings.csv