Building a recommendation system for Netflix to make movies recommendations based on user behavior on Netflix platform, using Association Rules.
Association Rules is a popular unsupervised learning technique used for discovering interesting relationships between variables or items in a dataset.
The goal of Association Rules in unsupervised learning is to identify interesting patterns or associations that are not immediately apparent in the raw data. This can be useful for discovering relationships between variables in a wide range of applications, such as: * market basket analysis * recommendation systems * and customer behavior analysis.
The dataset covers user behaviour on Netflix from users in the UK to opted-in to have their anonymized browsing activity tracked. It only includes desktop and laptop activity (which Netflix estimate is around 25% of global traffic) and is for a fixed window of time (January 2017 to June 2019, inclusive). It documents each time someone in our tracked panel in the UK clicked on a Netflix.com/watch URL for a movie.
From: Kaggle URL: https://www.kaggle.com/datasets/vodclickstream/netflix-audience-behaviour-uk-movies
library(ggplot2)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
library(cluster)
library(mclust)
## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
##
## Attaching package: 'mclust'
##
## The following object is masked from 'package:purrr':
##
## map
library(arules)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
##
## Attaching package: 'arules'
##
## The following object is masked from 'package:tm':
##
## inspect
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following objects are masked from 'package:base':
##
## abbreviate, write
library(arulesViz)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
full_data <- read.csv("netflix.csv",header=T, sep =",")
head(full_data,10)
## X datetime duration title
## 1 58773 2017-01-01 01:15:09 0 Angus, Thongs and Perfect Snogging
## 2 58774 2017-01-01 13:56:02 0 The Curse of Sleeping Beauty
## 3 58775 2017-01-01 15:17:47 10530 London Has Fallen
## 4 58776 2017-01-01 16:04:13 49 Vendetta
## 5 58777 2017-01-01 19:16:37 0 The SpongeBob SquarePants Movie
## 6 58778 2017-01-01 19:21:37 0 London Has Fallen
## 7 58779 2017-01-01 19:43:06 4903 The Water Diviner
## 8 58780 2017-01-01 19:44:38 0 Angel of Christmas
## 9 58781 2017-01-01 19:46:24 3845 Ratter
## 10 58782 2017-01-01 20:27:04 0 The Book of Life
## genres release_date
## 1 Comedy, Drama, Romance 2008-07-25
## 2 Fantasy, Horror, Mystery, Thriller 2016-06-02
## 3 Action, Thriller 2016-03-04
## 4 Action, Drama 2015-06-12
## 5 Animation, Action, Adventure, Comedy, Family, Fantasy 2004-11-19
## 6 Action, Thriller 2016-03-04
## 7 Drama, History, War 2014-12-26
## 8 Comedy, Romance 2015-11-29
## 9 Drama, Horror, Thriller 2016-02-12
## 10 Animation, Adventure, Comedy, Family, Fantasy, Musical, Romance 2014-10-17
## movie_id user_id
## 1 26bd5987e8 1dea19f6fe
## 2 f26ed2675e 544dcbc510
## 3 f77e500e7a 7cbcc791bf
## 4 c74aec7673 ebf43c36b6
## 5 a80d6fc2aa a57c992287
## 6 f77e500e7a c5bf4f3f57
## 7 7165c2fc94 8e1be40e32
## 8 b2f02f2689 892a51dee1
## 9 c39aae36c3 cff8ea652a
## 10 97183b9136 bf53608c70
str(full_data)
## 'data.frame': 671736 obs. of 8 variables:
## $ X : int 58773 58774 58775 58776 58777 58778 58779 58780 58781 58782 ...
## $ datetime : chr "2017-01-01 01:15:09" "2017-01-01 13:56:02" "2017-01-01 15:17:47" "2017-01-01 16:04:13" ...
## $ duration : num 0 0 10530 49 0 ...
## $ title : chr "Angus, Thongs and Perfect Snogging" "The Curse of Sleeping Beauty" "London Has Fallen" "Vendetta" ...
## $ genres : chr "Comedy, Drama, Romance" "Fantasy, Horror, Mystery, Thriller" "Action, Thriller" "Action, Drama" ...
## $ release_date: chr "2008-07-25" "2016-06-02" "2016-03-04" "2015-06-12" ...
## $ movie_id : chr "26bd5987e8" "f26ed2675e" "f77e500e7a" "c74aec7673" ...
## $ user_id : chr "1dea19f6fe" "544dcbc510" "7cbcc791bf" "ebf43c36b6" ...
# check th summary statistics of the datasets abd check columns with missing values
summary(full_data)
## X datetime duration title
## Min. : 58773 Length:671736 Min. : -1 Length:671736
## 1st Qu.:226707 Class :character 1st Qu.: 0 Class :character
## Median :394641 Mode :character Median : 14 Mode :character
## Mean :394641 Mean : 33476
## 3rd Qu.:562574 3rd Qu.: 6672
## Max. :730508 Max. :18237253
## genres release_date movie_id user_id
## Length:671736 Length:671736 Length:671736 Length:671736
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
# Counting number of missing values
sum(is.na(full_data))
## [1] 0
# Dimension
dim(full_data)
## [1] 671736 8
# to show the numbers in normal format
options(scipen = 999)
ggplot(full_data, aes((duration))) + geom_histogram(colour="darkblue", fill="lightblue") + theme_classic() + ggtitle('Duration Distributition')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boxplot(full_data$duration,
xlab = "Duratuin",
ylab = "Count",
col = 8,
boxlty = 1,
whisklty = 2,
whisklwd = 1.5,
staplelwd = 1.5,
horizontal = TRUE)
data_subset <- full_data[full_data$duration >360 , ]
#check the dimension
dim(data_subset)
## [1] 300049 8
ggplot(data_subset, aes((duration))) + geom_histogram(colour="darkblue", fill="lightblue") + theme_classic() + ggtitle('Duration Distributition')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boxplot(data_subset$duration,
xlab = "Duratuin",
ylab = "Count",
col = 8,
boxlty = 1,
whisklty = 2,
whisklwd = 1.5,
staplelwd = 1.5,
horizontal = TRUE)
data_subset <- full_data[full_data$duration >360 & full_data$duration < 100000 , ]
#check the dimension
dim(data_subset)
## [1] 257318 8
ggplot(data_subset, aes((duration))) + geom_histogram(colour="darkblue", fill="lightblue") + theme_classic() + ggtitle('Duration Distributition')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boxplot(data_subset$duration,
xlab = "Duratuin",
ylab = "Count",
col = 8,
boxlty = 1,
whisklty = 2,
whisklwd = 1.5,
staplelwd = 1.5,
horizontal = TRUE)
data_subset$duration_hrs = data_subset$duration / (60*60)
str(data_subset)
## 'data.frame': 257318 obs. of 9 variables:
## $ X : int 58775 58779 58781 58784 58786 58787 58792 58793 58796 58800 ...
## $ datetime : chr "2017-01-01 15:17:47" "2017-01-01 19:43:06" "2017-01-01 19:46:24" "2017-01-01 20:55:46" ...
## $ duration : num 10530 4903 3845 6175 38120 ...
## $ title : chr "London Has Fallen" "The Water Diviner" "Ratter" "28 Days" ...
## $ genres : chr "Action, Thriller" "Drama, History, War" "Drama, Horror, Thriller" "Comedy, Drama" ...
## $ release_date: chr "2016-03-04" "2014-12-26" "2016-02-12" "2000-04-14" ...
## $ movie_id : chr "f77e500e7a" "7165c2fc94" "c39aae36c3" "584bffaf5f" ...
## $ user_id : chr "7cbcc791bf" "8e1be40e32" "cff8ea652a" "759ae2eac9" ...
## $ duration_hrs: num 2.92 1.36 1.07 1.72 10.59 ...
head(data_subset,10)
## X datetime duration title
## 3 58775 2017-01-01 15:17:47 10530 London Has Fallen
## 7 58779 2017-01-01 19:43:06 4903 The Water Diviner
## 9 58781 2017-01-01 19:46:24 3845 Ratter
## 12 58784 2017-01-01 20:55:46 6175 28 Days
## 14 58786 2017-01-01 21:33:26 38120 The SpongeBob SquarePants Movie
## 15 58787 2017-01-01 21:37:41 7799 Beasts of No Nation
## 20 58792 2017-01-01 00:19:40 54195 About Last Night
## 21 58793 2017-01-01 00:49:03 44413 Fight Club
## 24 58796 2017-01-01 11:05:46 621 Joe and Caspar Hit the Road
## 28 58800 2017-01-01 16:05:02 581 Vendetta
## genres release_date
## 3 Action, Thriller 2016-03-04
## 7 Drama, History, War 2014-12-26
## 9 Drama, Horror, Thriller 2016-02-12
## 12 Comedy, Drama 2000-04-14
## 14 Animation, Action, Adventure, Comedy, Family, Fantasy 2004-11-19
## 15 Drama, War 2015-10-16
## 20 Comedy, Romance 2014-02-14
## 21 Drama 1999-10-15
## 24 Documentary, Adventure, Comedy 2015-11-23
## 28 Action, Drama 2015-06-12
## movie_id user_id duration_hrs
## 3 f77e500e7a 7cbcc791bf 2.9250000
## 7 7165c2fc94 8e1be40e32 1.3619444
## 9 c39aae36c3 cff8ea652a 1.0680556
## 12 584bffaf5f 759ae2eac9 1.7152778
## 14 a80d6fc2aa 5b1727dc12 10.5888889
## 15 c57e11da52 3142b4c730 2.1663889
## 20 f7d088d208 78cdb81c4f 15.0541667
## 21 338abadc17 ac30a85c52 12.3369444
## 24 416464eaad 7726b5615e 0.1725000
## 28 c74aec7673 ebf43c36b6 0.1613889
data_subset <- data_subset[data_subset$duration_hrs >= 0.5 & data_subset$duration_hrs <= 4 , ]
dim(data_subset)
## [1] 129762 9
ggplot(data_subset, aes((duration_hrs))) + geom_histogram(colour="darkblue", fill="lightblue") + theme_classic() + ggtitle('Duration in Hours Distributition')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
boxplot(data_subset$duration_hrs,
xlab = "Duratuin Huration in Hours",
ylab = "Count",
col = 8,
boxlty = 1,
whisklty = 2,
whisklwd = 1.5,
staplelwd = 1.5,
horizontal = TRUE)
head(data_subset)
## X datetime duration title
## 3 58775 2017-01-01 15:17:47 10530 London Has Fallen
## 7 58779 2017-01-01 19:43:06 4903 The Water Diviner
## 9 58781 2017-01-01 19:46:24 3845 Ratter
## 12 58784 2017-01-01 20:55:46 6175 28 Days
## 15 58787 2017-01-01 21:37:41 7799 Beasts of No Nation
## 32 58804 2017-01-01 17:21:04 2400 Clueless
## genres release_date movie_id user_id duration_hrs
## 3 Action, Thriller 2016-03-04 f77e500e7a 7cbcc791bf 2.9250000
## 7 Drama, History, War 2014-12-26 7165c2fc94 8e1be40e32 1.3619444
## 9 Drama, Horror, Thriller 2016-02-12 c39aae36c3 cff8ea652a 1.0680556
## 12 Comedy, Drama 2000-04-14 584bffaf5f 759ae2eac9 1.7152778
## 15 Drama, War 2015-10-16 c57e11da52 3142b4c730 2.1663889
## 32 Comedy, Romance 1995-07-19 fc8d2d5cbc 614abddbe8 0.6666667
genresCategories <- unique(unlist(str_split(data_subset$genres, "\\, ")))
length(genresCategories)
## [1] 27
genresCategories
## [1] "Action" "Thriller" "Drama" "History"
## [5] "War" "Horror" "Comedy" "Romance"
## [9] "Animation" "Adventure" "Family" "Fantasy"
## [13] "Documentary" "Biography" "Western" "Mystery"
## [17] "Music" "Sci-Fi" "Crime" "Sport"
## [21] "NOT AVAILABLE" "Musical" "News" "Short"
## [25] "Film-Noir" "Reality-TV" "Talk-Show"
genre_groups = data_subset %>% group_by(genres)%>% arrange(duration_hrs)%>%
summarise(moviesCount = n()) %>%
arrange(desc(moviesCount))
str(data_subset)
## 'data.frame': 129762 obs. of 9 variables:
## $ X : int 58775 58779 58781 58784 58787 58804 58806 58809 58812 58813 ...
## $ datetime : chr "2017-01-01 15:17:47" "2017-01-01 19:43:06" "2017-01-01 19:46:24" "2017-01-01 20:55:46" ...
## $ duration : num 10530 4903 3845 6175 7799 ...
## $ title : chr "London Has Fallen" "The Water Diviner" "Ratter" "28 Days" ...
## $ genres : chr "Action, Thriller" "Drama, History, War" "Drama, Horror, Thriller" "Comedy, Drama" ...
## $ release_date: chr "2016-03-04" "2014-12-26" "2016-02-12" "2000-04-14" ...
## $ movie_id : chr "f77e500e7a" "7165c2fc94" "c39aae36c3" "584bffaf5f" ...
## $ user_id : chr "7cbcc791bf" "8e1be40e32" "cff8ea652a" "759ae2eac9" ...
## $ duration_hrs: num 2.92 1.36 1.07 1.72 2.17 ...
genre_groups
## # A tibble: 972 × 2
## genres moviesCount
## <chr> <int>
## 1 Comedy 5914
## 2 Documentary 5416
## 3 Comedy, Romance 5390
## 4 Comedy, Drama, Romance 5318
## 5 Action, Adventure, Sci-Fi 3748
## 6 NOT AVAILABLE 3735
## 7 Comedy, Drama 2977
## 8 Drama, Romance 2547
## 9 Action, Thriller 2002
## 10 Drama 1921
## # … with 962 more rows
ggplot(genre_groups, aes(x = genres, y = moviesCount, fill = moviesCount)) +
geom_bar(stat = "identity", position = "dodge") +
xlab("genres") +
ylab("Count") +
ggtitle("Count of movies per each genre") +
theme(axis.text.x = element_text(angle = 0, hjust = 1))
genre_groups_top10 = genre_groups %>% slice_max(n = 20, order_by = moviesCount)
ggplot(genre_groups_top10, aes(x = genres, y = moviesCount, fill = moviesCount),las=3 ) +
geom_bar(stat = "identity", position = "dodge") +
xlab("genres") +
ylab("Count") +
ggtitle("Count of movies per each genre") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
movies_groups = data_subset %>% group_by(title)%>%
summarise(userCount = n()) %>%
arrange(desc(userCount))
movies_groups
## # A tibble: 5,683 × 2
## title userCount
## <chr> <int>
## 1 Black Mirror: Bandersnatch 1174
## 2 Bright 679
## 3 Bird Box 539
## 4 Annihilation 500
## 5 The Hitman's Bodyguard 486
## 6 FYRE: The Greatest Party That Never Happened 483
## 7 Avengers: Age of Ultron 481
## 8 To All the Boys I've Loved Before 447
## 9 Deadpool 439
## 10 Hot Fuzz 424
## # … with 5,673 more rows
movies_groups_top10_by_users = movies_groups %>% slice_max(n = 20, order_by = userCount)
movies_groups_top10_by_users
## # A tibble: 21 × 2
## title userCount
## <chr> <int>
## 1 Black Mirror: Bandersnatch 1174
## 2 Bright 679
## 3 Bird Box 539
## 4 Annihilation 500
## 5 The Hitman's Bodyguard 486
## 6 FYRE: The Greatest Party That Never Happened 483
## 7 Avengers: Age of Ultron 481
## 8 To All the Boys I've Loved Before 447
## 9 Deadpool 439
## 10 Hot Fuzz 424
## # … with 11 more rows
ggplot(movies_groups_top10_by_users, aes(x = title, y = userCount, fill = userCount),las=3 ) +
geom_bar(stat = "identity", position = "dodge") +
xlab("genres") +
ylab("Count") +
ggtitle("Count of movies per each genre") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
genre_groups_by_duration = data_subset %>% group_by(genres)%>% arrange(duration_hrs)%>%
summarise(TotalDuration = sum(duration_hrs))
genre_groups_by_duration
## # A tibble: 972 × 2
## genres TotalDuration
## <chr> <dbl>
## 1 Action 260.
## 2 Action, Adventure 96.9
## 3 Action, Adventure, Biography, Drama, History 50.6
## 4 Action, Adventure, Biography, Drama, Romance, Thriller 71.2
## 5 Action, Adventure, Biography, Drama, Thriller 80.0
## 6 Action, Adventure, Comedy 896.
## 7 Action, Adventure, Comedy, Crime 352.
## 8 Action, Adventure, Comedy, Crime, Drama 0.667
## 9 Action, Adventure, Comedy, Crime, Family, Mystery 10.9
## 10 Action, Adventure, Comedy, Crime, Family, Romance, Thriller 18.7
## # … with 962 more rows
ggplot(genre_groups_by_duration, aes(x = genres, y = TotalDuration, fill = TotalDuration)) +
geom_bar(stat = "identity", position = "dodge") +
xlab("genres") +
ylab("TotalDuration") +
ggtitle("Total Duration per each genre") +
theme(axis.text.x = element_text(angle = 0, hjust = 1))
genre_groups_top10_by_duration = genre_groups_by_duration %>% slice_max(n = 20, order_by = TotalDuration)
ggplot(genre_groups_top10_by_duration, aes(x = genres, y = TotalDuration, fill = TotalDuration),las=3 ) +
geom_bar(stat = "identity", position = "dodge") +
xlab("genres") +
ylab("Count") +
ggtitle("Count of movies per each genre") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
If the user spend time in two movie, X and Y together, there may be an connection or similarity between those two items. This could imply that users who spend time on movie X are also likely to check the movie Y, and vice versa.
Based on this assumption, we can use this information to help establish a user’s choice behavior. For example, if a user has watched movie X, we can recommend movie Y to them, assuming that they are likely to enjoy it. Similarly, if a user has watched movie Y, we can recommend movie X to them.
However, it’s worth noting that there could be other factors influencing user behavior and preferences that are not captured by this approach. For instance, a user may watch movie X and Y together simply because they were recommended together or because they belong to the same genre, rather than because of an underlying similarity between the two items. Therefore, while this assumption can be useful in making recommendations, it should be combined with other techniques and strategies to ensure that the recommendations are accurate and relevant to the user’s interests.
# get the viewed movies
viewed_movies <- data_subset %>%
group_by(user_id) %>%
summarise(movies = as.vector(list(title)))
# compute transactions
transactions <- as(viewed_movies$movies, "transactions")
## Warning in asMethod(object): removing duplicated items in transactions
Let’s first do some analyse to count of viewed movies
hist(size(transactions), breaks = 0:100, xaxt="n", ylim=c(0,5000),
main = "", xlab = "#Movies")
axis(1, at=seq(0,160,by=10), cex.axis=1)
mtext(paste("Total:", length(transactions), "Users who viewed movies,", sum(size(transactions)), "Movies"))
Next, let’s determine which Movies are frequent. We set the support threshold to 0.02, that means a movie will be considered as frequent iff at least 1 percent of all the Users view it. So in our case, an movie will be considered as being frequent if it is viewed in more than 6,108 Users
movies_frequencies <- itemFrequency(transactions, type="a")
support <- 0.01
freq_movies <- sort(movies_frequencies, decreasing = F)
freq_movies <- freq_movies[freq_movies>support*length(transactions)]
par(mar=c(2,10,2,2)); options(scipen=5)
barplot(freq_movies, horiz=T, las=3, main="Frequent movies", cex.names=.8, xlim=c(0,1000))
mtext(paste("support:",support), padj = .5)
abline(v=support*length(transactions), col="red")
The ranking shows more than 6,108 viewed this 2 movies (Bright, Black Mirror: Bandersnatch) .
Now, lets compute the frequent Movies We decrease the support threshold to take into account the small probability of observing a frequent movies of at least size 2 for each user.
support <- 0.0005
movieSets <- apriori(transactions, parameter = list(target = "frequent itemsets", supp=support, minlen=2), control = list(verbose = FALSE))
par(mar=c(5,18,2,2)+.1)
sets_order_supp <- DATAFRAME(sort(movieSets, by="support", decreasing = F))
barplot(sets_order_supp$support, names.arg=sets_order_supp$items, horiz = T, las = 2, cex.names = .6, main = "Frequent Viewed Movies")
mtext(paste("support:",support), padj = .8)
First of all, with a support threshold of 0.005 (~30 users), we observe frequent pairs only, and secondly: it seems that users are preferred to view movies that have more than one part like (Iron Man, IronMan 2),(kill Bill Vol 1, kill Bill Vol 2), (Shrek, Shrek2). * I assume that if a user viewed a Movie that have more than one part he will most likely wants to view the other Parts
Lets do association rules: First, we use a low support threshold and a high confidence to generate strong rules even for movies that are less frequent
rules1 <- apriori(transactions, parameter = list(supp = 0.00005, conf = 0.6, maxlen=3), control = list(verbose = FALSE))
summary(quality(rules1))
## support confidence coverage lift
## Min. :0.00006548 Min. :0.6250 Min. :0.00006548 Min. : 36.89
## 1st Qu.:0.00006548 1st Qu.:0.6667 1st Qu.:0.00008185 1st Qu.: 210.65
## Median :0.00006548 Median :0.8000 Median :0.00008185 Median : 465.43
## Mean :0.00007396 Mean :0.8045 Mean :0.00009569 Mean : 822.23
## 3rd Qu.:0.00008185 3rd Qu.:1.0000 3rd Qu.:0.00009822 3rd Qu.: 896.93
## Max. :0.00019644 Max. :1.0000 Max. :0.00029466 Max. :8145.07
## count
## Min. : 4.000
## 1st Qu.: 4.000
## Median : 4.000
## Mean : 4.518
## 3rd Qu.: 5.000
## Max. :12.000
plot(rules1)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
There are some rules with a heavy lift indicating a strong association between the movies Let’s further investigate those critical rules
inspect(sort(rules1, by="lift")[1:10])
## lhs rhs support confidence coverage lift count
## [1] {Mystery Science Theater 3000: Future War,
## Mystery Science Theater 3000: I Accuse My Parents} => {Mystery Science Theater 3000: Werewolf} 0.00006547931 0.8000000 0.00008184914 8145.067 4
## [2] {Mystery Science Theater 3000: Future War,
## Mystery Science Theater 3000: Werewolf} => {Mystery Science Theater 3000: I Accuse My Parents} 0.00006547931 1.0000000 0.00006547931 5553.455 4
## [3] {Mystery Science Theater 3000: Laserblast} => {Mystery Science Theater 3000: I Accuse My Parents} 0.00011458879 0.8750000 0.00013095862 4859.273 7
## [4] {Mystery Science Theater 3000: I Accuse My Parents} => {Mystery Science Theater 3000: Laserblast} 0.00011458879 0.6363636 0.00018006810 4859.273 7
## [5] {Mystery Science Theater 3000: Werewolf} => {Mystery Science Theater 3000: I Accuse My Parents} 0.00008184914 0.8333333 0.00009821896 4627.879 5
## [6] {Mystery Science Theater 3000: I Accuse My Parents,
## Mystery Science Theater 3000: Werewolf} => {Mystery Science Theater 3000: Future War} 0.00006547931 0.8000000 0.00008184914 3490.743 4
## [7] {Mystery Science Theater 3000: Werewolf} => {Mystery Science Theater 3000: Future War} 0.00006547931 0.6666667 0.00009821896 2908.952 4
## [8] {Fat, Sick & Nearly Dead 2} => {Fat, Sick and Nearly Dead} 0.00006547931 0.6666667 0.00009821896 2545.333 4
## [9] {A Christmas Prince,
## A Wish For Christmas} => {Once Upon a Holiday} 0.00006547931 0.6666667 0.00009821896 1696.889 4
## [10] {Extraction,
## Gods of Egypt} => {London Heist} 0.00006547931 0.8000000 0.00008184914 1629.013 4
inspect(sort(rules1, by="confidence")[1:10])
## lhs rhs support confidence coverage lift count
## [1] {Mystery Science Theater 3000: Future War,
## Mystery Science Theater 3000: Werewolf} => {Mystery Science Theater 3000: I Accuse My Parents} 0.00006547931 1 0.00006547931 5553.4545 4
## [2] {Bean: The Ultimate Disaster Movie,
## Kung Fu Panda 2} => {Mr. Bean's Holiday} 0.00006547931 1 0.00006547931 985.2903 4
## [3] {Scary Movie 3,
## Scary Movie 5} => {Scary Movie 2} 0.00006547931 1 0.00006547931 744.9756 4
## [4] {Scary Movie 2,
## Scary Movie 5} => {Scary Movie 3} 0.00006547931 1 0.00006547931 872.6857 4
## [5] {Mr. Bean's Holiday,
## Rush Hour 2} => {Rush Hour 3} 0.00006547931 1 0.00006547931 526.6207 4
## [6] {Kung Fu Panda 2,
## Rush Hour 2} => {Rush Hour 3} 0.00006547931 1 0.00006547931 526.6207 4
## [7] {Kung Fu Panda 2,
## Rush Hour 3} => {Rush Hour 2} 0.00006547931 1 0.00006547931 1388.3636 4
## [8] {Kung Fu Panda,
## Rush Hour 2} => {Rush Hour 3} 0.00006547931 1 0.00006547931 526.6207 4
## [9] {Kung Fu Panda 2,
## Rush Hour 2} => {Kung Fu Panda} 0.00006547931 1 0.00006547931 481.0079 4
## [10] {Kung Fu Panda,
## Rush Hour 2} => {Kung Fu Panda 2} 0.00006547931 1 0.00006547931 581.7905 4
rules2 <- apriori(transactions, parameter = list(supp = 0.0001, conf = 0.4, maxlen=3), control = list(verbose = FALSE))
summary(quality(rules2))
## support confidence coverage lift
## Min. :0.0001146 Min. :0.4118 Min. :0.0001310 Min. : 113.2
## 1st Qu.:0.0001310 1st Qu.:0.4387 1st Qu.:0.0002701 1st Qu.: 291.0
## Median :0.0001801 Median :0.5000 Median :0.0003110 Median : 347.1
## Mean :0.0001684 Mean :0.5168 Mean :0.0003363 Mean : 585.8
## 3rd Qu.:0.0001964 3rd Qu.:0.5639 3rd Qu.:0.0004092 3rd Qu.: 399.3
## Max. :0.0002128 Max. :0.8750 Max. :0.0004911 Max. :4859.3
## count
## Min. : 7.00
## 1st Qu.: 8.00
## Median :11.00
## Mean :10.29
## 3rd Qu.:12.00
## Max. :13.00
plot(rules2)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
inspect(sort(rules2, by="lift")[1:10])
## lhs rhs support confidence coverage lift count
## [1] {Mystery Science Theater 3000: Laserblast} => {Mystery Science Theater 3000: I Accuse My Parents} 0.0001145888 0.8750000 0.0001309586 4859.2727 7
## [2] {Mystery Science Theater 3000: I Accuse My Parents} => {Mystery Science Theater 3000: Laserblast} 0.0001145888 0.6363636 0.0001800681 4859.2727 7
## [3] {The Twilight Saga: Breaking Dawn: Part 2,
## The Twilight Saga: New Moon} => {The Twilight Saga: Breaking Dawn: Part 1} 0.0001964379 0.7058824 0.0002782871 479.1216 12
## [4] {The Twilight Saga: Breaking Dawn: Part 2,
## The Twilight Saga: New Moon} => {The Twilight Saga: Eclipse} 0.0001636983 0.5882353 0.0002782871 438.2209 10
## [5] {The Twilight Saga: Breaking Dawn: Part 2,
## Twilight} => {The Twilight Saga: Breaking Dawn: Part 1} 0.0001473284 0.6428571 0.0002291776 436.3429 9
## [6] {13TH: A Conversation with Oprah Winfrey & Ava DuVernay} => {13TH} 0.0002128078 0.5652174 0.0003765060 426.2716 13
## [7] {The Hangover,
## The Hangover Part III} => {The Hangover Part II} 0.0001800681 0.6875000 0.0002619172 424.2222 11
## [8] {The Twilight Saga: Breaking Dawn: Part 1,
## The Twilight Saga: Eclipse} => {The Twilight Saga: New Moon} 0.0002128078 0.5200000 0.0004092457 412.5423 13
## [9] {Madagascar,
## Madagascar 3: Europe's Most Wanted} => {Madagascar: Escape 2 Africa} 0.0001964379 0.5000000 0.0003928759 401.8947 12
## [10] {The Twilight Saga: Breaking Dawn: Part 2,
## Twilight} => {The Twilight Saga: New Moon} 0.0001145888 0.5000000 0.0002291776 396.6753 7
inspect(sort(rules2, by="confidence")[1:10])
## lhs rhs support confidence coverage lift count
## [1] {Mystery Science Theater 3000: Laserblast} => {Mystery Science Theater 3000: I Accuse My Parents} 0.0001145888 0.8750000 0.0001309586 4859.2727 7
## [2] {The Twilight Saga: Breaking Dawn: Part 2,
## The Twilight Saga: New Moon} => {The Twilight Saga: Breaking Dawn: Part 1} 0.0001964379 0.7058824 0.0002782871 479.1216 12
## [3] {The Hangover,
## The Hangover Part III} => {The Hangover Part II} 0.0001800681 0.6875000 0.0002619172 424.2222 11
## [4] {Madagascar 3: Europe's Most Wanted,
## Madagascar: Escape 2 Africa} => {Madagascar} 0.0001964379 0.6666667 0.0002946569 288.8322 12
## [5] {The Twilight Saga: Breaking Dawn: Part 2,
## Twilight} => {The Twilight Saga: Breaking Dawn: Part 1} 0.0001473284 0.6428571 0.0002291776 436.3429 9
## [6] {Mystery Science Theater 3000: I Accuse My Parents} => {Mystery Science Theater 3000: Laserblast} 0.0001145888 0.6363636 0.0001800681 4859.2727 7
## [7] {The Twilight Saga: Breaking Dawn: Part 2,
## The Twilight Saga: New Moon} => {The Twilight Saga: Eclipse} 0.0001636983 0.5882353 0.0002782871 438.2209 10
## [8] {13TH: A Conversation with Oprah Winfrey & Ava DuVernay} => {13TH} 0.0002128078 0.5652174 0.0003765060 426.2716 13
## [9] {The Twilight Saga: Breaking Dawn: Part 2,
## The Twilight Saga: Eclipse} => {The Twilight Saga: Breaking Dawn: Part 1} 0.0002128078 0.5652174 0.0003765060 383.6444 13
## [10] {The Twilight Saga: Breaking Dawn: Part 1,
## Twilight} => {The Twilight Saga: Breaking Dawn: Part 2} 0.0001473284 0.5625000 0.0002619172 343.6200 9
rules3 <- apriori(transactions, parameter = list(supp = 0.0005, conf = 0.1, maxlen=3), control = list(verbose = FALSE))
summary(quality(rules3))
## support confidence coverage lift
## Min. :0.0005075 Min. :0.1101 Min. :0.001572 Min. : 10.76
## 1st Qu.:0.0005238 1st Qu.:0.1716 1st Qu.:0.002181 1st Qu.: 35.29
## Median :0.0005566 Median :0.2088 Median :0.002636 Median : 74.76
## Mean :0.0006521 Mean :0.2081 Mean :0.003517 Mean : 75.45
## 3rd Qu.:0.0006343 3rd Qu.:0.2481 3rd Qu.:0.004252 3rd Qu.:118.58
## Max. :0.0014897 Max. :0.3333 Max. :0.008169 Max. :138.52
## count
## Min. :31.00
## 1st Qu.:32.00
## Median :34.00
## Mean :39.83
## 3rd Qu.:38.75
## Max. :91.00
plot(rules3)
inspect(sort(rules3, by="lift")[1:10])
## lhs rhs support confidence coverage lift count
## [1] {Kill Bill: Vol. 2} => {Kill Bill: Vol. 1} 0.0005238345 0.3333333 0.001571503 138.52154 32
## [2] {Kill Bill: Vol. 1} => {Kill Bill: Vol. 2} 0.0005238345 0.2176871 0.002406365 138.52154 32
## [3] {Iron Man} => {Iron Man 2} 0.0005402043 0.2426471 0.002226296 118.58259 33
## [4] {Iron Man 2} => {Iron Man} 0.0005402043 0.2640000 0.002046228 118.58259 33
## [5] {The Lord of the Rings: The Two Towers} => {The Lord of the Rings: The Fellowship of the Ring} 0.0005729439 0.2482270 0.002308146 86.64965 35
## [6] {The Lord of the Rings: The Fellowship of the Ring} => {The Lord of the Rings: The Two Towers} 0.0005729439 0.2000000 0.002864720 86.64965 35
## [7] {Shrek the Third} => {Shrek 2} 0.0005074646 0.2480000 0.002046228 62.86234 31
## [8] {Shrek 2} => {Shrek the Third} 0.0005074646 0.1286307 0.003945128 62.86234 31
## [9] {Shrek 2} => {Shrek} 0.0007202724 0.1825726 0.003945128 35.29429 44
## [10] {Shrek} => {Shrek 2} 0.0007202724 0.1392405 0.005172865 35.29429 44
inspect(sort(rules3, by="confidence")[1:10])
## lhs rhs support confidence coverage lift count
## [1] {Kill Bill: Vol. 2} => {Kill Bill: Vol. 1} 0.0005238345 0.3333333 0.001571503 138.52154 32
## [2] {Iron Man 2} => {Iron Man} 0.0005402043 0.2640000 0.002046228 118.58259 33
## [3] {The Lord of the Rings: The Two Towers} => {The Lord of the Rings: The Fellowship of the Ring} 0.0005729439 0.2482270 0.002308146 86.64965 35
## [4] {Shrek the Third} => {Shrek 2} 0.0005074646 0.2480000 0.002046228 62.86234 31
## [5] {Iron Man} => {Iron Man 2} 0.0005402043 0.2426471 0.002226296 118.58259 33
## [6] {Kill Bill: Vol. 1} => {Kill Bill: Vol. 2} 0.0005238345 0.2176871 0.002406365 138.52154 32
## [7] {The Lord of the Rings: The Fellowship of the Ring} => {The Lord of the Rings: The Two Towers} 0.0005729439 0.2000000 0.002864720 86.64965 35
## [8] {Shrek 2} => {Shrek} 0.0007202724 0.1825726 0.003945128 35.29429 44
## [9] {Bird Box} => {Black Mirror: Bandersnatch} 0.0014896543 0.1823647 0.008168544 10.76357 91
## [10] {Shrek} => {Shrek 2} 0.0007202724 0.1392405 0.005172865 35.29429 44
Movies that have more than one part are also called sequels or franchises. They are movies that continue the story or characters from a previous movie. Some examples of movies that have more than one part are Iron Man, Shrek , Kill Bill: Vol and The Lord of the Rings
Some possible reasons why these movies are most likely viewed together are:
comment