Building a recommendation system for Netflix

Name: Hamed Ahmed Hamed Ahmed

Student Id: 454827

Problem Statement

Building a recommendation system for Netflix to make movies recommendations based on user behavior on Netflix platform, using Association Rules.

What is Accoication Rules :

Association Rules is a popular unsupervised learning technique used for discovering interesting relationships between variables or items in a dataset.

Goal of Accoication Rules :

The goal of Association Rules in unsupervised learning is to identify interesting patterns or associations that are not immediately apparent in the raw data. This can be useful for discovering relationships between variables in a wide range of applications, such as: * market basket analysis * recommendation systems * and customer behavior analysis.

Dataset description:

The dataset covers user behaviour on Netflix from users in the UK to opted-in to have their anonymized browsing activity tracked. It only includes desktop and laptop activity (which Netflix estimate is around 25% of global traffic) and is for a fixed window of time (January 2017 to June 2019, inclusive). It documents each time someone in our tracked panel in the UK clicked on a Netflix.com/watch URL for a movie.

Dataset source:

From: Kaggle URL: https://www.kaggle.com/datasets/vodclickstream/netflix-audience-behaviour-uk-movies

Dataset Structure

X : RowId
datetime : Date and time of the click
duration : Time between this click and the user’s next tracked click on Netflix.com, in seconds
title : Movie title
genres : Movie genre(s)
release_date: Movie’s original theatrical release date (not when it first appeared on Netflix)
movie_id : title ID
user_id : user ID

Load Libraries

library(ggplot2)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 1.0.0 
## ✔ purrr   1.0.1      
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(tm)

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(cluster)
library(mclust)

## Package 'mclust' version 6.0.0
## Type 'citation("mclust")' for citing this R package in publications.
## 
## Attaching package: 'mclust'
## 
## The following object is masked from 'package:purrr':
## 
##     map

library(arules)

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## 
## Attaching package: 'arules'
## 
## The following object is masked from 'package:tm':
## 
##     inspect
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Load the dataset

full_data <- read.csv("netflix.csv",header=T, sep =",")

Quick look at the data

head(full_data,10)

##        X            datetime duration                              title
## 1  58773 2017-01-01 01:15:09        0 Angus, Thongs and Perfect Snogging
## 2  58774 2017-01-01 13:56:02        0       The Curse of Sleeping Beauty
## 3  58775 2017-01-01 15:17:47    10530                  London Has Fallen
## 4  58776 2017-01-01 16:04:13       49                           Vendetta
## 5  58777 2017-01-01 19:16:37        0    The SpongeBob SquarePants Movie
## 6  58778 2017-01-01 19:21:37        0                  London Has Fallen
## 7  58779 2017-01-01 19:43:06     4903                  The Water Diviner
## 8  58780 2017-01-01 19:44:38        0                 Angel of Christmas
## 9  58781 2017-01-01 19:46:24     3845                             Ratter
## 10 58782 2017-01-01 20:27:04        0                   The Book of Life
##                                                             genres release_date
## 1                                           Comedy, Drama, Romance   2008-07-25
## 2                               Fantasy, Horror, Mystery, Thriller   2016-06-02
## 3                                                 Action, Thriller   2016-03-04
## 4                                                    Action, Drama   2015-06-12
## 5            Animation, Action, Adventure, Comedy, Family, Fantasy   2004-11-19
## 6                                                 Action, Thriller   2016-03-04
## 7                                              Drama, History, War   2014-12-26
## 8                                                  Comedy, Romance   2015-11-29
## 9                                          Drama, Horror, Thriller   2016-02-12
## 10 Animation, Adventure, Comedy, Family, Fantasy, Musical, Romance   2014-10-17
##      movie_id    user_id
## 1  26bd5987e8 1dea19f6fe
## 2  f26ed2675e 544dcbc510
## 3  f77e500e7a 7cbcc791bf
## 4  c74aec7673 ebf43c36b6
## 5  a80d6fc2aa a57c992287
## 6  f77e500e7a c5bf4f3f57
## 7  7165c2fc94 8e1be40e32
## 8  b2f02f2689 892a51dee1
## 9  c39aae36c3 cff8ea652a
## 10 97183b9136 bf53608c70

Structure of the data

str(full_data)

## 'data.frame':    671736 obs. of  8 variables:
##  $ X           : int  58773 58774 58775 58776 58777 58778 58779 58780 58781 58782 ...
##  $ datetime    : chr  "2017-01-01 01:15:09" "2017-01-01 13:56:02" "2017-01-01 15:17:47" "2017-01-01 16:04:13" ...
##  $ duration    : num  0 0 10530 49 0 ...
##  $ title       : chr  "Angus, Thongs and Perfect Snogging" "The Curse of Sleeping Beauty" "London Has Fallen" "Vendetta" ...
##  $ genres      : chr  "Comedy, Drama, Romance" "Fantasy, Horror, Mystery, Thriller" "Action, Thriller" "Action, Drama" ...
##  $ release_date: chr  "2008-07-25" "2016-06-02" "2016-03-04" "2015-06-12" ...
##  $ movie_id    : chr  "26bd5987e8" "f26ed2675e" "f77e500e7a" "c74aec7673" ...
##  $ user_id     : chr  "1dea19f6fe" "544dcbc510" "7cbcc791bf" "ebf43c36b6" ...

Data Quality Check and EDA

# check th summary statistics of the datasets abd check columns with missing values 
summary(full_data)

##        X            datetime            duration           title          
##  Min.   : 58773   Length:671736      Min.   :      -1   Length:671736     
##  1st Qu.:226707   Class :character   1st Qu.:       0   Class :character  
##  Median :394641   Mode  :character   Median :      14   Mode  :character  
##  Mean   :394641                      Mean   :   33476                     
##  3rd Qu.:562574                      3rd Qu.:    6672                     
##  Max.   :730508                      Max.   :18237253                     
##     genres          release_date         movie_id           user_id         
##  Length:671736      Length:671736      Length:671736      Length:671736     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##

# Counting number of missing values 
sum(is.na(full_data))

## [1] 0

# Dimension
dim(full_data)

## [1] 671736      8

comment

There is no missing values from the datasets.
as we can see the duration for users who clicked on moves have mean 33476s so let’s check the normality of the duration variable

# to show the numbers in normal format
options(scipen = 999)

ggplot(full_data, aes((duration))) + geom_histogram(colour="darkblue", fill="lightblue") + theme_classic() + ggtitle('Duration Distributition')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplot(full_data$duration, 
     xlab = "Duratuin",
     ylab = "Count",
     col = 8,
     boxlty = 1,
     whisklty = 2,
     whisklwd = 1.5,
     staplelwd = 1.5,
     horizontal = TRUE)

comment

It seems that the data is skewed to the right and the result are concentrated on the left Area and there are outliers
let’s take subset of the data to remove the outliers

Doing data cleaning

data_subset <- full_data[full_data$duration >360 , ]

#check the dimension
dim(data_subset)

## [1] 300049      8

ggplot(data_subset, aes((duration))) + geom_histogram(colour="darkblue", fill="lightblue") + theme_classic() + ggtitle('Duration Distributition')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplot(data_subset$duration, 
     xlab = "Duratuin",
     ylab = "Count",
     col = 8,
     boxlty = 1,
     whisklty = 2,
     whisklwd = 1.5,
     staplelwd = 1.5,
     horizontal = TRUE)

comment

Interesting after taking a subset of the data between 360 - 100000, half of the data were discarded, which means that most of the users don’t even stay to read the movie description she click and leave immediately
so let’s take the subset for the duration bigger than zero
but still there are some outliers let’s take another subset

data_subset <- full_data[full_data$duration >360 & full_data$duration < 100000 , ]

#check the dimension
dim(data_subset)

## [1] 257318      8

ggplot(data_subset, aes((duration))) + geom_histogram(colour="darkblue", fill="lightblue") + theme_classic() + ggtitle('Duration Distributition')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplot(data_subset$duration, 
     xlab = "Duratuin",
     ylab = "Count",
     col = 8,
     boxlty = 1,
     whisklty = 2,
     whisklwd = 1.5,
     staplelwd = 1.5,
     horizontal = TRUE)

comment

Now we have more cleaner data still high skewed to the right but it’s much better than before
let’s have another column for the duration in hours
still there were some outliers so let’s remove the others outlier and filter the duration to be at most 3h

data_subset$duration_hrs = data_subset$duration / (60*60)
str(data_subset)

## 'data.frame':    257318 obs. of  9 variables:
##  $ X           : int  58775 58779 58781 58784 58786 58787 58792 58793 58796 58800 ...
##  $ datetime    : chr  "2017-01-01 15:17:47" "2017-01-01 19:43:06" "2017-01-01 19:46:24" "2017-01-01 20:55:46" ...
##  $ duration    : num  10530 4903 3845 6175 38120 ...
##  $ title       : chr  "London Has Fallen" "The Water Diviner" "Ratter" "28 Days" ...
##  $ genres      : chr  "Action, Thriller" "Drama, History, War" "Drama, Horror, Thriller" "Comedy, Drama" ...
##  $ release_date: chr  "2016-03-04" "2014-12-26" "2016-02-12" "2000-04-14" ...
##  $ movie_id    : chr  "f77e500e7a" "7165c2fc94" "c39aae36c3" "584bffaf5f" ...
##  $ user_id     : chr  "7cbcc791bf" "8e1be40e32" "cff8ea652a" "759ae2eac9" ...
##  $ duration_hrs: num  2.92 1.36 1.07 1.72 10.59 ...

visualize the new column

let’s filter with duration in hours between 0 < duration in hrs < 4

head(data_subset,10)

##        X            datetime duration                           title
## 3  58775 2017-01-01 15:17:47    10530               London Has Fallen
## 7  58779 2017-01-01 19:43:06     4903               The Water Diviner
## 9  58781 2017-01-01 19:46:24     3845                          Ratter
## 12 58784 2017-01-01 20:55:46     6175                         28 Days
## 14 58786 2017-01-01 21:33:26    38120 The SpongeBob SquarePants Movie
## 15 58787 2017-01-01 21:37:41     7799             Beasts of No Nation
## 20 58792 2017-01-01 00:19:40    54195                About Last Night
## 21 58793 2017-01-01 00:49:03    44413                      Fight Club
## 24 58796 2017-01-01 11:05:46      621     Joe and Caspar Hit the Road
## 28 58800 2017-01-01 16:05:02      581                        Vendetta
##                                                   genres release_date
## 3                                       Action, Thriller   2016-03-04
## 7                                    Drama, History, War   2014-12-26
## 9                                Drama, Horror, Thriller   2016-02-12
## 12                                         Comedy, Drama   2000-04-14
## 14 Animation, Action, Adventure, Comedy, Family, Fantasy   2004-11-19
## 15                                            Drama, War   2015-10-16
## 20                                       Comedy, Romance   2014-02-14
## 21                                                 Drama   1999-10-15
## 24                        Documentary, Adventure, Comedy   2015-11-23
## 28                                         Action, Drama   2015-06-12
##      movie_id    user_id duration_hrs
## 3  f77e500e7a 7cbcc791bf    2.9250000
## 7  7165c2fc94 8e1be40e32    1.3619444
## 9  c39aae36c3 cff8ea652a    1.0680556
## 12 584bffaf5f 759ae2eac9    1.7152778
## 14 a80d6fc2aa 5b1727dc12   10.5888889
## 15 c57e11da52 3142b4c730    2.1663889
## 20 f7d088d208 78cdb81c4f   15.0541667
## 21 338abadc17 ac30a85c52   12.3369444
## 24 416464eaad 7726b5615e    0.1725000
## 28 c74aec7673 ebf43c36b6    0.1613889

data_subset <- data_subset[data_subset$duration_hrs >= 0.5 & data_subset$duration_hrs <= 4 , ]

dim(data_subset)

## [1] 129762      9

ggplot(data_subset, aes((duration_hrs))) + geom_histogram(colour="darkblue", fill="lightblue") + theme_classic() + ggtitle('Duration in Hours Distributition')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

boxplot(data_subset$duration_hrs, 
     xlab = "Duratuin Huration in Hours",
     ylab = "Count",
     col = 8,
     boxlty = 1,
     whisklty = 2,
     whisklwd = 1.5,
     staplelwd = 1.5,
     horizontal = TRUE)

head(data_subset)

##        X            datetime duration               title
## 3  58775 2017-01-01 15:17:47    10530   London Has Fallen
## 7  58779 2017-01-01 19:43:06     4903   The Water Diviner
## 9  58781 2017-01-01 19:46:24     3845              Ratter
## 12 58784 2017-01-01 20:55:46     6175             28 Days
## 15 58787 2017-01-01 21:37:41     7799 Beasts of No Nation
## 32 58804 2017-01-01 17:21:04     2400            Clueless
##                     genres release_date   movie_id    user_id duration_hrs
## 3         Action, Thriller   2016-03-04 f77e500e7a 7cbcc791bf    2.9250000
## 7      Drama, History, War   2014-12-26 7165c2fc94 8e1be40e32    1.3619444
## 9  Drama, Horror, Thriller   2016-02-12 c39aae36c3 cff8ea652a    1.0680556
## 12           Comedy, Drama   2000-04-14 584bffaf5f 759ae2eac9    1.7152778
## 15              Drama, War   2015-10-16 c57e11da52 3142b4c730    2.1663889
## 32         Comedy, Romance   1995-07-19 fc8d2d5cbc 614abddbe8    0.6666667

check the Genres categories

genresCategories <- unique(unlist(str_split(data_subset$genres, "\\, ")))
length(genresCategories)

## [1] 27

genresCategories

##  [1] "Action"        "Thriller"      "Drama"         "History"      
##  [5] "War"           "Horror"        "Comedy"        "Romance"      
##  [9] "Animation"     "Adventure"     "Family"        "Fantasy"      
## [13] "Documentary"   "Biography"     "Western"       "Mystery"      
## [17] "Music"         "Sci-Fi"        "Crime"         "Sport"        
## [21] "NOT AVAILABLE" "Musical"       "News"          "Short"        
## [25] "Film-Noir"     "Reality-TV"    "Talk-Show"

We have around 27 different Genres
let’s group the movies based on genre

genre_groups = data_subset %>% group_by(genres)%>% arrange(duration_hrs)%>%
  summarise(moviesCount = n())  %>%
  arrange(desc(moviesCount))
str(data_subset)

## 'data.frame':    129762 obs. of  9 variables:
##  $ X           : int  58775 58779 58781 58784 58787 58804 58806 58809 58812 58813 ...
##  $ datetime    : chr  "2017-01-01 15:17:47" "2017-01-01 19:43:06" "2017-01-01 19:46:24" "2017-01-01 20:55:46" ...
##  $ duration    : num  10530 4903 3845 6175 7799 ...
##  $ title       : chr  "London Has Fallen" "The Water Diviner" "Ratter" "28 Days" ...
##  $ genres      : chr  "Action, Thriller" "Drama, History, War" "Drama, Horror, Thriller" "Comedy, Drama" ...
##  $ release_date: chr  "2016-03-04" "2014-12-26" "2016-02-12" "2000-04-14" ...
##  $ movie_id    : chr  "f77e500e7a" "7165c2fc94" "c39aae36c3" "584bffaf5f" ...
##  $ user_id     : chr  "7cbcc791bf" "8e1be40e32" "cff8ea652a" "759ae2eac9" ...
##  $ duration_hrs: num  2.92 1.36 1.07 1.72 2.17 ...

genre_groups

## # A tibble: 972 × 2
##    genres                    moviesCount
##    <chr>                           <int>
##  1 Comedy                           5914
##  2 Documentary                      5416
##  3 Comedy, Romance                  5390
##  4 Comedy, Drama, Romance           5318
##  5 Action, Adventure, Sci-Fi        3748
##  6 NOT AVAILABLE                    3735
##  7 Comedy, Drama                    2977
##  8 Drama, Romance                   2547
##  9 Action, Thriller                 2002
## 10 Drama                            1921
## # … with 962 more rows

Visualize the genre distribution

ggplot(genre_groups, aes(x = genres, y = moviesCount, fill = moviesCount)) + 
  geom_bar(stat = "identity", position = "dodge") + 
  xlab("genres") + 
  ylab("Count") +
  ggtitle("Count of movies per each genre") +
  theme(axis.text.x = element_text(angle = 0, hjust = 1))

Show 10 genre based on movies count

genre_groups_top10 = genre_groups %>% slice_max(n = 20, order_by = moviesCount)

ggplot(genre_groups_top10, aes(x = genres, y = moviesCount, fill = moviesCount),las=3 ) + 
  geom_bar(stat = "identity", position = "dodge") + 
  xlab("genres") + 
  ylab("Count") +
  ggtitle("Count of movies per each genre") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

comment

As we can see there is a big number of movies that doesn’t have genre
most of the movies for this dataset are around (Comedy, Drama,Romance and Documentary)

Show 10 Moves based on users viewed

let’s group the movies based on users

movies_groups = data_subset %>% group_by(title)%>%
  summarise(userCount = n())  %>%
  arrange(desc(userCount))

movies_groups

## # A tibble: 5,683 × 2
##    title                                        userCount
##    <chr>                                            <int>
##  1 Black Mirror: Bandersnatch                        1174
##  2 Bright                                             679
##  3 Bird Box                                           539
##  4 Annihilation                                       500
##  5 The Hitman's Bodyguard                             486
##  6 FYRE: The Greatest Party That Never Happened       483
##  7 Avengers: Age of Ultron                            481
##  8 To All the Boys I've Loved Before                  447
##  9 Deadpool                                           439
## 10 Hot Fuzz                                           424
## # … with 5,673 more rows

movies_groups_top10_by_users = movies_groups %>% slice_max(n = 20, order_by = userCount)

movies_groups_top10_by_users

## # A tibble: 21 × 2
##    title                                        userCount
##    <chr>                                            <int>
##  1 Black Mirror: Bandersnatch                        1174
##  2 Bright                                             679
##  3 Bird Box                                           539
##  4 Annihilation                                       500
##  5 The Hitman's Bodyguard                             486
##  6 FYRE: The Greatest Party That Never Happened       483
##  7 Avengers: Age of Ultron                            481
##  8 To All the Boys I've Loved Before                  447
##  9 Deadpool                                           439
## 10 Hot Fuzz                                           424
## # … with 11 more rows

ggplot(movies_groups_top10_by_users, aes(x = title, y = userCount, fill = userCount),las=3 ) + 
  geom_bar(stat = "identity", position = "dodge") + 
  xlab("genres") + 
  ylab("Count") +
  ggtitle("Count of movies per each genre") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

comment so as we can see these are the top 5 viewed movies

Black Mirror: Bandersnatch
Bright
Bird Box
Annihilation
FYRE: The Greatest Party That Never Happened

let’s check what are the most viewed genres by the user durations

genre_groups_by_duration = data_subset %>% group_by(genres)%>% arrange(duration_hrs)%>%
  summarise(TotalDuration = sum(duration_hrs))  

genre_groups_by_duration

## # A tibble: 972 × 2
##    genres                                                      TotalDuration
##    <chr>                                                               <dbl>
##  1 Action                                                            260.   
##  2 Action, Adventure                                                  96.9  
##  3 Action, Adventure, Biography, Drama, History                       50.6  
##  4 Action, Adventure, Biography, Drama, Romance, Thriller             71.2  
##  5 Action, Adventure, Biography, Drama, Thriller                      80.0  
##  6 Action, Adventure, Comedy                                         896.   
##  7 Action, Adventure, Comedy, Crime                                  352.   
##  8 Action, Adventure, Comedy, Crime, Drama                             0.667
##  9 Action, Adventure, Comedy, Crime, Family, Mystery                  10.9  
## 10 Action, Adventure, Comedy, Crime, Family, Romance, Thriller        18.7  
## # … with 962 more rows

ggplot(genre_groups_by_duration, aes(x = genres, y = TotalDuration, fill = TotalDuration)) + 
  geom_bar(stat = "identity", position = "dodge") + 
  xlab("genres") + 
  ylab("TotalDuration") +
  ggtitle("Total Duration  per each genre") +
  theme(axis.text.x = element_text(angle = 0, hjust = 1))

Show 10 genre based on total duration

genre_groups_top10_by_duration = genre_groups_by_duration %>% slice_max(n = 20, order_by = TotalDuration)

ggplot(genre_groups_top10_by_duration, aes(x = genres, y = TotalDuration, fill = TotalDuration),las=3 ) + 
  geom_bar(stat = "identity", position = "dodge") + 
  xlab("genres") + 
  ylab("Count") +
  ggtitle("Count of movies per each genre") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

comment

it seems that there is not much difference by grouping the genres with users duration or movies count

Let’s Apply Assocation Rules using Apriori Algorithm

Assumption

If the user spend time in two movie, X and Y together, there may be an connection or similarity between those two items. This could imply that users who spend time on movie X are also likely to check the movie Y, and vice versa.

Based on this assumption, we can use this information to help establish a user’s choice behavior. For example, if a user has watched movie X, we can recommend movie Y to them, assuming that they are likely to enjoy it. Similarly, if a user has watched movie Y, we can recommend movie X to them.

However, it’s worth noting that there could be other factors influencing user behavior and preferences that are not captured by this approach. For instance, a user may watch movie X and Y together simply because they were recommended together or because they belong to the same genre, rather than because of an underlying similarity between the two items. Therefore, while this assumption can be useful in making recommendations, it should be combined with other techniques and strategies to ensure that the recommendations are accurate and relevant to the user’s interests.

To Apply Apriori we should follow these steps:

prepare the data by converting it into a transaction format.
Set minimum support threshold.
Generate frequent itemsets.
Generate association rules.
Evaluate and interpret the results.

# get the viewed movies
viewed_movies <- data_subset %>% 
  group_by(user_id) %>%
  summarise(movies = as.vector(list(title)))

# compute transactions
transactions <- as(viewed_movies$movies, "transactions")

## Warning in asMethod(object): removing duplicated items in transactions

Analyzing the Movies

Let’s first do some analyse to count of viewed movies

hist(size(transactions), breaks = 0:100, xaxt="n", ylim=c(0,5000), 
     main = "", xlab = "#Movies")
axis(1, at=seq(0,160,by=10), cex.axis=1)
mtext(paste("Total:", length(transactions), "Users who viewed movies,", sum(size(transactions)), "Movies"))

Next, let’s determine which Movies are frequent. We set the support threshold to 0.02, that means a movie will be considered as frequent iff at least 1 percent of all the Users view it. So in our case, an movie will be considered as being frequent if it is viewed in more than 6,108 Users

movies_frequencies <- itemFrequency(transactions, type="a")
support <- 0.01
freq_movies <- sort(movies_frequencies, decreasing = F)
freq_movies <- freq_movies[freq_movies>support*length(transactions)]

par(mar=c(2,10,2,2)); options(scipen=5)

barplot(freq_movies, horiz=T, las=3, main="Frequent movies", cex.names=.8, xlim=c(0,1000))
mtext(paste("support:",support), padj = .5)
abline(v=support*length(transactions), col="red")

comment

The ranking shows more than 6,108 viewed this 2 movies (Bright, Black Mirror: Bandersnatch) .

Frequent Movies

Now, lets compute the frequent Movies We decrease the support threshold to take into account the small probability of observing a frequent movies of at least size 2 for each user.

support <- 0.0005
movieSets <- apriori(transactions, parameter = list(target = "frequent itemsets", supp=support, minlen=2), control = list(verbose = FALSE))

par(mar=c(5,18,2,2)+.1)
sets_order_supp <- DATAFRAME(sort(movieSets, by="support", decreasing = F))

barplot(sets_order_supp$support, names.arg=sets_order_supp$items, horiz = T, las = 2, cex.names = .6, main = "Frequent Viewed Movies")

mtext(paste("support:",support), padj = .8)

comment

First of all, with a support threshold of 0.005 (~30 users), we observe frequent pairs only, and secondly: it seems that users are preferred to view movies that have more than one part like (Iron Man, IronMan 2),(kill Bill Vol 1, kill Bill Vol 2), (Shrek, Shrek2). * I assume that if a user viewed a Movie that have more than one part he will most likely wants to view the other Parts

Association Rules

Lets do association rules: First, we use a low support threshold and a high confidence to generate strong rules even for movies that are less frequent

rules1 <- apriori(transactions, parameter = list(supp = 0.00005, conf = 0.6, maxlen=3), control = list(verbose = FALSE)) 
summary(quality(rules1))

##     support             confidence        coverage               lift        
##  Min.   :0.00006548   Min.   :0.6250   Min.   :0.00006548   Min.   :  36.89  
##  1st Qu.:0.00006548   1st Qu.:0.6667   1st Qu.:0.00008185   1st Qu.: 210.65  
##  Median :0.00006548   Median :0.8000   Median :0.00008185   Median : 465.43  
##  Mean   :0.00007396   Mean   :0.8045   Mean   :0.00009569   Mean   : 822.23  
##  3rd Qu.:0.00008185   3rd Qu.:1.0000   3rd Qu.:0.00009822   3rd Qu.: 896.93  
##  Max.   :0.00019644   Max.   :1.0000   Max.   :0.00029466   Max.   :8145.07  
##      count       
##  Min.   : 4.000  
##  1st Qu.: 4.000  
##  Median : 4.000  
##  Mean   : 4.518  
##  3rd Qu.: 5.000  
##  Max.   :12.000

plot(rules1)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

comment

There are some rules with a heavy lift indicating a strong association between the movies Let’s further investigate those critical rules

inspect(sort(rules1, by="lift")[1:10])

##      lhs                                                     rhs                                                       support confidence      coverage     lift count
## [1]  {Mystery Science Theater 3000: Future War,                                                                                                                       
##       Mystery Science Theater 3000: I Accuse My Parents}  => {Mystery Science Theater 3000: Werewolf}            0.00006547931  0.8000000 0.00008184914 8145.067     4
## [2]  {Mystery Science Theater 3000: Future War,                                                                                                                       
##       Mystery Science Theater 3000: Werewolf}             => {Mystery Science Theater 3000: I Accuse My Parents} 0.00006547931  1.0000000 0.00006547931 5553.455     4
## [3]  {Mystery Science Theater 3000: Laserblast}           => {Mystery Science Theater 3000: I Accuse My Parents} 0.00011458879  0.8750000 0.00013095862 4859.273     7
## [4]  {Mystery Science Theater 3000: I Accuse My Parents}  => {Mystery Science Theater 3000: Laserblast}          0.00011458879  0.6363636 0.00018006810 4859.273     7
## [5]  {Mystery Science Theater 3000: Werewolf}             => {Mystery Science Theater 3000: I Accuse My Parents} 0.00008184914  0.8333333 0.00009821896 4627.879     5
## [6]  {Mystery Science Theater 3000: I Accuse My Parents,                                                                                                              
##       Mystery Science Theater 3000: Werewolf}             => {Mystery Science Theater 3000: Future War}          0.00006547931  0.8000000 0.00008184914 3490.743     4
## [7]  {Mystery Science Theater 3000: Werewolf}             => {Mystery Science Theater 3000: Future War}          0.00006547931  0.6666667 0.00009821896 2908.952     4
## [8]  {Fat, Sick & Nearly Dead 2}                          => {Fat, Sick and Nearly Dead}                         0.00006547931  0.6666667 0.00009821896 2545.333     4
## [9]  {A Christmas Prince,                                                                                                                                             
##       A Wish For Christmas}                               => {Once Upon a Holiday}                               0.00006547931  0.6666667 0.00009821896 1696.889     4
## [10] {Extraction,                                                                                                                                                     
##       Gods of Egypt}                                      => {London Heist}                                      0.00006547931  0.8000000 0.00008184914 1629.013     4

inspect(sort(rules1, by="confidence")[1:10])

##      lhs                                            rhs                                                       support confidence      coverage      lift count
## [1]  {Mystery Science Theater 3000: Future War,                                                                                                               
##       Mystery Science Theater 3000: Werewolf}    => {Mystery Science Theater 3000: I Accuse My Parents} 0.00006547931          1 0.00006547931 5553.4545     4
## [2]  {Bean: The Ultimate Disaster Movie,                                                                                                                      
##       Kung Fu Panda 2}                           => {Mr. Bean's Holiday}                                0.00006547931          1 0.00006547931  985.2903     4
## [3]  {Scary Movie 3,                                                                                                                                          
##       Scary Movie 5}                             => {Scary Movie 2}                                     0.00006547931          1 0.00006547931  744.9756     4
## [4]  {Scary Movie 2,                                                                                                                                          
##       Scary Movie 5}                             => {Scary Movie 3}                                     0.00006547931          1 0.00006547931  872.6857     4
## [5]  {Mr. Bean's Holiday,                                                                                                                                     
##       Rush Hour 2}                               => {Rush Hour 3}                                       0.00006547931          1 0.00006547931  526.6207     4
## [6]  {Kung Fu Panda 2,                                                                                                                                        
##       Rush Hour 2}                               => {Rush Hour 3}                                       0.00006547931          1 0.00006547931  526.6207     4
## [7]  {Kung Fu Panda 2,                                                                                                                                        
##       Rush Hour 3}                               => {Rush Hour 2}                                       0.00006547931          1 0.00006547931 1388.3636     4
## [8]  {Kung Fu Panda,                                                                                                                                          
##       Rush Hour 2}                               => {Rush Hour 3}                                       0.00006547931          1 0.00006547931  526.6207     4
## [9]  {Kung Fu Panda 2,                                                                                                                                        
##       Rush Hour 2}                               => {Kung Fu Panda}                                     0.00006547931          1 0.00006547931  481.0079     4
## [10] {Kung Fu Panda,                                                                                                                                          
##       Rush Hour 2}                               => {Kung Fu Panda 2}                                   0.00006547931          1 0.00006547931  581.7905     4

comment

Seems like those rules mostly affect similar items which were usually bought together.
Seems like those rules mostly affect similar items which were usually bought together.

rules2 <- apriori(transactions, parameter = list(supp = 0.0001, conf = 0.4, maxlen=3), control = list(verbose = FALSE)) 
summary(quality(rules2))

##     support            confidence        coverage              lift       
##  Min.   :0.0001146   Min.   :0.4118   Min.   :0.0001310   Min.   : 113.2  
##  1st Qu.:0.0001310   1st Qu.:0.4387   1st Qu.:0.0002701   1st Qu.: 291.0  
##  Median :0.0001801   Median :0.5000   Median :0.0003110   Median : 347.1  
##  Mean   :0.0001684   Mean   :0.5168   Mean   :0.0003363   Mean   : 585.8  
##  3rd Qu.:0.0001964   3rd Qu.:0.5639   3rd Qu.:0.0004092   3rd Qu.: 399.3  
##  Max.   :0.0002128   Max.   :0.8750   Max.   :0.0004911   Max.   :4859.3  
##      count      
##  Min.   : 7.00  
##  1st Qu.: 8.00  
##  Median :11.00  
##  Mean   :10.29  
##  3rd Qu.:12.00  
##  Max.   :13.00

plot(rules2)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

inspect(sort(rules2, by="lift")[1:10])

##      lhs                                                         rhs                                                      support confidence     coverage      lift count
## [1]  {Mystery Science Theater 3000: Laserblast}               => {Mystery Science Theater 3000: I Accuse My Parents} 0.0001145888  0.8750000 0.0001309586 4859.2727     7
## [2]  {Mystery Science Theater 3000: I Accuse My Parents}      => {Mystery Science Theater 3000: Laserblast}          0.0001145888  0.6363636 0.0001800681 4859.2727     7
## [3]  {The Twilight Saga: Breaking Dawn: Part 2,                                                                                                                          
##       The Twilight Saga: New Moon}                            => {The Twilight Saga: Breaking Dawn: Part 1}          0.0001964379  0.7058824 0.0002782871  479.1216    12
## [4]  {The Twilight Saga: Breaking Dawn: Part 2,                                                                                                                          
##       The Twilight Saga: New Moon}                            => {The Twilight Saga: Eclipse}                        0.0001636983  0.5882353 0.0002782871  438.2209    10
## [5]  {The Twilight Saga: Breaking Dawn: Part 2,                                                                                                                          
##       Twilight}                                               => {The Twilight Saga: Breaking Dawn: Part 1}          0.0001473284  0.6428571 0.0002291776  436.3429     9
## [6]  {13TH: A Conversation with Oprah Winfrey & Ava DuVernay} => {13TH}                                              0.0002128078  0.5652174 0.0003765060  426.2716    13
## [7]  {The Hangover,                                                                                                                                                      
##       The Hangover Part III}                                  => {The Hangover Part II}                              0.0001800681  0.6875000 0.0002619172  424.2222    11
## [8]  {The Twilight Saga: Breaking Dawn: Part 1,                                                                                                                          
##       The Twilight Saga: Eclipse}                             => {The Twilight Saga: New Moon}                       0.0002128078  0.5200000 0.0004092457  412.5423    13
## [9]  {Madagascar,                                                                                                                                                        
##       Madagascar 3: Europe's Most Wanted}                     => {Madagascar: Escape 2 Africa}                       0.0001964379  0.5000000 0.0003928759  401.8947    12
## [10] {The Twilight Saga: Breaking Dawn: Part 2,                                                                                                                          
##       Twilight}                                               => {The Twilight Saga: New Moon}                       0.0001145888  0.5000000 0.0002291776  396.6753     7

inspect(sort(rules2, by="confidence")[1:10])

##      lhs                                                         rhs                                                      support confidence     coverage      lift count
## [1]  {Mystery Science Theater 3000: Laserblast}               => {Mystery Science Theater 3000: I Accuse My Parents} 0.0001145888  0.8750000 0.0001309586 4859.2727     7
## [2]  {The Twilight Saga: Breaking Dawn: Part 2,                                                                                                                          
##       The Twilight Saga: New Moon}                            => {The Twilight Saga: Breaking Dawn: Part 1}          0.0001964379  0.7058824 0.0002782871  479.1216    12
## [3]  {The Hangover,                                                                                                                                                      
##       The Hangover Part III}                                  => {The Hangover Part II}                              0.0001800681  0.6875000 0.0002619172  424.2222    11
## [4]  {Madagascar 3: Europe's Most Wanted,                                                                                                                                
##       Madagascar: Escape 2 Africa}                            => {Madagascar}                                        0.0001964379  0.6666667 0.0002946569  288.8322    12
## [5]  {The Twilight Saga: Breaking Dawn: Part 2,                                                                                                                          
##       Twilight}                                               => {The Twilight Saga: Breaking Dawn: Part 1}          0.0001473284  0.6428571 0.0002291776  436.3429     9
## [6]  {Mystery Science Theater 3000: I Accuse My Parents}      => {Mystery Science Theater 3000: Laserblast}          0.0001145888  0.6363636 0.0001800681 4859.2727     7
## [7]  {The Twilight Saga: Breaking Dawn: Part 2,                                                                                                                          
##       The Twilight Saga: New Moon}                            => {The Twilight Saga: Eclipse}                        0.0001636983  0.5882353 0.0002782871  438.2209    10
## [8]  {13TH: A Conversation with Oprah Winfrey & Ava DuVernay} => {13TH}                                              0.0002128078  0.5652174 0.0003765060  426.2716    13
## [9]  {The Twilight Saga: Breaking Dawn: Part 2,                                                                                                                          
##       The Twilight Saga: Eclipse}                             => {The Twilight Saga: Breaking Dawn: Part 1}          0.0002128078  0.5652174 0.0003765060  383.6444    13
## [10] {The Twilight Saga: Breaking Dawn: Part 1,                                                                                                                          
##       Twilight}                                               => {The Twilight Saga: Breaking Dawn: Part 2}          0.0001473284  0.5625000 0.0002619172  343.6200     9

comment

Now we that Movies with more than one part are often viewed together
Finally, lets increase support and decrease confidence

rules3 <- apriori(transactions, parameter = list(supp = 0.0005, conf = 0.1, maxlen=3), control = list(verbose = FALSE)) 
summary(quality(rules3))

##     support            confidence        coverage             lift       
##  Min.   :0.0005075   Min.   :0.1101   Min.   :0.001572   Min.   : 10.76  
##  1st Qu.:0.0005238   1st Qu.:0.1716   1st Qu.:0.002181   1st Qu.: 35.29  
##  Median :0.0005566   Median :0.2088   Median :0.002636   Median : 74.76  
##  Mean   :0.0006521   Mean   :0.2081   Mean   :0.003517   Mean   : 75.45  
##  3rd Qu.:0.0006343   3rd Qu.:0.2481   3rd Qu.:0.004252   3rd Qu.:118.58  
##  Max.   :0.0014897   Max.   :0.3333   Max.   :0.008169   Max.   :138.52  
##      count      
##  Min.   :31.00  
##  1st Qu.:32.00  
##  Median :34.00  
##  Mean   :39.83  
##  3rd Qu.:38.75  
##  Max.   :91.00

plot(rules3)

inspect(sort(rules3, by="lift")[1:10])

##      lhs                                                    rhs                                                      support confidence    coverage      lift count
## [1]  {Kill Bill: Vol. 2}                                 => {Kill Bill: Vol. 1}                                 0.0005238345  0.3333333 0.001571503 138.52154    32
## [2]  {Kill Bill: Vol. 1}                                 => {Kill Bill: Vol. 2}                                 0.0005238345  0.2176871 0.002406365 138.52154    32
## [3]  {Iron Man}                                          => {Iron Man 2}                                        0.0005402043  0.2426471 0.002226296 118.58259    33
## [4]  {Iron Man 2}                                        => {Iron Man}                                          0.0005402043  0.2640000 0.002046228 118.58259    33
## [5]  {The Lord of the Rings: The Two Towers}             => {The Lord of the Rings: The Fellowship of the Ring} 0.0005729439  0.2482270 0.002308146  86.64965    35
## [6]  {The Lord of the Rings: The Fellowship of the Ring} => {The Lord of the Rings: The Two Towers}             0.0005729439  0.2000000 0.002864720  86.64965    35
## [7]  {Shrek the Third}                                   => {Shrek 2}                                           0.0005074646  0.2480000 0.002046228  62.86234    31
## [8]  {Shrek 2}                                           => {Shrek the Third}                                   0.0005074646  0.1286307 0.003945128  62.86234    31
## [9]  {Shrek 2}                                           => {Shrek}                                             0.0007202724  0.1825726 0.003945128  35.29429    44
## [10] {Shrek}                                             => {Shrek 2}                                           0.0007202724  0.1392405 0.005172865  35.29429    44

inspect(sort(rules3, by="confidence")[1:10])

##      lhs                                                    rhs                                                      support confidence    coverage      lift count
## [1]  {Kill Bill: Vol. 2}                                 => {Kill Bill: Vol. 1}                                 0.0005238345  0.3333333 0.001571503 138.52154    32
## [2]  {Iron Man 2}                                        => {Iron Man}                                          0.0005402043  0.2640000 0.002046228 118.58259    33
## [3]  {The Lord of the Rings: The Two Towers}             => {The Lord of the Rings: The Fellowship of the Ring} 0.0005729439  0.2482270 0.002308146  86.64965    35
## [4]  {Shrek the Third}                                   => {Shrek 2}                                           0.0005074646  0.2480000 0.002046228  62.86234    31
## [5]  {Iron Man}                                          => {Iron Man 2}                                        0.0005402043  0.2426471 0.002226296 118.58259    33
## [6]  {Kill Bill: Vol. 1}                                 => {Kill Bill: Vol. 2}                                 0.0005238345  0.2176871 0.002406365 138.52154    32
## [7]  {The Lord of the Rings: The Fellowship of the Ring} => {The Lord of the Rings: The Two Towers}             0.0005729439  0.2000000 0.002864720  86.64965    35
## [8]  {Shrek 2}                                           => {Shrek}                                             0.0007202724  0.1825726 0.003945128  35.29429    44
## [9]  {Bird Box}                                          => {Black Mirror: Bandersnatch}                        0.0014896543  0.1823647 0.008168544  10.76357    91
## [10] {Shrek}                                             => {Shrek 2}                                           0.0007202724  0.1392405 0.005172865  35.29429    44

Final Conclusion

Movies that have more than one part are also called sequels or franchises. They are movies that continue the story or characters from a previous movie. Some examples of movies that have more than one part are Iron Man, Shrek , Kill Bill: Vol and The Lord of the Rings

Some possible reasons why these movies are most likely viewed together are:

They have a loyal fan base that follows the series and wants to see how it ends.
They have cliffhangers or unresolved plot points that make viewers curious about what happens next.
They have consistent quality, genre, style or themes that appeal to a certain audience.
They have popular actors, directors or writers that attract viewers.