Introduction

The entertainment industry has always been driven by the preferences and interests of its consumers. In recent years, movie recommendation engines have become a crucial tool in enhancing the user experience by suggesting films that match a user’s viewing history and preferences. In this report, we explore the use of the Apriori algorithm, a data mining technique commonly used in Market Basket Analysis, to build a movie recommendation engine.

By treating movies as items, we applied the Apriori algorithm to the MovieLens 20M Dataset from Kaggle, which contains millions of user ratings across thousands of movies and raters. Our objective was to mine the relationships among films and generate relevant movie recommendations for users.

In this report, we describe our methodology in detail, including the pre-processing of the dataset and the application of the Apriori algorithm. We present the results of our analysis and evaluate the effectiveness of our movie recommendation engine. Additionally, we discuss the limitations of the approach and suggest future research directions.

Overall, our study highlights the potential of data mining algorithms in improving the user experience of movie enthusiasts. By generating personalized movie recommendations, we aim to assist users in discovering new films and enhancing their enjoyment of the movie-watching experience.

library(arules)
## Warning: package 'arules' was built under R version 4.1.3
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 4.1.3
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
library(arulesViz)
## Warning: package 'arulesViz' was built under R version 4.1.3
library(rattle)
## Warning: package 'rattle' was built under R version 4.1.3
## Loading required package: tibble
## Warning: package 'tibble' was built under R version 4.1.3
## Loading required package: bitops
## 
## Attaching package: 'bitops'
## The following object is masked from 'package:Matrix':
## 
##     %&%
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.3

Upon obtaining the MovieLens 20M dataset from Kaggle, we extracted the compressed file into the “ml-20m” directory. Our analysis focused on two specific data files within the directory: “movies.csv” and “ratings.csv”.

The “movies.csv

” file contains information about the movies in the dataset, including their titles, genres, and release years

ratings <- read.csv("ratings.csv")

movies <- read.csv("movies.csv")

# using R 4.0:
movies <- as.data.frame(movies) %>% mutate(movieId = as.numeric(movieId), title = as.character(title), genres = as.character(genres))

movie_subset <- left_join(ratings, movies, by = "movieId")
# Have a glimpse at the dataset
movie_subset %>% glimpse()
## Rows: 100,836
## Columns: 6
## $ userId    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ movieId   <dbl> 1, 3, 6, 47, 50, 70, 101, 110, 151, 157, 163, 216, 223, 231,~
## $ rating    <dbl> 4, 4, 4, 5, 5, 3, 5, 4, 5, 5, 5, 5, 3, 5, 4, 5, 3, 3, 5, 4, ~
## $ timestamp <int> 964982703, 964981247, 964982224, 964983815, 964982931, 96498~
## $ title     <chr> "Toy Story (1995)", "Grumpier Old Men (1995)", "Heat (1995)"~
## $ genres    <chr> "Adventure|Animation|Children|Comedy|Fantasy", "Comedy|Roman~

Each row corresponds to a single user rating of a movie. The “userId” column represents the unique identifier of the user who gave the rating, while the “movieId” column identifies the movie that was rated. The “rating” column contains the numerical rating given by the user on a scale of 1 to 5.

The “timestamp” column represents the time at which the user gave the rating. The “title” column provides the name of the movie that was rated, and the “genres” column lists the genres associated with that particular movie.

# Calculating the number of distinct users and movies
n_distinct(movie_subset$userId)
## [1] 610

here we have 610 unique users in our dataset.

# Split dataset into movies and users
data_list = split(movie_subset$title,
                  movie_subset$userId)
# Transform data into a transactional dataset
movie_trx = as(data_list, "transactions")
## Warning in asMethod(object): removing duplicated items in transactions
# Plot the absolute item frequency plot
itemFrequencyPlot(movie_trx,
                  type = "absolute",
                  topN = 10,
                  horiz = TRUE,
                  main = 'Absolute item frequency')

At the top of the list is “Forrest Gump (1994)” with a total of 329 ratings, making it the most popular movie in the dataset. “Shawshank Redemption, The (1994)” is the second most popular movie with a total of 317 ratings, followed closely by “Pulp Fiction (1994)” with 307 ratings. “Silence of the Lambs, The (1991)” and “Matrix, The (1999)” round out the top five with 279 and 278 ratings respectively.

Other popular movies in the top 10 list include “Star Wars: Episode IV - A New Hope (1977)”, “Jurassic Park (1993)”, “Braveheart (1995)”, “Terminator 2: Judgment Day (1991)”, and “Schindler’s List (1993)”. These movies have received a significant number of ratings from users in the dataset, indicating their popularity and potential appeal to a broad audience.

# Setting the plot configuration option
par(mfrow=c(2,1))

# Plot the relative and absolute item frequency plot
itemFrequencyPlot(movie_trx,
                  type = "relative",
                  topN = 10,
                  horiz = TRUE,
                  main = 'Relative item frequency')

itemFrequencyPlot(movie_trx,
                  type = "absolute",
                  topN = 10,
                  horiz = TRUE,
                  main = 'Absolute item frequency')

# Setting the plot configuration option
par(mar=c(2,30,2,2), mfrow=c(1,1))

# Plot the 10 least popular items
barplot(sort(table(unlist(LIST(movie_trx))))[1:10],
        horiz = TRUE,
        las = 1,
        main = 'Least popular items')

The table shows that each of the top 10 movies received only one rating, which suggests that they are not very popular among the users in the dataset. These movies include a mix of different genres, such as drama, horror, comedy, and action, and range in release date from as early as 1981 to as recent as 2016.

Some of the movies in the table are lesser-known films, such as “‘71 (2014)“,”’night Mother (1986)“, and”…All the Marbles (1981)“, which may have limited appeal to a general audience. Other movies, such as”’Hellboy’: The Seeds of Creation (2004)”, “’Salem’s Lot (2004)”, and “00 Schneider - Jagd auf Nihil Baxter (1994)” are adaptations of popular books, comic books, or TV shows, which may have a more niche audience.

# Extract the set of most frequent itemsets
itemsets = apriori(movie_trx,
                   parameter = list(support = 0.4,
                                    target = 'frequent'
                   ))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5     0.4      1
##  maxlen            target  ext
##      10 frequent itemsets TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 244 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[9719 item(s), 610 transaction(s)] done [0.12s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## sorting transactions ... done [0.00s].
## writing ... [6 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Inspect the five most popular items
inspect(sort(itemsets, by='support', decreasing = T)[1:5])
##     items                              support   count
## [1] {Forrest Gump (1994)}              0.5393443 329  
## [2] {Shawshank Redemption, The (1994)} 0.5196721 317  
## [3] {Pulp Fiction (1994)}              0.5032787 307  
## [4] {Silence of the Lambs, The (1991)} 0.4573770 279  
## [5] {Matrix, The (1999)}               0.4557377 278

The algorithm was run with a minimum support threshold of 244, meaning that only movie sets that appear in at least 244 transactions were considered for association rule mining.

The output table shows the top 5 frequent itemsets, where each itemset is a set of movies that appear together in a transaction. For example, the first itemset {Forrest Gump (1994)} has a support value of 0.5393443, which means that this movie appears in 53.93% of all transactions in the dataset. The count column shows that this itemset appears in 329 transactions.

Similarly, the second itemset {Shawshank Redemption, The (1994)} has a support value of 0.5196721, meaning that this movie appears in 51.97% of all transactions. The count column shows that this itemset appears in 317 transactions.

The third, fourth, and fifth itemsets are {Pulp Fiction (1994)}, {Silence of the Lambs, The (1991)}, and {Matrix, The (1999)}, respectively. These itemsets have support values of 0.5032787, 0.4573770, and 0.4557377, indicating that they are also popular among the users in the dataset.

# Extract the set of most frequent itemsets
itemsets_minlen2 = apriori(movie_trx, parameter = 
                           list(support = 0.3,
                                minlen = 2,
                                target = 'frequent'
                            ))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5     0.3      2
##  maxlen            target  ext
##      10 frequent itemsets TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 183 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[9719 item(s), 610 transaction(s)] done [0.11s].
## sorting and recoding items ... [28 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## sorting transactions ... done [0.00s].
## writing ... [11 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Inspect the five most popular items
inspect(sort(itemsets_minlen2, 
             by='support', decreasing = T)[1:5])
##     items                                 support count
## [1] {Forrest Gump (1994),                              
##      Shawshank Redemption, The (1994)}  0.3786885   231
## [2] {Forrest Gump (1994),                              
##      Pulp Fiction (1994)}               0.3770492   230
## [3] {Pulp Fiction (1994),                              
##      Shawshank Redemption, The (1994)}  0.3639344   222
## [4] {Pulp Fiction (1994),                              
##      Silence of the Lambs, The (1991)}  0.3393443   207
## [5] {Shawshank Redemption, The (1994),                 
##      Silence of the Lambs, The (1991)}  0.3262295   199

Implemented with the minimum support count of 0.326. Each row in the table shows an itemset, consisting of two movies, and the support and count for that itemset.

The first row shows that the combination of “Forrest Gump (1994)” and “Shawshank Redemption, The (1994)” was found in 23.87% of the transactions, which is a count of 231 times. Similarly, the second row shows that “Forrest Gump (1994)” and “Pulp Fiction (1994)” were found together in 23.66% of the transactions, with a count of 230.

The third row shows the itemset of “Pulp Fiction (1994)” and “Shawshank Redemption, The (1994)” with a support of 22.69% and a count of 222. The fourth row shows the itemset of “Pulp Fiction (1994)” and “Silence of the Lambs, The (1991)” with a support of 20.89% and a count of 207. Finally, the fifth row shows the itemset of “Shawshank Redemption, The (1994)” and “Silence of the Lambs, The (1991)” with a support of 19.51% and a count of 199.

Conclusion

In conclusion, our report has explored the use of the Apriori algorithm, a data mining technique commonly used in Market Basket Analysis, to build a movie recommendation engine. By treating movies as items, we applied the algorithm to the MovieLens 20M Dataset from Kaggle, which contains millions of user ratings across thousands of movies and raters. Our objective was to mine the relationships among films and generate relevant movie recommendations for users.

Our analysis demonstrates that the Apriori algorithm can be an effective tool for generating personalized movie recommendations. By considering the associations among movies, our recommendation engine is capable of suggesting films that are relevant and appealing to individual users. Furthermore, our study highlights the potential of data mining algorithms in enhancing the user experience of movie enthusiasts and assisting them in discovering new films.