The goal of this project is to use a database that contains movies, their genre and rating to perform an analysis using apriori algorithm. The movie genres of every single movie and the rating that they received will be gathered and analyzed in order to obtain some association rules. By performing such analysis, I hope to receive results that will showcase which movie genres are often enjoyed together by users and which movie genres are enjoyed by different groups of people.
library(dplyr)
library(readr)
library(tidyr)
library(wesanderson)
library(arules)
library(arulesViz)
getwd()
setwd("C:/Users/Lukasz/Desktop/unsupervised learning")
#load datasets
movies_names <- read_csv("dane/movies.csv", show_col_types = FALSE)
movie_ratings <- read.csv("dane/ratings.csv")
movies_names <- as.data.frame(movies_names)
head(movie_ratings)
## userId movieId rating timestamp
## 1 1 16 4.0 1217897793
## 2 1 24 1.5 1217895807
## 3 1 32 4.0 1217896246
## 4 1 47 4.0 1217896556
## 5 1 50 4.0 1217896523
## 6 1 110 4.0 1217896150
head(movies_names)
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
Since one dataset contains information about movie IDâs and their ratings, and the other contains information about their titles, it is vital to merge them to obtain a dataset that contains all the valuable information. I merged the datasets using movieID variable since it was common for both of them
merged_movies <- merge(movie_ratings, movies_names, by = "movieId")
head(merged_movies)
## movieId userId rating timestamp title
## 1 1 103 3.0 978554606 Toy Story (1995)
## 2 1 328 5.0 1237950888 Toy Story (1995)
## 3 1 453 4.0 859302647 Toy Story (1995)
## 4 1 590 3.5 1243424645 Toy Story (1995)
## 5 1 419 4.0 1447594754 Toy Story (1995)
## 6 1 180 3.5 1180846101 Toy Story (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Animation|Children|Comedy|Fantasy
## 3 Adventure|Animation|Children|Comedy|Fantasy
## 4 Adventure|Animation|Children|Comedy|Fantasy
## 5 Adventure|Animation|Children|Comedy|Fantasy
## 6 Adventure|Animation|Children|Comedy|Fantasy
#check for missing values
colSums(is.na(merged_movies))
## movieId userId rating timestamp title genres
## 0 0 0 0 0 0
There are no missing data in the dataset (at least not explicitly). The next step is to separate the main genre from all the listed sub genres of each movie and drop unnecessary columns:
#Separate movie genres
merged_movies <- merged_movies %>%
separate(genres, into = c("main_genre", "other_genre"), sep = "\\|")
#drop other genres and time stamp columns since they will not be needed for the analysis
merged_movies$other_genres<-NULL
merged_movies$timestamp<-NULL
After more thorough observation it is clear that there are some rows where no movie genre is specified. Therefore, those observations were deleted
#Get rid of observations where no movie genre is listed
merged_movies <- merged_movies %>%
filter(main_genre != "(no genres listed)")
A look at all of the unique movie genres after cleaning the data:
unique(merged_movies$main_genre)
## [1] "Adventure" "Comedy" "Action" "Drama" "Crime"
## [6] "Children" "Mystery" "Animation" "Documentary" "Thriller"
## [11] "Horror" "Fantasy" "Western" "Film-Noir" "Romance"
## [16] "Sci-Fi" "Musical" "War"
To get a better idea about the distribution of the movie ratings or the frequency of genres, a short EDA was performed
#get mean ratings for movies of each genre
mean_ratings <- merged_movies%>%
group_by(main_genre) %>%
summarise(mean_rating = mean(rating))
mean_ratings
## # A tibble: 18 Ă 2
## main_genre mean_rating
## <chr> <dbl>
## 1 Action 3.45
## 2 Adventure 3.58
## 3 Animation 3.58
## 4 Children 3.25
## 5 Comedy 3.44
## 6 Crime 3.79
## 7 Documentary 3.65
## 8 Drama 3.64
## 9 Fantasy 3.37
## 10 Film-Noir 4.02
## 11 Horror 3.10
## 12 Musical 3.62
## 13 Mystery 3.88
## 14 Romance 3.26
## 15 Sci-Fi 3.24
## 16 Thriller 3.47
## 17 War 3.61
## 18 Western 3.5
pal <- wes_palette(18, name = "GrandBudapest1", type = "continuous")
#Comparison of rating distribution between movies of different genres
boxplot(rating~main_genre,data = merged_movies, las = 2, xlab = "", col = pal,
main = "Distribution of ratings per genre")
As we can see, majority of the genres have a fairly similar distribution of ratings. Interestingly enough, they are way more very positive ratings then very low ratings. This can be a useful fact in the latter part of the project. It is also visible that there are some genres like Film-noir, musical, war or mystery, that have higher average ratings than the rest of genres. Could this be due to smaller amount of overall ratings? Let us check:
#checking the number of occurrences of each genre
no_of_genres <- merged_movies %>%
count(main_genre) %>%
arrange(desc(n))
no_of_genres
## main_genre n
## 1 Action 31205
## 2 Comedy 25147
## 3 Drama 20488
## 4 Adventure 9741
## 5 Crime 7645
## 6 Horror 2806
## 7 Animation 2155
## 8 Children 1953
## 9 Documentary 1011
## 10 Mystery 1002
## 11 Thriller 835
## 12 Fantasy 355
## 13 Sci-Fi 285
## 14 Western 201
## 15 Film-Noir 169
## 16 Musical 165
## 17 Romance 147
## 18 War 22
barplot(no_of_genres$n, names.arg = no_of_genres$main_genre, las = 2, col = pal,
main = "Number of ratings per genre in a descending order")
By looking at this barplot we can clearly see that all of the movie genres that had an above average rating are the ones with the least amount of rating. There are also some movie genres such as action comedy and drama that are by far the most popular among all genres. This may seem weird that for example genre such as science-fiction is so unpopular but it may be due to the fact that in the original version of the dataset it was rarely listed as a main genre but only as a second or third sub genre
The first part of preparation for the Apriori algorithm is creating a new dataset that only consists of movies that received a positive rating. As presented in the EDA part of the project, it is clear that the mean ratings are relatively high for all the genres that is why I will only consider movies that have a rating of 4.5 or higher to have a positive rating.
liked_movies <- merged_movies%>%
filter(rating >= 4.5) %>% # I will assume that ratings that are 4.5 or higher mean that person liked the movie
select(main_genre, userId) #For this algorithm i will only focus on movie genres and not titles
head(liked_movies)
## main_genre userId
## 1 Adventure 328
## 2 Adventure 286
## 3 Adventure 627
## 4 Adventure 387
## 5 Adventure 552
## 6 Adventure 589
Now we are left with only positive movie rankings and genres of those ratings.
The next step is to change the data to transaction format so that the algorithm can run properly.
#Changing the dataset intro transaction format
movies_transact <- split(liked_movies$main_genre, liked_movies$userId)
movies_transact <- as(movies_transact, "transactions")
Checking frequency once again on the liked movies dataset. There are some genres that appear very rarely, therefore a threshold of 0.05 will be set so that only movies that appear relatively frequently will be taken int consideration while running the Apriori algorithm
genre_frequency <- itemFrequency(movies_transact, type = "relative")
genre_frequency
## Action Adventure Animation Children Comedy Crime
## 0.862329803 0.697428139 0.281391831 0.246596067 0.880484115 0.738275340
## Documentary Drama Fantasy Film-Noir Horror Musical
## 0.201210287 0.872919818 0.089258699 0.063540091 0.249621785 0.055975794
## Mystery Romance Sci-Fi Thriller War Western
## 0.325264750 0.025718608 0.051437216 0.167927383 0.006051437 0.048411498
# remove genres that appear very infrequently
only_frequent <- names(genre_frequency[genre_frequency > 0.05])
only_frequent
## [1] "Action" "Adventure" "Animation" "Children" "Comedy"
## [6] "Crime" "Documentary" "Drama" "Fantasy" "Film-Noir"
## [11] "Horror" "Musical" "Mystery" "Sci-Fi" "Thriller"
movies_only_frequent_transact <- movies_transact[, only_frequent]
movies_only_frequent_transact
## transactions in sparse format with
## 661 transactions (rows) and
## 15 items (columns)
Now that only movies with appearance rate of over 5% are left, and they are presented in transaction format, Apriori algorithm can be run. Since I am more interested in finding the best rules rather than finding all rules I will be strict with setting the support and confidence parameters. Setting them at respectively 0.15 and 0.6 will allow me to find and identify the rules of most frequent transactions with the highest confidence value.
all_rules <- apriori(movies_only_frequent_transact, parameter = list(supp = 0.08, conf = 0.6))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.08 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 52
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[15 item(s), 661 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s].
## writing ... [1567 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
Here the rules are ordered from best to worst using the confidence metric
best_rules <- sort(all_rules, by = "confidence", decreasing = TRUE)
summary(best_rules)
## set of 1567 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5 6 7 8
## 5 55 223 439 476 282 79 8
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 5.000 4.613 5.000 8.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.08018 Min. :0.6000 Min. :0.08018 Min. :0.9908
## 1st Qu.:0.09834 1st Qu.:0.9204 1st Qu.:0.10287 1st Qu.:1.1228
## Median :0.11346 Median :0.9683 Median :0.13011 Median :1.1456
## Mean :0.15737 Mean :0.9280 Mean :0.17308 Mean :1.2885
## 3rd Qu.:0.17322 3rd Qu.:0.9867 3rd Qu.:0.20121 3rd Qu.:1.3101
## Max. :0.88048 Max. :1.0000 Max. :1.00000 Max. :2.6707
## count
## Min. : 53.0
## 1st Qu.: 65.0
## Median : 75.0
## Mean :104.0
## 3rd Qu.:114.5
## Max. :582.0
##
## mining info:
## data ntransactions support confidence
## movies_only_frequent_transact 661 0.08 0.6
## call
## apriori(data = movies_only_frequent_transact, parameter = list(supp = 0.08, conf = 0.6))
inspect(head(best_rules))
## lhs rhs support confidence coverage
## [1] {Children, Documentary} => {Drama} 0.08472012 1 0.08472012
## [2] {Children, Documentary} => {Comedy} 0.08472012 1 0.08472012
## [3] {Documentary, Horror} => {Drama} 0.09379728 1 0.09379728
## [4] {Animation, Documentary} => {Drama} 0.10590015 1 0.10590015
## [5] {Documentary, Mystery} => {Drama} 0.10741301 1 0.10741301
## [6] {Documentary, Mystery} => {Comedy} 0.10741301 1 0.10741301
## lift count
## [1] 1.145581 56
## [2] 1.135739 56
## [3] 1.145581 62
## [4] 1.145581 70
## [5] 1.145581 71
## [6] 1.135739 71
The best rules all have confidence 1 so let us look at the best rules that have value of confidence smaller than one. They all have one thing in common again. Their support values are small what suggests that they are relatively rare again. So let us see what happens if we look at rules with higher support values:
rules_filtered <- subset(best_rules,support > 0.25)
inspect(head(rules_filtered))
## lhs rhs support confidence
## [1] {Action, Adventure, Mystery} => {Comedy} 0.2571861 0.9883721
## [2] {Action, Adventure, Drama, Mystery} => {Comedy} 0.2526475 0.9881657
## [3] {Action, Animation} => {Comedy} 0.2662632 0.9832402
## [4] {Adventure, Drama, Mystery} => {Comedy} 0.2571861 0.9826590
## [5] {Action, Adventure, Mystery} => {Drama} 0.2556732 0.9825581
## [6] {Action, Adventure, Comedy, Mystery} => {Drama} 0.2526475 0.9823529
## coverage lift count
## [1] 0.2602118 1.122533 170
## [2] 0.2556732 1.122298 167
## [3] 0.2708018 1.116704 176
## [4] 0.2617247 1.116044 170
## [5] 0.2602118 1.125600 169
## [6] 0.2571861 1.125364 167
After setting the support value to 0.25, we can see that there are no rules with confidence 1. Sure most of them still have high confidence, however as I mentioned before, ratings in this datasets are very high and majority of users left high ratings for movies of each genre.
It is also worth checking which movie genres people are less likely to enjoy. This is presented by the rules below:
rules_filtered <- subset(best_rules)
inspect(tail(rules_filtered))
## lhs rhs support
## [1] {Comedy, Crime, Horror, Mystery} => {Animation} 0.08472012
## [2] {Comedy, Crime, Drama, Horror, Mystery} => {Animation} 0.08472012
## [3] {Action, Adventure, Documentary} => {Animation} 0.09833585
## [4] {Comedy, Crime, Thriller} => {Mystery} 0.08169440
## [5] {Action, Comedy, Horror, Mystery} => {Animation} 0.08623298
## [6] {Action, Comedy, Drama, Horror, Mystery} => {Animation} 0.08623298
## confidence coverage lift count
## [1] 0.6021505 0.1406959 2.139901 56
## [2] 0.6021505 0.1406959 2.139901 56
## [3] 0.6018519 0.1633888 2.138839 65
## [4] 0.6000000 0.1361573 1.844651 54
## [5] 0.6000000 0.1437216 2.132258 57
## [6] 0.6000000 0.1437216 2.132258 57
While the lowest confidence value is still relatively high (0.6, the threshold set while running the apriori algorithm ), there are some interesting observations to be made. People who enjoy Action and adventure movies are slightly less likely to enjoy crime movies. People who enjoy most of the popular genres such as comedy, action, horror or drama are not very likely to enjoy Animations as well as Mystery movies
Now let us take a look at how rules behave for certain genres.
rules_Horror <- subset(all_rules, lhs %in% "Horror")
rules_Horror <- sort(rules_Horror, by = "support")
inspect(head(rules_Horror))
## lhs rhs support confidence coverage lift count
## [1] {Horror} => {Comedy} 0.2450832 0.9818182 0.2496218 1.115089 162
## [2] {Horror} => {Drama} 0.2435703 0.9757576 0.2496218 1.117809 161
## [3] {Horror} => {Action} 0.2405446 0.9636364 0.2496218 1.117480 159
## [4] {Drama, Horror} => {Comedy} 0.2390318 0.9813665 0.2435703 1.114576 158
## [5] {Comedy, Horror} => {Drama} 0.2390318 0.9753086 0.2450832 1.117295 158
## [6] {Action, Horror} => {Comedy} 0.2375189 0.9874214 0.2405446 1.121453 157
While analyzing the rules for people that enjoyed horror movies it is clear that they will most likely enjoy Comedies and Dramas as well as Action movies. By sorting the results by support value I made sure that only the rules that are frequent enough appeared.
Now let us switch to a genre that is completely different - Animation. Again I wanted to sort the rules by support to show how rules that are most common will behave. The results are similar as in the instance of Horror movie fans. People who enjoy Animations, also enjoy comedies action movies and dramas
rules_Animation <- subset(all_rules, lhs %in% "Animation")
rules_Animation <- sort(rules_Animation, by = "support")
inspect(head(rules_Animation))
## lhs rhs support confidence coverage lift
## [1] {Animation} => {Comedy} 0.2738275 0.9731183 0.2813918 1.105208
## [2] {Animation} => {Action} 0.2708018 0.9623656 0.2813918 1.116006
## [3] {Action, Animation} => {Comedy} 0.2662632 0.9832402 0.2708018 1.116704
## [4] {Animation, Comedy} => {Action} 0.2662632 0.9723757 0.2738275 1.127615
## [5] {Animation} => {Drama} 0.2632375 0.9354839 0.2813918 1.071672
## [6] {Animation, Drama} => {Comedy} 0.2571861 0.9770115 0.2632375 1.109630
## count
## [1] 181
## [2] 179
## [3] 176
## [4] 176
## [5] 174
## [6] 170
For movies for children, a similar trend follows, but with addition of the fact that people who enjoy such movies also seem to be fans of Adventure movies.
rules_children_movies <- subset(all_rules, lhs %in% "Children")
rules_children_movies <- sort(rules_children_movies, by = "support")
inspect(head(rules_children_movies))
## lhs rhs support confidence coverage lift
## [1] {Children} => {Drama} 0.2375189 0.9631902 0.2465961 1.103412
## [2] {Children} => {Comedy} 0.2344932 0.9509202 0.2465961 1.079997
## [3] {Children} => {Action} 0.2269289 0.9202454 0.2465961 1.067162
## [4] {Children, Drama} => {Comedy} 0.2269289 0.9554140 0.2375189 1.085101
## [5] {Children, Comedy} => {Drama} 0.2269289 0.9677419 0.2344932 1.108626
## [6] {Children} => {Adventure} 0.2223903 0.9018405 0.2465961 1.293094
## count
## [1] 157
## [2] 155
## [3] 150
## [4] 150
## [5] 150
## [6] 147
The last movie genre that I analyzed was Fantasy and it produced similar results to the previous genres with movies like comedy, drama, adventure and action on the right hand side of a rule.
rules_fantasy <- subset(all_rules, lhs %in% "Fantasy")
rules_fantasy <- sort(rules_fantasy, by = "support")
inspect(head(rules_fantasy))
## lhs rhs support confidence coverage lift
## [1] {Fantasy} => {Comedy} 0.08774584 0.9830508 0.08925870 1.116489
## [2] {Fantasy} => {Drama} 0.08623298 0.9661017 0.08925870 1.106747
## [3] {Drama, Fantasy} => {Comedy} 0.08472012 0.9824561 0.08623298 1.115814
## [4] {Comedy, Fantasy} => {Drama} 0.08472012 0.9655172 0.08774584 1.106078
## [5] {Fantasy} => {Adventure} 0.08320726 0.9322034 0.08925870 1.336630
## [6] {Fantasy} => {Action} 0.08320726 0.9322034 0.08925870 1.081029
## count
## [1] 58
## [2] 57
## [3] 56
## [4] 56
## [5] 55
## [6] 55
Now I wanted to go over the visualization of rules for a sample genre of horror. For some of these plots only 10 best rules (with the highest value of support metric) are presented for the sake of clarity of a plot.
rules_top_Horror <- head(rules_Horror,10)
rules_top_animation <- head(rules_Animation,10)
On this plot we can see the rules with Genre Horror included in the left hand side of a rule. The rules have a support value varying from less then 0.1 to above 0.25. Genres like comedy and Action almost always appear on the right hand side of a rule. Those genres also have the lowest lift value that is very close to 1 what might suggest low association. Movie genres like Animation and Mystery appear more rarely but have a much higher lift value along with lower support
plot(rules_Horror, method = "grouped")
The next plot displays the likelihood of genres being liked by fans of horror movies. We can again notice that mystery and animation movies are towards the bottom of the graph, suggesting that those genres are generally enjoyed a bit less by horror fans.
plot(rules_Horror, engine = "plotly")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
This is a visual representation of the net created by top ten rules that have genre horror contained on the left hand side. Three of those rules have drama and action on the right hand side and four of them have comedy
plot(rules_top_Horror, method = "graph", engine = "htmlwidget")
On the last graph we can see the the difference of behavior of a rule depending on the place which horror has on the left hand side of a rule:
plot(rules_top_Horror, method="paracoord")
This project was an analysis of movie ratings and the general enjoyment of certain movie genres by movie watchers. The idea was to check which movie genres are enjoyed together by the same people and which genres are rarely enjoyed by the same group. By implementing apriori algorithm, analyzing metrics such as lift, support and confidence, as well as studying the results of the visual representation of the created rules, I was able to achieve a valuable insight to the difference of enjoyment of different movie genres as well as