2. Loading required libraries and Importing the data

library(dplyr)
library(readr)
library(tidyr)
library(wesanderson)
library(arules)
library(arulesViz)

getwd()
setwd("C:/Users/Lukasz/Desktop/unsupervised learning")

#load datasets
movies_names <- read_csv("dane/movies.csv", show_col_types = FALSE)
movie_ratings <- read.csv("dane/ratings.csv")

2.1Structure of datasets

movies_names <- as.data.frame(movies_names)
head(movie_ratings)

##   userId movieId rating  timestamp
## 1      1      16    4.0 1217897793
## 2      1      24    1.5 1217895807
## 3      1      32    4.0 1217896246
## 4      1      47    4.0 1217896556
## 5      1      50    4.0 1217896523
## 6      1     110    4.0 1217896150

head(movies_names)

##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
## 6       6                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller

2.2Merging the datasets

Since one dataset contains information about movie ID’s and their ratings, and the other contains information about their titles, it is vital to merge them to obtain a dataset that contains all the valuable information. I merged the datasets using movieID variable since it was common for both of them

merged_movies <- merge(movie_ratings, movies_names, by = "movieId")
head(merged_movies)

##   movieId userId rating  timestamp            title
## 1       1    103    3.0  978554606 Toy Story (1995)
## 2       1    328    5.0 1237950888 Toy Story (1995)
## 3       1    453    4.0  859302647 Toy Story (1995)
## 4       1    590    3.5 1243424645 Toy Story (1995)
## 5       1    419    4.0 1447594754 Toy Story (1995)
## 6       1    180    3.5 1180846101 Toy Story (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Animation|Children|Comedy|Fantasy
## 3 Adventure|Animation|Children|Comedy|Fantasy
## 4 Adventure|Animation|Children|Comedy|Fantasy
## 5 Adventure|Animation|Children|Comedy|Fantasy
## 6 Adventure|Animation|Children|Comedy|Fantasy

3. Cleaning the data

#check for missing values 
colSums(is.na(merged_movies))

##   movieId    userId    rating timestamp     title    genres 
##         0         0         0         0         0         0

There are no missing data in the dataset (at least not explicitly). The next step is to separate the main genre from all the listed sub genres of each movie and drop unnecessary columns:

#Separate movie genres
merged_movies <- merged_movies %>%
  separate(genres, into = c("main_genre", "other_genre"), sep = "\\|")

#drop other genres and time stamp columns since they will not be needed for the analysis
merged_movies$other_genres<-NULL
merged_movies$timestamp<-NULL

After more thorough observation it is clear that there are some rows where no movie genre is specified. Therefore, those observations were deleted

#Get rid of observations where no movie genre is listed
merged_movies <- merged_movies %>%
  filter(main_genre != "(no genres listed)")

A look at all of the unique movie genres after cleaning the data:

unique(merged_movies$main_genre)

##  [1] "Adventure"   "Comedy"      "Action"      "Drama"       "Crime"      
##  [6] "Children"    "Mystery"     "Animation"   "Documentary" "Thriller"   
## [11] "Horror"      "Fantasy"     "Western"     "Film-Noir"   "Romance"    
## [16] "Sci-Fi"      "Musical"     "War"

4.Exploratory Data Analysis

To get a better idea about the distribution of the movie ratings or the frequency of genres, a short EDA was performed

#get mean ratings for movies of each genre
mean_ratings <- merged_movies%>%
  group_by(main_genre) %>%
  summarise(mean_rating = mean(rating))
mean_ratings

## # A tibble: 18 × 2
##    main_genre  mean_rating
##    <chr>             <dbl>
##  1 Action             3.45
##  2 Adventure          3.58
##  3 Animation          3.58
##  4 Children           3.25
##  5 Comedy             3.44
##  6 Crime              3.79
##  7 Documentary        3.65
##  8 Drama              3.64
##  9 Fantasy            3.37
## 10 Film-Noir          4.02
## 11 Horror             3.10
## 12 Musical            3.62
## 13 Mystery            3.88
## 14 Romance            3.26
## 15 Sci-Fi             3.24
## 16 Thriller           3.47
## 17 War                3.61
## 18 Western            3.5

pal <- wes_palette(18, name = "GrandBudapest1", type = "continuous")

#Comparison of rating distribution between movies of different genres
boxplot(rating~main_genre,data = merged_movies, las = 2, xlab = "", col = pal, 
        main = "Distribution of ratings per genre")

As we can see, majority of the genres have a fairly similar distribution of ratings. Interestingly enough, they are way more very positive ratings then very low ratings. This can be a useful fact in the latter part of the project. It is also visible that there are some genres like Film-noir, musical, war or mystery, that have higher average ratings than the rest of genres. Could this be due to smaller amount of overall ratings? Let us check:

#checking the number of occurrences of each genre
no_of_genres <- merged_movies %>% 
  count(main_genre) %>% 
  arrange(desc(n))
no_of_genres

##     main_genre     n
## 1       Action 31205
## 2       Comedy 25147
## 3        Drama 20488
## 4    Adventure  9741
## 5        Crime  7645
## 6       Horror  2806
## 7    Animation  2155
## 8     Children  1953
## 9  Documentary  1011
## 10     Mystery  1002
## 11    Thriller   835
## 12     Fantasy   355
## 13      Sci-Fi   285
## 14     Western   201
## 15   Film-Noir   169
## 16     Musical   165
## 17     Romance   147
## 18         War    22

barplot(no_of_genres$n, names.arg = no_of_genres$main_genre, las = 2, col = pal, 
        main = "Number of ratings per genre in a descending order")

By looking at this barplot we can clearly see that all of the movie genres that had an above average rating are the ones with the least amount of rating. There are also some movie genres such as action comedy and drama that are by far the most popular among all genres. This may seem weird that for example genre such as science-fiction is so unpopular but it may be due to the fact that in the original version of the dataset it was rarely listed as a main genre but only as a second or third sub genre

5. Preparing the data to apriori algorithm

The first part of preparation for the Apriori algorithm is creating a new dataset that only consists of movies that received a positive rating. As presented in the EDA part of the project, it is clear that the mean ratings are relatively high for all the genres that is why I will only consider movies that have a rating of 4.5 or higher to have a positive rating.

liked_movies <- merged_movies%>%
  filter(rating >= 4.5) %>%  # I will assume that ratings that are 4.5 or higher mean that person liked the movie
  select(main_genre, userId) #For this algorithm i will only focus on movie genres and not titles
head(liked_movies)

##   main_genre userId
## 1  Adventure    328
## 2  Adventure    286
## 3  Adventure    627
## 4  Adventure    387
## 5  Adventure    552
## 6  Adventure    589

Now we are left with only positive movie rankings and genres of those ratings.

The next step is to change the data to transaction format so that the algorithm can run properly.

#Changing the dataset intro transaction format
movies_transact <- split(liked_movies$main_genre, liked_movies$userId)
movies_transact <- as(movies_transact, "transactions")

Checking frequency once again on the liked movies dataset. There are some genres that appear very rarely, therefore a threshold of 0.05 will be set so that only movies that appear relatively frequently will be taken int consideration while running the Apriori algorithm

genre_frequency <- itemFrequency(movies_transact, type = "relative") 
genre_frequency

##      Action   Adventure   Animation    Children      Comedy       Crime 
## 0.862329803 0.697428139 0.281391831 0.246596067 0.880484115 0.738275340 
## Documentary       Drama     Fantasy   Film-Noir      Horror     Musical 
## 0.201210287 0.872919818 0.089258699 0.063540091 0.249621785 0.055975794 
##     Mystery     Romance      Sci-Fi    Thriller         War     Western 
## 0.325264750 0.025718608 0.051437216 0.167927383 0.006051437 0.048411498

# remove genres that appear very infrequently
only_frequent <- names(genre_frequency[genre_frequency > 0.05])
only_frequent

##  [1] "Action"      "Adventure"   "Animation"   "Children"    "Comedy"     
##  [6] "Crime"       "Documentary" "Drama"       "Fantasy"     "Film-Noir"  
## [11] "Horror"      "Musical"     "Mystery"     "Sci-Fi"      "Thriller"

movies_only_frequent_transact <- movies_transact[, only_frequent]
movies_only_frequent_transact

## transactions in sparse format with
##  661 transactions (rows) and
##  15 items (columns)

6. Implementing the apriori algorithm

Now that only movies with appearance rate of over 5% are left, and they are presented in transaction format, Apriori algorithm can be run. Since I am more interested in finding the best rules rather than finding all rules I will be strict with setting the support and confidence parameters. Setting them at respectively 0.15 and 0.6 will allow me to find and identify the rules of most frequent transactions with the highest confidence value.

all_rules <- apriori(movies_only_frequent_transact, parameter = list(supp = 0.08, conf = 0.6))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.08      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 52 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[15 item(s), 661 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s].
## writing ... [1567 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

Here the rules are ordered from best to worst using the confidence metric

best_rules <- sort(all_rules, by = "confidence", decreasing = TRUE)
summary(best_rules)

## set of 1567 rules
## 
## rule length distribution (lhs + rhs):sizes
##   1   2   3   4   5   6   7   8 
##   5  55 223 439 476 282  79   8 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   4.613   5.000   8.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.08018   Min.   :0.6000   Min.   :0.08018   Min.   :0.9908  
##  1st Qu.:0.09834   1st Qu.:0.9204   1st Qu.:0.10287   1st Qu.:1.1228  
##  Median :0.11346   Median :0.9683   Median :0.13011   Median :1.1456  
##  Mean   :0.15737   Mean   :0.9280   Mean   :0.17308   Mean   :1.2885  
##  3rd Qu.:0.17322   3rd Qu.:0.9867   3rd Qu.:0.20121   3rd Qu.:1.3101  
##  Max.   :0.88048   Max.   :1.0000   Max.   :1.00000   Max.   :2.6707  
##      count      
##  Min.   : 53.0  
##  1st Qu.: 65.0  
##  Median : 75.0  
##  Mean   :104.0  
##  3rd Qu.:114.5  
##  Max.   :582.0  
## 
## mining info:
##                           data ntransactions support confidence
##  movies_only_frequent_transact           661    0.08        0.6
##                                                                                      call
##  apriori(data = movies_only_frequent_transact, parameter = list(supp = 0.08, conf = 0.6))

inspect(head(best_rules))

##     lhs                         rhs      support    confidence coverage  
## [1] {Children, Documentary}  => {Drama}  0.08472012 1          0.08472012
## [2] {Children, Documentary}  => {Comedy} 0.08472012 1          0.08472012
## [3] {Documentary, Horror}    => {Drama}  0.09379728 1          0.09379728
## [4] {Animation, Documentary} => {Drama}  0.10590015 1          0.10590015
## [5] {Documentary, Mystery}   => {Drama}  0.10741301 1          0.10741301
## [6] {Documentary, Mystery}   => {Comedy} 0.10741301 1          0.10741301
##     lift     count
## [1] 1.145581 56   
## [2] 1.135739 56   
## [3] 1.145581 62   
## [4] 1.145581 70   
## [5] 1.145581 71   
## [6] 1.135739 71

The best rules all have confidence 1 so let us look at the best rules that have value of confidence smaller than one. They all have one thing in common again. Their support values are small what suggests that they are relatively rare again. So let us see what happens if we look at rules with higher support values:

rules_filtered <- subset(best_rules,support > 0.25)
inspect(head(rules_filtered))

##     lhs                                     rhs      support   confidence
## [1] {Action, Adventure, Mystery}         => {Comedy} 0.2571861 0.9883721 
## [2] {Action, Adventure, Drama, Mystery}  => {Comedy} 0.2526475 0.9881657 
## [3] {Action, Animation}                  => {Comedy} 0.2662632 0.9832402 
## [4] {Adventure, Drama, Mystery}          => {Comedy} 0.2571861 0.9826590 
## [5] {Action, Adventure, Mystery}         => {Drama}  0.2556732 0.9825581 
## [6] {Action, Adventure, Comedy, Mystery} => {Drama}  0.2526475 0.9823529 
##     coverage  lift     count
## [1] 0.2602118 1.122533 170  
## [2] 0.2556732 1.122298 167  
## [3] 0.2708018 1.116704 176  
## [4] 0.2617247 1.116044 170  
## [5] 0.2602118 1.125600 169  
## [6] 0.2571861 1.125364 167

After setting the support value to 0.25, we can see that there are no rules with confidence 1. Sure most of them still have high confidence, however as I mentioned before, ratings in this datasets are very high and majority of users left high ratings for movies of each genre.

It is also worth checking which movie genres people are less likely to enjoy. This is presented by the rules below:

rules_filtered <- subset(best_rules)
inspect(tail(rules_filtered))

##     lhs                                         rhs         support   
## [1] {Comedy, Crime, Horror, Mystery}         => {Animation} 0.08472012
## [2] {Comedy, Crime, Drama, Horror, Mystery}  => {Animation} 0.08472012
## [3] {Action, Adventure, Documentary}         => {Animation} 0.09833585
## [4] {Comedy, Crime, Thriller}                => {Mystery}   0.08169440
## [5] {Action, Comedy, Horror, Mystery}        => {Animation} 0.08623298
## [6] {Action, Comedy, Drama, Horror, Mystery} => {Animation} 0.08623298
##     confidence coverage  lift     count
## [1] 0.6021505  0.1406959 2.139901 56   
## [2] 0.6021505  0.1406959 2.139901 56   
## [3] 0.6018519  0.1633888 2.138839 65   
## [4] 0.6000000  0.1361573 1.844651 54   
## [5] 0.6000000  0.1437216 2.132258 57   
## [6] 0.6000000  0.1437216 2.132258 57

While the lowest confidence value is still relatively high (0.6, the threshold set while running the apriori algorithm ), there are some interesting observations to be made. People who enjoy Action and adventure movies are slightly less likely to enjoy crime movies. People who enjoy most of the popular genres such as comedy, action, horror or drama are not very likely to enjoy Animations as well as Mystery movies

Now let us take a look at how rules behave for certain genres.

rules_Horror <- subset(all_rules, lhs %in% "Horror")
rules_Horror <- sort(rules_Horror, by = "support")
inspect(head(rules_Horror))

##     lhs                 rhs      support   confidence coverage  lift     count
## [1] {Horror}         => {Comedy} 0.2450832 0.9818182  0.2496218 1.115089 162  
## [2] {Horror}         => {Drama}  0.2435703 0.9757576  0.2496218 1.117809 161  
## [3] {Horror}         => {Action} 0.2405446 0.9636364  0.2496218 1.117480 159  
## [4] {Drama, Horror}  => {Comedy} 0.2390318 0.9813665  0.2435703 1.114576 158  
## [5] {Comedy, Horror} => {Drama}  0.2390318 0.9753086  0.2450832 1.117295 158  
## [6] {Action, Horror} => {Comedy} 0.2375189 0.9874214  0.2405446 1.121453 157

While analyzing the rules for people that enjoyed horror movies it is clear that they will most likely enjoy Comedies and Dramas as well as Action movies. By sorting the results by support value I made sure that only the rules that are frequent enough appeared.

Now let us switch to a genre that is completely different - Animation. Again I wanted to sort the rules by support to show how rules that are most common will behave. The results are similar as in the instance of Horror movie fans. People who enjoy Animations, also enjoy comedies action movies and dramas

rules_Animation <- subset(all_rules, lhs %in% "Animation")
rules_Animation <- sort(rules_Animation, by = "support")
inspect(head(rules_Animation))

##     lhs                    rhs      support   confidence coverage  lift    
## [1] {Animation}         => {Comedy} 0.2738275 0.9731183  0.2813918 1.105208
## [2] {Animation}         => {Action} 0.2708018 0.9623656  0.2813918 1.116006
## [3] {Action, Animation} => {Comedy} 0.2662632 0.9832402  0.2708018 1.116704
## [4] {Animation, Comedy} => {Action} 0.2662632 0.9723757  0.2738275 1.127615
## [5] {Animation}         => {Drama}  0.2632375 0.9354839  0.2813918 1.071672
## [6] {Animation, Drama}  => {Comedy} 0.2571861 0.9770115  0.2632375 1.109630
##     count
## [1] 181  
## [2] 179  
## [3] 176  
## [4] 176  
## [5] 174  
## [6] 170

For movies for children, a similar trend follows, but with addition of the fact that people who enjoy such movies also seem to be fans of Adventure movies.

rules_children_movies <- subset(all_rules, lhs %in% "Children")
rules_children_movies <- sort(rules_children_movies, by = "support")
inspect(head(rules_children_movies))

##     lhs                   rhs         support   confidence coverage  lift    
## [1] {Children}         => {Drama}     0.2375189 0.9631902  0.2465961 1.103412
## [2] {Children}         => {Comedy}    0.2344932 0.9509202  0.2465961 1.079997
## [3] {Children}         => {Action}    0.2269289 0.9202454  0.2465961 1.067162
## [4] {Children, Drama}  => {Comedy}    0.2269289 0.9554140  0.2375189 1.085101
## [5] {Children, Comedy} => {Drama}     0.2269289 0.9677419  0.2344932 1.108626
## [6] {Children}         => {Adventure} 0.2223903 0.9018405  0.2465961 1.293094
##     count
## [1] 157  
## [2] 155  
## [3] 150  
## [4] 150  
## [5] 150  
## [6] 147

The last movie genre that I analyzed was Fantasy and it produced similar results to the previous genres with movies like comedy, drama, adventure and action on the right hand side of a rule.

rules_fantasy <- subset(all_rules, lhs %in% "Fantasy")
rules_fantasy <- sort(rules_fantasy, by = "support")
inspect(head(rules_fantasy))

##     lhs                  rhs         support    confidence coverage   lift    
## [1] {Fantasy}         => {Comedy}    0.08774584 0.9830508  0.08925870 1.116489
## [2] {Fantasy}         => {Drama}     0.08623298 0.9661017  0.08925870 1.106747
## [3] {Drama, Fantasy}  => {Comedy}    0.08472012 0.9824561  0.08623298 1.115814
## [4] {Comedy, Fantasy} => {Drama}     0.08472012 0.9655172  0.08774584 1.106078
## [5] {Fantasy}         => {Adventure} 0.08320726 0.9322034  0.08925870 1.336630
## [6] {Fantasy}         => {Action}    0.08320726 0.9322034  0.08925870 1.081029
##     count
## [1] 58   
## [2] 57   
## [3] 56   
## [4] 56   
## [5] 55   
## [6] 55

6. Visualizing the rules

Now I wanted to go over the visualization of rules for a sample genre of horror. For some of these plots only 10 best rules (with the highest value of support metric) are presented for the sake of clarity of a plot.

rules_top_Horror <- head(rules_Horror,10)
rules_top_animation <- head(rules_Animation,10)

On this plot we can see the rules with Genre Horror included in the left hand side of a rule. The rules have a support value varying from less then 0.1 to above 0.25. Genres like comedy and Action almost always appear on the right hand side of a rule. Those genres also have the lowest lift value that is very close to 1 what might suggest low association. Movie genres like Animation and Mystery appear more rarely but have a much higher lift value along with lower support

plot(rules_Horror, method = "grouped")

The next plot displays the likelihood of genres being liked by fans of horror movies. We can again notice that mystery and animation movies are towards the bottom of the graph, suggesting that those genres are generally enjoyed a bit less by horror fans.

plot(rules_Horror, engine = "plotly")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

This is a visual representation of the net created by top ten rules that have genre horror contained on the left hand side. Three of those rules have drama and action on the right hand side and four of them have comedy

plot(rules_top_Horror, method = "graph",  engine = "htmlwidget")

On the last graph we can see the the difference of behavior of a rule depending on the place which horror has on the left hand side of a rule:

plot(rules_top_Horror, method="paracoord")

Movie Genre Association Rule Analysis

Łukasz Brzoska

2025-02-08

1. Introduction