Association Rules Analysis of 1000 best-rated movies on IMDB

Cezary Kuźmowicz

Watching movies is part of almost everyone life. Through the decades form of consuming has been dramatically changing. In the previous millennium only way to do it was visiting cinema. Later, movies started to appear in TV. Sign of our times is the most flexible way of consuming film content - by VOD platforms like Netflix, MAX, Amazon Prime Video or Disney Plus has allowed people from every part of globe to watch almost all movies.

Topic of my research will be association rules analysis performed on 1000 best-rated movies on IMDB. It is world’s most known platform for film-enjoyers. I’ll create more and less detailed rules to get as much insights as possible. Apriori method will be used.

link to dataset: https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows

Data Preparation

Installing necessary packages

First, I need to install all needed packeges fot our analysis.

library(arules)

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ tidyr::pack()   masks Matrix::pack()
## ✖ dplyr::recode() masks arules::recode()
## ✖ tidyr::unpack() masks Matrix::unpack()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(arulesViz)

Having this done, we can move to loading data

Loading data

My dataset comes from Kaggle and is in .csv format. It is very convenient to read for R.

raw_data <- read.csv("imdb_top_1000.csv", sep = ",", header = TRUE)

Deleting unnecessary variables

Some variables present in original dataset is not crucial from perspective of association analysis. I’ll delete them.

data <- raw_data |> 
  select(-Poster_Link, -Overview, -Series_Title, -Meta_score, -Certificate)

Variables preparation

Variables are in very different form. In order to conduct analysis properly, I will change them to mostly categorical data. I’ll use dplyr notation because thanks to “Intro to R” course I really liked it :)

Release Year

In order to achieve categorical periods, I have to convert current data into numbers and then arrange it into predefined spaces.

data <- data |> 
  filter(!is.na(Released_Year)) |> 
  mutate(Released_Year = as.numeric(Released_Year)) |> 
  mutate(Released_Year = cut(Released_Year,
                             breaks = c(1920, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020),
                             labels = c("world_wars", "50s", "60s", "70s", "80s", "90s", "2000s", "2010s")))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Released_Year = as.numeric(Released_Year)`.
## Caused by warning:
## ! NAs introduced by coercion

Now information about year of release is ready fur further calculations.

Lenght of the movie

In the original dataset, Run time is stored as “120 min”. In this case I have to remove “min” part and then convert it into numbers. Final step will be creating new length tags.

data <- data |> 
  mutate(Runtime = substr(Runtime, 1, nchar(Runtime) - 4)) |> 
  mutate(Runtime = as.numeric(Runtime)) |> 
  mutate(Runtime = cut(Runtime, breaks = c(40, 60, 100, 120, 180, 500),
                       labels = c("short movie", "medium movie", "feature movie",
                                  "long movie", "extra long movie")))

Movie’s IMDB rating

Next step is gathering ratings into groups. IMDB ratings range from 1 to 10. As we analyze 1000 best rated movies, minimal rating is “only” 7.5. Case here is very easy - just assigning to groups.

data <- data |> 
  mutate(IMDB_Rating = cut(IMDB_Rating,
                       breaks = c(7.5, 7.75, 8, 8.5, 9, 10 ),
                       labels = c("7.5-7.75 rating", "7.75-8 rating", "8-8.5 rating", 
                                  "8.5-9 rating", "9+ rating")))

Number of votes

Every rating comes from many community votes. Here, I will gather them into intervals.

data <- data |> 
  mutate(No_of_Votes = cut(No_of_Votes,
                       breaks = c(25000, 50000, 100000, 200000, 500000, 1000000, 100e6),
                       labels = c("25-50k votes", "50-100k votes", "100-200k votes", "200-500k votes",
                                  "500k-1mln votes", "1mln+ votes")))

Gross revenue

Movie industry is a huge one and what comes with it - huge money! Important feature of each movie is its earnings. Here, some modifications were needed. If there were no information about earnings, I used median gross revenue (less biased than mean).

data <- data |> 
  mutate(Gross = as.numeric(gsub(",","",Gross))) |> 
  mutate(Gross = ifelse(is.na(Gross), median(Gross, na.rm = TRUE), Gross)) |> 
  mutate(Gross = cut(Gross,
                           breaks = c(0, 10e6, 50e6, 100e6, 500e6, 9999999e6),
                           labels = c("Low rev. (<$10M)", "Decent rev. ($10-50M)",
                                      "Moderate rev. ($50-100M)", "High rev. ($100-500M)",
                                      "Blockbusters ($500M+)")))

As the result we’ve got 5 intervals of earning.

Actors and Directors

Here, situation is little bit complicated. Four best stars per movie are presented. In order to prepare data to further analysis, I have to create new variable with all actors. When it comes to directors, I will add “dir_by_” to every director to differentiate them from actors during rules evaluation.

data <- data |> 
  mutate(Actors = paste(Star1, Star2, Star3, Star4, sep = ", ")) |> 
  select(-Star1, -Star2, -Star3, -Star4) |> 
  mutate(Director = paste0("dir_by_", Director))

Creating transaction column for association rules

In order to make association rules work, my data should be in transaction data type. In this step I will merge all needed columns into one. In the next chapter I will extract it as transactions.

data <- data |> 
  mutate(Transaction = paste(Genre, Director, Actors, Released_Year, Runtime, 
                             IMDB_Rating, No_of_Votes, Gross, sep = ", ")) |> 
  mutate(Transaction = strsplit(Transaction, ", "))

head(data$Transaction)

## [[1]]
##  [1] "Drama"                 "dir_by_Frank Darabont" "Tim Robbins"          
##  [4] "Morgan Freeman"        "Bob Gunton"            "William Sadler"       
##  [7] "90s"                   "long movie"            "9+ rating"            
## [10] "1mln+ votes"           "Decent rev. ($10-50M)"
## 
## [[2]]
##  [1] "Crime"                       "Drama"                      
##  [3] "dir_by_Francis Ford Coppola" "Marlon Brando"              
##  [5] "Al Pacino"                   "James Caan"                 
##  [7] "Diane Keaton"                "70s"                        
##  [9] "long movie"                  "9+ rating"                  
## [11] "1mln+ votes"                 "High rev. ($100-500M)"      
## 
## [[3]]
##  [1] "Action"                   "Crime"                   
##  [3] "Drama"                    "dir_by_Christopher Nolan"
##  [5] "Christian Bale"           "Heath Ledger"            
##  [7] "Aaron Eckhart"            "Michael Caine"           
##  [9] "2000s"                    "long movie"              
## [11] "8.5-9 rating"             "1mln+ votes"             
## [13] "Blockbusters ($500M+)"   
## 
## [[4]]
##  [1] "Crime"                       "Drama"                      
##  [3] "dir_by_Francis Ford Coppola" "Al Pacino"                  
##  [5] "Robert De Niro"              "Robert Duvall"              
##  [7] "Diane Keaton"                "70s"                        
##  [9] "extra long movie"            "8.5-9 rating"               
## [11] "1mln+ votes"                 "Moderate rev. ($50-100M)"   
## 
## [[5]]
##  [1] "Crime"               "Drama"               "dir_by_Sidney Lumet"
##  [4] "Henry Fonda"         "Lee J. Cobb"         "Martin Balsam"      
##  [7] "John Fiedler"        "50s"                 "medium movie"       
## [10] "8.5-9 rating"        "500k-1mln votes"     "Low rev. (<$10M)"   
## 
## [[6]]
##  [1] "Action"                "Adventure"             "Drama"                
##  [4] "dir_by_Peter Jackson"  "Elijah Wood"           "Viggo Mortensen"      
##  [7] "Ian McKellen"          "Orlando Bloom"         "2000s"                
## [10] "extra long movie"      "8.5-9 rating"          "1mln+ votes"          
## [13] "High rev. ($100-500M)"

We can see how inside of this column looks like. It’s a list with every needed information about movie.

some vis

Association rules analysis

Preparing to analysis

As being said before, I’ll extract column “Transaction” and convert it into transactions data type. It will allow me to successfully continue my work.

trans <- as(data$Transaction, "transactions")

## Warning in asMethod(object): removing duplicated items in transactions

summary(trans)

## transactions as itemMatrix in sparse format with
##  1000 rows (elements/itemsets/transactions) and
##  3308 columns (items) and a density of 0.003789903 
## 
## most frequent items:
##                 Drama            long movie         7.75-8 rating 
##                   724                   437                   398 
## Decent rev. ($10-50M)      Low rev. (<$10M)               (Other) 
##                   382                   321                 10275 
## 
## element (itemset/transaction) length distribution:
## sizes
##  11  12  13 
## 106 251 643 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   12.00   13.00   12.54   13.00   13.00 
## 
## includes extended item information - examples:
##           labels
## 1 100-200k votes
## 2    1mln+ votes
## 3 200-500k votes

From summary we could see that most frequent item is “Drama”, “long movie” and “7.75-8 rating”. First one has huge advantage above the rest - over 70% of best 1000 movies are dramas!

When it comes ti sizes, we see that each row contain from 11 to 13 elements. They differentiate by number of given top 4 actors.

Now we can see how information about each movie looks like.

inspect(head(trans,3))

##     items                         
## [1] {1mln+ votes,                 
##      9+ rating,                   
##      90s,                         
##      Bob Gunton,                  
##      Decent rev. ($10-50M),       
##      dir_by_Frank Darabont,       
##      Drama,                       
##      long movie,                  
##      Morgan Freeman,              
##      Tim Robbins,                 
##      William Sadler}              
## [2] {1mln+ votes,                 
##      70s,                         
##      9+ rating,                   
##      Al Pacino,                   
##      Crime,                       
##      Diane Keaton,                
##      dir_by_Francis Ford Coppola, 
##      Drama,                       
##      High rev. ($100-500M),       
##      James Caan,                  
##      long movie,                  
##      Marlon Brando}               
## [3] {1mln+ votes,                 
##      2000s,                       
##      8.5-9 rating,                
##      Aaron Eckhart,               
##      Action,                      
##      Blockbusters ($500M+),       
##      Christian Bale,              
##      Crime,                       
##      dir_by_Christopher Nolan,    
##      Drama,                       
##      Heath Ledger,                
##      long movie,                  
##      Michael Caine}

length(trans)

## [1] 1000

What’s more, we got confirmation that our dataset still have 1000 observations.

most frequent items

sort(itemFrequency(trans[, 1:10], type="relative"), decreasing = TRUE)

##           2000s           2010s  200-500k votes   50-100k votes    25-50k votes 
##           0.241           0.225           0.224           0.215           0.207 
##  100-200k votes 500k-1mln votes             60s             50s     1mln+ votes 
##           0.171           0.141           0.065           0.062           0.042

sort(itemFrequency(trans[, 1:10], type="absolute"), decreasing = TRUE)

##           2000s           2010s  200-500k votes   50-100k votes    25-50k votes 
##             241             225             224             215             207 
##  100-200k votes 500k-1mln votes             60s             50s     1mln+ votes 
##             171             141              65              62              42

To better present most frequent items, I will present special graph for that occasion.

We already know TOP 3 frequent items. On further positions are items like “Low rev.”, “2000s” or “Comedy”. We are able to see what characteristics have most best movies ;)

Last thing to do before moving to best part is checking sparsity. It refers to the proportion of empty (zero) values in a dataset, indicating how many items are missing or unused in transactions relative to the total possible items.

Most image is white so the data is sparse. In perfect world I should reduce number of unique categories. With full awareness I won’t do it. But I know about the problem.

Creating rules

General rules

In the beginning I’ll create general rules, without filtering. I’ve decided to go with support = 0.05 and confidence = 0.2. Why? I’ve tested plenty of options and that combination of parameters allowed me to get most optimal output.

movie_rules_pre <- apriori(trans, parameter = list(support = 0.05, confidence = 0.2, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 50 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[3308 item(s), 1000 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [458 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

movie_rules_pre

## set of 458 rules

As the result we got almost 460 rules. Before exploring them more, I will remove redundant rules. They provide no new information because a more general rule with the same or higher confidence already exists.

redundant_idx <- is.redundant(movie_rules_pre)
sum(redundant_idx)

## [1] 85

movie_rules <- movie_rules_pre[!is.redundant(movie_rules_pre)]
movie_rules

## set of 373 rules

Having this done, we could deep into them!

Exploring rules

First, I’ll go with summary of our rules.

summary(movie_rules)

## set of 373 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3 
## 272 101 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.271   3.000   3.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift       
##  Min.   :0.0500   Min.   :0.2000   Min.   :0.0560   Min.   :0.4581  
##  1st Qu.:0.0580   1st Qu.:0.2653   1st Qu.:0.1610   1st Qu.:1.0127  
##  Median :0.0730   Median :0.3550   Median :0.2150   Median :1.1135  
##  Mean   :0.0866   Mean   :0.4115   Mean   :0.2398   Mean   :1.1817  
##  3rd Qu.:0.0960   3rd Qu.:0.4771   3rd Qu.:0.2890   3rd Qu.:1.2604  
##  Max.   :0.3490   Max.   :0.9701   Max.   :0.7240   Max.   :3.2354  
##      count      
##  Min.   : 50.0  
##  1st Qu.: 58.0  
##  Median : 73.0  
##  Mean   : 86.6  
##  3rd Qu.: 96.0  
##  Max.   :349.0  
## 
## mining info:
##   data ntransactions support confidence
##  trans          1000    0.05        0.2
##                                                                                   call
##  apriori(data = trans, parameter = list(support = 0.05, confidence = 0.2, minlen = 2))

From summary we could observe that statistics of our rules aren’t high. Mean support equals 8.6%, while mean confidence 41%. Lift and coverage aren’t good too. None of the rules has coverage above 1 (this indicates strong positive correlation and makes rule widely applicable). With lift situation is little bit better - mean and median values are above 1 so that confirms strong association between LHS and RHS.

najlepsze reguly wg parametrow

## [1] "TOP 5 rules regarding support"

##     lhs                        rhs                     support confidence
## [1] {long movie}            => {Drama}                 0.349   0.7986270 
## [2] {Drama}                 => {long movie}            0.349   0.4820442 
## [3] {Decent rev. ($10-50M)} => {Drama}                 0.290   0.7591623 
## [4] {Drama}                 => {Decent rev. ($10-50M)} 0.290   0.4005525 
## [5] {7.75-8 rating}         => {Drama}                 0.283   0.7110553 
##     coverage lift      count
## [1] 0.437    1.1030760 349  
## [2] 0.724    1.1030760 349  
## [3] 0.382    1.0485667 290  
## [4] 0.724    1.0485667 290  
## [5] 0.398    0.9821205 283

## [1] "TOP 5 rules regarding confidence"

##     lhs                             rhs     support confidence coverage
## [1] {Biography, long movie}      => {Drama} 0.065   0.9701493  0.067   
## [2] {History}                    => {Drama} 0.054   0.9642857  0.056   
## [3] {2000s, Low rev. (<$10M)}    => {Drama} 0.080   0.9523810  0.084   
## [4] {Biography}                  => {Drama} 0.103   0.9449541  0.109   
## [5] {100-200k votes, long movie} => {Drama} 0.058   0.9062500  0.064   
##     lift     count
## [1] 1.339985  65  
## [2] 1.331886  54  
## [3] 1.315443  80  
## [4] 1.305185 103  
## [5] 1.251727  58

## [1] "TOP 5 rules regarding lift"

##     lhs                        rhs                     support confidence
## [1] {Adventure}             => {Animation}             0.052   0.2653061 
## [2] {Animation}             => {Adventure}             0.052   0.6341463 
## [3] {High rev. ($100-500M)} => {500k-1mln votes}       0.077   0.4325843 
## [4] {500k-1mln votes}       => {High rev. ($100-500M)} 0.077   0.5460993 
## [5] {Adventure, long movie} => {Action}                0.050   0.5681818 
##     coverage lift     count
## [1] 0.196    3.235441 52   
## [2] 0.082    3.235441 52   
## [3] 0.178    3.067974 77   
## [4] 0.141    3.067974 77   
## [5] 0.088    3.006253 50

## [1] "TOP 5 rules regarding coverage"

##     lhs        rhs              support confidence coverage lift      count
## [1] {Drama} => {25-50k votes}   0.167   0.2306630  0.724    1.1143139 167  
## [2] {Drama} => {Crime}          0.160   0.2209945  0.724    1.0573898 160  
## [3] {Drama} => {2010s}          0.171   0.2361878  0.724    1.0497238 171  
## [4] {Drama} => {50-100k votes}  0.176   0.2430939  0.724    1.1306694 176  
## [5] {Drama} => {200-500k votes} 0.158   0.2182320  0.724    0.9742502 158

First thing I would like to add - some of my rules are resistant to being remove by redundant filter. That’s why they are doubling.

I’ll start with best rules regarding support. Support measures how often item set appears in dataset. Here the winner is pair “long movie” and “drama” so two extremely frequent items. Second pair is “Decent rev” and again “Drama”. Fom example from the first rule we could interpret that if movie is “long”, in almost 80% it will be drama! Those results are little bit biased by extremaly high frequency of “Drama” in the dataset.

Next, confidence! It measures how often rule occur if LHS happens. Here, we have some nice rules. For example, if move is “Biography” genre and is “long”, in 97% it will be “Drama” movie! I won’t interpret other ones because they are once again with “Drama”.

It’s time for lift. It measures how much likely rule occurs compared to situation when LHS and RHS were independent. High values indicate strong association. Here, strongest rule is pair of “Adventure” and “Animation” genres. They present lift over 3 so it shows their extremely high association.

Last but not least, coverage. It shows how often LHS appear in transactions. It shows rule’s applicability in the dataset. As expected, all of them are connected with “Drama”. Ehh….

I’ll show plot regarding all 3 key measures. On x-axis we will have support, on y-axis confidence. Color shading will be made by lift.

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

From the plot we could see that Most of our rules have low confidence and support. That isn’t a good sign.

Last chart in this section will present distribution of confidence of our rules.

We could observe that most rules have confidence up to 0.4. Similar results suggested previous summary. Interesting part are rules with confidence over 0.8 - we will deep into them!

Targetted rules

In this part I will crate rules for certain items. Additional visualisations will be also provided. In each sections I will use different parameteros of support and confidence. I know it’s kinda againt the logic but I want to present more than 1 or 2 rules :)

Rules for High-Revenue Movies

##     lhs                   rhs                     support confidence coverage     lift count
## [1] {500k-1mln votes,                                                                       
##      Adventure}        => {High rev. ($100-500M)}   0.043  0.7962963    0.054 4.473575    43
## [2] {500k-1mln votes,                                                                       
##      long movie}       => {High rev. ($100-500M)}   0.041  0.5774648    0.071 3.244184    41
## [3] {500k-1mln votes}  => {High rev. ($100-500M)}   0.077  0.5460993    0.141 3.067974    77
## [4] {Action,                                                                                
##      Adventure}        => {High rev. ($100-500M)}   0.044  0.5301205    0.083 2.978205    44
## [5] {Adventure,                                                                             
##      long movie}       => {High rev. ($100-500M)}   0.046  0.5227273    0.088 2.936670    46
## [6] {Adventure}        => {High rev. ($100-500M)}   0.083  0.4234694    0.196 2.379042    83

First interesting result! From our rules we could see that “Adventure” genre is most infuential factor for film revenues. It’s present in all our rules. So here comes tip for directors - if you want a lot of money, make adventure films ;)

Rules for best directors

I’ve chose four directors - my favorite ones. These are Hitchcok, Spielberg, Kubrick and Nolan. I’ll create rules for them

##      lhs                           rhs                     support confidence
## [1]  {dir_by_Christopher Nolan} => {1mln+ votes}           0.007   0.8750000 
## [2]  {dir_by_Stanley Kubrick}   => {8-8.5 rating}          0.007   0.7777778 
## [3]  {dir_by_Stanley Kubrick}   => {Drama}                 0.007   0.7777778 
## [4]  {dir_by_Christopher Nolan} => {long movie}            0.006   0.7500000 
## [5]  {dir_by_Steven Spielberg}  => {High rev. ($100-500M)} 0.009   0.6923077 
## [6]  {dir_by_Stanley Kubrick}   => {Decent rev. ($10-50M)} 0.006   0.6666667 
## [7]  {dir_by_Christopher Nolan} => {High rev. ($100-500M)} 0.005   0.6250000 
## [8]  {dir_by_Christopher Nolan} => {Action}                0.005   0.6250000 
## [9]  {dir_by_Steven Spielberg}  => {500k-1mln votes}       0.006   0.4615385 
## [10] {dir_by_Steven Spielberg}  => {80s}                   0.005   0.3846154 
##      coverage lift      count
## [1]  0.008    20.833333 7    
## [2]  0.009     2.691273 7    
## [3]  0.009     1.074279 7    
## [4]  0.008     1.716247 6    
## [5]  0.013     3.889369 9    
## [6]  0.009     1.745201 6    
## [7]  0.008     3.511236 5    
## [8]  0.008     3.306878 5    
## [9]  0.013     3.273322 6    
## [10] 0.013     4.321521 5

From our rules we can see that Nolan is guarantee for plenty of community votes (his films are popular) and Kubrick almost always provide perfect movies. Here we also see favorite genres of director: for Kubrick “Drama” and for Nolan “Action”

Rules for top actors

##      lhs                    rhs                     support confidence coverage
## [1]  {Al Pacino}         => {Drama}                 0.013   1.0000000  0.013   
## [2]  {Robert De Niro}    => {Drama}                 0.017   1.0000000  0.017   
## [3]  {Tom Hanks}         => {High rev. ($100-500M)} 0.012   0.8571429  0.014   
## [4]  {Al Pacino}         => {Crime}                 0.011   0.8461538  0.013   
## [5]  {Brad Pitt}         => {long movie}            0.010   0.8333333  0.012   
## [6]  {Leonardo DiCaprio} => {long movie}            0.009   0.8181818  0.011   
## [7]  {Leonardo DiCaprio} => {Drama}                 0.009   0.8181818  0.011   
## [8]  {Christian Bale}    => {Drama}                 0.009   0.8181818  0.011   
## [9]  {Al Pacino}         => {long movie}            0.010   0.7692308  0.013   
## [10] {Brad Pitt}         => {Drama}                 0.009   0.7500000  0.012   
## [11] {Leonardo DiCaprio} => {High rev. ($100-500M)} 0.008   0.7272727  0.011   
## [12] {Christian Bale}    => {long movie}            0.008   0.7272727  0.011   
## [13] {Robert De Niro}    => {Crime}                 0.012   0.7058824  0.017   
## [14] {Clint Eastwood}    => {long movie}            0.008   0.6666667  0.012   
## [15] {Tom Hanks}         => {Drama}                 0.009   0.6428571  0.014   
## [16] {Robert De Niro}    => {long movie}            0.010   0.5882353  0.017   
## [17] {Tom Hanks}         => {long movie}            0.008   0.5714286  0.014   
## [18] {Al Pacino}         => {Decent rev. ($10-50M)} 0.007   0.5384615  0.013   
## [19] {Tom Hanks}         => {90s}                   0.007   0.5000000  0.014   
## [20] {Tom Hanks}         => {Adventure}             0.007   0.5000000  0.014   
## [21] {Robert De Niro}    => {8-8.5 rating}          0.007   0.4117647  0.017   
## [22] {Robert De Niro}    => {Decent rev. ($10-50M)} 0.007   0.4117647  0.017   
##      lift      count
## [1]  1.3812155 13   
## [2]  1.3812155 17   
## [3]  4.8154093 12   
## [4]  4.0485830 11   
## [5]  1.9069413 10   
## [6]  1.8722696  9   
## [7]  1.1300854  9   
## [8]  1.1300854  9   
## [9]  1.7602535 10   
## [10] 1.0359116  9   
## [11] 4.0858018  8   
## [12] 1.6642397  8   
## [13] 3.3774275 12   
## [14] 1.5255530  8   
## [15] 0.8879242  9   
## [16] 1.3460762 10   
## [17] 1.3076169  8   
## [18] 1.4095852  7   
## [19] 3.1055901  7   
## [20] 2.5510204  7   
## [21] 1.4247914  7   
## [22] 1.0779181  7

From that part we also got nice insights! We see that Al Pacino and De Niro play always in Dramas. What’s more, Tom Hanks is generating +$100 millions in 86% movies.

Rules for best rated movies

##      lhs                         rhs            support confidence coverage     lift count
## [1]  {1mln+ votes,                                                                        
##       90s,                                                                                
##       High rev. ($100-500M)}  => {8.5-9 rating}   0.007  0.7777778    0.009 25.08961     7
## [2]  {1mln+ votes,                                                                        
##       Drama,                                                                              
##       High rev. ($100-500M)}  => {8.5-9 rating}   0.010  0.5882353    0.017 18.97533    10
## [3]  {1mln+ votes,                                                                        
##       Crime}                  => {8.5-9 rating}   0.007  0.5833333    0.012 18.81720     7
## [4]  {1mln+ votes,                                                                        
##       Action,                                                                             
##       High rev. ($100-500M)}  => {8.5-9 rating}   0.007  0.5833333    0.012 18.81720     7
## [5]  {1mln+ votes,                                                                        
##       Crime,                                                                              
##       Drama}                  => {8.5-9 rating}   0.007  0.5833333    0.012 18.81720     7
## [6]  {1mln+ votes,                                                                        
##       90s,                                                                                
##       Drama}                  => {8.5-9 rating}   0.008  0.5714286    0.014 18.43318     8
## [7]  {1mln+ votes,                                                                        
##       90s}                    => {8.5-9 rating}   0.009  0.5625000    0.016 18.14516     9
## [8]  {1mln+ votes,                                                                        
##       High rev. ($100-500M)}  => {8.5-9 rating}   0.014  0.5185185    0.027 16.72640    14
## [9]  {1mln+ votes,                                                                        
##       Drama}                  => {8.5-9 rating}   0.015  0.5172414    0.029 16.68521    15
## [10] {1mln+ votes,                                                                        
##       Adventure,                                                                          
##       High rev. ($100-500M)}  => {8.5-9 rating}   0.007  0.5000000    0.014 16.12903     7
## [11] {1mln+ votes,                                                                        
##       Drama,                                                                              
##       long movie}             => {8.5-9 rating}   0.010  0.5000000    0.020 16.12903    10
## [12] {1mln+ votes,                                                                        
##       Drama,                                                                              
##       High rev. ($100-500M),                                                              
##       long movie}             => {8.5-9 rating}   0.007  0.5000000    0.014 16.12903     7

Here we have most advanced rules! We can see that high number of votes and high earnings “predict” the great note from community! Most common genres there are drama, crime, action and adventure.

Rules for animation movies

##     lhs            rhs                     support confidence coverage
## [1] {Animation} => {Adventure}             0.052   0.6341463  0.082   
## [2] {Animation} => {medium movie}          0.048   0.5853659  0.082   
## [3] {Animation} => {7.75-8 rating}         0.032   0.3902439  0.082   
## [4] {Animation} => {High rev. ($100-500M)} 0.030   0.3658537  0.082   
## [5] {Animation} => {Comedy}                0.028   0.3414634  0.082   
## [6] {Animation} => {2000s}                 0.026   0.3170732  0.082   
## [7] {Animation} => {feature movie}         0.026   0.3170732  0.082   
## [8] {Animation} => {7.5-7.75 rating}       0.025   0.3048780  0.082   
## [9] {Animation} => {Low rev. (<$10M)}      0.025   0.3048780  0.082   
##     lift      count
## [1] 3.2354405 52   
## [2] 2.8278544 48   
## [3] 0.9805123 32   
## [4] 2.0553576 30   
## [5] 1.4655082 28   
## [6] 1.3156563 26   
## [7] 1.0065815 26   
## [8] 1.0888502 25   
## [9] 0.9497759 25

For animations movie we can find really interesting insights! In 63% animations movies are also adventure! They tend to be shorter ones. In almost 36% of cases they generated over $100 millions!

Summary

My project presented usage of apriori method for association rules. I’ve looked for interesting insights and plenty of them was found! They could be found in corresponding tab. I think this project is my biggest success and if I had more time I would deep even harder!