Cezary Kuźmowicz
Watching movies is part of almost everyone life. Through the decades form of consuming has been dramatically changing. In the previous millennium only way to do it was visiting cinema. Later, movies started to appear in TV. Sign of our times is the most flexible way of consuming film content - by VOD platforms like Netflix, MAX, Amazon Prime Video or Disney Plus has allowed people from every part of globe to watch almost all movies.
Topic of my research will be association rules analysis performed on 1000 best-rated movies on IMDB. It is world’s most known platform for film-enjoyers. I’ll create more and less detailed rules to get as much insights as possible. Apriori method will be used.
link to dataset: https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows
First, I need to install all needed packeges fot our analysis.
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tidyr::pack() masks Matrix::pack()
## ✖ dplyr::recode() masks arules::recode()
## ✖ tidyr::unpack() masks Matrix::unpack()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Having this done, we can move to loading data
My dataset comes from Kaggle and is in .csv format. It is very convenient to read for R.
Some variables present in original dataset is not crucial from perspective of association analysis. I’ll delete them.
Variables are in very different form. In order to conduct analysis properly, I will change them to mostly categorical data. I’ll use dplyr notation because thanks to “Intro to R” course I really liked it :)
In order to achieve categorical periods, I have to convert current data into numbers and then arrange it into predefined spaces.
data <- data |>
filter(!is.na(Released_Year)) |>
mutate(Released_Year = as.numeric(Released_Year)) |>
mutate(Released_Year = cut(Released_Year,
breaks = c(1920, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020),
labels = c("world_wars", "50s", "60s", "70s", "80s", "90s", "2000s", "2010s")))## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Released_Year = as.numeric(Released_Year)`.
## Caused by warning:
## ! NAs introduced by coercion
Now information about year of release is ready fur further calculations.
In the original dataset, Run time is stored as “120 min”. In this case I have to remove “min” part and then convert it into numbers. Final step will be creating new length tags.
data <- data |>
mutate(Runtime = substr(Runtime, 1, nchar(Runtime) - 4)) |>
mutate(Runtime = as.numeric(Runtime)) |>
mutate(Runtime = cut(Runtime, breaks = c(40, 60, 100, 120, 180, 500),
labels = c("short movie", "medium movie", "feature movie",
"long movie", "extra long movie")))Next step is gathering ratings into groups. IMDB ratings range from 1 to 10. As we analyze 1000 best rated movies, minimal rating is “only” 7.5. Case here is very easy - just assigning to groups.
data <- data |>
mutate(IMDB_Rating = cut(IMDB_Rating,
breaks = c(7.5, 7.75, 8, 8.5, 9, 10 ),
labels = c("7.5-7.75 rating", "7.75-8 rating", "8-8.5 rating",
"8.5-9 rating", "9+ rating")))Every rating comes from many community votes. Here, I will gather them into intervals.
data <- data |>
mutate(No_of_Votes = cut(No_of_Votes,
breaks = c(25000, 50000, 100000, 200000, 500000, 1000000, 100e6),
labels = c("25-50k votes", "50-100k votes", "100-200k votes", "200-500k votes",
"500k-1mln votes", "1mln+ votes")))Movie industry is a huge one and what comes with it - huge money! Important feature of each movie is its earnings. Here, some modifications were needed. If there were no information about earnings, I used median gross revenue (less biased than mean).
data <- data |>
mutate(Gross = as.numeric(gsub(",","",Gross))) |>
mutate(Gross = ifelse(is.na(Gross), median(Gross, na.rm = TRUE), Gross)) |>
mutate(Gross = cut(Gross,
breaks = c(0, 10e6, 50e6, 100e6, 500e6, 9999999e6),
labels = c("Low rev. (<$10M)", "Decent rev. ($10-50M)",
"Moderate rev. ($50-100M)", "High rev. ($100-500M)",
"Blockbusters ($500M+)")))As the result we’ve got 5 intervals of earning.
Here, situation is little bit complicated. Four best stars per movie are presented. In order to prepare data to further analysis, I have to create new variable with all actors. When it comes to directors, I will add “dir_by_” to every director to differentiate them from actors during rules evaluation.
data <- data |>
mutate(Actors = paste(Star1, Star2, Star3, Star4, sep = ", ")) |>
select(-Star1, -Star2, -Star3, -Star4) |>
mutate(Director = paste0("dir_by_", Director))In order to make association rules work, my data should be in transaction data type. In this step I will merge all needed columns into one. In the next chapter I will extract it as transactions.
data <- data |>
mutate(Transaction = paste(Genre, Director, Actors, Released_Year, Runtime,
IMDB_Rating, No_of_Votes, Gross, sep = ", ")) |>
mutate(Transaction = strsplit(Transaction, ", "))
head(data$Transaction)## [[1]]
## [1] "Drama" "dir_by_Frank Darabont" "Tim Robbins"
## [4] "Morgan Freeman" "Bob Gunton" "William Sadler"
## [7] "90s" "long movie" "9+ rating"
## [10] "1mln+ votes" "Decent rev. ($10-50M)"
##
## [[2]]
## [1] "Crime" "Drama"
## [3] "dir_by_Francis Ford Coppola" "Marlon Brando"
## [5] "Al Pacino" "James Caan"
## [7] "Diane Keaton" "70s"
## [9] "long movie" "9+ rating"
## [11] "1mln+ votes" "High rev. ($100-500M)"
##
## [[3]]
## [1] "Action" "Crime"
## [3] "Drama" "dir_by_Christopher Nolan"
## [5] "Christian Bale" "Heath Ledger"
## [7] "Aaron Eckhart" "Michael Caine"
## [9] "2000s" "long movie"
## [11] "8.5-9 rating" "1mln+ votes"
## [13] "Blockbusters ($500M+)"
##
## [[4]]
## [1] "Crime" "Drama"
## [3] "dir_by_Francis Ford Coppola" "Al Pacino"
## [5] "Robert De Niro" "Robert Duvall"
## [7] "Diane Keaton" "70s"
## [9] "extra long movie" "8.5-9 rating"
## [11] "1mln+ votes" "Moderate rev. ($50-100M)"
##
## [[5]]
## [1] "Crime" "Drama" "dir_by_Sidney Lumet"
## [4] "Henry Fonda" "Lee J. Cobb" "Martin Balsam"
## [7] "John Fiedler" "50s" "medium movie"
## [10] "8.5-9 rating" "500k-1mln votes" "Low rev. (<$10M)"
##
## [[6]]
## [1] "Action" "Adventure" "Drama"
## [4] "dir_by_Peter Jackson" "Elijah Wood" "Viggo Mortensen"
## [7] "Ian McKellen" "Orlando Bloom" "2000s"
## [10] "extra long movie" "8.5-9 rating" "1mln+ votes"
## [13] "High rev. ($100-500M)"
We can see how inside of this column looks like. It’s a list with every needed information about movie.
some vis
As being said before, I’ll extract column “Transaction” and convert it into transactions data type. It will allow me to successfully continue my work.
## Warning in asMethod(object): removing duplicated items in transactions
## transactions as itemMatrix in sparse format with
## 1000 rows (elements/itemsets/transactions) and
## 3308 columns (items) and a density of 0.003789903
##
## most frequent items:
## Drama long movie 7.75-8 rating
## 724 437 398
## Decent rev. ($10-50M) Low rev. (<$10M) (Other)
## 382 321 10275
##
## element (itemset/transaction) length distribution:
## sizes
## 11 12 13
## 106 251 643
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 12.00 13.00 12.54 13.00 13.00
##
## includes extended item information - examples:
## labels
## 1 100-200k votes
## 2 1mln+ votes
## 3 200-500k votes
From summary we could see that most frequent item is “Drama”, “long movie” and “7.75-8 rating”. First one has huge advantage above the rest - over 70% of best 1000 movies are dramas!
When it comes ti sizes, we see that each row contain from 11 to 13 elements. They differentiate by number of given top 4 actors.
Now we can see how information about each movie looks like.
## items
## [1] {1mln+ votes,
## 9+ rating,
## 90s,
## Bob Gunton,
## Decent rev. ($10-50M),
## dir_by_Frank Darabont,
## Drama,
## long movie,
## Morgan Freeman,
## Tim Robbins,
## William Sadler}
## [2] {1mln+ votes,
## 70s,
## 9+ rating,
## Al Pacino,
## Crime,
## Diane Keaton,
## dir_by_Francis Ford Coppola,
## Drama,
## High rev. ($100-500M),
## James Caan,
## long movie,
## Marlon Brando}
## [3] {1mln+ votes,
## 2000s,
## 8.5-9 rating,
## Aaron Eckhart,
## Action,
## Blockbusters ($500M+),
## Christian Bale,
## Crime,
## dir_by_Christopher Nolan,
## Drama,
## Heath Ledger,
## long movie,
## Michael Caine}
## [1] 1000
What’s more, we got confirmation that our dataset still have 1000 observations.
most frequent items
## 2000s 2010s 200-500k votes 50-100k votes 25-50k votes
## 0.241 0.225 0.224 0.215 0.207
## 100-200k votes 500k-1mln votes 60s 50s 1mln+ votes
## 0.171 0.141 0.065 0.062 0.042
## 2000s 2010s 200-500k votes 50-100k votes 25-50k votes
## 241 225 224 215 207
## 100-200k votes 500k-1mln votes 60s 50s 1mln+ votes
## 171 141 65 62 42
To better present most frequent items, I will present special graph
for that occasion.
We already know TOP 3 frequent items. On further positions are items like “Low rev.”, “2000s” or “Comedy”. We are able to see what characteristics have most best movies ;)
Last thing to do before moving to best part is checking sparsity. It
refers to the proportion of empty (zero) values in a dataset, indicating
how many items are missing or unused in transactions relative to the
total possible items.
Most image is white so the data is sparse. In perfect world I should reduce number of unique categories. With full awareness I won’t do it. But I know about the problem.
In the beginning I’ll create general rules, without filtering. I’ve decided to go with support = 0.05 and confidence = 0.2. Why? I’ve tested plenty of options and that combination of parameters allowed me to get most optimal output.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.05 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 50
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[3308 item(s), 1000 transaction(s)] done [0.00s].
## sorting and recoding items ... [38 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [458 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 458 rules
As the result we got almost 460 rules. Before exploring them more, I will remove redundant rules. They provide no new information because a more general rule with the same or higher confidence already exists.
## [1] 85
## set of 373 rules
Having this done, we could deep into them!
First, I’ll go with summary of our rules.
## set of 373 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 272 101
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.271 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.0500 Min. :0.2000 Min. :0.0560 Min. :0.4581
## 1st Qu.:0.0580 1st Qu.:0.2653 1st Qu.:0.1610 1st Qu.:1.0127
## Median :0.0730 Median :0.3550 Median :0.2150 Median :1.1135
## Mean :0.0866 Mean :0.4115 Mean :0.2398 Mean :1.1817
## 3rd Qu.:0.0960 3rd Qu.:0.4771 3rd Qu.:0.2890 3rd Qu.:1.2604
## Max. :0.3490 Max. :0.9701 Max. :0.7240 Max. :3.2354
## count
## Min. : 50.0
## 1st Qu.: 58.0
## Median : 73.0
## Mean : 86.6
## 3rd Qu.: 96.0
## Max. :349.0
##
## mining info:
## data ntransactions support confidence
## trans 1000 0.05 0.2
## call
## apriori(data = trans, parameter = list(support = 0.05, confidence = 0.2, minlen = 2))
From summary we could observe that statistics of our rules aren’t high. Mean support equals 8.6%, while mean confidence 41%. Lift and coverage aren’t good too. None of the rules has coverage above 1 (this indicates strong positive correlation and makes rule widely applicable). With lift situation is little bit better - mean and median values are above 1 so that confirms strong association between LHS and RHS.
najlepsze reguly wg parametrow
## [1] "TOP 5 rules regarding support"
## lhs rhs support confidence
## [1] {long movie} => {Drama} 0.349 0.7986270
## [2] {Drama} => {long movie} 0.349 0.4820442
## [3] {Decent rev. ($10-50M)} => {Drama} 0.290 0.7591623
## [4] {Drama} => {Decent rev. ($10-50M)} 0.290 0.4005525
## [5] {7.75-8 rating} => {Drama} 0.283 0.7110553
## coverage lift count
## [1] 0.437 1.1030760 349
## [2] 0.724 1.1030760 349
## [3] 0.382 1.0485667 290
## [4] 0.724 1.0485667 290
## [5] 0.398 0.9821205 283
## [1] "TOP 5 rules regarding confidence"
## lhs rhs support confidence coverage
## [1] {Biography, long movie} => {Drama} 0.065 0.9701493 0.067
## [2] {History} => {Drama} 0.054 0.9642857 0.056
## [3] {2000s, Low rev. (<$10M)} => {Drama} 0.080 0.9523810 0.084
## [4] {Biography} => {Drama} 0.103 0.9449541 0.109
## [5] {100-200k votes, long movie} => {Drama} 0.058 0.9062500 0.064
## lift count
## [1] 1.339985 65
## [2] 1.331886 54
## [3] 1.315443 80
## [4] 1.305185 103
## [5] 1.251727 58
## [1] "TOP 5 rules regarding lift"
## lhs rhs support confidence
## [1] {Adventure} => {Animation} 0.052 0.2653061
## [2] {Animation} => {Adventure} 0.052 0.6341463
## [3] {High rev. ($100-500M)} => {500k-1mln votes} 0.077 0.4325843
## [4] {500k-1mln votes} => {High rev. ($100-500M)} 0.077 0.5460993
## [5] {Adventure, long movie} => {Action} 0.050 0.5681818
## coverage lift count
## [1] 0.196 3.235441 52
## [2] 0.082 3.235441 52
## [3] 0.178 3.067974 77
## [4] 0.141 3.067974 77
## [5] 0.088 3.006253 50
## [1] "TOP 5 rules regarding coverage"
## lhs rhs support confidence coverage lift count
## [1] {Drama} => {25-50k votes} 0.167 0.2306630 0.724 1.1143139 167
## [2] {Drama} => {Crime} 0.160 0.2209945 0.724 1.0573898 160
## [3] {Drama} => {2010s} 0.171 0.2361878 0.724 1.0497238 171
## [4] {Drama} => {50-100k votes} 0.176 0.2430939 0.724 1.1306694 176
## [5] {Drama} => {200-500k votes} 0.158 0.2182320 0.724 0.9742502 158
First thing I would like to add - some of my rules are resistant to being remove by redundant filter. That’s why they are doubling.
I’ll start with best rules regarding support. Support measures how often item set appears in dataset. Here the winner is pair “long movie” and “drama” so two extremely frequent items. Second pair is “Decent rev” and again “Drama”. Fom example from the first rule we could interpret that if movie is “long”, in almost 80% it will be drama! Those results are little bit biased by extremaly high frequency of “Drama” in the dataset.
Next, confidence! It measures how often rule occur if LHS happens. Here, we have some nice rules. For example, if move is “Biography” genre and is “long”, in 97% it will be “Drama” movie! I won’t interpret other ones because they are once again with “Drama”.
It’s time for lift. It measures how much likely rule occurs compared to situation when LHS and RHS were independent. High values indicate strong association. Here, strongest rule is pair of “Adventure” and “Animation” genres. They present lift over 3 so it shows their extremely high association.
Last but not least, coverage. It shows how often LHS appear in transactions. It shows rule’s applicability in the dataset. As expected, all of them are connected with “Drama”. Ehh….
I’ll show plot regarding all 3 key measures. On x-axis we will have support, on y-axis confidence. Color shading will be made by lift.
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
From the plot we could see that Most of our rules have low confidence and support. That isn’t a good sign.
Last chart in this section will present distribution of confidence of
our rules.
We could observe that most rules have confidence up to 0.4. Similar results suggested previous summary. Interesting part are rules with confidence over 0.8 - we will deep into them!
In this part I will crate rules for certain items. Additional visualisations will be also provided. In each sections I will use different parameteros of support and confidence. I know it’s kinda againt the logic but I want to present more than 1 or 2 rules :)
## lhs rhs support confidence coverage lift count
## [1] {500k-1mln votes,
## Adventure} => {High rev. ($100-500M)} 0.043 0.7962963 0.054 4.473575 43
## [2] {500k-1mln votes,
## long movie} => {High rev. ($100-500M)} 0.041 0.5774648 0.071 3.244184 41
## [3] {500k-1mln votes} => {High rev. ($100-500M)} 0.077 0.5460993 0.141 3.067974 77
## [4] {Action,
## Adventure} => {High rev. ($100-500M)} 0.044 0.5301205 0.083 2.978205 44
## [5] {Adventure,
## long movie} => {High rev. ($100-500M)} 0.046 0.5227273 0.088 2.936670 46
## [6] {Adventure} => {High rev. ($100-500M)} 0.083 0.4234694 0.196 2.379042 83
First interesting result! From our rules we could see that “Adventure” genre is most infuential factor for film revenues. It’s present in all our rules. So here comes tip for directors - if you want a lot of money, make adventure films ;)
I’ve chose four directors - my favorite ones. These are Hitchcok, Spielberg, Kubrick and Nolan. I’ll create rules for them
## lhs rhs support confidence
## [1] {dir_by_Christopher Nolan} => {1mln+ votes} 0.007 0.8750000
## [2] {dir_by_Stanley Kubrick} => {8-8.5 rating} 0.007 0.7777778
## [3] {dir_by_Stanley Kubrick} => {Drama} 0.007 0.7777778
## [4] {dir_by_Christopher Nolan} => {long movie} 0.006 0.7500000
## [5] {dir_by_Steven Spielberg} => {High rev. ($100-500M)} 0.009 0.6923077
## [6] {dir_by_Stanley Kubrick} => {Decent rev. ($10-50M)} 0.006 0.6666667
## [7] {dir_by_Christopher Nolan} => {High rev. ($100-500M)} 0.005 0.6250000
## [8] {dir_by_Christopher Nolan} => {Action} 0.005 0.6250000
## [9] {dir_by_Steven Spielberg} => {500k-1mln votes} 0.006 0.4615385
## [10] {dir_by_Steven Spielberg} => {80s} 0.005 0.3846154
## coverage lift count
## [1] 0.008 20.833333 7
## [2] 0.009 2.691273 7
## [3] 0.009 1.074279 7
## [4] 0.008 1.716247 6
## [5] 0.013 3.889369 9
## [6] 0.009 1.745201 6
## [7] 0.008 3.511236 5
## [8] 0.008 3.306878 5
## [9] 0.013 3.273322 6
## [10] 0.013 4.321521 5
From our rules we can see that Nolan is guarantee for plenty of community votes (his films are popular) and Kubrick almost always provide perfect movies. Here we also see favorite genres of director: for Kubrick “Drama” and for Nolan “Action”
## lhs rhs support confidence coverage
## [1] {Al Pacino} => {Drama} 0.013 1.0000000 0.013
## [2] {Robert De Niro} => {Drama} 0.017 1.0000000 0.017
## [3] {Tom Hanks} => {High rev. ($100-500M)} 0.012 0.8571429 0.014
## [4] {Al Pacino} => {Crime} 0.011 0.8461538 0.013
## [5] {Brad Pitt} => {long movie} 0.010 0.8333333 0.012
## [6] {Leonardo DiCaprio} => {long movie} 0.009 0.8181818 0.011
## [7] {Leonardo DiCaprio} => {Drama} 0.009 0.8181818 0.011
## [8] {Christian Bale} => {Drama} 0.009 0.8181818 0.011
## [9] {Al Pacino} => {long movie} 0.010 0.7692308 0.013
## [10] {Brad Pitt} => {Drama} 0.009 0.7500000 0.012
## [11] {Leonardo DiCaprio} => {High rev. ($100-500M)} 0.008 0.7272727 0.011
## [12] {Christian Bale} => {long movie} 0.008 0.7272727 0.011
## [13] {Robert De Niro} => {Crime} 0.012 0.7058824 0.017
## [14] {Clint Eastwood} => {long movie} 0.008 0.6666667 0.012
## [15] {Tom Hanks} => {Drama} 0.009 0.6428571 0.014
## [16] {Robert De Niro} => {long movie} 0.010 0.5882353 0.017
## [17] {Tom Hanks} => {long movie} 0.008 0.5714286 0.014
## [18] {Al Pacino} => {Decent rev. ($10-50M)} 0.007 0.5384615 0.013
## [19] {Tom Hanks} => {90s} 0.007 0.5000000 0.014
## [20] {Tom Hanks} => {Adventure} 0.007 0.5000000 0.014
## [21] {Robert De Niro} => {8-8.5 rating} 0.007 0.4117647 0.017
## [22] {Robert De Niro} => {Decent rev. ($10-50M)} 0.007 0.4117647 0.017
## lift count
## [1] 1.3812155 13
## [2] 1.3812155 17
## [3] 4.8154093 12
## [4] 4.0485830 11
## [5] 1.9069413 10
## [6] 1.8722696 9
## [7] 1.1300854 9
## [8] 1.1300854 9
## [9] 1.7602535 10
## [10] 1.0359116 9
## [11] 4.0858018 8
## [12] 1.6642397 8
## [13] 3.3774275 12
## [14] 1.5255530 8
## [15] 0.8879242 9
## [16] 1.3460762 10
## [17] 1.3076169 8
## [18] 1.4095852 7
## [19] 3.1055901 7
## [20] 2.5510204 7
## [21] 1.4247914 7
## [22] 1.0779181 7
From that part we also got nice insights! We see that Al Pacino and De Niro play always in Dramas. What’s more, Tom Hanks is generating +$100 millions in 86% movies.
## lhs rhs support confidence coverage lift count
## [1] {1mln+ votes,
## 90s,
## High rev. ($100-500M)} => {8.5-9 rating} 0.007 0.7777778 0.009 25.08961 7
## [2] {1mln+ votes,
## Drama,
## High rev. ($100-500M)} => {8.5-9 rating} 0.010 0.5882353 0.017 18.97533 10
## [3] {1mln+ votes,
## Crime} => {8.5-9 rating} 0.007 0.5833333 0.012 18.81720 7
## [4] {1mln+ votes,
## Action,
## High rev. ($100-500M)} => {8.5-9 rating} 0.007 0.5833333 0.012 18.81720 7
## [5] {1mln+ votes,
## Crime,
## Drama} => {8.5-9 rating} 0.007 0.5833333 0.012 18.81720 7
## [6] {1mln+ votes,
## 90s,
## Drama} => {8.5-9 rating} 0.008 0.5714286 0.014 18.43318 8
## [7] {1mln+ votes,
## 90s} => {8.5-9 rating} 0.009 0.5625000 0.016 18.14516 9
## [8] {1mln+ votes,
## High rev. ($100-500M)} => {8.5-9 rating} 0.014 0.5185185 0.027 16.72640 14
## [9] {1mln+ votes,
## Drama} => {8.5-9 rating} 0.015 0.5172414 0.029 16.68521 15
## [10] {1mln+ votes,
## Adventure,
## High rev. ($100-500M)} => {8.5-9 rating} 0.007 0.5000000 0.014 16.12903 7
## [11] {1mln+ votes,
## Drama,
## long movie} => {8.5-9 rating} 0.010 0.5000000 0.020 16.12903 10
## [12] {1mln+ votes,
## Drama,
## High rev. ($100-500M),
## long movie} => {8.5-9 rating} 0.007 0.5000000 0.014 16.12903 7
Here we have most advanced rules! We can see that high number of votes and high earnings “predict” the great note from community! Most common genres there are drama, crime, action and adventure.
## lhs rhs support confidence coverage
## [1] {Animation} => {Adventure} 0.052 0.6341463 0.082
## [2] {Animation} => {medium movie} 0.048 0.5853659 0.082
## [3] {Animation} => {7.75-8 rating} 0.032 0.3902439 0.082
## [4] {Animation} => {High rev. ($100-500M)} 0.030 0.3658537 0.082
## [5] {Animation} => {Comedy} 0.028 0.3414634 0.082
## [6] {Animation} => {2000s} 0.026 0.3170732 0.082
## [7] {Animation} => {feature movie} 0.026 0.3170732 0.082
## [8] {Animation} => {7.5-7.75 rating} 0.025 0.3048780 0.082
## [9] {Animation} => {Low rev. (<$10M)} 0.025 0.3048780 0.082
## lift count
## [1] 3.2354405 52
## [2] 2.8278544 48
## [3] 0.9805123 32
## [4] 2.0553576 30
## [5] 1.4655082 28
## [6] 1.3156563 26
## [7] 1.0065815 26
## [8] 1.0888502 25
## [9] 0.9497759 25
For animations movie we can find really interesting insights! In 63% animations movies are also adventure! They tend to be shorter ones. In almost 36% of cases they generated over $100 millions!
My project presented usage of apriori method for association rules. I’ve looked for interesting insights and plenty of them was found! They could be found in corresponding tab. I think this project is my biggest success and if I had more time I would deep even harder!