This report analyzes the movie dataset using exploratory data analysis and association rule mining. The dataset contains various columns such as title, director name, genre, rating and budget. In this analysis, the assocaiton between genres is examined.
The data is obtained from: (copy and paste the link) https://www.kaggle.com/datasets/karthiknamboori1/movie-datasets?resource=download
The goal is to discover some assocations between movies. The genre column is used to extract information about simultanous occurances of genres in the same movie.The extracted association rules show which genres tend to co-occur, with metrics such as support (frequency), confidence (likelihood of co-occurrence), and lift (strength of association).
library(arulesViz)
library(ggpubr)
library(plotly)
library(tidyr)
library(arules)
library(ggplot2)
library(readr)
library(dplyr)
library(rCBA)
movies <- read_csv("movie_dataset.csv")
head(movies)
## # A tibble: 6 × 12
## budget genres id original_language popularity release_date revenue runtime
## <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
## 1 2.37e8 Actio… 19995 en 150. 10-12-2009 2.79e9 162
## 2 3 e8 Adven… 285 en 139. 19-05-2007 9.61e8 169
## 3 2.45e8 Actio… 206647 en 107. 26-10-2015 8.81e8 148
## 4 2.5 e8 Actio… 49026 en 112. 16-07-2012 1.08e9 165
## 5 2.6 e8 Actio… 49529 en 43.9 07-03-2012 2.84e8 132
## 6 2.58e8 Fanta… 559 en 116. 01-05-2007 8.91e8 139
## # ℹ 4 more variables: title <chr>, vote_average <dbl>, vote_count <dbl>,
## # director <chr>
str(movies)
## spc_tbl_ [4,041 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ budget : num [1:4041] 2.37e+08 3.00e+08 2.45e+08 2.50e+08 2.60e+08 2.58e+08 2.60e+08 2.80e+08 2.50e+08 2.50e+08 ...
## $ genres : chr [1:4041] "Action Adventure Fantasy Science-Fiction" "Adventure Fantasy Action" "Action Adventure Crime" "Action Crime Drama Thriller" ...
## $ id : num [1:4041] 19995 285 206647 49026 49529 ...
## $ original_language: chr [1:4041] "en" "en" "en" "en" ...
## $ popularity : num [1:4041] 150.4 139.1 107.4 112.3 43.9 ...
## $ release_date : chr [1:4041] "10-12-2009" "19-05-2007" "26-10-2015" "16-07-2012" ...
## $ revenue : num [1:4041] 2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
## $ runtime : num [1:4041] 162 169 148 165 132 139 100 141 153 151 ...
## $ title : chr [1:4041] "Avatar" "Pirates of the Caribbean: At World's End" "Spectre" "The Dark Knight Rises" ...
## $ vote_average : num [1:4041] 7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
## $ vote_count : num [1:4041] 11800 4500 4466 9106 2124 ...
## $ director : chr [1:4041] "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
## - attr(*, "spec")=
## .. cols(
## .. budget = col_double(),
## .. genres = col_character(),
## .. id = col_double(),
## .. original_language = col_character(),
## .. popularity = col_double(),
## .. release_date = col_character(),
## .. revenue = col_double(),
## .. runtime = col_double(),
## .. title = col_character(),
## .. vote_average = col_double(),
## .. vote_count = col_double(),
## .. director = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
ggplot(movies, aes(x = vote_average)) +
geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
labs(title = "Distribution of Vote Averages", x = "Vote Average", y = "Frequency") +
theme_minimal()
ggplot(movies, aes(x = runtime)) +
geom_histogram(binwidth = 10, fill = "orange", color = "black") +
labs(title = "Distribution of Runtime", x = "Runtime (minutes)", y = "Frequency") +
theme_minimal()
ggplot(movies, aes(x = popularity)) +
geom_histogram(binwidth = 5, fill = "green", color = "black") +
labs(title = "Distribution of Popularity", x = "Popularity", y = "Frequency") +
theme_minimal()
# genres and count occurrences are spllitted
genres_count <- movies %>%
mutate(genres_split = strsplit(as.character(genres), ",")) %>%
unnest(genres_split) %>%
count(genres_split, sort = TRUE) %>%
top_n(10)
ggplot(genres_count, aes(x = reorder(genres_split, n), y = n)) +
geom_bar(stat = "identity", fill = "purple") +
coord_flip() +
labs(title = "Top 10 Genres", x = "Genres", y = "Frequency") +
theme_minimal()
Following parameters are used in this application of the Apriori Method: supp = 0.03, conf = 0.6
Remove missing values (NA) from the genres column. Convert the genres column into a list format, where each movie’s genres are split into individual items. Transform the list into a transactions format, which is required to apply the Apriori algorithm.
Support measures how frequently an item or itemset appears in the dataset. A support threshold of 0.03 (3%) means that only itemsets that appear in at least 3% of all transactions will be considered.
Confidence measures the likelihood that when item X appears, item Y also appears. A confidence threshold of 0.6 (60%) means that the algorithm will only return rules where at least 60% of transactions containing X also contain Y.
# NA values is removed
movies <- movies %>% filter(!is.na(genres))
movies$genres <- strsplit(as.character(movies$genres), " ")
movies_trans <- as(movies$genres, "transactions")
#Apriori algorithm applied in here with the following parameters
rules <- apriori(movies_trans, parameter = list(supp = 0.03, conf = 0.6))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.03 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 120
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[20 item(s), 4013 transaction(s)] done [0.00s].
## sorting and recoding items ... [16 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 8 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 5 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 2.000 2.375 3.000 3.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.03514 Min. :0.6008 Min. :0.04137 Min. :1.414
## 1st Qu.:0.03862 1st Qu.:0.6558 1st Qu.:0.04747 1st Qu.:2.238
## Median :0.04186 Median :0.6939 Median :0.05632 Median :2.434
## Mean :0.05887 Mean :0.7410 Mean :0.08510 Mean :3.101
## 3rd Qu.:0.06317 3rd Qu.:0.8548 3rd Qu.:0.09961 3rd Qu.:3.003
## Max. :0.12036 Max. :0.9000 Max. :0.18216 Max. :7.918
## count
## Min. :141.0
## 1st Qu.:155.0
## Median :168.0
## Mean :236.2
## 3rd Qu.:253.5
## Max. :483.0
##
## mining info:
## data ntransactions support confidence
## movies_trans 4013 0.03 0.6
## call
## apriori(data = movies_trans, parameter = list(supp = 0.03, conf = 0.6))
inspect(head(sort(rules, by="lift")))
## lhs rhs support confidence coverage
## [1] {Animation} => {Family} 0.04335908 0.8405797 0.05158236
## [2] {Adventure, Thriller} => {Action} 0.04036880 0.9000000 0.04485422
## [3] {Adventure, Science-Fiction} => {Action} 0.03513581 0.7268041 0.04834289
## [4] {Mystery} => {Thriller} 0.04859208 0.6610169 0.07351109
## [5] {Action, Crime} => {Thriller} 0.03912285 0.6408163 0.06105158
## [6] {Adventure} => {Action} 0.10690257 0.6008403 0.17792175
## lift count
## [1] 7.918419 174
## [2] 3.509913 162
## [3] 2.834465 141
## [4] 2.472191 195
## [5] 2.396641 157
## [6] 2.343219 429
inspect(sort(rules, by = "support", decreasing = TRUE)[1:5])
## lhs rhs support confidence coverage
## [1] {Romance} => {Drama} 0.12035883 0.6607387 0.18215799
## [2] {Adventure} => {Action} 0.10690257 0.6008403 0.17792175
## [3] {Mystery} => {Thriller} 0.04859208 0.6610169 0.07351109
## [4] {Animation} => {Family} 0.04335908 0.8405797 0.05158236
## [5] {Adventure, Thriller} => {Action} 0.04036880 0.9000000 0.04485422
## lift count
## [1] 1.414157 483
## [2] 2.343219 429
## [3] 2.472191 195
## [4] 7.918419 174
## [5] 3.509913 162
The graph-based visualization helps in understanding the relationships between genres.
#plot the graph
plot(rules, method="graph", engine="htmlwidget")
The scatter plot displays the rules with support on the x-axis and confidence on the y-axis, while color represents the lift.
#scatter plot of association rules
plot(rules, measure=c("support", "confidence"),
shading="lift", engine="plotly")
I decreased this value to be able to cover more data points. Support measures how frequently an item or itemset appears in the dataset. A support threshold of 0.006 (0.3%) means that only itemsets that appear in at least 0.3% of all transactions will be considered.
I decreased this value to be able to cover more data points. Confidence measures the likelihood that when item X appears, item Y also appears. A confidence threshold of 0.25 (25%) means that the algorithm will only return rules where at least 25% of transactions containing X also contain Y.
rules2 <- apriori(movies_trans, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.25 0.1 1 none FALSE TRUE 5 0.006 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 24
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[20 item(s), 4013 transaction(s)] done [0.00s].
## sorting and recoding items ... [19 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [176 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules2)
## set of 176 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4
## 51 106 19
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.000 3.000 2.818 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.006230 Min. :0.2507 Min. :0.008473 Min. : 0.5949
## 1st Qu.:0.009469 1st Qu.:0.3480 1st Qu.:0.021119 1st Qu.: 1.4190
## Median :0.016197 Median :0.4545 Median :0.041116 Median : 2.0051
## Mean :0.027459 Mean :0.4795 Mean :0.065970 Mean : 2.5955
## 3rd Qu.:0.035510 3rd Qu.:0.5833 3rd Qu.:0.080738 3rd Qu.: 2.7592
## Max. :0.121356 Max. :0.9123 Max. :0.467232 Max. :11.4651
## count
## Min. : 25.0
## 1st Qu.: 38.0
## Median : 65.0
## Mean :110.2
## 3rd Qu.:142.5
## Max. :487.0
##
## mining info:
## data ntransactions support confidence
## movies_trans 4013 0.006 0.25
## call
## apriori(data = movies_trans, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))
inspect(head(sort(rules2, by="lift")))
## lhs rhs support confidence coverage
## [1] {Adventure, Comedy, Family} => {Animation} 0.01370546 0.5913978 0.02317468
## [2] {Drama, War} => {History} 0.01071518 0.4018692 0.02666334
## [3] {Adventure, Family} => {Animation} 0.02192873 0.4782609 0.04585098
## [4] {War} => {History} 0.01245951 0.3787879 0.03289310
## [5] {History} => {War} 0.01245951 0.3012048 0.04136556
## [6] {Drama, History} => {War} 0.01071518 0.2885906 0.03712933
## lift count
## [1] 11.465119 55
## [2] 9.715066 43
## [3] 9.271792 88
## [4] 9.157083 50
## [5] 9.157083 50
## [6] 8.773592 43
inspect(sort(rules2, by = "support", decreasing = TRUE)[1:5])
## lhs rhs support confidence coverage lift count
## [1] {Thriller} => {Action} 0.1213556 0.4538677 0.2673810 1.7700398 487
## [2] {Action} => {Thriller} 0.1213556 0.4732750 0.2564166 1.7700398 487
## [3] {Romance} => {Drama} 0.1203588 0.6607387 0.1821580 1.4141570 483
## [4] {Drama} => {Romance} 0.1203588 0.2576000 0.4672315 1.4141570 483
## [5] {Thriller} => {Drama} 0.1131323 0.4231128 0.2673810 0.9055742 454
The graph-based visualization helps in understanding the relationships between genres.
plot(rules2, method="graph", engine="htmlwidget")
The scatter plot displays the rules with support on the x-axis and confidence on the y-axis, while color represents the lift.
#scatter plot of association rules
plot(rules2, measure=c("support", "confidence"),
shading="lift", engine="plotly")
The Apriori algorithm discovers associations between movie genres that frequently appear together. The extracted association rules show which genres tend to co-occur, with metrics such as support (frequency), confidence (likelihood of co-occurrence), and lift (strength of association).
For example, a rule like: {Action} → {Adventure} (lift > 1) indicates that movies labeled as Action are significantly more likely to also be labeled as Adventure compared to random chance
ECLAT finds frequent itemsets (combinations of frequently co-occurring genres). Rule Induction extracts association rules (if a movie has Genre A, it’s likely to have Genre B).
Support measures how frequently an itemset appears in the dataset. While confidence measures the probability of one genre appearing given anothe, lift tells us whether the association is stronger than chance.
Visualizations (bar plots, network graphs) help in analyzing frequent itemsets and their relationships.
The parameter supp = 0.03 ensures only itemsets appearing in at least 3% of transactions are considered. The confidence = 0.6 parameter ensures that only rules with at least 60% confidence are considered.
library(arules)
library(arulesViz)
# dataset are converted into a transactions format
movies$genres <- strsplit(as.character(movies$genres), " ")
movies_trans <- as(movies$genres, "transactions")
#ECLAT algorithm
freq.items <- eclat(movies_trans, parameter=list(supp=0.03, maxlen=15))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.03 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 120
##
## create itemset ...
## set transactions ...[76 item(s), 4013 transaction(s)] done [0.00s].
## sorting and recoding items ... [27 item(s)] done [0.00s].
## creating sparse bit matrix ... [27 row(s), 4013 column(s)] done [0.00s].
## writing ... [37 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
#inspections
inspect(head(sort(freq.items, by="support", decreasing=TRUE), 10))
## items support count
## [1] {c("Drama",} 0.17268876 693
## [2] {"Thriller")} 0.16795415 674
## [3] {c("Action",} 0.16072764 645
## [4] {c("Comedy",} 0.14453028 580
## [5] {"Romance")} 0.13506105 542
## [6] {"Drama",} 0.13331672 535
## [7] {"Comedy",} 0.09294792 373
## [8] {"Drama")} 0.08497384 341
## [9] {"Adventure",} 0.07774732 312
## [10] {Drama} 0.07625218 306
rules <- ruleInduction(freq.items, transactions = movies_trans, confidence = 0.6)
inspect(head(sort(rules, by="lift"), 10))
## lhs rhs support confidence lift itemset
## [1] {"Adventure",} => {c("Action",} 0.05158236 0.6634615 4.127862 2
#visualizations
plot(freq.items, method="graph", engine="igraph", control=list(type="items"))
## Available control parameters (with default values):
## main = Graph for 37 itemsets
## max = 100
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## itemnodeCol = #66CC66FF
## edgeCol = #ABABABFF
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## arrowSize = 0.5
## alpha = 0.5
## cex = 1
## layout = NULL
## layoutParams = list()
## engine = igraph
## plot = TRUE
## plot_options = list()
## verbose = FALSE
plot(freq.items, method="graph", engine="visNetwork")
The ECLAT algorithm discovers frequent genre combinations in movies. Unlike Apriori, which generates rules directly, ECLAT focuses on finding itemsets that frequently appear together.
The extracted frequent itemsets reveal which genres co-occur most often in movies. The rule induction step then derives association rules, showing which genres are strongly linked.
For example, a high-support itemset {Action, Adventure, Fantasy} suggests that these genres frequently appear together in movies. A strong association rule like: {Action, Adventure} → {Fantasy} (lift > 1) indicates that movies labeled Action and Adventure are significantly more likely to also be labeled Fantasy compared to random chance.
ECLAT is better for finding common genre combinations. Apriori is better for understanding genre relationships and causality
Support (frequency of itemsets) exists in both Apriori and ECLAT. However, Confidence (probability of one genre leading to another) and Lift (strength of association compared to random co-occurrence) exists only in Apriori. Although they both have advantages, using Apriori might be a better idea as we are trying to make a deeper anlysis about simultanous occurances of movie genres.