Association

Introduction

This report analyzes the movie dataset using exploratory data analysis and association rule mining. The dataset contains various columns such as title, director name, genre, rating and budget. In this analysis, the assocaiton between genres is examined.

The data is obtained from: (copy and paste the link) https://www.kaggle.com/datasets/karthiknamboori1/movie-datasets?resource=download

Goal:

The goal is to discover some assocations between movies. The genre column is used to extract information about simultanous occurances of genres in the same movie.The extracted association rules show which genres tend to co-occur, with metrics such as support (frequency), confidence (likelihood of co-occurrence), and lift (strength of association).

Data Loading and Preprocessing

library(arulesViz)
library(ggpubr)
library(plotly)
library(tidyr)
library(arules)
library(ggplot2)
library(readr)
library(dplyr)
library(rCBA)

movies <- read_csv("movie_dataset.csv")
head(movies)

## # A tibble: 6 × 12
##   budget genres     id original_language popularity release_date revenue runtime
##    <dbl> <chr>   <dbl> <chr>                  <dbl> <chr>          <dbl>   <dbl>
## 1 2.37e8 Actio…  19995 en                     150.  10-12-2009    2.79e9     162
## 2 3   e8 Adven…    285 en                     139.  19-05-2007    9.61e8     169
## 3 2.45e8 Actio… 206647 en                     107.  26-10-2015    8.81e8     148
## 4 2.5 e8 Actio…  49026 en                     112.  16-07-2012    1.08e9     165
## 5 2.6 e8 Actio…  49529 en                      43.9 07-03-2012    2.84e8     132
## 6 2.58e8 Fanta…    559 en                     116.  01-05-2007    8.91e8     139
## # ℹ 4 more variables: title <chr>, vote_average <dbl>, vote_count <dbl>,
## #   director <chr>

str(movies)

## spc_tbl_ [4,041 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ budget           : num [1:4041] 2.37e+08 3.00e+08 2.45e+08 2.50e+08 2.60e+08 2.58e+08 2.60e+08 2.80e+08 2.50e+08 2.50e+08 ...
##  $ genres           : chr [1:4041] "Action Adventure Fantasy Science-Fiction" "Adventure Fantasy Action" "Action Adventure Crime" "Action Crime Drama Thriller" ...
##  $ id               : num [1:4041] 19995 285 206647 49026 49529 ...
##  $ original_language: chr [1:4041] "en" "en" "en" "en" ...
##  $ popularity       : num [1:4041] 150.4 139.1 107.4 112.3 43.9 ...
##  $ release_date     : chr [1:4041] "10-12-2009" "19-05-2007" "26-10-2015" "16-07-2012" ...
##  $ revenue          : num [1:4041] 2.79e+09 9.61e+08 8.81e+08 1.08e+09 2.84e+08 ...
##  $ runtime          : num [1:4041] 162 169 148 165 132 139 100 141 153 151 ...
##  $ title            : chr [1:4041] "Avatar" "Pirates of the Caribbean: At World's End" "Spectre" "The Dark Knight Rises" ...
##  $ vote_average     : num [1:4041] 7.2 6.9 6.3 7.6 6.1 5.9 7.4 7.3 7.4 5.7 ...
##  $ vote_count       : num [1:4041] 11800 4500 4466 9106 2124 ...
##  $ director         : chr [1:4041] "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   budget = col_double(),
##   ..   genres = col_character(),
##   ..   id = col_double(),
##   ..   original_language = col_character(),
##   ..   popularity = col_double(),
##   ..   release_date = col_character(),
##   ..   revenue = col_double(),
##   ..   runtime = col_double(),
##   ..   title = col_character(),
##   ..   vote_average = col_double(),
##   ..   vote_count = col_double(),
##   ..   director = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

Distribution of the IMDB scores of the movies in the dataset

ggplot(movies, aes(x = vote_average)) +
  geom_histogram(binwidth = 0.5, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Vote Averages", x = "Vote Average", y = "Frequency") +
  theme_minimal()

Distribution of the runtime of the movies in the dataset

ggplot(movies, aes(x = runtime)) +
  geom_histogram(binwidth = 10, fill = "orange", color = "black") +
  labs(title = "Distribution of Runtime", x = "Runtime (minutes)", y = "Frequency") +
  theme_minimal()

Distribution of the popularity of the movies in the dataset

ggplot(movies, aes(x = popularity)) +
  geom_histogram(binwidth = 5, fill = "green", color = "black") +
  labs(title = "Distribution of Popularity", x = "Popularity", y = "Frequency") +
  theme_minimal()

The top 10 genres of the movies according to their frequencies in the dataset

# genres and count occurrences are spllitted
genres_count <- movies %>%
  mutate(genres_split = strsplit(as.character(genres), ",")) %>%
  unnest(genres_split) %>%
  count(genres_split, sort = TRUE) %>%
  top_n(10)

ggplot(genres_count, aes(x = reorder(genres_split, n), y = n)) +
  geom_bar(stat = "identity", fill = "purple") +
  coord_flip() +
  labs(title = "Top 10 Genres", x = "Genres", y = "Frequency") +
  theme_minimal()

Apriori Method

Following parameters are used in this application of the Apriori Method: supp = 0.03, conf = 0.6

Remove missing values (NA) from the genres column. Convert the genres column into a list format, where each movie’s genres are split into individual items. Transform the list into a transactions format, which is required to apply the Apriori algorithm.

supp (Support = 0.03)

Support measures how frequently an item or itemset appears in the dataset. A support threshold of 0.03 (3%) means that only itemsets that appear in at least 3% of all transactions will be considered.

conf (Confidence = 0.6)

Confidence measures the likelihood that when item X appears, item Y also appears. A confidence threshold of 0.6 (60%) means that the algorithm will only return rules where at least 60% of transactions containing X also contain Y.

# NA values is removed
movies <- movies %>% filter(!is.na(genres))

movies$genres <- strsplit(as.character(movies$genres), " ")
movies_trans <- as(movies$genres, "transactions")

#Apriori algorithm applied in here with the following parameters
rules <- apriori(movies_trans, parameter = list(supp = 0.03, conf = 0.6))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.03      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 120 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[20 item(s), 4013 transaction(s)] done [0.00s].
## sorting and recoding items ... [16 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [8 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(rules)

## set of 8 rules
## 
## rule length distribution (lhs + rhs):sizes
## 2 3 
## 5 3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   2.000   2.375   3.000   3.000 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.03514   Min.   :0.6008   Min.   :0.04137   Min.   :1.414  
##  1st Qu.:0.03862   1st Qu.:0.6558   1st Qu.:0.04747   1st Qu.:2.238  
##  Median :0.04186   Median :0.6939   Median :0.05632   Median :2.434  
##  Mean   :0.05887   Mean   :0.7410   Mean   :0.08510   Mean   :3.101  
##  3rd Qu.:0.06317   3rd Qu.:0.8548   3rd Qu.:0.09961   3rd Qu.:3.003  
##  Max.   :0.12036   Max.   :0.9000   Max.   :0.18216   Max.   :7.918  
##      count      
##  Min.   :141.0  
##  1st Qu.:155.0  
##  Median :168.0  
##  Mean   :236.2  
##  3rd Qu.:253.5  
##  Max.   :483.0  
## 
## mining info:
##          data ntransactions support confidence
##  movies_trans          4013    0.03        0.6
##                                                                     call
##  apriori(data = movies_trans, parameter = list(supp = 0.03, conf = 0.6))

inspect(head(sort(rules, by="lift")))

##     lhs                             rhs        support    confidence coverage  
## [1] {Animation}                  => {Family}   0.04335908 0.8405797  0.05158236
## [2] {Adventure, Thriller}        => {Action}   0.04036880 0.9000000  0.04485422
## [3] {Adventure, Science-Fiction} => {Action}   0.03513581 0.7268041  0.04834289
## [4] {Mystery}                    => {Thriller} 0.04859208 0.6610169  0.07351109
## [5] {Action, Crime}              => {Thriller} 0.03912285 0.6408163  0.06105158
## [6] {Adventure}                  => {Action}   0.10690257 0.6008403  0.17792175
##     lift     count
## [1] 7.918419 174  
## [2] 3.509913 162  
## [3] 2.834465 141  
## [4] 2.472191 195  
## [5] 2.396641 157  
## [6] 2.343219 429

inspect(sort(rules, by = "support", decreasing = TRUE)[1:5])

##     lhs                      rhs        support    confidence coverage  
## [1] {Romance}             => {Drama}    0.12035883 0.6607387  0.18215799
## [2] {Adventure}           => {Action}   0.10690257 0.6008403  0.17792175
## [3] {Mystery}             => {Thriller} 0.04859208 0.6610169  0.07351109
## [4] {Animation}           => {Family}   0.04335908 0.8405797  0.05158236
## [5] {Adventure, Thriller} => {Action}   0.04036880 0.9000000  0.04485422
##     lift     count
## [1] 1.414157 483  
## [2] 2.343219 429  
## [3] 2.472191 195  
## [4] 7.918419 174  
## [5] 3.509913 162

The graph-based visualization helps in understanding the relationships between genres.

#plot the graph
plot(rules, method="graph", engine="htmlwidget")

The scatter plot displays the rules with support on the x-axis and confidence on the y-axis, while color represents the lift.

#scatter plot of association rules
plot(rules, measure=c("support", "confidence"), 
     shading="lift", engine="plotly")

supp (Support = 0.006)

I decreased this value to be able to cover more data points. Support measures how frequently an item or itemset appears in the dataset. A support threshold of 0.006 (0.3%) means that only itemsets that appear in at least 0.3% of all transactions will be considered.

conf (Confidence = 0.25)

I decreased this value to be able to cover more data points. Confidence measures the likelihood that when item X appears, item Y also appears. A confidence threshold of 0.25 (25%) means that the algorithm will only return rules where at least 25% of transactions containing X also contain Y.

rules2 <- apriori(movies_trans, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.25    0.1    1 none FALSE            TRUE       5   0.006      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 24 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[20 item(s), 4013 transaction(s)] done [0.00s].
## sorting and recoding items ... [19 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [176 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(rules2)

## set of 176 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4 
##  51 106  19 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   2.818   3.000   4.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift        
##  Min.   :0.006230   Min.   :0.2507   Min.   :0.008473   Min.   : 0.5949  
##  1st Qu.:0.009469   1st Qu.:0.3480   1st Qu.:0.021119   1st Qu.: 1.4190  
##  Median :0.016197   Median :0.4545   Median :0.041116   Median : 2.0051  
##  Mean   :0.027459   Mean   :0.4795   Mean   :0.065970   Mean   : 2.5955  
##  3rd Qu.:0.035510   3rd Qu.:0.5833   3rd Qu.:0.080738   3rd Qu.: 2.7592  
##  Max.   :0.121356   Max.   :0.9123   Max.   :0.467232   Max.   :11.4651  
##      count      
##  Min.   : 25.0  
##  1st Qu.: 38.0  
##  Median : 65.0  
##  Mean   :110.2  
##  3rd Qu.:142.5  
##  Max.   :487.0  
## 
## mining info:
##          data ntransactions support confidence
##  movies_trans          4013   0.006       0.25
##                                                                                            call
##  apriori(data = movies_trans, parameter = list(support = 0.006, confidence = 0.25, minlen = 2))

inspect(head(sort(rules2, by="lift")))

##     lhs                            rhs         support    confidence coverage  
## [1] {Adventure, Comedy, Family} => {Animation} 0.01370546 0.5913978  0.02317468
## [2] {Drama, War}                => {History}   0.01071518 0.4018692  0.02666334
## [3] {Adventure, Family}         => {Animation} 0.02192873 0.4782609  0.04585098
## [4] {War}                       => {History}   0.01245951 0.3787879  0.03289310
## [5] {History}                   => {War}       0.01245951 0.3012048  0.04136556
## [6] {Drama, History}            => {War}       0.01071518 0.2885906  0.03712933
##     lift      count
## [1] 11.465119 55   
## [2]  9.715066 43   
## [3]  9.271792 88   
## [4]  9.157083 50   
## [5]  9.157083 50   
## [6]  8.773592 43

inspect(sort(rules2, by = "support", decreasing = TRUE)[1:5])

##     lhs           rhs        support   confidence coverage  lift      count
## [1] {Thriller} => {Action}   0.1213556 0.4538677  0.2673810 1.7700398 487  
## [2] {Action}   => {Thriller} 0.1213556 0.4732750  0.2564166 1.7700398 487  
## [3] {Romance}  => {Drama}    0.1203588 0.6607387  0.1821580 1.4141570 483  
## [4] {Drama}    => {Romance}  0.1203588 0.2576000  0.4672315 1.4141570 483  
## [5] {Thriller} => {Drama}    0.1131323 0.4231128  0.2673810 0.9055742 454

The graph-based visualization helps in understanding the relationships between genres.

plot(rules2, method="graph", engine="htmlwidget")

The scatter plot displays the rules with support on the x-axis and confidence on the y-axis, while color represents the lift.

#scatter plot of association rules
plot(rules2, measure=c("support", "confidence"), 
     shading="lift", engine="plotly")

Interpretation

The Apriori algorithm discovers associations between movie genres that frequently appear together. The extracted association rules show which genres tend to co-occur, with metrics such as support (frequency), confidence (likelihood of co-occurrence), and lift (strength of association).

For example, a rule like: {Action} → {Adventure} (lift > 1) indicates that movies labeled as Action are significantly more likely to also be labeled as Adventure compared to random chance

ECLAT Method

ECLAT finds frequent itemsets (combinations of frequently co-occurring genres). Rule Induction extracts association rules (if a movie has Genre A, it’s likely to have Genre B).

Support measures how frequently an itemset appears in the dataset. While confidence measures the probability of one genre appearing given anothe, lift tells us whether the association is stronger than chance.

Visualizations (bar plots, network graphs) help in analyzing frequent itemsets and their relationships.

The parameter supp = 0.03 ensures only itemsets appearing in at least 3% of transactions are considered. The confidence = 0.6 parameter ensures that only rules with at least 60% confidence are considered.

library(arules)
library(arulesViz)

# dataset are converted into a transactions format
movies$genres <- strsplit(as.character(movies$genres), " ")
movies_trans <- as(movies$genres, "transactions")

#ECLAT algorithm
freq.items <- eclat(movies_trans, parameter=list(supp=0.03, maxlen=15))

## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.03      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 120 
## 
## create itemset ... 
## set transactions ...[76 item(s), 4013 transaction(s)] done [0.00s].
## sorting and recoding items ... [27 item(s)] done [0.00s].
## creating sparse bit matrix ... [27 row(s), 4013 column(s)] done [0.00s].
## writing  ... [37 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

#inspections
inspect(head(sort(freq.items, by="support", decreasing=TRUE), 10))

##      items          support    count
## [1]  {c("Drama",}   0.17268876 693  
## [2]  {"Thriller")}  0.16795415 674  
## [3]  {c("Action",}  0.16072764 645  
## [4]  {c("Comedy",}  0.14453028 580  
## [5]  {"Romance")}   0.13506105 542  
## [6]  {"Drama",}     0.13331672 535  
## [7]  {"Comedy",}    0.09294792 373  
## [8]  {"Drama")}     0.08497384 341  
## [9]  {"Adventure",} 0.07774732 312  
## [10] {Drama}        0.07625218 306

rules <- ruleInduction(freq.items, transactions = movies_trans, confidence = 0.6)
inspect(head(sort(rules, by="lift"), 10))

##     lhs               rhs           support    confidence lift     itemset
## [1] {"Adventure",} => {c("Action",} 0.05158236 0.6634615  4.127862 2

#visualizations
plot(freq.items, method="graph", engine="igraph", control=list(type="items"))

## Available control parameters (with default values):
## main  =  Graph for 37 itemsets
## max   =  100
## nodeCol   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## itemnodeCol   =  #66CC66FF
## edgeCol   =  #ABABABFF
## labelCol  =  #000000B3
## measureLabels     =  FALSE
## precision     =  3
## arrowSize     =  0.5
## alpha     =  0.5
## cex   =  1
## layout    =  NULL
## layoutParams  =  list()
## engine    =  igraph
## plot  =  TRUE
## plot_options  =  list()
## verbose   =  FALSE

plot(freq.items, method="graph", engine="visNetwork")

Interpretation of ECLAT

The ECLAT algorithm discovers frequent genre combinations in movies. Unlike Apriori, which generates rules directly, ECLAT focuses on finding itemsets that frequently appear together.

The extracted frequent itemsets reveal which genres co-occur most often in movies. The rule induction step then derives association rules, showing which genres are strongly linked.

For example, a high-support itemset {Action, Adventure, Fantasy} suggests that these genres frequently appear together in movies. A strong association rule like: {Action, Adventure} → {Fantasy} (lift > 1) indicates that movies labeled Action and Adventure are significantly more likely to also be labeled Fantasy compared to random chance.

Comparison of Apriori and ECLAT

ECLAT is better for finding common genre combinations. Apriori is better for understanding genre relationships and causality

Support (frequency of itemsets) exists in both Apriori and ECLAT. However, Confidence (probability of one genre leading to another) and Lift (strength of association compared to random co-occurrence) exists only in Apriori. Although they both have advantages, using Apriori might be a better idea as we are trying to make a deeper anlysis about simultanous occurances of movie genres.