Who are the best buddies in the movie industry? Let’s use association rules to find out!

Introduction

This is the second out of two papers for Unsupervised Learning, winter class of 2021.

In this paper, I use association rules to identify ‘cliques’ in the movie world - sets of artists who frequently make movies together. The analysis includes actors, directors, producers, writers and composers. The working hypothesis is that association rules should be able to identify groups of artists who are frequently billed together, e.g.:

cast of long-running franchises such as Harry Potter or Fast and Furious movies
director-writer, director-composer or even director-cinematographer duos (Christopher Nolan and Hans Zimmer, Patryk Vega and Mirosław Brożek)
co-directors such as Coen brothers or Wachowskis
directors who are known to work frequently with same actors (Quentin Tarantino and Samuel L. Jackson, Michael Madsen, Tim Roth)
other, less obvious relationships - which I am interested in the most

Identifying the famous duos is straightforward, possible with a single join. I focus on identifying sets of at least three artists, which requires data mining tools such as association rules and should provide at least a few surprising results. Nonetheless the ability of the algorithm to identify a few of the duos will be a good benchmark to sens-check the results.

Overall, this paper is intended to document the end-to-end analysis process, including data cleaning, analysis and visualization.

Dataset

Throughout the analysis, I use data made available by IMDB, documented under this link and published under this link. I use the following tables:

Title basics - title, release year and type (movie/TV series/short etc.)
Principals - principal cast, i.e. top 10 billed artists for each title (not only actors, can include director, composer, cinematographer etc.)
Names - artist names (in other tables artists are identified with alphanumeric IDs)

The principal cast table contains only information on top 10 billing. Data on full cast needs to be licensed and accessed via API or scraped. One could say this means the analysis will not be comprehensive. However, as we will soon see, with the number of movies in the database, calculating association rules with more than 10 names per movie would not be feasible in home setting anyway.

Exploratory data analysis

I start off with loading the datasets. Titles, Principals and Names tables hold 8.5m, 47.8m and 11.3m records respectively, which poses a challenge for base and tidyverse read functions - they are too slow. I need to use fread function from data.table package. For EDA purposes I will work at times with 20% random samples of the records, to shorten running time. For the final run I will switch to the full dataset.

library(tidyverse)

basics <- data.table::fread(
  'title.basics.tsv',
  quote = '',
  na.strings = '\\N',
  encoding = 'UTF-8',
  colClasses = c(
    startYear = 'double',
    runtimeMinutes = 'double'))

set.seed(456)
basics_sample <- slice_sample(basics, prop=0.2)
save(basics_sample, file='basics_sample.RData')

load('basics_sample.RData')

I investigate first the title type distribution. It seems the 75% of IMDB titles are actually TV episodes. This is significant for my analysis - TV series generally feature same core cast over multiple episodes or even seasons. This means that rule mining would mostly find longest running TV series instead of interesting relationships in the movie world. This is why in the next step I am narrowing down

basics_sample %>% 
  group_by(titleType) %>% 
  summarise(n = n()) %>% 
  mutate(freq = n/sum(n)) %>% 
  ggplot(aes(x = fct_reorder(titleType, freq),
             y = freq
             )) +
  geom_col(col='black', fill='coral') +
  coord_flip() +
  geom_text(aes(label = paste0(round(freq*100,2),'%')), nudge_y = 0.05) +
  scale_y_continuous(expand = expansion(mult = c(0,0.3))) +
  labs(
    title = 'Title type proportion',
    x = 'Title type',
    y = 'Count'
  ) +
  theme_bw() +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

Since the title type I am interested in makes up only 7% of the original dataset, it is safe to switch back to the full dataset, narrowed down to Movie type.

movies <- filter(basics, titleType == 'movie', startYear >= 1920, startYear <= 2021)
rm(basics)
rm(basics_sample)
save(movies, file='movies.RData')

Interestingly, the data is heavily skewed towards movies released after year 2000. It should be noted that IMDB exists only since 1990 and was acquired by Amazon in 1998, which coincided with acceleration in growth of recorded releases. It’s not possible to tell based on this whether this reflects a boom in global movie industry, is a product of better bookkeeping or is driven by localized explosions in movies produced by very prolific Nigerian and Indian film industries, known as Nollywood and Bollywood. Nonetheless it is interesting to see how pandemic impacted the film industry, which shrunk 15% in terms of number of releases in 2020 and still has not fully recovered.

load('movies.RData')

nrow(movies)

## [1] 500573

movies %>%
  group_by(startYear) %>% 
  summarise(n = n()) %>% 
  drop_na() %>% 
  ggplot(aes(x = startYear,
             y = n)) +
  labs(title = 'Movies released by year',
       x = 'Release year',
       y = 'Movies released') + 
  geom_line(size = 1, col = 'brown2') +
  scale_y_continuous(labels = function(x) format(x, big.mark = ',')) +
  theme_bw()

I add the information about the principal cast to the movies. The principal cast table has 47.8m records and needs to be removed after the join to conserve memory.

princ <- data.table::fread(
  'title.principals.tsv',
  na.strings = '\\N',
  encoding = 'UTF-8',
  quote = '')

movies_princ <- princ %>%
  inner_join(select(movies, tconst, titleType, primaryTitle, startYear),
             by = c('tconst'))

rm(princ)
save(movies_princ, file='movies_princ.RData')

Noticeably the directors, writers and producers seem to be overrepresented among the cast - this is of course the impact of limiting the analysis to the principal cast though, where these roles are more likely to be featured among top 10 billed names. It is thus crucial not to limit the analysis to actors only, as this will put a hard cap on maximal itemset that can be found via association mining.

load('movies_princ.RData')

nrow(movies_princ)

## [1] 3813954

movies_princ %>% 
  group_by(category) %>% 
  summarise(n = n()) %>% 
  mutate(freq = paste0(round(n/sum(n)*100,0),'%')) %>% 
  arrange(desc(n))

## # A tibble: 12 x 3
##    category                  n freq 
##    <chr>                 <int> <chr>
##  1 actor               1061108 28%  
##  2 actress              643069 17%  
##  3 director             492261 13%  
##  4 writer               375924 10%  
##  5 producer             311756 8%   
##  6 cinematographer      298464 8%   
##  7 composer             268598 7%   
##  8 editor               175476 5%   
##  9 self                 141618 4%   
## 10 production_designer   32938 1%   
## 11 archive_footage       12689 0%   
## 12 archive_sound            53 0%

Another noticeable feature is that some movies have less than 10 principal cast. These are likely to be niche or low budget movies. The average number of names per title is also decreasing over time, which suggests there is an increasing inflow of such movies - perhaps driven by access to hardware and decreasing cost of amateur filmmaking.

In any case, associations produced by these movies will not be as interesting, so I decide to drop the movies with 9 and fewer principal cast, for the sake of efficiency. This cuts the number of movies to be analyzed by half.

names_per_movie <- movies_princ %>% 
  group_by(tconst) %>% 
  summarise(n_names = n())

names_per_movie %>% 
  ggplot(aes(x = n_names)) +
  geom_histogram(bins = 10, col='black', fill='cornflowerblue') +
  labs(title = 'Distribution of names per title') +
  theme_bw()

mprinc_avgs <- movies_princ %>%
  group_by(startYear, tconst) %>% 
  summarise(n_names = n()) %>% 
  summarise(avg_names = mean(n_names))

mprinc_avgs %>% 
  ggplot(aes(x = startYear, y = avg_names)) +
  geom_line(col = 'brown2') +
  labs(title = 'Avg. number of names per title over time') +
  theme_bw()

movies <- movies %>% 
  inner_join(select(filter(names_per_movie, n_names == 10), tconst), by = c('tconst'))

Switching to analysis of artists, I notice that exactly 2/3 of names are featured in the principal cast of only one movie (of course these names could be featured in wider cast). These names are guaranteed not to feature in any association, so I can drop these safely. I also drop all the names featured in principal cast of only two movies - as overall I am more interested in rules involving three or more names.

movies_per_name <- movies_princ %>% 
  group_by(nconst) %>% 
  summarise(n = n())

movies_per_name %>% 
  mutate(n_bin = cut(n, breaks = c(1:10, Inf), right = F)) %>% 
  group_by(n_bin) %>% 
  summarise(n = n()) %>% 
  mutate(freq = format(n/sum(n), digits = 1))

## # A tibble: 10 x 3
##    n_bin         n freq 
##    <fct>     <int> <chr>
##  1 [1,2)    877480 0.665
##  2 [2,3)    174208 0.132
##  3 [3,4)     76513 0.058
##  4 [4,5)     43827 0.033
##  5 [5,6)     27994 0.021
##  6 [6,7)     19707 0.015
##  7 [7,8)     14443 0.011
##  8 [8,9)     11089 0.008
##  9 [9,10)     8640 0.007
## 10 [10,Inf)  65233 0.049

names_universe <- movies_per_name %>% 
  filter(n > 2) %>% 
  select(nconst)

The last step is to drop all names which featured in a title with less than 10 principal cast and then drop all titles where the entire principal cast featured in one or no other movies. This leaves me with 267k names featured in 240k movies, which is a considerable improvement over the initial 11.3m names featured in 8.5m titles.

movies_princ <- semi_join(movies_princ, names_universe, by = c('nconst')) %>% 
  semi_join(movies, by = c('tconst'))

movies <- semi_join(movies, movies_princ, by = c('tconst'))

nrow(movies)

## [1] 239957

length(unique(movies_princ$nconst))

## [1] 247692

Finally, I convert the nconst identifiers into actual names, so it’s easier to interpret the association mining results.

names <- data.table::fread(
  'name.basics.tsv',
  na.strings = '\\N',
  quote = '',
  encoding = 'UTF-8')

movies_trans <- movies_princ %>% 
  select(tconst, nconst) %>% 
  left_join(select(movies, tconst, primaryTitle), by = c('tconst')) %>% 
  mutate(title = paste(tconst, str_trunc(primaryTitle, 15))) %>% 
  left_join(select(names, nconst, primaryName), by = c('nconst')) %>% 
  mutate(name = paste(substr(nconst, 8, 9), str_trunc(primaryName, 20)))

save(movies_trans, file='movies_trans.RData')

Association rule mining

For the purposes of rule mining, every movie will be considered a transaction and every name will be considered an item. I am interested in finding ‘cliques’ with longest possible rules (either lhs or rhs), then analyzing the rules within the context of film industry.

I conduct the analysis with the help of An R Companion for Introduction to Data Mining (Hashler, 2021)

I start out by creating the transaction objects. This includes an object for a small sample of transactions, that I will use to show how most of standard visualizations fail when a larger scale problem is tackled.

load('movies_trans.RData')

trans <- arules::transactions(movies_trans, format = 'long', cols = c('title', 'name'))
trans_sample <- arules::transactions(slice_sample(movies_trans, prop = 0.05),
                                     format = 'long', cols = c('title', 'name'))
summary(trans)

##       Length        Class         Mode 
##       239957 transactions           S4

The binary matrix uninformative for the problem of this scale, even for a 5% sample - essentially looks like white noise. Time required to plot it for the full dataset is prohibitive.

arules::image(trans_sample)

Just looking at most frequent items, i.e. names found in principal cast most often, yields a few interesting insights (supplemented with IMDB website):

Shoji Sakai, a Japanese editor, is the most profilic artist in the dataset - he is credited with editing 1040 titles so far
Adoor Bhasi, a Bollywood actor, is the most prolific actor - he starred in 624 titles between 1953 and 1991. This means that he starred in 16 titles every year for 38 years
Eddie Garcia, a Philipino actor, starred in 681 titles over his career. While he was only slightly less productive, starring in 9 titles every year, his career spanned over 70 years. He played in his last movie aged 90.
Ilaiyaraaja is a Bollywood composer credited with scoring a whopping 1026 titles
William Shakespire, despite having written his last play in 1616, is quite prominently featured on IMDB - he’s credited with writing for 1656 titles, with more coming every year

arules::itemFrequencyPlot(trans, topN = 20)

Investigating te support level across entire dataset, we can immediately see that choosing right level of support will be an inconvenience - there is a small group of very prolific artists, but 85% of artists have been featured in the principal cast of less than 10 titles.

supp_item <- tibble(
    support = sort(arules::itemFrequency(trans, type = "absolute"), decreasing = TRUE),
    item = seq_len(ncol(trans)))

p1 <- slice_sample(supp_item, prop = 0.1) %>%
  ggplot(aes(item, support)) +
  geom_line(col = 'blue') +
  theme_bw()

p2 <- supp_item %>% ggplot(aes(support)) +
  stat_ecdf(geom = 'step', col = 'blue') +
  labs(title = 'Zoomed in ECDF',
       y = 'Cumulative freq') +
  coord_cartesian(xlim = c(0, 50)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1)) +
  theme_bw()

gridExtra::grid.arrange(p1, p2, ncol = 2)

This is where the difference vs typical use case for transactional data becomes visible. Unlike in basket analysis, 85% of items (i.e. actors) have a very low ceiling on how many transactions (i.e. movies) they feature in, which is best expressed in absolute numbers rather then relative frequencies. This is somewhat inconvenient, since I need to provide support in relative terms to the apriori function. I decide on minimum support of 19 movies, which precludes some of the franchises stated in the hypothesis, but it seems there are just too many prolific artists.

Looking at most frequent itemsets we can see an interesting group of Japanese artists:

Yôji Yamada (director/writer) and Yoshitaka Asama (writer)
Tetsuo Takaha (cinematographer)
Naozumi Yamamoto (composer)
Chieko Baishô (actress)
Kiyoshi Atsumi (actor)

All of them were were very active, each with 50-150 titles to their names between 1960 and 2000. In this specific combination, they have starred in 32 movies together.

The other most frequent ‘clique’ of actors, sharing 30 movies, is a group of early American movie artists, active between 1930-1950 (not shown in the output, since it is obscured by combinations with other Japanese artists of the first group):

Holbrook N. Todd
Al St. John
Jack Greenhalgh
Sam Newfield
Sigmund Neufeld

its <- arules::apriori(trans, 
                       parameter = list(target = 'frequent', support = 0.00008),
                       control = list(verbose = FALSE))

ggplot(tibble(`Itemset size` = factor(arules::size(its))), aes(`Itemset size`)) +
  geom_bar(fill = 'lightblue', col = 'black') +
  theme_bw()

arules::inspect(its[arules::size(its) > 5])

##     items                       support count
## [1] {10 Kiyoshi Atsumi,                      
##      35 Yoshitaka Asama,                     
##      62 Chieko Baishô,                       
##      62 Naozumi Yamamoto,                    
##      82 Yôji Yamada,                         
##      90 Tetsuo Takaha}     0.0001333572    32

rules <- arules::apriori(trans, 
                         parameter = list(support = 0.00008, confidence = 0.5),
                         control = list(verbose = FALSE))

It’s worth noting that analysing movies rather than transactional data needs adjusting the approach towards interestingness measures. Most commonly used Lift measure incorporates so called null-transactions, in this case movies where neither A nor B starred. Since most artists are featured in no more than 10 movies, while the total number of movies released is relatively big, co-starring even in a single movie generates high lift values. In this case Kulczynski metric is preferrable, which is not affected by high proportion of null-transactions (Wu et al., 2007)

Additionally, most rules involve only two names (e.g. Coen brothers) - I restrict the analyzed rules to these where total rule size (lhs + rhs) has more than 3 names, to limit the number and identify more interesting rules.

This helped identify another interesting trio - Arthur Lake, Penny Singleton and Larry Simms. Together they brought to life the Blondie Bumstead comic in the 40s. Interestingly, by inspect ing the rules we see that Arthur Lake features as lhs in all rules with high Kulczynski metric for this trio, but none of rhs. This implies that while Penny Singleton and Larry Simms were mostly known (or ‘typecast’) as Blondie Bumstead characters, Arthur Lake went on to star in multiple other movies. This turns out to be true! Larry Simms was mostly known as child actor, Penny Singleton filmography is mostly Blondie related, while in case of Arthur Lake Blondie comprises less than half of the movies he starred in.

library(arules)

rules_sub <- subset(rules, subset = (size(lhs) + size(rhs) > 3))
rules_sub2 <- subset(rules, subset = (size(lhs) + size(rhs) > 3))

quality(rules_sub) <- cbind(
  interestMeasure(rules_sub,
                  measure = c('count', 'lift', 'support', 'kulczynski'),
                  trans = trans),
  'size' = size(rules_sub))

inspect(head(rules_sub, by = 'kulczynski', decreasing = TRUE, n = 5))

##     lhs                      rhs                  count     lift      support kulczynski size
## [1] {21 Arthur Lake,                                                                         
##      25 Penny Singleton,                                                                     
##      82 Larry Simms}      => {82 Chic Young}         28 8569.893 0.0001166876  1.0000000    4
## [2] {21 Arthur Lake,                                                                         
##      25 Penny Singleton,                                                                     
##      82 Chic Young}       => {82 Larry Simms}        28 8569.893 0.0001166876  1.0000000    4
## [3] {21 Arthur Lake,                                                                         
##      82 Chic Young,                                                                          
##      82 Larry Simms}      => {25 Penny Singleton}    28 7498.656 0.0001166876  0.9375000    4
## [4] {10 Gene Autry,                                                                          
##      85 Armand Schaefer,                                                                     
##      99 William Bradford} => {95 Champion}           25 8240.282 0.0001041853  0.9271978    4
## [5] {54 Sakae Nitta,                                                                         
##      66 Shôji Sakai,                                                                         
##      82 Teruo Oka}        => {23 Yukio Chiba}        52 3242.662 0.0002167055  0.8513514    4

The standard plot for visualizing association rules has limited use for movie data - most rules are clustered around 1, due to low absolute supports and it is difficult to discern artist ‘cliques’.

library(arulesViz)
plot(rules_sub2, engine = 'html')

Parallel coordinates plot also has a limited use - primarily because of difficult interpretation and limited number of items and rules than can be fit on a single chart. It can be nonetheless useful in compressing the association information and allows to focus on artists, rather than specific rules. Sorting by size or association measures and truncating number of analyzed rules is necessary to retain visibility. On the chart below we can immediately see that there is only one ‘clique’ among the 20 highest-sized rules.

plot(head(rules_sub, by='size', n=20), method = 'paracoord')

Graph is perfect use case for mapping the association rules between artists - it allows to quickly identify groups of artists who often feature together in movies, explore the personal details and only then dive into specific relationships connecting the individuals. All of the artist groups mentioned in earlier paragraphs (i.e. found via manual inspection of rules or other visualization method) can be easily identified on the graph - either by zooming in on specific clusters or hovering over the labels. Direction and color of the arrows can be interpreted as relative diversity of each pair of artists’ partners. It is definitely the most efficient way of analyzing the associations.

plot(head(rules_sub, by='kulczynski', n=100), method = 'graph', engine = 'html')

Summary

In summary, throughout this paper I found the following:

Internet Movie DataBase should actually be called Internet TV Episode DataBase
Filmmaking exploded in popularity in 2000
Movie industry still has not recovered from the pandemic, in terms of yearly releases
Actors/Actresses comprise less than half of the principal cast on average
Most prolific artists are able to feature in 1000+ titles
Hollywood is not the center of the universe
There are ‘cliques’ of actors who starred in more than 30 movies together
The graph based plots turned out to be most useful for visualizing career relationships between artists
Kulczynski metric is the most appropriate for data where many null-transactions are present - which is also a perfect fit for movie data