This is the second out of two papers for Unsupervised Learning, winter class of 2021.
In this paper, I use association rules to identify ‘cliques’ in the movie world - sets of artists who frequently make movies together. The analysis includes actors, directors, producers, writers and composers. The working hypothesis is that association rules should be able to identify groups of artists who are frequently billed together, e.g.:
Identifying the famous duos is straightforward, possible with a single join. I focus on identifying sets of at least three artists, which requires data mining tools such as association rules and should provide at least a few surprising results. Nonetheless the ability of the algorithm to identify a few of the duos will be a good benchmark to sens-check the results.
Overall, this paper is intended to document the end-to-end analysis process, including data cleaning, analysis and visualization.
Throughout the analysis, I use data made available by IMDB, documented under this link and published under this link. I use the following tables:
The principal cast table contains only information on top 10 billing. Data on full cast needs to be licensed and accessed via API or scraped. One could say this means the analysis will not be comprehensive. However, as we will soon see, with the number of movies in the database, calculating association rules with more than 10 names per movie would not be feasible in home setting anyway.
I start off with loading the datasets. Titles, Principals and Names tables hold 8.5m, 47.8m and 11.3m records respectively, which poses a challenge for base and tidyverse read functions - they are too slow. I need to use fread function from data.table package. For EDA purposes I will work at times with 20% random samples of the records, to shorten running time. For the final run I will switch to the full dataset.
library(tidyverse)
basics <- data.table::fread(
'title.basics.tsv',
quote = '',
na.strings = '\\N',
encoding = 'UTF-8',
colClasses = c(
startYear = 'double',
runtimeMinutes = 'double'))
set.seed(456)
basics_sample <- slice_sample(basics, prop=0.2)
save(basics_sample, file='basics_sample.RData')
load('basics_sample.RData')
I investigate first the title type distribution. It seems the 75% of IMDB titles are actually TV episodes. This is significant for my analysis - TV series generally feature same core cast over multiple episodes or even seasons. This means that rule mining would mostly find longest running TV series instead of interesting relationships in the movie world. This is why in the next step I am narrowing down
basics_sample %>%
group_by(titleType) %>%
summarise(n = n()) %>%
mutate(freq = n/sum(n)) %>%
ggplot(aes(x = fct_reorder(titleType, freq),
y = freq
)) +
geom_col(col='black', fill='coral') +
coord_flip() +
geom_text(aes(label = paste0(round(freq*100,2),'%')), nudge_y = 0.05) +
scale_y_continuous(expand = expansion(mult = c(0,0.3))) +
labs(
title = 'Title type proportion',
x = 'Title type',
y = 'Count'
) +
theme_bw() +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
Since the title type I am interested in makes up only 7% of the original dataset, it is safe to switch back to the full dataset, narrowed down to Movie type.
movies <- filter(basics, titleType == 'movie', startYear >= 1920, startYear <= 2021)
rm(basics)
rm(basics_sample)
save(movies, file='movies.RData')
Interestingly, the data is heavily skewed towards movies released after year 2000. It should be noted that IMDB exists only since 1990 and was acquired by Amazon in 1998, which coincided with acceleration in growth of recorded releases. It’s not possible to tell based on this whether this reflects a boom in global movie industry, is a product of better bookkeeping or is driven by localized explosions in movies produced by very prolific Nigerian and Indian film industries, known as Nollywood and Bollywood. Nonetheless it is interesting to see how pandemic impacted the film industry, which shrunk 15% in terms of number of releases in 2020 and still has not fully recovered.
load('movies.RData')
nrow(movies)
## [1] 500573
movies %>%
group_by(startYear) %>%
summarise(n = n()) %>%
drop_na() %>%
ggplot(aes(x = startYear,
y = n)) +
labs(title = 'Movies released by year',
x = 'Release year',
y = 'Movies released') +
geom_line(size = 1, col = 'brown2') +
scale_y_continuous(labels = function(x) format(x, big.mark = ',')) +
theme_bw()
I add the information about the principal cast to the movies. The principal cast table has 47.8m records and needs to be removed after the join to conserve memory.
princ <- data.table::fread(
'title.principals.tsv',
na.strings = '\\N',
encoding = 'UTF-8',
quote = '')
movies_princ <- princ %>%
inner_join(select(movies, tconst, titleType, primaryTitle, startYear),
by = c('tconst'))
rm(princ)
save(movies_princ, file='movies_princ.RData')
Noticeably the directors, writers and producers seem to be overrepresented among the cast - this is of course the impact of limiting the analysis to the principal cast though, where these roles are more likely to be featured among top 10 billed names. It is thus crucial not to limit the analysis to actors only, as this will put a hard cap on maximal itemset that can be found via association mining.
load('movies_princ.RData')
nrow(movies_princ)
## [1] 3813954
movies_princ %>%
group_by(category) %>%
summarise(n = n()) %>%
mutate(freq = paste0(round(n/sum(n)*100,0),'%')) %>%
arrange(desc(n))
## # A tibble: 12 x 3
## category n freq
## <chr> <int> <chr>
## 1 actor 1061108 28%
## 2 actress 643069 17%
## 3 director 492261 13%
## 4 writer 375924 10%
## 5 producer 311756 8%
## 6 cinematographer 298464 8%
## 7 composer 268598 7%
## 8 editor 175476 5%
## 9 self 141618 4%
## 10 production_designer 32938 1%
## 11 archive_footage 12689 0%
## 12 archive_sound 53 0%
Another noticeable feature is that some movies have less than 10 principal cast. These are likely to be niche or low budget movies. The average number of names per title is also decreasing over time, which suggests there is an increasing inflow of such movies - perhaps driven by access to hardware and decreasing cost of amateur filmmaking.
In any case, associations produced by these movies will not be as interesting, so I decide to drop the movies with 9 and fewer principal cast, for the sake of efficiency. This cuts the number of movies to be analyzed by half.
names_per_movie <- movies_princ %>%
group_by(tconst) %>%
summarise(n_names = n())
names_per_movie %>%
ggplot(aes(x = n_names)) +
geom_histogram(bins = 10, col='black', fill='cornflowerblue') +
labs(title = 'Distribution of names per title') +
theme_bw()
mprinc_avgs <- movies_princ %>%
group_by(startYear, tconst) %>%
summarise(n_names = n()) %>%
summarise(avg_names = mean(n_names))
mprinc_avgs %>%
ggplot(aes(x = startYear, y = avg_names)) +
geom_line(col = 'brown2') +
labs(title = 'Avg. number of names per title over time') +
theme_bw()
movies <- movies %>%
inner_join(select(filter(names_per_movie, n_names == 10), tconst), by = c('tconst'))
Switching to analysis of artists, I notice that exactly 2/3 of names are featured in the principal cast of only one movie (of course these names could be featured in wider cast). These names are guaranteed not to feature in any association, so I can drop these safely. I also drop all the names featured in principal cast of only two movies - as overall I am more interested in rules involving three or more names.
movies_per_name <- movies_princ %>%
group_by(nconst) %>%
summarise(n = n())
movies_per_name %>%
mutate(n_bin = cut(n, breaks = c(1:10, Inf), right = F)) %>%
group_by(n_bin) %>%
summarise(n = n()) %>%
mutate(freq = format(n/sum(n), digits = 1))
## # A tibble: 10 x 3
## n_bin n freq
## <fct> <int> <chr>
## 1 [1,2) 877480 0.665
## 2 [2,3) 174208 0.132
## 3 [3,4) 76513 0.058
## 4 [4,5) 43827 0.033
## 5 [5,6) 27994 0.021
## 6 [6,7) 19707 0.015
## 7 [7,8) 14443 0.011
## 8 [8,9) 11089 0.008
## 9 [9,10) 8640 0.007
## 10 [10,Inf) 65233 0.049
names_universe <- movies_per_name %>%
filter(n > 2) %>%
select(nconst)
The last step is to drop all names which featured in a title with less than 10 principal cast and then drop all titles where the entire principal cast featured in one or no other movies. This leaves me with 267k names featured in 240k movies, which is a considerable improvement over the initial 11.3m names featured in 8.5m titles.
movies_princ <- semi_join(movies_princ, names_universe, by = c('nconst')) %>%
semi_join(movies, by = c('tconst'))
movies <- semi_join(movies, movies_princ, by = c('tconst'))
nrow(movies)
## [1] 239957
length(unique(movies_princ$nconst))
## [1] 247692
Finally, I convert the nconst identifiers into actual names, so it’s easier to interpret the association mining results.
names <- data.table::fread(
'name.basics.tsv',
na.strings = '\\N',
quote = '',
encoding = 'UTF-8')
movies_trans <- movies_princ %>%
select(tconst, nconst) %>%
left_join(select(movies, tconst, primaryTitle), by = c('tconst')) %>%
mutate(title = paste(tconst, str_trunc(primaryTitle, 15))) %>%
left_join(select(names, nconst, primaryName), by = c('nconst')) %>%
mutate(name = paste(substr(nconst, 8, 9), str_trunc(primaryName, 20)))
save(movies_trans, file='movies_trans.RData')
For the purposes of rule mining, every movie will be considered a transaction and every name will be considered an item. I am interested in finding ‘cliques’ with longest possible rules (either lhs or rhs), then analyzing the rules within the context of film industry.
I conduct the analysis with the help of An R Companion for Introduction to Data Mining (Hashler, 2021)
I start out by creating the transaction objects. This includes an object for a small sample of transactions, that I will use to show how most of standard visualizations fail when a larger scale problem is tackled.
load('movies_trans.RData')
trans <- arules::transactions(movies_trans, format = 'long', cols = c('title', 'name'))
trans_sample <- arules::transactions(slice_sample(movies_trans, prop = 0.05),
format = 'long', cols = c('title', 'name'))
summary(trans)
## Length Class Mode
## 239957 transactions S4
The binary matrix uninformative for the problem of this scale, even for a 5% sample - essentially looks like white noise. Time required to plot it for the full dataset is prohibitive.
arules::image(trans_sample)
Just looking at most frequent items, i.e. names found in principal cast most often, yields a few interesting insights (supplemented with IMDB website):
arules::itemFrequencyPlot(trans, topN = 20)
Investigating te support level across entire dataset, we can immediately see that choosing right level of support will be an inconvenience - there is a small group of very prolific artists, but 85% of artists have been featured in the principal cast of less than 10 titles.
supp_item <- tibble(
support = sort(arules::itemFrequency(trans, type = "absolute"), decreasing = TRUE),
item = seq_len(ncol(trans)))
p1 <- slice_sample(supp_item, prop = 0.1) %>%
ggplot(aes(item, support)) +
geom_line(col = 'blue') +
theme_bw()
p2 <- supp_item %>% ggplot(aes(support)) +
stat_ecdf(geom = 'step', col = 'blue') +
labs(title = 'Zoomed in ECDF',
y = 'Cumulative freq') +
coord_cartesian(xlim = c(0, 50)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.1)) +
theme_bw()
gridExtra::grid.arrange(p1, p2, ncol = 2)
This is where the difference vs typical use case for transactional data becomes visible. Unlike in basket analysis, 85% of items (i.e. actors) have a very low ceiling on how many transactions (i.e. movies) they feature in, which is best expressed in absolute numbers rather then relative frequencies. This is somewhat inconvenient, since I need to provide support in relative terms to the apriori function. I decide on minimum support of 19 movies, which precludes some of the franchises stated in the hypothesis, but it seems there are just too many prolific artists.
Looking at most frequent itemsets we can see an interesting group of Japanese artists:
All of them were were very active, each with 50-150 titles to their names between 1960 and 2000. In this specific combination, they have starred in 32 movies together.
The other most frequent ‘clique’ of actors, sharing 30 movies, is a group of early American movie artists, active between 1930-1950 (not shown in the output, since it is obscured by combinations with other Japanese artists of the first group):
its <- arules::apriori(trans,
parameter = list(target = 'frequent', support = 0.00008),
control = list(verbose = FALSE))
ggplot(tibble(`Itemset size` = factor(arules::size(its))), aes(`Itemset size`)) +
geom_bar(fill = 'lightblue', col = 'black') +
theme_bw()
arules::inspect(its[arules::size(its) > 5])
## items support count
## [1] {10 Kiyoshi Atsumi,
## 35 Yoshitaka Asama,
## 62 Chieko Baishô,
## 62 Naozumi Yamamoto,
## 82 Yôji Yamada,
## 90 Tetsuo Takaha} 0.0001333572 32
rules <- arules::apriori(trans,
parameter = list(support = 0.00008, confidence = 0.5),
control = list(verbose = FALSE))
It’s worth noting that analysing movies rather than transactional data needs adjusting the approach towards interestingness measures. Most commonly used Lift measure incorporates so called null-transactions, in this case movies where neither A nor B starred. Since most artists are featured in no more than 10 movies, while the total number of movies released is relatively big, co-starring even in a single movie generates high lift values. In this case Kulczynski metric is preferrable, which is not affected by high proportion of null-transactions (Wu et al., 2007)
Additionally, most rules involve only two names (e.g. Coen brothers) - I restrict the analyzed rules to these where total rule size (lhs + rhs) has more than 3 names, to limit the number and identify more interesting rules.
This helped identify another interesting trio - Arthur Lake, Penny Singleton and Larry Simms. Together they brought to life the Blondie Bumstead comic in the 40s. Interestingly, by inspect ing the rules we see that Arthur Lake features as lhs in all rules with high Kulczynski metric for this trio, but none of rhs. This implies that while Penny Singleton and Larry Simms were mostly known (or ‘typecast’) as Blondie Bumstead characters, Arthur Lake went on to star in multiple other movies. This turns out to be true! Larry Simms was mostly known as child actor, Penny Singleton filmography is mostly Blondie related, while in case of Arthur Lake Blondie comprises less than half of the movies he starred in.
library(arules)
rules_sub <- subset(rules, subset = (size(lhs) + size(rhs) > 3))
rules_sub2 <- subset(rules, subset = (size(lhs) + size(rhs) > 3))
quality(rules_sub) <- cbind(
interestMeasure(rules_sub,
measure = c('count', 'lift', 'support', 'kulczynski'),
trans = trans),
'size' = size(rules_sub))
inspect(head(rules_sub, by = 'kulczynski', decreasing = TRUE, n = 5))
## lhs rhs count lift support kulczynski size
## [1] {21 Arthur Lake,
## 25 Penny Singleton,
## 82 Larry Simms} => {82 Chic Young} 28 8569.893 0.0001166876 1.0000000 4
## [2] {21 Arthur Lake,
## 25 Penny Singleton,
## 82 Chic Young} => {82 Larry Simms} 28 8569.893 0.0001166876 1.0000000 4
## [3] {21 Arthur Lake,
## 82 Chic Young,
## 82 Larry Simms} => {25 Penny Singleton} 28 7498.656 0.0001166876 0.9375000 4
## [4] {10 Gene Autry,
## 85 Armand Schaefer,
## 99 William Bradford} => {95 Champion} 25 8240.282 0.0001041853 0.9271978 4
## [5] {54 Sakae Nitta,
## 66 Shôji Sakai,
## 82 Teruo Oka} => {23 Yukio Chiba} 52 3242.662 0.0002167055 0.8513514 4
The standard plot for visualizing association rules has limited use for movie data - most rules are clustered around 1, due to low absolute supports and it is difficult to discern artist ‘cliques’.
library(arulesViz)
plot(rules_sub2, engine = 'html')
Parallel coordinates plot also has a limited use - primarily because of difficult interpretation and limited number of items and rules than can be fit on a single chart. It can be nonetheless useful in compressing the association information and allows to focus on artists, rather than specific rules. Sorting by size or association measures and truncating number of analyzed rules is necessary to retain visibility. On the chart below we can immediately see that there is only one ‘clique’ among the 20 highest-sized rules.
plot(head(rules_sub, by='size', n=20), method = 'paracoord')
Graph is perfect use case for mapping the association rules between artists - it allows to quickly identify groups of artists who often feature together in movies, explore the personal details and only then dive into specific relationships connecting the individuals. All of the artist groups mentioned in earlier paragraphs (i.e. found via manual inspection of rules or other visualization method) can be easily identified on the graph - either by zooming in on specific clusters or hovering over the labels. Direction and color of the arrows can be interpreted as relative diversity of each pair of artists’ partners. It is definitely the most efficient way of analyzing the associations.
plot(head(rules_sub, by='kulczynski', n=100), method = 'graph', engine = 'html')
In summary, throughout this paper I found the following: