In this post, I use IMDb data to construct a sample of first-time movie directors and collect data on the movies they worked on throughout their career.
This data is created for a study in my dissertation that explores whether working with highly central people in the network predicts career survival of first-time movie directors, or more specifically, the number of movies directors go on to make after their first movie. To this end, I first construct the career histories of movie directors who made their first movie between 1980 and 2010, using the data I downloaded from https://datasets.imdbws.com/. IMDb has the most complete database on film-makers worldwide. It contains information on film and TV productions and the people who work on them since the late 1800s until now.
I choose 1980 as the start of my sample of directors because this time marks the transition of the Hollywood film industry from in-house productions to projects and personal networks. As such, movies made after 1980 are more representative of film industry today. I choose 2010 as the end of the sample because it allows me to observe the first 10 years of all directors’ career.
First, let’s read the Imdb data on movie information and see what years we have data on.
# read data on movies that have information on year and genres
title_basics_2 <- readRDS("C:/Users/nnguye79/OneDrive - McGill University/Work/Projects/Social cap and gender/Film-maker-network/Data/Raw data/title_basics_2.rds")
# load data on people working on movies
title_principals_2 <- readRDS("C:/Users/nnguye79/OneDrive - McGill University/Work/Projects/Social cap and gender/Film-maker-network/Data/Raw data/title_principals_2.rds")
# get the functions I have written to create table with scroll box
source("functions.R")
The earliest year we have data on is 1896 and the latest year we have data on is 2028. Seems like IMDb also has info on movies that have not been finished yet. Let’s graph the number of movies in the database across time.
# graph the number of movies over time
title_basics_2 %>%
# turn year into numeric character
mutate(year = as.numeric(year)) %>%
# count the number of movies made in each year
count(year, sort= T) %>%
# create a bar chart with year in the x axis and number of movies in the y axis
ggplot(aes(year, n)) +
geom_col(fill = "#8BD8BD") +
labs(y = "Number of movies",
x = NULL,
title = "How many movies were made each year?",
subtitle = "A lot more movies are made in the 21st century than previous centuries")
Next, we retain only the movies made between 1980 and 2010. Let’s take a look at the first 20 rows of the data to see what the data looks like now.
# get the movies made between 1980 and 2010
title_basics_2 %>% filter(between(year, 1980, 2010)) -> movie
movie %>% head(20) %>% kbl_2()
| movie_id | year | genres |
|---|---|---|
| tt0015724 | 1993 | Drama,Mystery,Romance |
| tt0035423 | 2001 | Comedy,Fantasy,Romance |
| tt0036606 | 1983 | Drama,War |
| tt0038687 | 1980 | Documentary,War |
| tt0057461 | 1983 | Drama,Fantasy |
| tt0059325 | 1990 | Drama,Romance |
| tt0059900 | 1990 | Drama,Fantasy |
| tt0062181 | 1981 | Drama |
| tt0064820 | 1989 | Comedy |
| tt0065108 | 1989 | Drama,Romance |
| tt0065188 | 1990 | Drama |
| tt0065530 | 1983 | Drama |
| tt0066020 | 1986 | Action,Comedy |
| tt0066151 | 1986 | Action,Adventure |
| tt0067100 | 1981 | Action,Drama,Thriller |
| tt0067460 | 1995 | Documentary,Sport |
| tt0067625 | 1986 | Drama,War |
| tt0067694 | 1987 | Drama,War |
| tt0067754 | 1981 | Documentary |
| tt0068494 | 1990 | Drama |
We now have a list of 146580 movies made between 1980 and 2010, with information on movie id (movie_id), the year the movie was released (year), and the movie’s genres (genres).
Our next step is to get the directors of these movies. To do this, we need to merge the IMDb dataset on titles and the dataset on principals, and exclude movies that do not have information on directors.
movie %>%
# get the director of these movies by merging movie data with crew data using movie id
left_join(title_principals_2 %>%
# get directors of movies
filter(category == "director") %>%
# choose movie id and person id
select(movie_id, person_id),
# merge with movie id as key
by = "movie_id") %>%
rename(director_id = person_id) %>%
# remove movies without info on director
filter(!is.na(director_id)) -> movie
This gives us 146973 director-movie pairs, which are made up of 72305 distinct directors and 132743 distinct movies, with some movies had more than one director.
Now that we have a list of directors who made at least one movie between 1980 and 2010, we need to get the subset of directors whose first movie was made within this period. To do this, we first get all the movies of the directors in the list.
movie %>%
# get distinct director id
distinct(director_id) %>%
# get all work directed by these directors by merging by person id with crew info
left_join(title_principals_2 %>%
filter(category == "director") %>%
select(movie_id, person_id),
by = c("director_id" = "person_id")) -> movie
# get the release year of directors' movies by merging by movie id with movie info
movie %>%
left_join(title_basics_2, by = "movie_id") %>%
# remove movies of directors that do not have release year
filter(!is.na(year)) -> movie
Next, we get the first movies of the directors in our list and then only choose the directors whose first movie was made between 1980 and 2010.
movie %>%
# group by director id
group_by(director_id) %>%
# get the movie with earliest year within each director id group
slice_min(year) %>%
ungroup() %>%
# get the movies made between 1980 and 2010
filter(between(year, 1980, 2010)) -> movie
This gives us a sample of 68281 director-movie pair, made up of 66647 directors directing 60779 movies, with some first-time directors directed more than one movie, and some movies were directed by more than one first-time director.
Let’s see how many directors make their first movie in each year within our observation period.
movie %>%
distinct(director_id, year) %>%
# count the number of directors in each year
count(year, sort= T) %>%
# create a bar chart with year in the x axis and number of director in the y axis
ggplot(aes(factor(year), n, fill = year)) +
geom_col(show.legend = F) +
labs(y = "Number of directors",
x = NULL,
title = "How many directors made their first movie in each year?") +
scale_x_discrete(breaks = c(1980, 1985, 1990, 1995, 2000, 2005, 2010))
How many movies did a director usually have when they first directed?
movie %>%
count(director_id, sort = T) %>%
count(n) %>%
rename(number_movie = n,
n = nn) %>%
kable()
| number_movie | n |
|---|---|
| 1 | 65172 |
| 2 | 1362 |
| 3 | 85 |
| 4 | 18 |
| 5 | 6 |
| 6 | 1 |
| 7 | 2 |
| 8 | 1 |
It looks like most directors directed one movie in their early career, but a small group of directors did make more than one movie the year they started directing. Let’s create a variable for the numbers of movies a director made the year they started their directing career.
movie %>%
left_join(movie %>%
# count the number of movies a first-time director made
count(director_id) %>%
rename(number_first_movie = n)) -> movie
Now that we have our list of directors who made their first movies between 1980 and 2010, we can move on to tracing how the career of these directors turns out. Specifically, let’s gather information on whether they go on to make other movies in the first 10 years of their career. First, let’s get all the movies a director in our list has been involved in and see what the first 20 rows of the data looks like.
movie %>%
rename(first_movie_id = movie_id,
first_movie_year = year,
first_movie_genres = genres) %>%
# merge with data on crew information to find all the projects a director has been involved in
left_join(title_principals_2,
by = c("director_id" = "person_id")) -> movie
movie %>%
# merge with data on movie information
left_join(title_basics_2, by = "movie_id") %>%
# remove the projects that are not in the movie data (and thus are not movies, but for example are tv episodes)
filter(!is.na(year)) %>%
select(-genres) -> movie
movie %>% head(20) %>% kbl_2()
| director_id | first_movie_id | first_movie_year | first_movie_genres | number_first_movie | movie_id | category | year |
|---|---|---|---|---|---|---|---|
| nm0000083 | tt0969216 | 2007 | Documentary,Family | 1 | tt0170560 | editor | 1998 |
| nm0000083 | tt0969216 | 2007 | Documentary,Family | 1 | tt0424773 | writer | 2002 |
| nm0000083 | tt0969216 | 2007 | Documentary,Family | 1 | tt0969216 | director | 2007 |
| nm0000083 | tt0969216 | 2007 | Documentary,Family | 1 | tt11192552 | writer | 2019 |
| nm0000083 | tt0969216 | 2007 | Documentary,Family | 1 | tt12979838 | editor | 2009 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0084490 | actor | 1982 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0086293 | actor | 1984 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0088443 | actor | 1984 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0088954 | actor | 1985 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0089946 | actor | 1985 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0090668 | actor | 1987 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0091495 | actor | 1986 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0091805 | actor | 1986 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0093412 | actor | 1987 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0093747 | actor | 1988 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0094705 | actor | 1989 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0094822 | actor | 1988 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0095675 | actor | 1988 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0096938 | actor | 1989 |
| nm0000104 | tt0142201 | 1999 | Comedy,Crime,Drama | 1 | tt0098324 | actor | 1989 |
Each row now is a director-movie observation, containing information on a movie a director has been involved in regardless of her or his role in the movie. For each director, there is information on
director_id)first_movie_id)first_movie_year)first_movie_genres)number_first_movie)movie_id)category)year).For the movies the directors in our list have been involved in, we can also add information on their IMDb ratings and number of votes on IMDb.
title_ratings <- readRDS("C:/Users/nnguye79/OneDrive - McGill University/Work/Projects/Social cap and gender/Film-maker-network/Data/Raw data/title_ratings.rds")
movie %>%
# merging movie data with rating data
left_join(title_ratings, by = c("movie_id" = "tconst")) %>%
rename(rating = averageRating,
number_vote = numVotes) %>%
left_join(title_ratings, by = c("first_movie_id" = "tconst")) %>%
rename(first_movie_rating = averageRating,
first_movie_number_vote = numVotes) -> movie
Let’s graph the distribution of ratings for movies made by first-time directors over time. To do this, we will group the directors based on the decade when they made their first movie - 1980s, 1990s, 2000s, 2010s.
movie %>%
filter(first_movie_id == movie_id) %>%
mutate(decade = (as.numeric(year) %/% 10) * 10,
decade = factor(decade),
decade = paste(decade, "s", sep = "")) %>%
ggplot(aes(decade, first_movie_rating, color = decade)) +
geom_boxplot(outlier.colour = NA) +
geom_jitter(alpha = 0.1, width = 0.15) +
labs(x = NULL, y = "Movie ratings", title = "Distribution of ratings among movies made by first-time directors") +
theme(legend.position = "none")
It looks like movies made by first-time directors in the 2000s and 2010s have slightly higher ratings on average compared to movies made by first-time directors in the 1980s and 1990s.
Let’s also graph the distribution of votes (reflecting popularity) for movies made by first-time directors over time.
# turn off academic notion
options(scipen = 999)
movie %>%
filter(first_movie_id == movie_id) %>%
mutate(decade = (as.numeric(year) %/% 10) * 10,
decade = factor(decade),
decade = paste(decade, "s", sep = "")) %>%
ggplot(aes(decade, first_movie_number_vote, color = decade)) +
geom_boxplot(outlier.colour = NA) +
geom_jitter(alpha = 0.1, width = 0.15) +
labs(x = NULL, y = "Number of votes (logged)", title = "Distribution of votes among movies made by first-time directors") +
theme(legend.position = "none") +
# log transform y axis for clearer image
scale_y_log10()
Some first-time directors might have worked on other movies before they started directing, where they took on creative non-directing roles (producer, writer, editor, cinematographer, production designer, and composer). This prior experience might influence their career survival as a director. Therefore, we will create two variables to reflect a first-time director’s prior work experience, including whether they have worked on other movies in creative non-directing roles before they directed their first movie, and if so, how many movies and which role they took on in these movies.
movie %>%
# get the movies a director worked on before they directed,
# where they took on creative non-directing roles
filter(year < first_movie_year &
category %in% c("producer", "writer", "editor",
"cinematographer", "production_designer", "composer")) %>%
# calculate the number of times a director has worked on a particular creative role
fastDummies::dummy_cols(select_columns = "category") %>%
group_by(director_id) %>%
summarise(across(starts_with("category_"), sum)) %>%
# calculate the number of movies a director worked on before they directed,
# where they took on creative non-directing roles
rowwise(director_id) %>%
mutate(number_previous_movie = sum(c_across(where(is.numeric)))) %>%
# create dummy variable reflecting whether a director has worked on a creative non-directing role or not
mutate(across(starts_with("category_"), ~ ifelse(. > 0, 1, .))) -> prior_experience
movie %>%
# merge movie data prior experience data
left_join(prior_experience) %>%
# for directors who has not worked in creative non-directing roles before they directed a movie,
# code all variables on prior experience as 0
mutate(across(category_cinematographer:number_previous_movie, ~ ifelse(is.na(.), 0, .))) -> movie
Once we do that, let’s see how many first-time directors in our list have had prior experience working in creative non-directing roles.
movie %>%
# get distinct pairs of director and number of previous movie
distinct(director_id, number_previous_movie) %>%
# count the number of directors with certain number of previous movie
count(number_previous_movie) %>%
kbl_2()
| number_previous_movie | n |
|---|---|
| 0 | 57767 |
| 1 | 4673 |
| 2 | 1630 |
| 3 | 706 |
| 4 | 467 |
| 5 | 292 |
| 6 | 185 |
| 7 | 157 |
| 8 | 118 |
| 9 | 99 |
| 10 | 71 |
| 11 | 52 |
| 12 | 55 |
| 13 | 41 |
| 14 | 32 |
| 15 | 26 |
| 16 | 17 |
| 17 | 18 |
| 18 | 31 |
| 19 | 17 |
| 20 | 17 |
| 21 | 15 |
| 22 | 14 |
| 23 | 7 |
| 24 | 13 |
| 25 | 8 |
| 26 | 2 |
| 27 | 12 |
| 28 | 9 |
| 29 | 3 |
| 30 | 7 |
| 31 | 3 |
| 32 | 7 |
| 33 | 4 |
| 35 | 1 |
| 36 | 2 |
| 37 | 4 |
| 38 | 1 |
| 39 | 3 |
| 40 | 5 |
| 41 | 1 |
| 42 | 2 |
| 43 | 1 |
| 45 | 2 |
| 46 | 6 |
| 47 | 1 |
| 48 | 1 |
| 49 | 2 |
| 50 | 1 |
| 51 | 1 |
| 52 | 1 |
| 53 | 2 |
| 55 | 1 |
| 56 | 4 |
| 57 | 1 |
| 58 | 3 |
| 60 | 4 |
| 62 | 1 |
| 64 | 1 |
| 66 | 1 |
| 67 | 1 |
| 69 | 1 |
| 70 | 1 |
| 74 | 2 |
| 78 | 2 |
| 80 | 1 |
| 86 | 1 |
| 93 | 1 |
| 96 | 1 |
| 103 | 1 |
| 104 | 1 |
| 119 | 1 |
| 128 | 1 |
| 142 | 1 |
| 188 | 1 |
| 342 | 1 |
| 456 | 1 |
It appears that most first-time directors in our list have not worked in other creative roles in movies before they directed their first movie.
Among the first-time directors who did worked in creative non-directing roles in past movies, what roles did they usually take on?
movie %>%
# get dummy variables on creative roles
select(director_id, starts_with("category_")) %>%
# get distinct pairs of director and creative roles
distinct() %>%
# count the number of directors who have worked on certain creative roles
summarise(across(starts_with("category_"), sum)) %>%
# convert wide table to long table for plotting
tibble::add_column(ID = 1) %>%
pivot_longer(!ID, names_to = "role", values_to = "n") %>%
# clean up variable names for plotting
mutate(role = stringr::str_remove(role, "category_"),
role = forcats::fct_reorder(role, -n)) %>%
# create a bar chart with creative roles in the x axis,
# number of directors in the y axis
ggplot(aes(role, n, fill = role)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "Number of directors",
title = "How many directors worked in creative non-directing role \n before they directed their first movie?")
It looks like most directors who have worked on other movies before they started directing were writers and producers.
Next we will count the number of movies each director went on to direct within the first 10 years of their career besides their first movie (i.e., their career survival).
movie %>%
# get the movies directors in the list work on in the director role
filter(category == "director") %>%
# convert release year to numeric
mutate(across(c(first_movie_year, year), as.numeric)) %>%
# get the movies each director directed within the first 10 years of their career
filter(year <= first_movie_year + 10) %>%
# count the number of movies each directors made
count(director_id) %>%
rename(number_movie = n) -> number_movie
movie %>%
# convert release year to numeric
mutate(across(c(first_movie_year, year), as.numeric)) %>%
# merge movie data with career survival data
left_join(number_movie) %>%
# count the number of movies each director went on direct
# within the first 10 years of their career besides their first movie
mutate(number_movie = number_movie - number_first_movie) -> movie
Let’s take a look at the number of movies the directors directed within the first 10 years of their career
movie %>%
# get distinct director-number of movie pair
distinct(director_id, number_movie) %>%
# count the number of directors with certain number of movie
count(number_movie, sort = T) %>%
kbl_2()
| number_movie | n |
|---|---|
| 0 | 39541 |
| 1 | 13023 |
| 2 | 6507 |
| 3 | 2986 |
| 4 | 1782 |
| 5 | 788 |
| 6 | 701 |
| 8 | 262 |
| 7 | 251 |
| 10 | 152 |
| 12 | 117 |
| 9 | 105 |
| 14 | 74 |
| 11 | 48 |
| 18 | 41 |
| 16 | 39 |
| 20 | 26 |
| 15 | 24 |
| 13 | 19 |
| 22 | 15 |
| 30 | 14 |
| 24 | 13 |
| 21 | 10 |
| 26 | 10 |
| 19 | 7 |
| 25 | 6 |
| 28 | 6 |
| 17 | 5 |
| 40 | 5 |
| 42 | 5 |
| 23 | 4 |
| 29 | 4 |
| 32 | 4 |
| 38 | 4 |
| 27 | 3 |
| 31 | 3 |
| 33 | 3 |
| 34 | 3 |
| 52 | 3 |
| 72 | 3 |
| 36 | 2 |
| 44 | 2 |
| 46 | 2 |
| 48 | 2 |
| 49 | 2 |
| 56 | 2 |
| 63 | 2 |
| 69 | 2 |
| 35 | 1 |
| 41 | 1 |
| 45 | 1 |
| 51 | 1 |
| 53 | 1 |
| 58 | 1 |
| 60 | 1 |
| 64 | 1 |
| 78 | 1 |
| 84 | 1 |
| 88 | 1 |
| 123 | 1 |
| 130 | 1 |
| 210 | 1 |
| 220 | 1 |
# create nested data with detailed career history of each director is nested in a list column
movie %>%
nest(movie_id:number_vote) %>%
rename(career_history = data) -> movie
Let’s also graph the distribution of number of movies made by directors over time.
movie %>%
mutate(decade = (as.numeric(first_movie_year) %/% 10) * 10,
decade = factor(decade),
decade = paste(decade, "s", sep = "")) %>%
ggplot(aes(decade, number_movie, color = decade)) +
geom_boxplot(outlier.colour = NA) +
geom_jitter(alpha = 0.1, width = 0.15) +
labs(x = NULL,
y = "Number of movies",
title = "Distribution of number of movies directors made \n in the first 10 years of their career") +
theme(legend.position = "none")
This is a little hard to see. Let’s count the number of directors who had made 0, 1, 2, and more than 2 movies in the first 10 years of their career.
movie %>%
mutate(number_movie_1 = ifelse(number_movie >2, 3, number_movie),
decade = (as.numeric(first_movie_year) %/% 10) * 10,
decade = factor(decade),
decade = paste(decade, "s", sep = "")) %>%
count(decade, number_movie_1) %>%
mutate(number_movie_1 = as.factor(number_movie_1)) %>%
ggplot(aes(decade, n, fill = number_movie_1)) +
geom_col(position = position_dodge(preserve = "single")) +
scale_fill_manual(name = "Number of movies",
labels=c("0","1","2", "More than 2"),
values=c("#316879", "#f47a60", "#7fe7dc", "#fbcbc9")) +
labs(x = NULL,
y = "Number of directors",
title = "How many directors go on to make more movies \n after their first movie?")
It looks like across time, the majority of directors did not go on to direct another movie after their first movie. Even among the directors who did, most only directed one or two movies after their first movie. This indicates high career failure among movie directors. As such, it is important to identify the factors that help (or hurt) career survival of movie directors.
Part of my dissertation explores one such factor - the social network of the people a director worked with in their first movie. In other posts, I will construct the collaboration network of film-makers, calculate network position of a director’s early collaborators, and test if working with people central in the network increases a director’s chance of continuing their directing career.