In this post, I use IMDb data to construct a sample of first-time movie directors and collect data on the movies they worked on throughout their career.

This data is created for a study in my dissertation that explores whether working with highly central people in the network predicts career survival of first-time movie directors, or more specifically, the number of movies directors go on to make after their first movie. To this end, I first construct the career histories of movie directors who made their first movie between 1980 and 2010, using the data I downloaded from https://datasets.imdbws.com/. IMDb has the most complete database on film-makers worldwide. It contains information on film and TV productions and the people who work on them since the late 1800s until now.

I choose 1980 as the start of my sample of directors because this time marks the transition of the Hollywood film industry from in-house productions to projects and personal networks. As such, movies made after 1980 are more representative of film industry today. I choose 2010 as the end of the sample because it allows me to observe the first 10 years of all directors’ career.

Getting movies made between 1980 and 2010

First, let’s read the Imdb data on movie information and see what years we have data on.

# read data on movies that have information on year and genres 
title_basics_2 <- readRDS("C:/Users/nnguye79/OneDrive - McGill University/Work/Projects/Social cap and gender/Film-maker-network/Data/Raw data/title_basics_2.rds")

# load data on people working on movies
title_principals_2 <- readRDS("C:/Users/nnguye79/OneDrive - McGill University/Work/Projects/Social cap and gender/Film-maker-network/Data/Raw data/title_principals_2.rds")

# get the functions I have written to create table with scroll box
source("functions.R")

The earliest year we have data on is 1896 and the latest year we have data on is 2028. Seems like IMDb also has info on movies that have not been finished yet. Let’s graph the number of movies in the database across time.

# graph the number of movies over time
title_basics_2 %>% 
  # turn year into numeric character
  mutate(year = as.numeric(year)) %>% 
  # count the number of movies made in each year
  count(year, sort= T) %>% 
  # create a bar chart with year in the x axis and number of movies in the y axis
  ggplot(aes(year, n)) + 
  geom_col(fill = "#8BD8BD") +
  labs(y = "Number of movies", 
       x = NULL, 
       title = "How many movies were made each year?",
       subtitle = "A lot more movies are made in the 21st century than previous centuries")

Next, we retain only the movies made between 1980 and 2010. Let’s take a look at the first 20 rows of the data to see what the data looks like now.

# get the movies made between 1980 and 2010
title_basics_2 %>% filter(between(year, 1980, 2010)) -> movie

movie %>% head(20) %>% kbl_2()


movie_id	year	genres
tt0015724	1993	Drama,Mystery,Romance
tt0035423	2001	Comedy,Fantasy,Romance
tt0036606	1983	Drama,War
tt0038687	1980	Documentary,War
tt0057461	1983	Drama,Fantasy
tt0059325	1990	Drama,Romance
tt0059900	1990	Drama,Fantasy
tt0062181	1981	Drama
tt0064820	1989	Comedy
tt0065108	1989	Drama,Romance
tt0065188	1990	Drama
tt0065530	1983	Drama
tt0066020	1986	Action,Comedy
tt0066151	1986	Action,Adventure
tt0067100	1981	Action,Drama,Thriller
tt0067460	1995	Documentary,Sport
tt0067625	1986	Drama,War
tt0067694	1987	Drama,War
tt0067754	1981	Documentary
tt0068494	1990	Drama

We now have a list of 146580 movies made between 1980 and 2010, with information on movie id (movie_id), the year the movie was released (year), and the movie’s genres (genres).

Getting directors who made their first movie between 1980 and 2010

Our next step is to get the directors of these movies. To do this, we need to merge the IMDb dataset on titles and the dataset on principals, and exclude movies that do not have information on directors.

movie %>% 
  # get the director of these movies by merging movie data with crew data using movie id
  left_join(title_principals_2 %>% 
              # get directors of movies
              filter(category == "director") %>% 
              # choose movie id and person id
              select(movie_id, person_id), 
            # merge with movie id as key
            by = "movie_id") %>%
  rename(director_id = person_id) %>% 
  # remove movies without info on director
  filter(!is.na(director_id)) -> movie

This gives us 146973 director-movie pairs, which are made up of 72305 distinct directors and 132743 distinct movies, with some movies had more than one director.

Now that we have a list of directors who made at least one movie between 1980 and 2010, we need to get the subset of directors whose first movie was made within this period. To do this, we first get all the movies of the directors in the list.

movie %>% 
  # get distinct director id
  distinct(director_id) %>% 
  # get all work directed by these directors by merging by person id with crew info
  left_join(title_principals_2 %>% 
              filter(category == "director") %>% 
              select(movie_id, person_id), 
            by = c("director_id" = "person_id")) -> movie

# get the release year of directors' movies by merging by movie id with movie info
movie %>% 
  left_join(title_basics_2, by = "movie_id") %>% 
  # remove movies of directors that do not have release year
  filter(!is.na(year)) -> movie

Next, we get the first movies of the directors in our list and then only choose the directors whose first movie was made between 1980 and 2010.

movie %>% 
  # group by director id 
  group_by(director_id) %>% 
  # get the movie with earliest year within each director id group
  slice_min(year) %>%   
  ungroup() %>% 
  # get the movies made between 1980 and 2010
  filter(between(year, 1980, 2010)) -> movie

This gives us a sample of 68281 director-movie pair, made up of 66647 directors directing 60779 movies, with some first-time directors directed more than one movie, and some movies were directed by more than one first-time director.

Let’s see how many directors make their first movie in each year within our observation period.

movie %>% 
  distinct(director_id, year) %>% 
  # count the number of directors in each year
  count(year, sort= T) %>% 
  # create a bar chart with year in the x axis and number of director in the y axis
  ggplot(aes(factor(year), n, fill = year)) + 
  geom_col(show.legend = F) +
  labs(y = "Number of directors", 
       x = NULL, 
       title = "How many directors made their first movie in each year?") +
  scale_x_discrete(breaks = c(1980, 1985, 1990, 1995, 2000, 2005, 2010))

How many movies did a director usually have when they first directed?

movie %>% 
  count(director_id, sort = T) %>% 
  count(n) %>% 
  rename(number_movie = n,
         n = nn) %>% 
  kable()

number_movie	n
1	65172
2	1362
3	85
4	18
5	6
6	1
7	2
8	1

It looks like most directors directed one movie in their early career, but a small group of directors did make more than one movie the year they started directing. Let’s create a variable for the numbers of movies a director made the year they started their directing career.

movie %>% 
  left_join(movie %>% 
              # count the number of movies a first-time director made
              count(director_id) %>% 
              rename(number_first_movie = n)) -> movie

Tracing career history of first-time directors

Now that we have our list of directors who made their first movies between 1980 and 2010, we can move on to tracing how the career of these directors turns out. Specifically, let’s gather information on whether they go on to make other movies in the first 10 years of their career. First, let’s get all the movies a director in our list has been involved in and see what the first 20 rows of the data looks like.

movie %>% 
  rename(first_movie_id = movie_id, 
         first_movie_year = year,
         first_movie_genres = genres) %>% 
  # merge with data on crew information to find all the projects a director has been involved in
  left_join(title_principals_2,
            by = c("director_id" = "person_id")) -> movie

movie %>% 
  # merge with data on movie information
  left_join(title_basics_2, by = "movie_id") %>% 
  # remove the projects that are not in the movie data (and thus are not movies, but for example are tv episodes)
  filter(!is.na(year)) %>% 
  select(-genres) -> movie

movie %>% head(20) %>% kbl_2()


director_id	first_movie_id	first_movie_year	first_movie_genres	number_first_movie	movie_id	category	year
nm0000083	tt0969216	2007	Documentary,Family	1	tt0170560	editor	1998
nm0000083	tt0969216	2007	Documentary,Family	1	tt0424773	writer	2002
nm0000083	tt0969216	2007	Documentary,Family	1	tt0969216	director	2007
nm0000083	tt0969216	2007	Documentary,Family	1	tt11192552	writer	2019
nm0000083	tt0969216	2007	Documentary,Family	1	tt12979838	editor	2009
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0084490	actor	1982
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0086293	actor	1984
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0088443	actor	1984
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0088954	actor	1985
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0089946	actor	1985
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0090668	actor	1987
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0091495	actor	1986
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0091805	actor	1986
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0093412	actor	1987
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0093747	actor	1988
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0094705	actor	1989
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0094822	actor	1988
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0095675	actor	1988
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0096938	actor	1989
nm0000104	tt0142201	1999	Comedy,Crime,Drama	1	tt0098324	actor	1989

Each row now is a director-movie observation, containing information on a movie a director has been involved in regardless of her or his role in the movie. For each director, there is information on

Director ID (director_id)
ID of the first movie they directed (first_movie_id)
The year they directed their first movie (first_movie_year)
The genres of the first movie they directed (first_movie_genres)
The number of movies they directed the year they started directing (number_first_movie)
ID of movies they have been involved in throughout their career in any role (movie_id)
The role they took on in these movies (category)
The year these movies came out (year).

For the movies the directors in our list have been involved in, we can also add information on their IMDb ratings and number of votes on IMDb.

title_ratings <- readRDS("C:/Users/nnguye79/OneDrive - McGill University/Work/Projects/Social cap and gender/Film-maker-network/Data/Raw data/title_ratings.rds")

movie %>% 
  # merging movie data with rating data
  left_join(title_ratings, by = c("movie_id" = "tconst")) %>% 
  rename(rating = averageRating,
         number_vote = numVotes) %>% 
  left_join(title_ratings, by = c("first_movie_id"  = "tconst")) %>% 
  rename(first_movie_rating = averageRating,
         first_movie_number_vote = numVotes) -> movie

Let’s graph the distribution of ratings for movies made by first-time directors over time. To do this, we will group the directors based on the decade when they made their first movie - 1980s, 1990s, 2000s, 2010s.

movie %>% 
  filter(first_movie_id == movie_id) %>% 
  mutate(decade = (as.numeric(year) %/% 10) * 10,
         decade = factor(decade), 
         decade = paste(decade, "s", sep = "")) %>% 
  ggplot(aes(decade, first_movie_rating, color = decade)) +
  geom_boxplot(outlier.colour = NA) +
  geom_jitter(alpha = 0.1, width = 0.15) +
  labs(x = NULL, y = "Movie ratings", title = "Distribution of ratings among movies made by first-time directors") +
  theme(legend.position = "none")

It looks like movies made by first-time directors in the 2000s and 2010s have slightly higher ratings on average compared to movies made by first-time directors in the 1980s and 1990s.

Let’s also graph the distribution of votes (reflecting popularity) for movies made by first-time directors over time.

# turn off academic notion
options(scipen = 999)

movie %>% 
  filter(first_movie_id == movie_id) %>% 
  mutate(decade = (as.numeric(year) %/% 10) * 10,
         decade = factor(decade), 
         decade = paste(decade, "s", sep = "")) %>% 
  ggplot(aes(decade, first_movie_number_vote, color = decade)) +
  geom_boxplot(outlier.colour = NA) +
  geom_jitter(alpha = 0.1, width = 0.15) +
  labs(x = NULL, y = "Number of votes (logged)", title = "Distribution of votes among movies made by first-time directors") +
  theme(legend.position = "none") +
  # log transform y axis for clearer image
  scale_y_log10()

Director career before their first movie

Some first-time directors might have worked on other movies before they started directing, where they took on creative non-directing roles (producer, writer, editor, cinematographer, production designer, and composer). This prior experience might influence their career survival as a director. Therefore, we will create two variables to reflect a first-time director’s prior work experience, including whether they have worked on other movies in creative non-directing roles before they directed their first movie, and if so, how many movies and which role they took on in these movies.

movie %>% 
  # get the movies a director worked on before they directed, 
  # where they took on creative non-directing roles 
  filter(year < first_movie_year & 
           category %in% c("producer", "writer", "editor", 
                           "cinematographer", "production_designer", "composer")) %>% 
  # calculate the number of times a director has worked on a particular creative role 
  fastDummies::dummy_cols(select_columns = "category") %>%
  group_by(director_id) %>% 
  summarise(across(starts_with("category_"), sum)) %>% 
  # calculate the number of movies a director worked on before they directed, 
  # where they took on creative non-directing roles 
  rowwise(director_id) %>% 
  mutate(number_previous_movie = sum(c_across(where(is.numeric)))) %>% 
  # create dummy variable reflecting whether a director has worked on a creative non-directing role or not
  mutate(across(starts_with("category_"), ~ ifelse(. > 0, 1, .))) -> prior_experience

movie %>% 
  # merge movie data prior experience data
  left_join(prior_experience) %>% 
  # for directors who has not worked in creative non-directing roles before they directed a movie, 
  # code all variables on prior experience as 0
  mutate(across(category_cinematographer:number_previous_movie, ~ ifelse(is.na(.), 0, .))) -> movie

Once we do that, let’s see how many first-time directors in our list have had prior experience working in creative non-directing roles.

movie %>% 
  # get distinct pairs of director and number of previous movie
  distinct(director_id, number_previous_movie) %>% 
  # count the number of directors with certain number of previous movie
  count(number_previous_movie) %>% 
  kbl_2()


number_previous_movie	n
0	57767
1	4673
2	1630
3	706
4	467
5	292
6	185
7	157
8	118
9	99
10	71
11	52
12	55
13	41
14	32
15	26
16	17
17	18
18	31
19	17
20	17
21	15
22	14
23	7
24	13
25	8
26	2
27	12
28	9
29	3
30	7
31	3
32	7
33	4
35	1
36	2
37	4
38	1
39	3
40	5
41	1
42	2
43	1
45	2
46	6
47	1
48	1
49	2
50	1
51	1
52	1
53	2
55	1
56	4
57	1
58	3
60	4
62	1
64	1
66	1
67	1
69	1
70	1
74	2
78	2
80	1
86	1
93	1
96	1
103	1
104	1
119	1
128	1
142	1
188	1
342	1
456	1

It appears that most first-time directors in our list have not worked in other creative roles in movies before they directed their first movie.

Among the first-time directors who did worked in creative non-directing roles in past movies, what roles did they usually take on?

movie %>% 
  # get dummy variables on creative roles
  select(director_id, starts_with("category_")) %>% 
  # get distinct pairs of director and creative roles
  distinct() %>% 
  # count the number of directors who have worked on certain creative roles
  summarise(across(starts_with("category_"), sum)) %>% 
  # convert wide table to long table for plotting
  tibble::add_column(ID = 1) %>% 
  pivot_longer(!ID, names_to = "role", values_to = "n") %>% 
  # clean up variable names for plotting
  mutate(role = stringr::str_remove(role, "category_"), 
         role = forcats::fct_reorder(role, -n)) %>% 
  # create a bar chart with creative roles in the x axis, 
  # number of directors in the y axis
  ggplot(aes(role, n, fill = role)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "Number of directors",
       title = "How many directors worked in creative non-directing role \n before they directed their first movie?")

It looks like most directors who have worked on other movies before they started directing were writers and producers.

Director career after their first movie

Next we will count the number of movies each director went on to direct within the first 10 years of their career besides their first movie (i.e., their career survival).

movie %>% 
  # get the movies directors in the list work on in the director role
  filter(category == "director") %>% 
  # convert release year to numeric 
  mutate(across(c(first_movie_year, year), as.numeric)) %>% 
  # get the movies each director directed within the first 10 years of their career
  filter(year <= first_movie_year + 10) %>% 
  # count the number of movies each directors made 
  count(director_id) %>% 
  rename(number_movie = n) -> number_movie

movie %>% 
  # convert release year to numeric
  mutate(across(c(first_movie_year, year), as.numeric)) %>%
  # merge movie data with career survival data
  left_join(number_movie) %>% 
  # count the number of movies each director went on direct 
  # within the first 10 years of their career besides their first movie
  mutate(number_movie = number_movie - number_first_movie) -> movie

Let’s take a look at the number of movies the directors directed within the first 10 years of their career

movie %>% 
  # get distinct director-number of movie pair
  distinct(director_id, number_movie) %>% 
  # count the number of directors with certain number of movie
  count(number_movie, sort = T) %>% 
  kbl_2()


number_movie	n
0	39541
1	13023
2	6507
3	2986
4	1782
5	788
6	701
8	262
7	251
10	152
12	117
9	105
14	74
11	48
18	41
16	39
20	26
15	24
13	19
22	15
30	14
24	13
21	10
26	10
19	7
25	6
28	6
17	5
40	5
42	5
23	4
29	4
32	4
38	4
27	3
31	3
33	3
34	3
52	3
72	3
36	2
44	2
46	2
48	2
49	2
56	2
63	2
69	2
35	1
41	1
45	1
51	1
53	1
58	1
60	1
64	1
78	1
84	1
88	1
123	1
130	1
210	1
220	1

# create nested data with detailed career history of each director is nested in a list column
movie %>% 
  nest(movie_id:number_vote) %>% 
  rename(career_history = data) -> movie

Let’s also graph the distribution of number of movies made by directors over time.

movie %>% 
  mutate(decade = (as.numeric(first_movie_year) %/% 10) * 10,
         decade = factor(decade), 
         decade = paste(decade, "s", sep = "")) %>% 
  ggplot(aes(decade, number_movie, color = decade)) +
  geom_boxplot(outlier.colour = NA) +
  geom_jitter(alpha = 0.1, width = 0.15) +
  labs(x = NULL, 
       y = "Number of movies",
       title = "Distribution of number of movies directors made \n in the first 10 years of their career") +
  theme(legend.position = "none")

This is a little hard to see. Let’s count the number of directors who had made 0, 1, 2, and more than 2 movies in the first 10 years of their career.

movie %>% 
  mutate(number_movie_1 = ifelse(number_movie >2, 3, number_movie),
         decade = (as.numeric(first_movie_year) %/% 10) * 10,
         decade = factor(decade), 
         decade = paste(decade, "s", sep = "")) %>% 
  count(decade, number_movie_1) %>% 
  mutate(number_movie_1 = as.factor(number_movie_1)) %>% 
  ggplot(aes(decade, n, fill = number_movie_1)) +
  geom_col(position = position_dodge(preserve = "single")) +
  scale_fill_manual(name = "Number of movies",
                    labels=c("0","1","2", "More than 2"),
                    values=c("#316879", "#f47a60", "#7fe7dc", "#fbcbc9")) +
  labs(x = NULL,
       y = "Number of directors",
       title = "How many directors go on to make more movies \n after their first movie?")

It looks like across time, the majority of directors did not go on to direct another movie after their first movie. Even among the directors who did, most only directed one or two movies after their first movie. This indicates high career failure among movie directors. As such, it is important to identify the factors that help (or hurt) career survival of movie directors.

Part of my dissertation explores one such factor - the social network of the people a director worked with in their first movie. In other posts, I will construct the collaboration network of film-makers, calculate network position of a director’s early collaborators, and test if working with people central in the network increases a director’s chance of continuing their directing career.

Tracing career histories of film directors from IMDb data

Julie Nguyen (Personal website, Email)

March 13, 2022

Getting movies made between 1980 and 2010

Getting directors who made their first movie between 1980 and 2010

Tracing career history of first-time directors

Director career before their first movie

Director career after their first movie