In this post, I use IMDb data to construct a sample of first-time movie directors and collect data on the movies they worked on throughout their career.

This data is created for a study in my dissertation that explores whether working with highly central people in the network predicts career survival of first-time movie directors, or more specifically, the number of movies directors go on to make after their first movie. To this end, I first construct the career histories of movie directors who made their first movie between 1980 and 2010, using the data I downloaded from https://datasets.imdbws.com/. IMDb has the most complete database on film-makers worldwide. It contains information on film and TV productions and the people who work on them since the late 1800s until now.

I choose 1980 as the start of my sample of directors because this time marks the transition of the Hollywood film industry from in-house productions to projects and personal networks. As such, movies made after 1980 are more representative of film industry today. I choose 2010 as the end of the sample because it allows me to observe the first 10 years of all directors’ career.

Getting movies made between 1980 and 2010

First, let’s read the Imdb data on movie information and see what years we have data on.

# read data on movies that have information on year and genres 
title_basics_2 <- readRDS("C:/Users/nnguye79/OneDrive - McGill University/Work/Projects/Social cap and gender/Film-maker-network/Data/Raw data/title_basics_2.rds")

# load data on people working on movies
title_principals_2 <- readRDS("C:/Users/nnguye79/OneDrive - McGill University/Work/Projects/Social cap and gender/Film-maker-network/Data/Raw data/title_principals_2.rds")

# get the functions I have written to create table with scroll box
source("functions.R")

The earliest year we have data on is 1896 and the latest year we have data on is 2028. Seems like IMDb also has info on movies that have not been finished yet. Let’s graph the number of movies in the database across time.

# graph the number of movies over time
title_basics_2 %>% 
  # turn year into numeric character
  mutate(year = as.numeric(year)) %>% 
  # count the number of movies made in each year
  count(year, sort= T) %>% 
  # create a bar chart with year in the x axis and number of movies in the y axis
  ggplot(aes(year, n)) + 
  geom_col(fill = "#8BD8BD") +
  labs(y = "Number of movies", 
       x = NULL, 
       title = "How many movies were made each year?",
       subtitle = "A lot more movies are made in the 21st century than previous centuries")

Next, we retain only the movies made between 1980 and 2010. Let’s take a look at the first 20 rows of the data to see what the data looks like now.

# get the movies made between 1980 and 2010
title_basics_2 %>% filter(between(year, 1980, 2010)) -> movie

movie %>% head(20) %>% kbl_2()
movie_id year genres
tt0015724 1993 Drama,Mystery,Romance
tt0035423 2001 Comedy,Fantasy,Romance
tt0036606 1983 Drama,War
tt0038687 1980 Documentary,War
tt0057461 1983 Drama,Fantasy
tt0059325 1990 Drama,Romance
tt0059900 1990 Drama,Fantasy
tt0062181 1981 Drama
tt0064820 1989 Comedy
tt0065108 1989 Drama,Romance
tt0065188 1990 Drama
tt0065530 1983 Drama
tt0066020 1986 Action,Comedy
tt0066151 1986 Action,Adventure
tt0067100 1981 Action,Drama,Thriller
tt0067460 1995 Documentary,Sport
tt0067625 1986 Drama,War
tt0067694 1987 Drama,War
tt0067754 1981 Documentary
tt0068494 1990 Drama

We now have a list of 146580 movies made between 1980 and 2010, with information on movie id (movie_id), the year the movie was released (year), and the movie’s genres (genres).

Getting directors who made their first movie between 1980 and 2010

Our next step is to get the directors of these movies. To do this, we need to merge the IMDb dataset on titles and the dataset on principals, and exclude movies that do not have information on directors.

movie %>% 
  # get the director of these movies by merging movie data with crew data using movie id
  left_join(title_principals_2 %>% 
              # get directors of movies
              filter(category == "director") %>% 
              # choose movie id and person id
              select(movie_id, person_id), 
            # merge with movie id as key
            by = "movie_id") %>%
  rename(director_id = person_id) %>% 
  # remove movies without info on director
  filter(!is.na(director_id)) -> movie

This gives us 146973 director-movie pairs, which are made up of 72305 distinct directors and 132743 distinct movies, with some movies had more than one director.

Now that we have a list of directors who made at least one movie between 1980 and 2010, we need to get the subset of directors whose first movie was made within this period. To do this, we first get all the movies of the directors in the list.

movie %>% 
  # get distinct director id
  distinct(director_id) %>% 
  # get all work directed by these directors by merging by person id with crew info
  left_join(title_principals_2 %>% 
              filter(category == "director") %>% 
              select(movie_id, person_id), 
            by = c("director_id" = "person_id")) -> movie

# get the release year of directors' movies by merging by movie id with movie info
movie %>% 
  left_join(title_basics_2, by = "movie_id") %>% 
  # remove movies of directors that do not have release year
  filter(!is.na(year)) -> movie

Next, we get the first movies of the directors in our list and then only choose the directors whose first movie was made between 1980 and 2010.

movie %>% 
  # group by director id 
  group_by(director_id) %>% 
  # get the movie with earliest year within each director id group
  slice_min(year) %>%   
  ungroup() %>% 
  # get the movies made between 1980 and 2010
  filter(between(year, 1980, 2010)) -> movie

This gives us a sample of 68281 director-movie pair, made up of 66647 directors directing 60779 movies, with some first-time directors directed more than one movie, and some movies were directed by more than one first-time director.

Let’s see how many directors make their first movie in each year within our observation period.

movie %>% 
  distinct(director_id, year) %>% 
  # count the number of directors in each year
  count(year, sort= T) %>% 
  # create a bar chart with year in the x axis and number of director in the y axis
  ggplot(aes(factor(year), n, fill = year)) + 
  geom_col(show.legend = F) +
  labs(y = "Number of directors", 
       x = NULL, 
       title = "How many directors made their first movie in each year?") +
  scale_x_discrete(breaks = c(1980, 1985, 1990, 1995, 2000, 2005, 2010))

How many movies did a director usually have when they first directed?

movie %>% 
  count(director_id, sort = T) %>% 
  count(n) %>% 
  rename(number_movie = n,
         n = nn) %>% 
  kable()
number_movie n
1 65172
2 1362
3 85
4 18
5 6
6 1
7 2
8 1

It looks like most directors directed one movie in their early career, but a small group of directors did make more than one movie the year they started directing. Let’s create a variable for the numbers of movies a director made the year they started their directing career.

movie %>% 
  left_join(movie %>% 
              # count the number of movies a first-time director made
              count(director_id) %>% 
              rename(number_first_movie = n)) -> movie

Tracing career history of first-time directors

Now that we have our list of directors who made their first movies between 1980 and 2010, we can move on to tracing how the career of these directors turns out. Specifically, let’s gather information on whether they go on to make other movies in the first 10 years of their career. First, let’s get all the movies a director in our list has been involved in and see what the first 20 rows of the data looks like.

movie %>% 
  rename(first_movie_id = movie_id, 
         first_movie_year = year,
         first_movie_genres = genres) %>% 
  # merge with data on crew information to find all the projects a director has been involved in
  left_join(title_principals_2,
            by = c("director_id" = "person_id")) -> movie

movie %>% 
  # merge with data on movie information
  left_join(title_basics_2, by = "movie_id") %>% 
  # remove the projects that are not in the movie data (and thus are not movies, but for example are tv episodes)
  filter(!is.na(year)) %>% 
  select(-genres) -> movie

movie %>% head(20) %>% kbl_2()
director_id first_movie_id first_movie_year first_movie_genres number_first_movie movie_id category year
nm0000083 tt0969216 2007 Documentary,Family 1 tt0170560 editor 1998
nm0000083 tt0969216 2007 Documentary,Family 1 tt0424773 writer 2002
nm0000083 tt0969216 2007 Documentary,Family 1 tt0969216 director 2007
nm0000083 tt0969216 2007 Documentary,Family 1 tt11192552 writer 2019
nm0000083 tt0969216 2007 Documentary,Family 1 tt12979838 editor 2009
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0084490 actor 1982
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0086293 actor 1984
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0088443 actor 1984
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0088954 actor 1985
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0089946 actor 1985
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0090668 actor 1987
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0091495 actor 1986
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0091805 actor 1986
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0093412 actor 1987
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0093747 actor 1988
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0094705 actor 1989
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0094822 actor 1988
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0095675 actor 1988
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0096938 actor 1989
nm0000104 tt0142201 1999 Comedy,Crime,Drama 1 tt0098324 actor 1989

Each row now is a director-movie observation, containing information on a movie a director has been involved in regardless of her or his role in the movie. For each director, there is information on

For the movies the directors in our list have been involved in, we can also add information on their IMDb ratings and number of votes on IMDb.

title_ratings <- readRDS("C:/Users/nnguye79/OneDrive - McGill University/Work/Projects/Social cap and gender/Film-maker-network/Data/Raw data/title_ratings.rds")

movie %>% 
  # merging movie data with rating data
  left_join(title_ratings, by = c("movie_id" = "tconst")) %>% 
  rename(rating = averageRating,
         number_vote = numVotes) %>% 
  left_join(title_ratings, by = c("first_movie_id"  = "tconst")) %>% 
  rename(first_movie_rating = averageRating,
         first_movie_number_vote = numVotes) -> movie

Let’s graph the distribution of ratings for movies made by first-time directors over time. To do this, we will group the directors based on the decade when they made their first movie - 1980s, 1990s, 2000s, 2010s.

movie %>% 
  filter(first_movie_id == movie_id) %>% 
  mutate(decade = (as.numeric(year) %/% 10) * 10,
         decade = factor(decade), 
         decade = paste(decade, "s", sep = "")) %>% 
  ggplot(aes(decade, first_movie_rating, color = decade)) +
  geom_boxplot(outlier.colour = NA) +
  geom_jitter(alpha = 0.1, width = 0.15) +
  labs(x = NULL, y = "Movie ratings", title = "Distribution of ratings among movies made by first-time directors") +
  theme(legend.position = "none")

It looks like movies made by first-time directors in the 2000s and 2010s have slightly higher ratings on average compared to movies made by first-time directors in the 1980s and 1990s.

Let’s also graph the distribution of votes (reflecting popularity) for movies made by first-time directors over time.

# turn off academic notion
options(scipen = 999)

movie %>% 
  filter(first_movie_id == movie_id) %>% 
  mutate(decade = (as.numeric(year) %/% 10) * 10,
         decade = factor(decade), 
         decade = paste(decade, "s", sep = "")) %>% 
  ggplot(aes(decade, first_movie_number_vote, color = decade)) +
  geom_boxplot(outlier.colour = NA) +
  geom_jitter(alpha = 0.1, width = 0.15) +
  labs(x = NULL, y = "Number of votes (logged)", title = "Distribution of votes among movies made by first-time directors") +
  theme(legend.position = "none") +
  # log transform y axis for clearer image
  scale_y_log10()

Director career before their first movie

Some first-time directors might have worked on other movies before they started directing, where they took on creative non-directing roles (producer, writer, editor, cinematographer, production designer, and composer). This prior experience might influence their career survival as a director. Therefore, we will create two variables to reflect a first-time director’s prior work experience, including whether they have worked on other movies in creative non-directing roles before they directed their first movie, and if so, how many movies and which role they took on in these movies.

movie %>% 
  # get the movies a director worked on before they directed, 
  # where they took on creative non-directing roles 
  filter(year < first_movie_year & 
           category %in% c("producer", "writer", "editor", 
                           "cinematographer", "production_designer", "composer")) %>% 
  # calculate the number of times a director has worked on a particular creative role 
  fastDummies::dummy_cols(select_columns = "category") %>%
  group_by(director_id) %>% 
  summarise(across(starts_with("category_"), sum)) %>% 
  # calculate the number of movies a director worked on before they directed, 
  # where they took on creative non-directing roles 
  rowwise(director_id) %>% 
  mutate(number_previous_movie = sum(c_across(where(is.numeric)))) %>% 
  # create dummy variable reflecting whether a director has worked on a creative non-directing role or not
  mutate(across(starts_with("category_"), ~ ifelse(. > 0, 1, .))) -> prior_experience

movie %>% 
  # merge movie data prior experience data
  left_join(prior_experience) %>% 
  # for directors who has not worked in creative non-directing roles before they directed a movie, 
  # code all variables on prior experience as 0
  mutate(across(category_cinematographer:number_previous_movie, ~ ifelse(is.na(.), 0, .))) -> movie

Once we do that, let’s see how many first-time directors in our list have had prior experience working in creative non-directing roles.

movie %>% 
  # get distinct pairs of director and number of previous movie
  distinct(director_id, number_previous_movie) %>% 
  # count the number of directors with certain number of previous movie
  count(number_previous_movie) %>% 
  kbl_2()
number_previous_movie n
0 57767
1 4673
2 1630
3 706
4 467
5 292
6 185
7 157
8 118
9 99
10 71
11 52
12 55
13 41
14 32
15 26
16 17
17 18
18 31
19 17
20 17
21 15
22 14
23 7
24 13
25 8
26 2
27 12
28 9
29 3
30 7
31 3
32 7
33 4
35 1
36 2
37 4
38 1
39 3
40 5
41 1
42 2
43 1
45 2
46 6
47 1
48 1
49 2
50 1
51 1
52 1
53 2
55 1
56 4
57 1
58 3
60 4
62 1
64 1
66 1
67 1
69 1
70 1
74 2
78 2
80 1
86 1
93 1
96 1
103 1
104 1
119 1
128 1
142 1
188 1
342 1
456 1

It appears that most first-time directors in our list have not worked in other creative roles in movies before they directed their first movie.

Among the first-time directors who did worked in creative non-directing roles in past movies, what roles did they usually take on?

movie %>% 
  # get dummy variables on creative roles
  select(director_id, starts_with("category_")) %>% 
  # get distinct pairs of director and creative roles
  distinct() %>% 
  # count the number of directors who have worked on certain creative roles
  summarise(across(starts_with("category_"), sum)) %>% 
  # convert wide table to long table for plotting
  tibble::add_column(ID = 1) %>% 
  pivot_longer(!ID, names_to = "role", values_to = "n") %>% 
  # clean up variable names for plotting
  mutate(role = stringr::str_remove(role, "category_"), 
         role = forcats::fct_reorder(role, -n)) %>% 
  # create a bar chart with creative roles in the x axis, 
  # number of directors in the y axis
  ggplot(aes(role, n, fill = role)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "Number of directors",
       title = "How many directors worked in creative non-directing role \n before they directed their first movie?")

It looks like most directors who have worked on other movies before they started directing were writers and producers.

Director career after their first movie

Next we will count the number of movies each director went on to direct within the first 10 years of their career besides their first movie (i.e., their career survival).

movie %>% 
  # get the movies directors in the list work on in the director role
  filter(category == "director") %>% 
  # convert release year to numeric 
  mutate(across(c(first_movie_year, year), as.numeric)) %>% 
  # get the movies each director directed within the first 10 years of their career
  filter(year <= first_movie_year + 10) %>% 
  # count the number of movies each directors made 
  count(director_id) %>% 
  rename(number_movie = n) -> number_movie

movie %>% 
  # convert release year to numeric
  mutate(across(c(first_movie_year, year), as.numeric)) %>%
  # merge movie data with career survival data
  left_join(number_movie) %>% 
  # count the number of movies each director went on direct 
  # within the first 10 years of their career besides their first movie
  mutate(number_movie = number_movie - number_first_movie) -> movie

Let’s take a look at the number of movies the directors directed within the first 10 years of their career

movie %>% 
  # get distinct director-number of movie pair
  distinct(director_id, number_movie) %>% 
  # count the number of directors with certain number of movie
  count(number_movie, sort = T) %>% 
  kbl_2()
number_movie n
0 39541
1 13023
2 6507
3 2986
4 1782
5 788
6 701
8 262
7 251
10 152
12 117
9 105
14 74
11 48
18 41
16 39
20 26
15 24
13 19
22 15
30 14
24 13
21 10
26 10
19 7
25 6
28 6
17 5
40 5
42 5
23 4
29 4
32 4
38 4
27 3
31 3
33 3
34 3
52 3
72 3
36 2
44 2
46 2
48 2
49 2
56 2
63 2
69 2
35 1
41 1
45 1
51 1
53 1
58 1
60 1
64 1
78 1
84 1
88 1
123 1
130 1
210 1
220 1
# create nested data with detailed career history of each director is nested in a list column
movie %>% 
  nest(movie_id:number_vote) %>% 
  rename(career_history = data) -> movie

Let’s also graph the distribution of number of movies made by directors over time.

movie %>% 
  mutate(decade = (as.numeric(first_movie_year) %/% 10) * 10,
         decade = factor(decade), 
         decade = paste(decade, "s", sep = "")) %>% 
  ggplot(aes(decade, number_movie, color = decade)) +
  geom_boxplot(outlier.colour = NA) +
  geom_jitter(alpha = 0.1, width = 0.15) +
  labs(x = NULL, 
       y = "Number of movies",
       title = "Distribution of number of movies directors made \n in the first 10 years of their career") +
  theme(legend.position = "none") 

This is a little hard to see. Let’s count the number of directors who had made 0, 1, 2, and more than 2 movies in the first 10 years of their career.

movie %>% 
  mutate(number_movie_1 = ifelse(number_movie >2, 3, number_movie),
         decade = (as.numeric(first_movie_year) %/% 10) * 10,
         decade = factor(decade), 
         decade = paste(decade, "s", sep = "")) %>% 
  count(decade, number_movie_1) %>% 
  mutate(number_movie_1 = as.factor(number_movie_1)) %>% 
  ggplot(aes(decade, n, fill = number_movie_1)) +
  geom_col(position = position_dodge(preserve = "single")) +
  scale_fill_manual(name = "Number of movies",
                    labels=c("0","1","2", "More than 2"),
                    values=c("#316879", "#f47a60", "#7fe7dc", "#fbcbc9")) +
  labs(x = NULL,
       y = "Number of directors",
       title = "How many directors go on to make more movies \n after their first movie?")

It looks like across time, the majority of directors did not go on to direct another movie after their first movie. Even among the directors who did, most only directed one or two movies after their first movie. This indicates high career failure among movie directors. As such, it is important to identify the factors that help (or hurt) career survival of movie directors.

Part of my dissertation explores one such factor - the social network of the people a director worked with in their first movie. In other posts, I will construct the collaboration network of film-makers, calculate network position of a director’s early collaborators, and test if working with people central in the network increases a director’s chance of continuing their directing career.