Introduction

In this project, I am investigating the connections that exists between IMDb’s highest rated movies of all time in the aspects of directors, genres, rating, and decade released. With this information, I can find the most central movies and directors and find the links between them. I can also make inferences on how popular movies may change over time based on genres and ratings.

The data that is used comes from the official IMDb website [1]. This data is used to interpret the connections between the various movies.

Within this work, I used iterative methods to parse through the data and grab movie titles, directors, years, genres, and ratings to assign to variables. While doing this, edges are being created between movies that share a director, directors that work together on a movie, or all movies within a decade. Once the edgelist is created, necessary analyses such as community are performed and the data is plotted. With the plotted data, the key results found were that there were central directors and movies in this dataset.

Part 1

Data

Data Source

The data for this analysis comes from the official IMDb website [1]. This dataset contains 1001 movies to be used for the three different analysis. With the data found, the columns of “Title”, “IMDb Rating”, “Year”, “Genres”, and “Directors” were used.

As this data is open for use and distribution, there are no further legal implications for the collection, storage or dissemination of the data.

Wrangling, Cleaning, Imputing, and Encoding

I began this analysis by exporting the 1001 movies along with their additional data to a “csv” file. In this file, there was a lot of information but the information focused on for this project are the movie titles, IMDb ratings, years, genres, and directors. Luckily, there was zero missing data for the columns that were used so no imputation was needed. From here, I made a function that would create edges between movies that share a director, directors who work on the same movie, and movies that share a decade that they were released in.

Exploratory Analysis

The following plot is of the relationships between the directors who were apart of the same movie as another director.

Here we see that most of the directors only have one edge. However, there are a few components with a hub node where they shared one movie with a director and another movie with another director. In addition to this, there is one component that is much larger than all of the others.

The plot below is using the same dataset as the plot above but is interpreting the relationships all movies made by the same director. The size of the nodes correlate to the movies IMDb rating.

The most noticeable part of this plot is that there is significant clustering and transitivity. There also seem to be a few genres that are most popular in this dataset.

This plot uses the same dataset as the two above but this time creates edges between all movies that came out in the same decade. The size of the nodes correlate to the movies IMDb rating.

In this plot, we can see that some components have only a couple nodes while others have many.

Issues during Exploratory Analysis

The first two networks went fairly smoothly however I ran into an issue in the third one. When attempting to plot the network with edges between every node that shares a decade the program kept crashing. This is because there were too many (4.4 million) edges created during the exploratory process. The way I fixed this was by only using 1/10 the amount of nodes from each decade for my analysis. By doing this, the program was able to run however a side effect is that there may be genres missing from the third plot entirely. Because it chooses the nodes differently each time, genres with few movies may not be represented.

Part 2

Analysis and Results

Methods

The methods used to complete the analysis include creating edgelists, finding communities, and using the rating and genre as an element in plotting the resulting data. The reason I chose this was because I found that by having the nodes color-coded based on what community or genre they’re apart of, there was an additional aspect added to the data that could lead to more insights in connections between the movies and directors. By having the size represented as rating, it became easier to see which genres, directors, and decades could be considered as the “most successful”.

Results

The results from this project are all about which movies, directors, genres, and decades are most important. For the first plot which is about directors I used this table for my results:

Using this table, I found that the director with the highest degree centrality is named, “Bill Roberts”. This director had a value of 0.200. The director with the highest betweenness centrality is named, “David Hand”. This director had a betweenness centrality of 25.590. There were many directors with a closeness centrality of 1. The highest page_rank belongs to a director named, “Lee Unkrich”. The page rank for this director is 0.013.

For the second plot which creates connections between movies and directors I used this table:

Using this table, I found that there were quite a few movies with the highest degree centrality such as, “Raiders of the Lost Ark”. These movies had a value of 0.382. The one with the highest betweenness centrality is, “Toy Story 3”. This one had a betweenness centrality of 1. There are eight movies tied for the highest closeness centrality. These movies all had a value of 1. The highest page_rank belongs to, “Toy Story 3” as well. The page rank for this node is around 0.003.

For the third plot which focuses on connections between movies made in the same decade, I used the following table.

Using this table, I found that the nodes with the highest degree centrality are all of the movies made in the 2000’s. These nodes had a value of 0.221. All nodes had a betweenness centrality of 0. The ones with the highest closeness centrality are the two 1920’s movies named, “Faust: Eine deutsche Volkssage” and “The Cameraman”. The value for these was 1. The highest page_rank belongs to all nodes from the 1980’s. The page rank for these are 0.010.

Interpretation

Based on these results, we now know who the most important directors, movies, genres, and decades are in this study. The most important directors are “Steven Spielberg”, “Lee Unkrich”, “David Hand”, and “Bill Roberts”. “Steven Spielberg” is considered one of the most important directors because he has the most movies on this list as seen by his movies having the highest degree centrality in the Movie-Director network. “Bill Roberts” has the most connections with other directors in the Director-Director network which puts him on this list. “David Hand” is important when making connections between different directors due to how he has the highest betweenness centrality for the Director-Director network. “Lee Unkrich” is an important director because he has the highest page rank in the Director-Director network with imporant movies such as “Toy Story 3” which has the highest betweenness centrality and page rank in the Movie-Director network.

The two most important decades at the time of this dataset being made seem to be the 2000’s and the 1980’s. This is because the 2000’s have the highest degree centrality on the Movie-Decade network and the 1980’s have the highest page rank. The 2000’s having the highest degree centrality means that that decade has the most movies on the list. The 1980’s having the highest page rank means that movies made in this time period have the most control over the flow of information compared to any of the other networks.

Although these networks may never be “under attack”, it is still important that we analyze their robustness and resilience. The director-director network as well as the movie-director network are the least robust because of how few nodes are present in most components. Due to this, one attack on most components would be detrimental. The movie-decade plot is more robust because of how interconnected it is. An attack on this network would be less devastating. The resilience has a similar conclusion. Because of the high degree, the movie-decade network would do much better than the other two as building back.

Conclusions

The goal of this project was to find the most important directors and movies of all time and make connections between what genres and directors have the largest impact on what makes a successful movie.

From this data and the processes used to plot and interpret it, I have learned about who and what are the most important nodes in each plot. From the analyses that have been done, I learned that Steven Spielberg appears to be one of if not the most important director. I came to this conclusion because of the fact that he has the most movies on this list out of any other director as seen from his movies having the highest degree centrality in the Movie-Director network. Bill Roberts, David Hand, and Lee Unkrich are also very important directors and they all happen to be in the world of animation. They are considered important as well because they each had a highest centrality value for the Director-Director network. The 2000’s and 1980’s are also an important time for movies because 2000’s movies had the highest degree centrality and 1980’s had the highest page rank.

We have learned that the movie industry is not exempt from change. As popular directors and genres change, new insights can be made in the film industry. At this moment in time, Steven Spielberg continues to show his dominance in film and who knows when this title will be passed down to someone else? Animated movies have been a staple for the longest time as well. Will this ever change? Drama, Comedy, and Crime have been extremely popular genres while Westerns and Mysteries seem to fade away. As this dataset was made in 2015 and therefore doesn’t have movies from the 2010’s, we still have more to learn if movies will continue to improve, or if it will be similar to a bell curve with there being less and less great movies each decade due to originality being more difficult.

An area of further study that future researchers should consider is analyzing the change of successful genres over time. Successful could mean the amount on the list or the average IMDb rating. With this, we will have even more insight into what has made movies successful and how that could change in the future. In addition, if data that includes the main actors for each film could be made, adding that to the analysis to be able to find the most successful actors could be interesting.

References

[1]
TopTenner, “2015 edition: Top10ner’s 1001 ’greatest’ movies of all time.” IMDb, 2015. Available: https://www.imdb.com/list/ls074674014/
[2]
R Core Team, R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2023. Available: https://www.R-project.org/

Appendix A: Plots and Tables

The following graph displays the key topographical measures of the three graphs that were apart of this project.

Topographical Measures of Graphs
Directors Movies Decades
Vertex Count 156.0000000 478.0000000 96.0000000
Edge Count 269.0000000 7178.0000000 571.0000000
Diameter 18.0000000 11.0000000 41.0000000
Density 0.0222498 0.0629633 0.1252193
Component Count 3.0000000 3.0000000 3.0000000
Transitivity 0.1034483 0.2322702 0.1959184
Centralization Degree 0.1777502 0.3185881 0.0958333
Centralization Closeness NaN NaN NaN
Centralization Betweenness 0.1563690 0.0840537 0.3514768
Centralization Eigenvalue 1.0007007 0.9767101 0.8177795
Cohesion 0.0000000 0.0000000 0.0000000
Compactness NaN NaN NaN
Global Clustering Coefficient 0.1034483 0.2322702 0.1959184

These topographical measures give us some important insight into the three networks. First off, we can see that the Decades network has the highest density due to how every node within a component is connected to each other and there aren’t as many components. The transitivity and global clustering coefficient are highest with the movies network because most of the components (directors) seem to have around three movies present on the list. The directors network has the highest eigenvalue centralization because of that one component that had way more nodes and connections in it than any of the other components. These nodes also seem to be ones that could be considered important.

Appendix B: Code/syntax

library(igraph)
library(network)
library(kableExtra)
library(RColorBrewer)

data <- read.csv("Movie Dataset.csv", header=T, as.is=T)

# Split Directors column into individual directors
all_directors <- strsplit(as.character(data$Directors), ",")

# Flatten the list and get unique directors
unique_directors <- unique(unlist(all_directors))

# Initialize variables
num_pairs <- 0
edge_list <- matrix(nrow = 0, ncol = 2)

# Create pairs of directors that are a part of the same movie
for (directors_list in all_directors) {
  if (length(directors_list) > 1) {
    director_pairs <- combn(directors_list, 2)
    num_pairs_article <- ncol(director_pairs)
    edge_list <- rbind(edge_list, t(director_pairs))
    num_pairs <- num_pairs + num_pairs_article
  }
}

# Create an edgelist
edge_list_df <- as.data.frame(edge_list)
colnames(edge_list_df) <- c("director1", "director2")

# Make a data frame
net <- graph_from_data_frame(edge_list_df, directed = FALSE)

#Use this dataframe to find communities
netS <- simplify(net)
communities <- edge.betweenness.community(netS)
membership <- membership(communities)
V(net)$community <- membership[V(netS)]

#Find the degree and assign colors to different communities
V(net)$degree <- igraph::degree(net)
Community <- length(unique(V(net)$community))
color_pal <- rainbow(Community)

#Plot the data with the color based on the community and the size based on the degree
plot(net,

     vertex.label = NA,
     vertex.color = color_pal[as.numeric(factor(V(net)$community))],
     vertex.size = 5,

     main = "Director-Director Network"
)
# Create a vector to store edges
edge_listM <- c()

# Add Edges for Shared Directors
for (i in 1:nrow(data)) {
  movie1 <- data$Title[i]
  directors1 <- strsplit(as.character(data$Directors[i]), ", ")[[1]]
  
  shared_movies <- data[data$Directors %in% directors1 & data$Title != movie1, "Title"]

    if (length(shared_movies) > 1) {
    movie_pairs <- combn(shared_movies, 2)
    num_pairs_article <- ncol(movie_pairs)
    edge_listM <- rbind(edge_listM, t(movie_pairs))
    num_pairs <- num_pairs + num_pairs_article
  }
}

edge_listM_df <- as.data.frame(edge_listM)
colnames(edge_listM_df) <- c("movie1", "movie2")

netM <- graph_from_data_frame(edge_listM_df, directed = FALSE)

# Get node names from the graph
node_names <- V(netM)$name


# Create a lookup table from original_dataset
lookup_table <- setNames(data$IMDb.Rating, data$Title)

# Iterate through each node in the graph
for (node_name in node_names) {
  # Find corresponding title in the lookup table
  title <- node_name
  if (title %in% names(lookup_table)) {
    # Update IMDb rating attribute in the graph
    netM <- set_vertex_attr(netM, "IMDb.Rating", index = node_name, value = lookup_table[[title]])
  }
}
# Create a lookup table from original_dataset
lookup_table2 <- setNames(data$Genres, data$Title)

# Iterate through each node in the graph
for (node_name in node_names) {
  # Find corresponding title in the lookup table
  title <- node_name
  if (title %in% names(lookup_table2)) {
    # Get genres for the movie from the lookup table
    genres <- lookup_table2[[title]]
    
    # Ensure genres is of character type
    genres <- as.character(genres)

    # Split genres string by comma and take only the first genre
    genre_part <- strsplit(genres, ",")[[1]]
    first_genre <- trimws(genre_part[1]) 
    
    # Update Genre attribute in the graph with the first two genres
    netM <- set_vertex_attr(netM, "Genre", index = node_name, value = first_genre)
  }
}

# Define the breaks for different size groups based on IMDb ratings
breaks <- c(0, 7.5, 8, 8.5, 9, 9.5, 10)  # Adjust the breaks as needed

# Create the size groups based on IMDb ratings
size_groups <- cut(V(netM)$IMDb.Rating, breaks = breaks, labels = FALSE)
color_pal <- rainbow(length(unique(V(netM)$Genre))) 

# Plot the network
plot(netM,

     vertex.label = NA,
     vertex.color = color_pal[factor(V(netM)$Genre)], 
     vertex.size = size_groups*3,

     main = "Movie-Director Network",
)

legend("left",
       legend = levels(factor(V(netM)$Genre)),
       fill = color_pal,
       title = "Genre",
       cex=0.7
)
# Create a vector to store edges
edge_listD <- c()

unique_decades <- unique((data$Year %/% 10) * 10)


for (decade in unique_decades) {
  # Subset the data for the current decade
  decade_data <- data[(data$Year %/% 10) * 10 == decade, ]
  
  # Sample a subset of movies for the current decade
   sample_size <- max(1, floor(nrow(decade_data) / 10)) 
  sampled_movies <- sample(decade_data$Title, size = sample_size, replace = FALSE)
  
  # Create edges for the sampled movies
  if (length(sampled_movies) > 1) {
    movie_pairs <- combn(sampled_movies, 2)
    num_pairs_article <- ncol(movie_pairs)
    edge_listD <- rbind(edge_listD, t(movie_pairs))
  }
}


edge_listD_df <- as.data.frame(edge_listD)
colnames(edge_listD_df) <- c("movie1", "movie2")

netD <- graph_from_data_frame(edge_listD_df, directed = FALSE)

# Get node names from the graph
node_names <- V(netD)$name


# Create a lookup table from original_dataset
lookup_table <- setNames(data$IMDb.Rating, data$Title)

# Iterate through each node in the graph
for (node_name in node_names) {
  # Find corresponding title in the lookup table
  title <- node_name
  if (title %in% names(lookup_table)) {
    # Update IMDb rating attribute in the graph
    netD <- set_vertex_attr(netD, "IMDb.Rating", index = node_name, value = lookup_table[[title]])
  }
}
# Create a lookup table from original_dataset
lookup_table2 <- setNames(data$Genres, data$Title)

# Iterate through each node in the graph
for (node_name in node_names) {
  # Find corresponding title in the lookup table
  title <- node_name
  if (title %in% names(lookup_table2)) {
    # Get genres for the movie from the lookup table
    genres <- lookup_table2[[title]]

    # Ensure genres is of character type
    genres <- as.character(genres)

    # Split genres string by comma and take only the first genre
    genre_part <- strsplit(genres, ",")[[1]]
    first_genre <- trimws(genre_part[1])

    # Update Genre attribute in the graph with the first two genres
    netD <- set_vertex_attr(netD, "Genre", index = node_name, value = first_genre)
  }
}

# Define the breaks for different size groups based on IMDb ratings
breaks <- c(0, 7.5, 8, 8.5, 9, 9.5, 10)  # Adjust the breaks as needed

# Create the size groups based on IMDb ratings
size_groups <- cut(V(netD)$IMDb.Rating, breaks = breaks, labels = FALSE)
color_pal <- rainbow(length(unique(V(netD)$Genre)))

# Plot the network
plot(netD,
     
     vertex.label = NA,
     vertex.color = color_pal[factor(V(netD)$Genre)], 
     vertex.size = size_groups*3,

     main = "Movie-Decade Network",
)

legend("left",
       legend = levels(factor(V(netD)$Genre)),
       fill = color_pal,
       title = "Genre",
       cex=0.7
)

Appendix C: Acknowledgements

The R statistical language was used in the development of this article [2].


  1. Oregon Institute of Technology, ↩︎