Streaming platforms like Netflix have drastically changed media consumption by offering various films and TV shows that feature a wide range of collaborations between actors, directors, and production teams. These collaborations create a complex professional network that reveals patterns about how the industry is structured. By viewing the entertainment industry through a network analysis, this can provide insights into how actors are connected and whether certain groups dominate the industry. Exploring these networks will give us a better understanding of social network theory with a focus on the real-world. The research question I used to guide this project was: How connected is the actor collaboration network in Netflix movies since 2015, and what does this reveal about collaboration trends in the platform’s productions?
The dataset used in this project was found on Kaggle by Shivam Bansal. The researcher collected the data by scraping Netflix’s public catalog using automated tools. (The link to the dataset can be found here: https://www.kaggle.com/datasets/shivamb/netflix-shows?resource=download). This dataset includes 8000+ Netflix movies and TV Shows. I decided to focus on movies to limit cluttering and make the data wrangling easier. The edges represent a collaboration between two actors if they appeared in the same movie and each vertex represent an actor. I used the tidyverse, igraph, viridis, and ggraph packages.
# libraries I used for the project
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(igraph)
##
## Attaching package: 'igraph'
##
## The following objects are masked from 'package:lubridate':
##
## %--%, union
##
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
##
## The following objects are masked from 'package:purrr':
##
## compose, simplify
##
## The following object is masked from 'package:tidyr':
##
## crossing
##
## The following object is masked from 'package:tibble':
##
## as_data_frame
##
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
##
## The following object is masked from 'package:base':
##
## union
library(ggraph)
library(viridis) # this stack overflow introduced me to the viridis package color pallete which I thought might be useful: https://stackoverflow.com/questions/45663162/map-values-to-viridis-colours-in-r
## Loading required package: viridisLite
# reading the Netflix dataset
netflix_data <- read.csv("netflix_titles.csv")
# cleaning the data:
# removing rows with missing actor info
# keeping titles from 2015 and up
# keeping titles that have between 2 to 5 actors
filtered_data <- netflix_data %>%
filter(!is.na(cast), release_year >= 2015) %>%
mutate(num_actors = str_count(cast, ",") + 1) %>%
filter(num_actors >= 2, num_actors <= 5)
# breaking the cast column so that it's one row per actor
long_cast <- filtered_data %>%
select(title, cast, type) %>%
separate_rows(cast, sep = ", ")
# making actor pairs for each movie
# I found this part a bit difficult but this stack overflow was resourceful for the unesting: https://stackoverflow.com/questions/78350703/tidyverse-dplyr-solution-to-assigning-values-to-column-names-extracted-from-a-ne
actor_pairs <- long_cast %>%
filter(type == "Movie") %>%
group_by(title) %>%
summarise(pairs = list(t(combn(cast, 2)))) %>%
unnest_longer(pairs) %>%
unnest_wider(pairs, names_sep = "_") %>%
distinct()
# making the actor pairs into a network graph (undirected since collaboration goes both ways)
graph <- graph_from_data_frame(actor_pairs, directed = FALSE)
# how many people an actor worked with
degree <- degree(graph)
# making sure to keep actors who worked with more than 5 people
popular_actors <- names(degree[degree > 5])
# updating the graph to only include more-connected actors
graph <- induced_subgraph(graph, vids = popular_actors)
# actors who worked with more than 35 others
really_popular <- names(degree(graph)[degree(graph) > 35])
The plot below shows a network of actors in Netflix movies and the connection between them based on shared appearances in the same movies. Each node represents a Netflix actor and each edge shows that those two actors appeared in the same film.
The color and size of the nodes are based on degree centrality (how many connections each actor has). More connected actors are shown in brighter colors and larger sizes. I used the viridis package, specifically the “plasma” color scale so the colors could be easily seen and contrasted from other nodes. I also used the “dh” or Davidson Harel layout to space out the nodes in a clean way that would avoid overlapping. The edges are grey to keep the focus on the nodes. Overall, this visualization shows that a few actors clearly dominate in terms of connections, while others have fewer links.
# plotting the network of actors and their collaborations
# color and size based on how connected the actor is
ggraph(graph, layout = "dh") + # this source gave me information on how to use the Davidson-Harel layout https://igraph.org/r/doc/layout_with_dh.html
geom_edge_link(alpha = 0.15, color = "darkgrey") + # grey edges
geom_node_point(aes(size = degree(graph), color = degree(graph))) +
scale_size_continuous(range = c(1, 5)) + # keep nodes small so it’s not messy
scale_color_viridis_c(option = "plasma") + # nice color scale
geom_node_text(aes(label = ifelse(name %in% really_popular, name, "")),
repel = TRUE, size = 3) + # only show names for really connected people
theme_void() +
labs(title = "Netflix Movie Actor Collaboration Network",
size = "Degree (Actor Connections)",
color = "Degree (Actor Connections)")
The actor collaboration network in Netflix movies since 2015 shows that even among the most connected actors, the overall network is pretty sparse with a density of 0.0043. On average each actor worked with about 2 others, and yet the average path length was only 1.82. This means that while collaborations are selective, the network is still well-connected, with most actors just a few steps away from one. This reflects a small world structure.
# printing out some stats to describe the graph
# number of actors in the graph
cat("Number of actors in the graph:", vcount(graph), "\n")
## Number of actors in the graph: 457
# number of connections between them
cat("Number of collaborations (edges):", ecount(graph), "\n")
## Number of collaborations (edges): 450
# on average how many people did each actor work with?
avg_deg <- mean(degree(graph))
cat("Average degree (connections per actor):", round(avg_deg, 2), "\n")
## Average degree (connections per actor): 1.97
# how dense is the network? (1 is super connected, 0 is barely)
graph_density <- edge_density(graph)
cat("Network density:", round(graph_density, 4), "\n")
## Network density: 0.0043
# how far apart are actors on average?
avg_path <- average.path.length(graph)
cat("Average path length:", round(avg_path, 2), "\n")
## Average path length: 1.82
Overall, this project explored how actors collaborate in Netflix movies using network analysis. By examining connections between actors who appeared together since 2015, I found that while the network is not densely connected it still forms a tightly linked structure where most actors are only a couple of connections apart. This suggests that even in a selective casting environment, there are key individuals or clusters that keep the network cohesive. This analysis shows how network theory can help us understand real-world collaboration patterns in the film industry and uncover hidden structures in streaming content. A limitation of this study is that it does not consider TV shows or the names of the actors since this study was mainly focused on movies and the structure of the network itself rather than the specific connections. Additional work could visualize and analyze connections between actors on TV shows, and find the most connected actors in movies and TV shows. Regardless, there were many insights just from focusing on movies themselves.