What’s your favorite movie?
This timeless question, asked everywhere from awkward dates to boardroom icebreakers, is wildly subjective. Some evaluate films based on artistic merit, meticulously analyzing the shot selection in the shower scene of “Psycho” or the way Daniel Day Lewis triumphantly declares “I drink your milkshake” in “There Will Be Blood.” Some prefer the laughs provided by Ferris Bueller skipping school, and some would rather cry as a 1940’s Ryan Gosling writes in a notebook. Some would rather just re-watch “Cars” because that Rascal Flatt’s song is just too damn catchy.
With movie tastes as diverse as the world around us, it’s a miracle that IMDB was able to successfully aggregate a cohesive, continuously evolving list of the best 250 films of all time, based on IMDB user scores. Given these scores, we can attempt to glean insights at what the world prioritizes when evaluating a film.
By evaluating the frequency of films on the list in relation to their release year, rating, cast, and genre, we can develop a hypothesis about what viewers value most when evaluating a film.
IMDB (Internet Movie Database) is one of the largest online databases for movies and television shows, providing comprehensive information about movies, including ratings and reviews from its vast user base. The IMDB ratings are widely used as a benchmark for the popularity and success of movies. To see the most up to date IMDB Top 250, please reference https://www.imdb.com/chart/top/.
The dataset used in this analysis was published by “Chidambara Raju G” on kaggle. You may reference the original dataset at https://www.kaggle.com/datasets/rajugc/imdb-top-250-movies-dataset?resource=download.
Despite a range of 101 years, the median year is 1994–suggesting more recent films are more likely to be on the list. The ratings are also pretty tightly grouped: with a max of 9.3 (out of 10) and a min of 8.0, we can confirm that most of these movies scored pretty close to each other, with only marginal differences in quality.
Interestingly, there is a massive spread on box office revenue and budget. Granted, some of this is likely due to inflation, but the point stands that you don’t need a big budget blockbuster to make an iconic film. It’s definitely common, though: the average budget was $50M and the average box office revenue was $200M.
summary(df)
## rank name year rating
## Min. : 1.00 Length:250 Min. :1921 Min. :8.000
## 1st Qu.: 63.25 Class :character 1st Qu.:1966 1st Qu.:8.100
## Median :125.50 Mode :character Median :1994 Median :8.200
## Mean :125.50 Mean :1986 Mean :8.307
## 3rd Qu.:187.75 3rd Qu.:2006 3rd Qu.:8.400
## Max. :250.00 Max. :2022 Max. :9.300
##
## genre certificate run_time budget
## Length:250 Length:250 Length:250 Min. :1.330e+05
## Class :character Class :character Class :character 1st Qu.:3.000e+06
## Mode :character Mode :character Mode :character Median :1.500e+07
## Mean :5.291e+07
## 3rd Qu.:5.100e+07
## Max. :2.400e+09
## NA's :44
## box_office casts directors writers
## Min. :6.700e+01 Length:250 Length:250 Length:250
## 1st Qu.:8.574e+06 Class :character Class :character Class :character
## Median :7.404e+07 Mode :character Mode :character Mode :character
## Mean :2.382e+08
## 3rd Qu.:3.218e+08
## Max. :2.799e+09
## NA's :33
## year1
## Min. :1921
## 1st Qu.:1966
## Median :1994
## Mean :1986
## 3rd Qu.:2006
## Max. :2022
##
str(df)
## Classes 'data.table' and 'data.frame': 250 obs. of 13 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ name : chr "The Shawshank Redemption" "The Godfather" "The Dark Knight" "The Godfather Part II" ...
## $ year : int 1994 1972 2008 1974 1957 1993 2003 1994 2001 1966 ...
## $ rating : num 9.3 9.2 9 9 9 9 9 8.9 8.8 8.8 ...
## $ genre : chr "Drama" "Crime,Drama" "Action,Crime,Drama" "Crime,Drama" ...
## $ certificate: chr "R" "R" "PG-13" "R" ...
## $ run_time : chr "2h 22m" "2h 55m" "2h 32m" "3h 22m" ...
## $ budget : num 2.50e+07 6.00e+06 1.85e+08 1.30e+07 3.50e+05 2.20e+07 9.40e+07 8.00e+06 9.30e+07 1.20e+06 ...
## $ box_office : num 2.89e+07 2.50e+08 1.01e+09 4.80e+07 9.55e+02 ...
## $ casts : chr "Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,Clancy Brown,Gil Bellows,Mark Rolston,James Whitmore,Jeffr"| __truncated__ "Marlon Brando,Al Pacino,James Caan,Diane Keaton,Richard S. Castellano,Robert Duvall,Sterling Hayden,John Marley"| __truncated__ "Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,Maggie Gyllenhaal,Gary Oldman,Morgan Freeman,Monique Ga"| __truncated__ "Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,John Cazale,Talia Shire,Lee Strasberg,Michael V. Gazzo,G.D."| __truncated__ ...
## $ directors : chr "Frank Darabont" "Francis Ford Coppola" "Christopher Nolan" "Francis Ford Coppola" ...
## $ writers : chr "Stephen King,Frank Darabont" "Mario Puzo,Francis Ford Coppola" "Jonathan Nolan,Christopher Nolan,David S. Goyer" "Francis Ford Coppola,Mario Puzo" ...
## $ year1 : num 1994 1972 2008 1974 1957 ...
## - attr(*, ".internal.selfref")=<externalptr>
nrow(df)
## [1] 250
This chart shows a leftward skewed distribution for release years. It definitely makes sense that there’s more recent films on the list, given that they’re more likely to be seen / there’s a larger n value of newer films in general, but it’s interesting that the chart doesn’t continuously rise–in fact, there are significant drop offs after the early 2000’s.
Pop Culture definitely reveres the era of “Shawshank Redemption” and “Forrest Gump,” but is this based on real merit, or are we just nostalgic for the 90’s-00’s? I wonder if the Nintendo 64, BopIt, and Furby would rank in the top 250 toys of all time.
MovieYears_df <- data.frame(
name = c(df$name),
year = c(df$year),
stringsAsFactors = FALSE)
ReleaseYearCount <- MovieYears_df %>%
dplyr::group_by(year) %>%
dplyr::summarize(count = n()) %>%
data.frame()
ggplot(ReleaseYearCount, aes(x = year, y = count)) +
geom_line(aes(color='red'), linewidth=1) +
labs(title = "Film Releases by Year", x = "Year", y = "Count of Films") +
theme_wsj() +
theme(axis.title=element_text(size=12)) +
theme(plot.title = element_text(hjust = 0.5)) +
theme(legend.position = 'none')
Although we don’t have demographic data of the scorers, we may be able to make some assumptions based on the ratings of the films on the next tab…
Although we don’t know the ages of individuals scoring these films, the fact that there are more movies on the list rated “R” than “G”, “PG”, and “PG-13” suggest that either it’s primarily adults submitting scores, or movie theaters have gotten way to relaxed on who they admit.
While we can’t truly assume the age of the viewer, we can hypothesize that viewers are more likely to favor films that contain mature content–whether that be crude humor, violence, sexual themes, or just mature topics in general.
x <- c('G', 'PG', 'PG-13', 'R', 'Not Rated', 'Other')
Ratings_df <- df %>%
dplyr::group_by(certificate) %>%
dplyr::summarize(count = n()) %>%
dplyr::mutate(certificate = ifelse(count < 17, "Other", certificate)) %>%
dplyr::group_by(certificate) %>%
dplyr::summarise(count = sum(count)) %>%
dplyr::mutate(certificate = factor(certificate, levels = x)) %>%
dplyr::arrange(certificate) %>%
data.frame()
plot_ly(Ratings_df, labels = ~certificate, values = ~count) %>%
add_pie(hole=0.6) %>%
layout(xaxis = list(categoryorder = "array",
categoryarray = c('G', 'PG', 'PG-13', 'R', 'Not Rated', 'Other')),
title="Count of Films by Rating")
This provides some insight into the maturity of the films, but what about the people actually in them? Lets see if we can find any trends within the cast on the next tab…
This chart is sorted by the frequency of times an actor is casted in the Top 250.
What catches my eye first is that the top 15 are all men. What catches my eye next is the fact that Mark Ruffalo somehow snuck his way onto this list. I’m a fan of the guy, but I’ve never considered the MCU star to be in the same league as De Niro.
What this suggests to me is that actors don’t necessarily make it on this list because they’re the most talented. It might just be because they managed to secure a spot in 1 or more high-grossing franchises (looking at you, Harrison Ford).
Using the tooltip to see which films these actors are in, we can start to see some trends: first, almost all of these films are R or PG-13 rated Action films or Dramas. Secondly, most of the films are from the roughly 30-year gap of mid 1980’s to mid 2010’s. Finally, close to 90 of the 250 films (or roughly 33%) contain one of the actors in this chart. That’s a pretty significant number.
So, does the man make the movie or does the movie make the man? It’s hard to deny the equity some of these folks bring to the table: De Niro, Freeman, DiCaprio, etc are synonymous with modern high-brow cinema.
cast_df <- df %>%
separate_rows(casts, sep = ",")
cast <- cast_df$casts
get_movie_names <- function(casts) {
paste(paste(cast_df$name[cast_df$casts == casts], collapse = ", "), collapse = ", ")
}
CastCount <- cast_df %>%
group_by(casts) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
head(15) %>%
arrange(desc(count))
CastCount$movies <- sapply(CastCount$casts, get_movie_names)
title_list <- CastCount$movies
title_list <- lapply(title_list, function(x) gsub(", ","<br>",x,fixed = TRUE))
CastCount$movies2 <- sapply(title_list, paste, collapse = ", ")
plot_ly(data = CastCount, x = ~count, y = ~casts,
hovertext = ~movies2,
type = 'bar',
orientation = 'h',
marker = list(color='darkgreen')) %>%
layout(title = 'Top 15 Actors',
xaxis = list(title = 'Number of Movies',
zeroline = FALSE,
gridwidth =2),
yaxis = list(title = '',
tickfont = list(size = 10),
categoryorder = "total ascending",
automargin = TRUE))
If these actors are such a draw for the films they’re in, then maybe the genres they act in are just more likely to be scene, and thus rated favorably? While we don’t have a viewership count, we do have box office revenue on the next tab…
Let’s review what we know (or think we know) so far. We know there’s more new films rated high than old ones, we know there’s a lot of R rated movies on the list, and we know 1/3 of these films contain at least 1 of 15 specific men (and one of them is Mark Ruffalo – I still can’t believe that).
This heatmap helps put some of these puzzle pieces together to tell a clearer story. First, we see that although R rated films have more coverage on the map, the most revenue is actually grouped in PG-13–which makes sense, as there’s a larger customer base. Second, we see that the most money is in PG + PG-13 “Action” and “Adventure” films, with third place going to R rated Horror (no one likes PG-13 horror, lets be real).
So, the hottest parts of the map correspond to our top 15 acting list.
df$box_officeB <- round(df$box_office / 1000000000, 2)
df$certificate_new <- ifelse(df$certificate %in% c("X","Unrated","TV-PG","TV-MA","Passed","Not Rated",
"Not Available","GP","Approved","18+","13+"),
"Other", df$certificate)
df$primarygenre <- gsub(",.*","", df$genre)
ggplot(df, aes(x = certificate_new, y = primarygenre, fill=box_officeB)) +
geom_tile(colour='white', size=0.25) +
labs(title="Heatmap: Box Office by Genre + Rating",
x = "Rating",
y = "Genre",
fill = "Box Office $") +
theme(panel.background = element_blank()) +
scale_y_discrete(limits = rev(levels(df$primarygenre)),expand = c(0,0)) +
scale_fill_continuous(
low = 'lightblue',
high = 'blue',
labels = dollar_format(prefix = "$", suffix = "B"))
This view helps us visualize how the top 250 scores were distributed accross genre, rating and decade. Most notably, we can see that the 90’s had a huge surge in Dramas and Crime films. The drama trend carried through the 2010’s, but crime dropped off, supplemented by more traditional action flicks. I suppose it was hard to top Pulp Fiction.
The wave of green from 1970’s onwards reinforces how frequent R rated films make the list. Despite being introduced in the 1980’s (thanks, Gremlins), PG-13 really doesn’t start to make waves until the 2000’s (led by Lord of the Rings, Nolan’s Batman Trilogy, etc) and held strong through the 2010’s as the Marvel Cinematic Universe grew in popularity.
Is it a coincidence that we start to see a spike in action films once PG-13 went mainstream? My theory is that once Hollywood had a template to create content that was more mature than a kids film but didn’t restrict their audience to 18+, the floodgates opened up, leading to more talented folks jumping on the projects, thus making higher quality, better rated action films.
ggplot(df[df$primarygenre %in% df_top7$primarygenre, ], aes(x = primarygenre, y = rating, fill = certificate_new)) +
geom_bar(stat = "identity", position = "stack") +
theme_light() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(labels = comma) +
coord_flip() +
labs(title = "Top 7 Genres by Rating by Decade",
x= "Genre",
y= "Rating",
fill = "Certificate") +
scale_fill_brewer(palette = "Set2") +
facet_wrap(~decade, ncol=4, nrow=3)
What’s your favorite movie?
I’d wager it’s an R-rated drama or crime film from the 1990s. Bonus points if it has De Niro.
Through this analysis, there are few a key takeaways: 1)the films on the list skew newer, 2) despite not being the biggest box office wins, R-rated films are most commonly in the top 250, 3) star power matters: there is a high concentration of the same actors in the top 250, and 4) PG-13 has enabled a new era of high grossing, high quality films that are competing with dramas for spots on the chart.
Martin Scorsese recently made comments that he feels that movies like Avengers: Endgame aren’t “real cinema,” and instead are more like an amusement attraction. I have to imagine that this kind of comment, made from one of the greatest filmmakers of all time, has to be driven by some level of resentment (dare I say, jealousy?) that his action/crime/drama masterclasses like “The Departed” and “Goodfellas” are getting sidelined by a new-age phenom.
Although I am sure that there are plenty of analysts working for Universal and WarnerBros that already know what resonates most with viewers, if this study was to continue, I’d recommend that demographic and psychographic data is considered for movie-goers. We can build hypothesis based off of the castings and genres, but to achieve causal data would require further studies.