Final Project

knitr::opts_chunk$set(warning = FALSE, message = FALSE)

The Dataset

General Overview of the Raw Dataset

For my dataset, I chose to use a movies dataset from Kaggle containing 7,688 movies which released between 1980 to 2020. The dataset was collated by scraping the top ~200 movies by popularity from the IMDb website.

library(tidyverse)
movies_raw = read.csv("movies.csv")

Before performing any tidying and data manipulation, we see that each movie has the following 15 attributes recorded alongside it:

budget: the budget of a movie. Some movies don’t have this, so it appears as 0
company: the production company
country: country of origin
director: the director
genre: main genre of the movie.
gross: revenue of the movie
name: name of the movie
rating: rating of the movie (R, PG, etc.)
released: release date (YYYY-MM-DD) and release country
runtime: duration of the movie
score: IMDb user rating
votes: number of user votes
star: main actor/actress
writer: writer of the movie
year: year of release

Raw Dataset Cleaning

While the dataset is mostly clean in terms of its format, some variables need to be mutated or reinterpreted to better support our future analysis. First, we should mutate string variables which are actually categorical to instead be factors. Additionally, the “released” variable should really be two variables – one indicating the date of the release and one indicating the release country.

movies1 <- movies_raw %>%
  mutate(rating = factor(rating), genre = factor(genre), rating = factor(rating)) %>% # make categorical vars factors
  separate(released, into = c("release_date", "release_country"), sep = " \\(|\\)") %>% # separate released into multiple variables
  mutate(release_country = factor(release_country), release_date = mdy(release_date)) %>% # mutate to correct types
  mutate(release_date = if_else(is.na(release_date), mdy(paste("January 1,", year)), release_date)) # account for missing days/months

Next, we can note that the budget and gross revenue are given to us in terms of their values at the time of release – that is, the data has not been adjusted for inflation. This impacts our ability to perform any analysis relating to budget or gross over time because older figures will seem artificially smaller than ones in the present era. To account for this, we should adjust both budget and gross for inflation. The website https://fred.stlouisfed.org/series/CPIAUCSL records the “CPI” or Consumer Price Index every month from 1947 onwards. The CPI is a measure of how the price of goods and services have generally changed over time, so an increase of CPI is an indicator of inflation. Naturally, we can see that CPI has generally increased overtime:

library(quantmod)
library(lubridate)
library(ggplot2)
library(scales)
getSymbols("CPIAUCSL", src='FRED')

[1] "CPIAUCSL"

cpi_data <- data.frame(approx_date = index(CPIAUCSL), CPI = coredata(CPIAUCSL)) %>%
  rename(CPI = CPIAUCSL)
ggplot(cpi_data, aes(x = approx_date, y = CPI)) +
  geom_line(color = "blue", size = 1.5) + 
  labs(x = "Date", y = "CPI") +
  ggtitle("Inflation as Measured by CPI Over Time") +
  theme_minimal()

To adjust our numbers for inflation, we can join a CPI dataframe from the FRED website with our movies dataframe by month. I chose to baseline at 2020 since that is the date of the most recent movie in our dataset. Then, to adjust for inflation we simply multiply the movie budget by the 2020 CPI and divide by the CPI associated with the time of the movies release. After performing the adjustment, we can see that the ratio of adjusted budget to budget and adjusted gross to gross is a perfect inverse of inflation, meaning the CPI operation is correct:

movies2 <- movies1 %>%
  mutate(approx_date = floor_date(release_date, "month")) %>%
  merge(cpi_data, by="approx_date", all.x = TRUE)
cur_cpi = tail(movies2$CPI, n=1)
movies3 <- movies2 %>%
  mutate(adj_budget = budget * cur_cpi / CPI,
         adj_gross = gross * cur_cpi / CPI)

After adjusting for inflation, we can create a graph of budget and gross vs adjusted budget and adjusted gross respectively colored according to date to visualize the effect of adjusting for inflation on the data. Note that I could not figure out how to get scale ticks to be the actual dates, so note that blue represents the oldest movie and green the most recent.

ggplot(movies3, aes(x = budget, y = adj_budget, color = release_date)) +
  geom_point(alpha=0.5) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(x = "Budget", y = "Adjusted Budget", color = "Date") +
  ggtitle("Scatter Plot of Budget vs Adjusted Budget") +
  scale_color_gradient(low = "blue", high = "green")

ggplot(movies3, aes(x = gross, y = adj_gross, color = release_date)) +
  geom_point(alpha=0.5) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(x = "Gross", y = "Adjusted Gross", color = "Date") +
  ggtitle("Scatter Plot of Gross vs Adjusted Gross") +
  scale_color_gradient(low = "blue", high = "green")

We observe that, as expected, movies budget gets scaled according to its age, with older movies having their adjusted budgets and grosses increased more than younger ones. Additionally, all movies get their figures adjusted upwards since we are baselining to the most recent date of the dataset where inflation has caused CPI to be greater than ever.

General Overview of the Cleaned Dataset

Now that the movies dataset has been sufficiently cleaned, we can provide a general overview of the dataset to get a feel for the data. Our dataset now has a total of 15 primary columns of interest all with their appropriate data types:

name: the title of the movie (string)
rating: the rating of the movie (R, PG, etc.) (factor)
genre: the main genre of the movie (factor)
year: year of release (double)
release_date: the date the movie was first released (date)
score: IMDb user rating (double)
votes: number of user votes (double)
director: the director (string)
writer: writer of the movie (string)
star: main actor/actress (string)
country: the country of origin (factor)
adj_budget: the non-inflation adjusted budget of a movie (double)
adj_gross: the inflation adjusted box office revenue of the movie in the US (double)
company: the production company (string)
runtime: duration of the movie (double)

There are 4 different types of data captured in this dataset: time, string, numeric, and categorical data. To breakdown the general patterns of the dataset, let’s look at the distribution of each category of data. First, we can visualize the distribution of the categorical variables of movie rating, genre, and country using histograms:

library(gridExtra)

ratings <- movies3 %>% group_by(rating) %>% summarise(count = n()) %>% mutate(percentage = round(count / sum(count) * 100, 2))
plot1 <- ggplot(ratings, aes(x = rating, y = percentage)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Frequency of Movies by Rating",
       x = "Rating",
       y = "Percentage")
genres <- movies3 %>% group_by(genre) %>% summarise(count = n()) %>% mutate(percentage = round(count / sum(count) * 100, 2))
plot2 <- ggplot(genres, aes(x = reorder(genre, -percentage), y = percentage)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Frequency of Movies by Genre",
       x = "Genre",
       y = "Percentage") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
countries <- movies3 %>% group_by(country) %>% summarise(count = n()) %>%
  mutate(percentage = round(count / sum(count) * 100, 2)) %>% arrange(desc(count))
plot3 <- ggplot(head(countries, 20), aes(x = reorder(country, -percentage), y = percentage)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Frequency of Movies by Country, Top 20",
       x = "Country",
       y = "Percentage") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

grid.arrange(plot1, plot2, plot3, ncol = 1)

Across all categorical variables, we can see that the data is distributed highly unevenly. For instance, for rating 92% all ratings are one of “PG”, “PG-13”, or “R”. Additionally, for genre we see that comedy, action, and drama dominate the primary category holding 71% of the total distribution. Lastly, for country the distribution is the most lopsided, with United States alone accounting for 71% of all countries of origin. This is a testament to how much Hollywood has taken over the film industry, with the US having 7x the share of popular movie production compared to the next closest in Great Britain.

Next, we can look at the numerical variables associated with the dataset. Namely, these consist of movie score, adjusted budget, adjusted gross, and runtime. To construct a high level overview of each variable, we can create a table comparing the summary statistics, distribution, and percent missing for each:

# inspired by Linus Jen final project: https://dacss.github.io/DACSS_601_Summer2023_Sec1/posts/LinusJen_FinalProject.html
cat_cols = movies3[, c("score", "adj_budget", "adj_gross", "runtime")]
print(summarytools::dfSummary(cat_cols,
                              varnumbers=FALSE,
                              plain.ascii=FALSE,
                              style="grid",
                              graph.magnif = 0.70,
                              valid.col=FALSE),
      method="render",
      table.classes="table-condensed")

Data Frame Summary

cat_cols

Dimensions: 7668 x 4
Duplicates: 12

score [numeric]

Mean (sd) : 6.4 (1)
min ≤ med ≤ max:
1.9 ≤ 6.5 ≤ 9.3
IQR (CV) : 1.3 (0.2)

72 distinct values

3 (0.0%)

adj_budget [numeric]

Mean (sd) : 49356607 (50471516)
min ≤ med ≤ max:
5053.8 ≤ 33390382 ≤ 380210878
IQR (CV) : 48810522 (1)

4785 distinct values

2171 (28.3%)

adj_gross [numeric]

Mean (sd) : 106885864 (207963377)
min ≤ med ≤ max:
501.4 ≤ 33228567 ≤ 3565566383
IQR (CV) : 100407325 (1.9)

7478 distinct values

189 (2.5%)

runtime [numeric]

Mean (sd) : 107.3 (18.6)
min ≤ med ≤ max:
55 ≤ 104 ≤ 366
IQR (CV) : 21 (0.2)

138 distinct values

4 (0.1%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-31

Analyzing movie score first, we see that the average movie scored a 6.4 with a relatively tight standard deviation of 1. Moving on, we observe that the average movie had an adjusted budget of 49 million dollars and adjusted gross of 107 million dollars, but the standard deviation for both is quite large. Specifically, the histogram shows that movie budgets and movie grosses are heavily skewed right. This makes sense, as only a handful of movies with blockbuster budgets get produced each year, but our dataset captures the most popular 200 movies from each year. It is also important to note that 28.3% of movie budgets are missing, so it will be important to account for those missing entries before performing any sort of analysis on budget. Lastly, we see that the typical movie had a runtime of 107 minutes, with the shortest movie being just 55 minutes and the longest a whopping 366 minutes.

We can briefly look at the range of times for the movie dataset to get a feel for how well represented various times are:

year_counts <- movies3 %>% group_by(year) %>% summarise(count = n())
ggplot(year_counts, aes(x = year, y = count)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Frequency of Movies by Year",
       x = "Year",
       y = "Count")

With the exception of the first 5 years and the last year, every year has a full 200 movies scraped. There likely was not enough data on IMDb to be scraped during the first 5 years of movie releases, and the data was scraped toward the beginning of 2020 so this year is not fully captured.

Lastly, let’s take a big picture look at the string data associated with the datatset:

str_cols = movies3[, c("name", "director", "writer", "star", "company")]
print(summarytools::dfSummary(str_cols,
                              varnumbers=FALSE,
                              plain.ascii=FALSE,
                              style="grid",
                              graph.magnif = 0.70,
                              valid.col=FALSE),
      method="render",
      table.classes="table-condensed")

Data Frame Summary

str_cols

Dimensions: 7668 x 5
Duplicates: 0

name [character]

1. Anna
2. Fever Pitch
3. Hamlet
4. Hercules
5. Nobody's Fool
6. Pulse
7. Venom
8. A Nightmare on Elm Street
9. After the Wedding
10. Aladdin
[ 7502 others ]

3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
3	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
2	(	0.0%	)
7641	(	99.6%	)

0 (0.0%)

director [character]

1. Woody Allen
2. Clint Eastwood
3. Directors
4. Steven Spielberg
5. Ron Howard
6. Ridley Scott
7. Steven Soderbergh
8. Joel Schumacher
9. Barry Levinson
10. Martin Scorsese
[ 2939 others ]

38	(	0.5%	)
31	(	0.4%	)
28	(	0.4%	)
27	(	0.4%	)
24	(	0.3%	)
23	(	0.3%	)
23	(	0.3%	)
22	(	0.3%	)
20	(	0.3%	)
19	(	0.2%	)
7413	(	96.7%	)

0 (0.0%)

writer [character]

1. Woody Allen
2. Stephen King
3. Luc Besson
4. John Hughes
5. David Mamet
6. William Shakespeare
7. Joel Coen
8. Pedro Almodóvar
9. Michael Crichton
10. Wes Craven
[ 4526 others ]

37	(	0.5%	)
31	(	0.4%	)
26	(	0.3%	)
25	(	0.3%	)
15	(	0.2%	)
15	(	0.2%	)
13	(	0.2%	)
13	(	0.2%	)
12	(	0.2%	)
12	(	0.2%	)
7469	(	97.4%	)

0 (0.0%)

star [character]

1. Nicolas Cage
2. Robert De Niro
3. Tom Hanks
4. Denzel Washington
5. Bruce Willis
6. Tom Cruise
7. Johnny Depp
8. Sylvester Stallone
9. John Travolta
10. Kevin Costner
[ 2805 others ]

43	(	0.6%	)
41	(	0.5%	)
41	(	0.5%	)
37	(	0.5%	)
34	(	0.4%	)
34	(	0.4%	)
33	(	0.4%	)
32	(	0.4%	)
31	(	0.4%	)
29	(	0.4%	)
7313	(	95.4%	)

0 (0.0%)

company [character]

1. Universal Pictures
2. Warner Bros.
3. Columbia Pictures
4. Paramount Pictures
5. Twentieth Century Fox
6. New Line Cinema
7. Touchstone Pictures
8. Metro-Goldwyn-Mayer (MGM)
9. Walt Disney Pictures
10. TriStar Pictures
[ 2376 others ]

377	(	4.9%	)
334	(	4.4%	)
332	(	4.3%	)
320	(	4.2%	)
240	(	3.1%	)
174	(	2.3%	)
132	(	1.7%	)
125	(	1.6%	)
123	(	1.6%	)
94	(	1.2%	)
5417	(	70.6%	)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.3.2)
2024-01-31

We see that there is a huge range of possibilities for all string columns, which makes sense because names in general is a large space. There are 7602 distinct movie names, which is 156 lower than the number of rows in the dataset. This is because there is some clash in movie names, like “Hamlet” for example which is originally a famous play and thus may been produced by different studios 3 times under the same name. Shifting focus to director and writer, we see that Woody Allen takes the crown as the most frequent in both categories. There are significantly more distinct writers than directors however, with only 2949 distinct directors captured by the dataset compared to 4536 distinct writers.

Looking at the most frequently occuring stars, we see many movie celebrities as would be expected. Big names like Nicolas Cage, Robert De Niro, and Tom Hanks are in the top 3, and every name in the top 10 is easily recognizable to an avid movie goer. However, the distribution of stars is still spread quite wide, with a total of 7323 distinct stars of movies found. Finally, we see that movie company is the most top heavy variable of the bunch, with Universal Pictures having a 4.9% share of all movies, closely followed by Warner Bros., Columbia Pictures, and Paramount Pictures. However, these top 4 still only account for 18% of movies produced, with a massive 5413 other movie studios producing the remaining 82%.

Research Questions

1. How Have the Economics of Movie Making Changed Over Time

As technology continues to advance, movie making inevitably has advanced along with it. For instance, streaming services like Netflix have become increasingly popular, making it easier and easier for movie watchers to enjoy a film from the comfort of their home rather than out at a theater. Alternatively, we could look to computer advancements making CGI better than ever, altering the final product that directors are able to create. As the environment that movies are produced in has changed overtime, how have the economics of popular movies changed in turn?

To start, let’s take a look at movie budget and gross over time by creating a chronological graph of adjusted movie budget year by year:

movies_clean <- movies3 %>% filter(!is.na(adj_budget) & !is.na(adj_gross)) # remove movies where the budget or gross is not recorded
median_economics <- movies_clean %>% group_by(year) %>% summarise(median_budget = median(adj_budget), median_gross = median(adj_gross))
median_economics <- median_economics[median_economics$year != 2020, ] # 2020 is an outlier due to low sample size, so we remove it

ggplot(data = median_economics) +
  geom_point(aes(x = year, y = median_budget, color = "Median Budget"), alpha = 0.7) +
  geom_point(aes(x = year, y = median_gross, color = "Median Gross"), alpha = 0.7) +
  labs(x = "Year", y = "Median Value", title = "Median Budget and Gross Over Time") +
  scale_color_manual(values = c("Median Budget" = "red", "Median Gross" = "darkgreen")) +
  scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6)) +
  theme_minimal() + 
  guides(color = guide_legend(title = "Legend"))

We see that, somewhat surprisingly, median movie budget has not simply trended up each year. Rather, movie budget generally rose from 1980 to its peak in 1999, and then began a general downward trend to just above its original levels. On the other hand, median gross has been mostly rising from 1985 onwards, drastically outpacing the rate at which median budget rose. To drive home the comparison, let’s compute the box office profit of each movie as its adjusted gross minus its adjusted budget, and create a scatter plot vs time:

movies_clean$adj_profit <- movies_clean$adj_gross - movies_clean$adj_budget
median_profits <- movies_clean %>% group_by(year) %>% summarise(median_profit = median(adj_profit), median_profit = median(adj_profit))
median_profits <- median_profits[median_profits$year != 2020, ] # 2020 is an outlier due to low sample size, so we remove it

ggplot(data = median_profits, aes(x = year, y = median_profit, color = median_profit)) +
  geom_point() +
  scale_color_gradient(low = "darkgreen", high = "lightgreen") +
  labs(x = "Year", y = "Profit", title = "Movie Profit Over Time") +
  scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6)) +
  theme_minimal() +
  guides(color = "none")

We see that a typical popular movie has in fact been generally increasing in box office profitability over time even when accounting for inflation. Profit has steadily rose since 2000 – when the median movie was in fact rather close to not making a profit at all – up to an impressive 63 million dollars in 2019. Since the median profit of popular movies has actually steaidly rose over the past 20 years, we can cast some doubt on the claim that the rise of streaming services is killing the theater industry.

Another interesting aspect of economics to analyze is how the market share of top movie production companies has changed over time. From our exploratory statistics, we know that Universal Pictures, Warner Bros., and Columbia Pictures are the top 3 most frequent companies, holding ~4% of all movies recorded in the dataset each. With this in mind, how has the total gross and market share of these 3 companies changed over time?

company_year_gross <- movies3 %>%
  group_by(year, company) %>%
  summarize(total_gross = sum(adj_gross, na.rm=TRUE))

yearly_total_gross <- movies3 %>%
  group_by(year) %>%
  summarize(yearly_total_gross = sum(adj_gross, na.rm=TRUE))

market_share <- company_year_gross %>%
  left_join(yearly_total_gross, by = "year") %>%
  mutate(market_share = total_gross / yearly_total_gross)
market_share <- market_share[market_share$year != 2020, ]

company_frequency <- movies3 %>%
  group_by(company) %>%
  summarise(frequency = n()) %>% 
  arrange(desc(frequency))
top_3_companies <- company_frequency$company[1:3]
top_3_market_share <- market_share %>%
  filter(company %in% top_3_companies)
top_3_market_share$market_share = top_3_market_share$market_share * 100

plot1 <- ggplot(top_3_market_share, aes(x = year, y = total_gross, fill = company)) +
  geom_bar(position="stack", stat = "identity") +
  labs(title = "Total Adjusted Gross of Top 3 Companies by Year",
       x = "Year",
       y = "Total Adjusted Gross") + 
  scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6))
plot2 <- ggplot(top_3_market_share, aes(x = year, y = market_share, fill = company)) +
  geom_bar(position="stack", stat = "identity") +
  labs(title = "Market Share of Top 3 Companies by Year",
       x = "Year",
       y = "Percentage Market Share")
grid.arrange(plot1, plot2, nrow = 2)

While the top 3 companies have certainly generally increased their total gross over time, their combined market share has actually remained relatively stable by comparison. These 3 companies peaked in terms of their market share in 1982, holding over 40% of all the total gross. However, that figure was actually generally decreasing from 2005 onwards to only 20% in 2019. This indicates that while these 3 companies still control a significant portion of the movie making economy, they have much less of a combined dominance over the movie market than they did in the early 2000s.

We could also visualize the relative share of the top 5 movie making companies to each other over time, in order to see how much parity there is in movie production among the top companies year by year. In this graph, each company is ranked on how they performed in the year, and then we can compare their relative market share to each other:

top_5_companies <- market_share %>% group_by(year) %>% top_n(5, wt = market_share) %>%
  group_by(year) %>%
  mutate(rank = factor(rank(desc(market_share))))
top_5_companies$rank <- factor(top_5_companies$rank, levels = rev(levels(top_5_companies$rank)))
top_5_companies <- top_5_companies %>%
  group_by(year) %>%
  mutate(sum_market_share = sum(market_share)) %>%
  mutate(normalized_market_share = market_share / sum_market_share)

ggplot(top_5_companies, aes(x = year, y = normalized_market_share, fill = rank)) +
  geom_bar(position="stack", stat = "identity") +
  labs(title = "Market Share of Top 5 Companies by Year",
       x = "Year",
       y = "Relative Percentage Market Share")

What we observe is that, while relative levels of market share among the top 5 ranked companies have fluctuated some over time, there does not appear to be a significant general trend at any rank. While there is certainly parity among the top 5 (it is not the case that the best performing company completely dominates the market each year), it is true that the top company accrues over ~25% of the relative market share each year, while the 5th ranked company only accrues ~15% or less.

2. What Factors Impact a Movie’s Score?

Whether you’re a director or the viewer seeing the final product, everyone wants a movie to be good. This then begs the question, what factors most impact how profitable a movie is? =

In our dataset, the “quality” of a movie is captured by its score which is the average IMDb user rating on a scale of 1 to 10. To start out, let’s compare score to budget, gross, profit, and runtime. However, first we can note that budget and gross vary somewhat exponentially, and both have natural floors at 0. Thus, they are good choices to apply a logarithm to in order to make patterns more visible. Note we can’t do the same for profit since it can be negative.

movies_clean$log_budget <- log(movies_clean$adj_budget)
movies_clean$log_gross <- log(movies_clean$adj_gross)

plot1 <- ggplot(data = movies_clean, aes(x = log_budget, y = score)) +
  geom_point(color="red", alpha=0.4) +
  labs(x = "Log Budget", y = "Score", title = "Movie Score vs Log Budget") +
  theme_minimal() +
  guides(color = "none") +
  coord_cartesian(ylim = c(1, 10))
plot2 <- ggplot(data = movies_clean, aes(x = log_gross, y = score)) +
  geom_point(color="black", alpha=0.4) +
  labs(x = "Log Gross", y = "Score", title = "Movie Score vs Log Gross") +
  theme_minimal() +
  guides(color = "none") +
  coord_cartesian(ylim = c(1, 10))
plot3 <- ggplot(data = movies_clean, aes(x = adj_profit, y = score)) +
  geom_point(color="blue", alpha=0.2) +
  labs(x = "Profit", y = "Score", title = "Movie Score vs Profit") +
  theme_minimal() +
  scale_x_continuous(labels = unit_format(unit = "M", scale = 1e-6)) +
  guides(color = "none") +
  coord_cartesian(ylim = c(1, 10))
plot4 <- ggplot(data = movies_clean, aes(x = runtime, y = score)) +
  geom_point(color="purple", alpha=0.2) +
  labs(x = "Runtime in Minutes", y = "Score", title = "Movie Score vs Runtime") +
  theme_minimal() +
  guides(color = "none") +
  coord_cartesian(xlim = c(0, NA), ylim = c(1, 10))

grid.arrange(plot1, plot2, plot3, plot4, nrow = 2, ncol = 2)

While budget actually does not seem to have a very strong correlation with the score of a movie based on the scatter plot, there does appear to be a general upward trend with respect to gross. That is, there appears to be a weak trend associating highly grossing movies with higher scores. Shifting our attention to profit, we see that the association becomes stronger. Movies which made a massive profit always scored above a 7.5, and in general highly profiting movies seem to receive better scores. This makes intuitive sense, as one driver of movie profit is the “quality” of a movie, or how much an audience will enjoy it. Lastly, we see that perhaps the best association of movie score is with runtime. Almost all true “bombs” are movies which are very short – under 125 minutes – and as movie runtime increases so does score generally, especially toward the extreme of length.

To illustrate the point further, we can create a graph of our two variables which appear more strongly associated with score (profit and runtime) and color by score:

ggplot(data = movies_clean, aes(x = adj_profit, y = runtime, color = score)) +
  geom_point(alpha=0.25) +
  scale_color_gradient(low = "black", high = "red") +
  labs(x = "Adjusted Profit", y = "Runtime in Minutes", title = "Profit vs Runtime Colored by Score") +
  scale_x_continuous(labels = unit_format(unit = "M", scale = 1e-6)) +
  theme_minimal() +
  coord_cartesian(ylim = c(0, NA))

Upon doing so, we see a clear pattern that the darkest colors which correspond to low scores are concentrated near low profit, and as we move further away from this point score tends to increase. While the data is ultimately still quite noisy, it does appear that score correlates positively with profit and runtime.

Next, we can try to quantify the impact of movie genre on score. Since most movies fall into just 3 genres, we can reduce the space of possible genres by recoding everything outside the top 3 to be classified as “other”:

movies_clean2 <- movies_clean %>% mutate(genre = recode(genre, "Comedy" = "Comedy", "Action" = "Action", "Drama" = "Drama", .default = "Other"))
genre_prop <- movies_clean2 %>% group_by(genre) %>% summarise(count = n())
ggplot(genre_prop, aes(x = reorder(genre, -count), y = count)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Frequency of Movies by Primary Genre",
       x = "Genre",
       y = "Count")

Using this new formulation, we can take a second look at log gross vs score by faceting by genre:

ggplot(data = movies_clean2, aes(x = log_gross, y = score)) +
  geom_point(color="black", alpha=0.4) +
  labs(x = "Log Gross", y = "Score", title = "Movie Score vs Log Gross") +
  theme_minimal() +
  guides(color = "none") +
  coord_cartesian(ylim = c(1, 10)) +
  facet_wrap(~ genre)

It does not appear that genre has a substantial impact on the score distribution of movies. In fact, breaking down the IQR by genre we see just how slight the differences between score by genre are:

movies_clean2 %>% group_by(genre) %>% summarise(as_tibble_row(quantile(score)))

# A tibble: 4 × 6
  genre   `0%` `25%` `50%` `75%` `100%`
  <fct>  <dbl> <dbl> <dbl> <dbl>  <dbl>
1 Action   2.1   5.7   6.3   6.9    9  
2 Other    2.4   6     6.6   7.2    8.9
3 Comedy   1.9   5.7   6.3   6.8    8.6
4 Drama    2.3   6.2   6.8   7.3    9.3

Movies regardless of genre tend to average around a 6.5, with this varying by just +/-1 from 25% to 75%.

Conclusion

In my research, I explored the patterns of movie economics over time and the relation of various factors to movie score. While an average popular movie has been in general getting cheaper to produce since 2000, the gross of such a movie has actually been increasing in that same time period. As a result, movie profit has steadily rose since 2000, making the movie industry seem stronger than ever rather than threatened by the advent of streaming services. Movie score is not strongly associated with any variable, but does tend to increase as movie budget, gross, profit, and runtime increase. There is no strong association with genre in terms of predicting movie score.

Limitations

The dataset used in this analysis is not without some significant limitations. Data scraped includes only the top 200 movies from each year, although this does not control for the total quantity of movies produced in a year. Since there are more movies being produced today than there were in the 80s, the top 200 movies produced today are a subset of higher quality movies than previously. Additionally, the dataset had to undergo significant modification in order to be in a readily usable form – we had to throw out ~25% of movies for a lot of the analysis since they did not have budgets associated with them. If movies without budget were not randomly distributed, then removing these movies would affect the conclusions we draw from our analysis.

References

Baumer, Benjamin S., et al. Modern Data Science with R. CRC Press, Taylor & Francis Group, 2021.

“Consumer Price Index for All Urban Consumers: All Items in U.S. City Average.” FRED, 11 Jan. 2024, fred.stlouisfed.org/series/CPIAUCSL.

Grijalva, Daniel. “Movie Industry.” Kaggle, 23 July 2021, www.kaggle.com/datasets/danielgrijalvas/movies/data.

R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

Wickham, Hadley, et al. R for Data Science: Import, Tidy, Transform, Visualie, and Model Data. O’Reilly, 2023.