HW1

A Note

Sorry that I wasn’t able to get the rough draft of the final project as far along as I would have wanted, I had a lot come up this week. The overall direction of my project is here – the investigation of the research questions etc. The project definitely still needs more fleshing out in terms of providing supporting writing and overall direction, and I’ll be sure to take care of that in the coming days and maybe ask for another review if that is possible. All feedback is still greatly appreciated!

The Dataset: Movies

Importing the Dataset

For my dataset, I chose to use a movies dataset from Kaggle containing 7,688 movies from 1980 to 2020.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
movies_raw = read.csv("movies.csv")
# head(movies_raw, 10)

Cleaning the Dataset

The dataset is mostly clean, however a few columns could be represented in a better format. Rating, genre, and country should all be factors as they are categorical options rather than a continuous spectrum.

movies <- mutate(movies_raw, rating = factor(rating),
                 genre = factor(genre),
                 rating = factor(rating))
# head(movies, 10)

Next, the column “released” currently has data which is a combination of a date and a country, like June 13, 1980 (United States). In order to tidy the data, we should split this column into a release_date column and a release_country column.

movies <- separate(movies, released, into = c("release_date", "release_country"), sep = " \\(|\\)") %>% mutate(release_country = factor(release_country), release_date = mdy(release_date))
Warning: Expected 2 pieces. Additional pieces discarded in 7666 rows [1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2 rows [5729,
5731].
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `release_date = mdy(release_date)`.
Caused by warning:
!  10 failed to parse.
# head(movies, 10)

We note that there are two instances where released is empty, and then 10 instances where mdy fails to parse the date because only the year or nothing is given:

movies_raw[is.na(movies$release_date), ][5]
                 released
202  1981 (United States)
313          1982 (Japan)
787         1985 (Taiwan)
801  1985 (United States)
1174 1987 (United States)
1821 1990 (United States)
1826        1990 (Canada)
2817          1995 (Iran)
4188 2019 (United States)
5729                     
5731                     
6414 2013 (United States)

To account for this, we can simply make the release date be equal to the release year starting at January 1st which is given to us as the column “year”.

movies$release_date <- if_else(is.na(movies$release_date), mdy(paste("January 1,", movies$year)), movies$release_date)
# head(movies, 10)

Since budgets and grosses are not inflation adjusted, it is difficult to compare their trends over time. So, we should adjust these numbers for inflation, base lining them for what their value would have been in 2020.

library(quantmod)
Loading required package: xts
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

######################### Warning from 'xts' package ##########################
#                                                                             #
# The dplyr lag() function breaks how base R's lag() function is supposed to  #
# work, which breaks lag(my_xts). Calls to lag(my_xts) that you type or       #
# source() into this session won't work correctly.                            #
#                                                                             #
# Use stats::lag() to make sure you're not using dplyr::lag(), or you can add #
# conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop           #
# dplyr from breaking base R's lag() function.                                #
#                                                                             #
# Code in packages is not affected. It's protected by R's namespace mechanism #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning.  #
#                                                                             #
###############################################################################

Attaching package: 'xts'
The following objects are masked from 'package:dplyr':

    first, last
Loading required package: TTR
Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 
library(lubridate)
getSymbols("CPIAUCSL", src='FRED')
[1] "CPIAUCSL"
cpi_data <- data.frame(approx_date = index(CPIAUCSL), CPI = coredata(CPIAUCSL)) %>%
  rename(CPI = CPIAUCSL)

movies <- movies %>%
  mutate(approx_date = floor_date(release_date, "month")) %>%
  merge(cpi_data, by="approx_date", all.x = TRUE)
cur_cpi = tail(movies$CPI, n=1)
movies <- movies %>%
  mutate(budget = budget * cur_cpi / CPI,
         gross = gross * cur_cpi / CPI)
movies$approx_date <- NULL
movies$CPI <- NULL
# movies

High Level Overview of the Cleaned Dataset

This dataset is a list of 7,688 movies released from 1980 to 2020. For each year, the most popular 200 (with some exceptions) movies from that year are recorded. For each movie, there are 16 attributes recorded:

  • name: the title of the movie (string)

  • rating: the rating of the movie (R, PG, etc.) (factor)

  • genre: the main genre of the movie (factor)

  • year: year of release (double)

  • release_date: the date the movie was first released (date)

  • release_country: the country the movie was first released in (factor)

  • score: IMDb user rating (double)

  • votes: number of user votes (double)

  • director: the director (string)

  • writer: writer of the movie (string)

  • star: main actor/actress (string)

  • country: the country of origin (factor)

  • budget: the inflation adjusted budget of a movie. Some movies don’t have this, so it appears as 0 (double)

  • gross: the inflation adjusted box office revenue of the movie in the US (double)

  • company: the production company (string)

  • runtime: duration of the movie (double)

General Descriptive Analysis

First, we can calculate the distribution of ratings:

rating_prop <- movies %>% group_by(rating) %>% summarise(count = n()) %>% mutate(percentage = round(count / sum(count) * 100, 2))
print(rating_prop)
# A tibble: 13 × 3
   rating      count percentage
   <fct>       <int>      <dbl>
 1 ""             77       1   
 2 "Approved"      1       0.01
 3 "G"           153       2   
 4 "NC-17"        23       0.3 
 5 "Not Rated"   283       3.69
 6 "PG"         1252      16.3 
 7 "PG-13"      2112      27.5 
 8 "R"          3697      48.2 
 9 "TV-14"         1       0.01
10 "TV-MA"         9       0.12
11 "TV-PG"         5       0.07
12 "Unrated"      52       0.68
13 "X"             3       0.04
ggplot(rating_prop, aes(x = reorder(rating, -count), y = count)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Frequency of Movies by Rating",
       x = "Rating",
       y = "Count")

We see that the most common ratings are “R”, “PG-13”, and “PG”. Additionally, we see that there are 77 movies where the rating was not recorded.

year_counts <- movies %>% group_by(year) %>% summarise(count = n())
ggplot(year_counts, aes(x = year, y = count)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Frequency of Movies by Year",
       x = "Year",
       y = "Count")

We can see that the earliest years and 2020 do not have a full 200 movies recorded, likely due to a lack of availability of data to scrape.

mean(movies$votes, na.rm = TRUE)
[1] 88108.5
quantile(movies$votes, na.rm = TRUE)
     0%     25%     50%     75%    100% 
      7    9100   33000   93000 2400000 
sd(movies$votes, na.rm = TRUE)
[1] 163323.8

The mean number of votes cast for a movie was 88 thousand, which is significantly higher than the median votes cast of 33 thousand. The fewest votes cast was 7, while the maximum votes cast was 2,400,000. The standard deviation was quite considerable at 163,323, indicating that the data is not well concentrated.

length(unique(movies$director))
[1] 2949
string_counts <- table(movies$director)
string_counts[which.max(string_counts)]
Woody Allen 
         38 

We see that 2949 unique directors are captured by the dataset, and the most common director was Woody Allen with a total of 38 movies.

length(unique(movies$writer))
[1] 4536
string_counts <- table(movies$writer)
string_counts[which.max(string_counts)]
Woody Allen 
         37 

We see that 4536 unique writes are captured by the dataset, and the most common writer was again Woody Allen with a total of 37 movies.

length(unique(movies$star))
[1] 2815
string_counts <- table(movies$star)
string_counts[which.max(string_counts)]
Nicolas Cage 
          43 

We see that 2815 unique stars are captured by the dataset, and the most common star was Nicolas Cage with a total of 43 movies.

options(scipen = 999)
mean(movies$budget, na.rm = TRUE)
[1] 49356607
quantile(movies$budget, na.rm = TRUE)
           0%           25%           50%           75%          100% 
     5053.752  15587910.360  33390382.282  64398432.203 380210877.609 
sd(movies$budget, na.rm = TRUE)
[1] 50471516

We see that the average budget of a movie is $49,356,607 while the median budget is significantly lower at $20,500,000. The smallest movie budget was $5053 while the largest was $380,210,877.

mean(movies$gross, na.rm = TRUE)
[1] 106885865
quantile(movies$gross, na.rm = TRUE)
             0%             25%             50%             75%            100% 
       501.3549    7414694.0743   33228566.8326  107822019.0615 3565566383.3266 
sd(movies$gross, na.rm = TRUE)
[1] 207963377

We see that the average gross of a movie is $106,885,865 while the median gross is significantly lower at $33,228,566. The smallest movie gross was $501 while the largest was $3,565,566,383.

mean(movies$runtime, na.rm = TRUE)
[1] 107.2616
quantile(movies$runtime, na.rm = TRUE)
  0%  25%  50%  75% 100% 
  55   95  104  116  366 
sd(movies$runtime, na.rm = TRUE)
[1] 18.58125

The average length of a movie was 107.2616 minutes, while the median was lower at 104 minutes. The shortest movie was 55 minutes, and the longest was 366.

country_counts <- movies %>% group_by(country) %>% summarise(count = n()) %>%
  mutate(percentage = round(count / sum(count) * 100, 2)) %>% arrange(desc(count))
print(country_counts)
# A tibble: 60 × 3
   country        count percentage
   <chr>          <int>      <dbl>
 1 United States   5475      71.4 
 2 United Kingdom   816      10.6 
 3 France           279       3.64
 4 Canada           190       2.48
 5 Germany          117       1.53
 6 Australia         92       1.2 
 7 Japan             81       1.06
 8 India             62       0.81
 9 Italy             61       0.8 
10 Spain             47       0.61
# ℹ 50 more rows
top_countries = head(country_counts, 15)
ggplot(top_countries, aes(x = reorder(country, -count), y = count)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Frequency of Movies by Primary Genre",
       x = "Genre",
       y = "Count") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

We see that while there are a total of 60 different countries of origin captured by the dataset, by far the majority are from a few different countries. The US alone accounts for 71% of the movies, and in fact just the top 5 countries account for 90% of the total.

Question Driven Analysis

1. The Economics of Movie Making Over Time

Research Question: How have the economics of movie making changed over time? Are online streaming services like Netflix detracting from the appeal of seeing a movie at theaters, and thereby reducing the box office returns of movies?

Let’s explore how budget has changed over time. First, we can try simply creating a scatterplot of budget vs time.

ggplot(movies, aes(x = release_date, y = budget)) +
  geom_point() +
  labs(title = "Scatterplot of Release Date vs Budget",
       x = "Release Date",
       y = "Budget")
Warning: Removed 2171 rows containing missing values (`geom_point()`).

While the data does seem to generally trend up, the data is not cleanly captured. Instead, let’s bucket the movies by year and graph the median budget for each year:

movies2 <- movies[!is.na(movies$budget), ]
med_budgets <- movies2 %>% group_by(year) %>% summarise(median_budget = median(budget))
ggplot(med_budgets, aes(x = year, y = median_budget)) +
  geom_point() +
  labs(title = "Scatterplot of Release Year vs Median Budget",
       x = "Release Year",
       y = "Median Budget")

What we actually see is that when adjusting for inflation, the median budget of the top 200 movies (where budget is recorded) steadily rose from 1980 to its peak in 1999, after which it began to decrease again. Let’s overlay the trend in median gross on top of this data to see if we notice anything:

movies2 <- movies[!is.na(movies$gross), ]
movies2 <- movies2[!is.na(movies2$budget), ]
movies2 <- filter(movies2, year != 2020)
med_budg_gross <- movies2 %>% group_by(year) %>% summarise(median_budget = median(budget), median_gross = median(gross))

ggplot(med_budg_gross, aes(x = year)) +
  geom_line(aes(y = median_budget, color = "Budget"), size = 2) +
  geom_line(aes(y = median_gross, color = "Gross"), size = 2) +
  labs(title = "Scatter Plot of Movie Budget and Gross vs Time",
       x = "Release Date",
       y = "Value",
       color = "Legend") +
  scale_color_manual(values = c("Budget" = "blue", "Gross" = "red"))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Now we can see that generally movies are becoming much more profitable, as the inflation adjusted gross of top movies has become significantly greater over time. Specifically, we see that gross has steadily rose since 1990 at a rapid rate, while budget has remained relatively steady and in fact decreased since 1990.

In conclusion, it appears that the movie industry is not struggling due to the advent of more streaming services. Rather, both in absolute value and value relative to the budget of the movie, US box office returns are seeing all time highs at present, meaning that as of 2020 movies were more profitable than ever in theaters.

2. Influence of Various Factors on Movie Score

Research Question: What factors tend to make a movie well received? Are there associations with budget, profit, or movie length?

First, let’s analyze the impact of budget on movie score.

clean_movies <- movies %>% filter(!is.na(gross) & !is.na(budget) & !is.na(score))

ggplot(clean_movies, aes(x = budget, y = score)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Scatter Plot of Movie Score vs Budget",
       x = "Budget",
       y = "Score")
`geom_smooth()` using formula = 'y ~ x'

Fitting a linear to the data, we see a slight general upward trend correlating increasing budget with increasing movie score. While there are significantly more low budget movies that are flops than high budget, there are also simply more low budget movies in general. We also see that the highest scoring movies don’t necessarily require a high budget.

Next, let’s shift our focus instead to the correlation of the movie score with profit, as we may profit to be a better indicator of how well received a movie was.

clean_movies$profit <- clean_movies$gross - clean_movies$budget
clean_movies2 <- clean_movies %>% filter(profit <= 2e+09) # remove outliers for better fit

ggplot(clean_movies2, aes(x = profit, y = score)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Scatter Plot of Movie Score vs Profit",
       x = "Profit",
       y = "Score")
`geom_smooth()` using formula = 'y ~ x'

Now, we see that the upward trend is significantly more apparent. Almost all movies that were true flops, receiving scores under 4, also had a small gross under 250 million dollars. Furthermore, the highest scoring movies almost always had scores greater than 6.

To further our analysis, we can increase the dimensional of our graph by coloring movies by genre. Looking at the overall distribution of genres, we see that action, comedy, and drama are by far the most common.

genre_prop <- movies %>% group_by(genre) %>% summarise(count = n()) %>% mutate(percentage = round(count / sum(count) * 100, 2))
ggplot(genre_prop, aes(x = reorder(genre, -count), y = count)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Frequency of Movies by Primary Genre",
       x = "Genre",
       y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

To increase the efficacy of our analysis, let’s reduce the dimensions to only label the big 3 genres, and classify all the others under the umbrella of “other”.

clean_movies2 <- clean_movies %>% mutate(genre = recode(genre, "Comedy" = "Comedy", "Action" = "Action", "Drama" = "Drama", .default = "Other"))
genre_prop <- clean_movies2 %>% group_by(genre) %>% summarise(count = n())
ggplot(genre_prop, aes(x = reorder(genre, -count), y = count)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Frequency of Movies by Primary Genre",
       x = "Genre",
       y = "Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Now, we can separate our scatterplot of profit vs score based on the genre

ggplot(clean_movies2, aes(x = profit, y = score)) +
  geom_point(alpha = 0.25) +
  labs(title = "Scatter Plot of Movie Score vs Gross",
       x = "Profit",
       y = "Score") + 
  facet_wrap(~genre, scales = "free_y")

We see that action and other genres have a much greater spread to higher profits (big hits) than comedy and drama, which are more well concentrated below $100,000,000. This may be due to action having a higher mass appeal than comedies or drama, so the ceiling of profit for an action movie is much higher. We can also simply breakdown the scores of movies by genre using the interquantile range:

clean_movies2 %>% group_by(genre) %>% summarise(as_tibble_row(quantile(score)))
# A tibble: 4 × 6
  genre   `0%` `25%` `50%` `75%` `100%`
  <fct>  <dbl> <dbl> <dbl> <dbl>  <dbl>
1 Action   2.1   5.7   6.3   6.9    9  
2 Other    2.4   6     6.6   7.2    8.9
3 Comedy   1.9   5.7   6.3   6.8    8.6
4 Drama    2.3   6.2   6.8   7.3    9.3

We observe that genre does not have a strongn association with score — movies regardless of genre tend to average around a 6.5, with this varying by just +/-1 from 25% to 75%.

We could also test to see if the length of a movie impacts its score. Let’s create a graph of runtime vs score to see how strong a correlation exists:

clean_movies <- movies %>% filter(!is.na(score) & !is.na(runtime))

ggplot(movies, aes(x = runtime, y = score)) +
  geom_point(alpha = 0.25) +
  labs(title = "Scatter Plot of Runtime vs Score",
       x = "Runtime",
       y = "Score")
Warning: Removed 7 rows containing missing values (`geom_point()`).

We observe that there is some positive association that as runtime increases the odds a movie truly bombs decreases. In particular, movie scores become more tightly concentrated at a higher score as runtime increases from 60 to 150 minutes, with the frequency of movies bombing greatly decreasing.

In general, we have observed that as movie profit and movie runtime increases, movie score also generally increases. Movies which make more money and have a greater runtime are substantially less likely to have been flops than movies with low profit or short runtime. Genre did not have a large impact on movie score.

Bibliography

Baumer, Benjamin S., et al. Modern Data Science with R. CRC Press, Taylor & Francis Group, 2021.

Grijalva, Daniel. “Movie Industry.” Kaggle, 23 July 2021, www.kaggle.com/datasets/danielgrijalvas/movies/data.

R Core Team (2023). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.

Wickham, Hadley, et al. R for Data Science: Import, Tidy, Transform, Visualie, and Model Data. O’Reilly, 2023.