Hello everyone. It’s been over a year since COVID-19 was introduced to the world and since then it has affected every one of our lives. The effects can be seen and observed almost everywhere from restaurants, schools, businesses, and even the movie industry. In this project, I wanted to see the impact that the pandemic has had on the movie industry specifically. Therefore, I decided that I would look at the total movie profit (gross income - movie budget) within each year. The end result was quite shocking.

Loading in the data

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ ggplot2 3.3.4     ✓ purrr   0.3.4
## ✓ tibble  3.1.2     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)
library(tinytex)
library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

setwd("/Users/justinpark/Desktop/DATA 110 IMDb Movies Data") 
movieslist <- read.csv("IMDb movies.csv")

Observing the strucutre of the dataset

As you can see, the dataset has a numerous amount of variables. Everything from the movie title, language, actors, directors, budgets, etc. In this project I wanted to only look at certain years, budgets, and the US gross income.

str(movieslist)

## 'data.frame':    85855 obs. of  22 variables:
##  $ imdb_title_id        : chr  "tt0000009" "tt0000574" "tt0001892" "tt0002101" ...
##  $ title                : chr  "Miss Jerry" "The Story of the Kelly Gang" "Den sorte drøm" "Cleopatra" ...
##  $ original_title       : chr  "Miss Jerry" "The Story of the Kelly Gang" "Den sorte drøm" "Cleopatra" ...
##  $ year                 : chr  "1894" "1906" "1911" "1912" ...
##  $ date_published       : chr  "1894-10-09" "1906-12-26" "1911-08-19" "1912-11-13" ...
##  $ genre                : chr  "Romance" "Biography, Crime, Drama" "Drama" "Drama, History" ...
##  $ duration             : int  45 70 53 100 68 60 85 120 120 55 ...
##  $ country              : chr  "USA" "Australia" "Germany, Denmark" "USA" ...
##  $ language             : chr  "None" "None" "" "English" ...
##  $ director             : chr  "Alexander Black" "Charles Tait" "Urban Gad" "Charles L. Gaskill" ...
##  $ writer               : chr  "Alexander Black" "Charles Tait" "Urban Gad, Gebhard Schätzler-Perasini" "Victorien Sardou" ...
##  $ production_company   : chr  "Alexander Black Photoplays" "J. and N. Tait" "Fotorama" "Helen Gardner Picture Players" ...
##  $ actors               : chr  "Blanche Bayliss, William Courtenay, Chauncey Depew" "Elizabeth Tait, John Tait, Norman Campbell, Bella Cola, Will Coyne, Sam Crewes, Jack Ennis, John Forde, Vera Li"| __truncated__ "Asta Nielsen, Valdemar Psilander, Gunnar Helsengreen, Emil Albes, Hugo Flink, Mary Hagen" "Helen Gardner, Pearl Sindelar, Miss Fielding, Miss Robson, Helene Costello, Charles Sindelar, Mr. Howard, James"| __truncated__ ...
##  $ description          : chr  "The adventures of a female reporter in the 1890s." "True story of notorious Australian outlaw Ned Kelly (1855-80)." "Two men of high rank are both wooing the beautiful and famous equestrian acrobat Stella. While Stella ignores t"| __truncated__ "The fabled queen of Egypt's affair with Roman general Marc Antony is ultimately disastrous for both of them." ...
##  $ avg_vote             : num  5.9 6.1 5.8 5.2 7 5.7 6.8 6.2 6.7 5.5 ...
##  $ votes                : int  154 589 188 446 2237 484 753 273 198 225 ...
##  $ budget               : chr  "" "$ 2250" "" "$ 45000" ...
##  $ usa_gross_income     : chr  "" "" "" "" ...
##  $ worlwide_gross_income: chr  "" "" "" "" ...
##  $ metascore            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ reviews_from_users   : num  1 7 5 25 31 13 12 7 4 8 ...
##  $ reviews_from_critics : num  2 7 2 3 14 5 9 5 1 1 ...

Filtering only the necessary categories

I decided that I would only look at movies only from the US and released from 2015-2020.

There was an extraneous category of “TV Movie 2019”, however there was only 1 movie in this category so I decided to exclude it as well.

movies <- movieslist %>%
  filter(country == "USA" & year > 2015 & year != "TV Movie 2019")
# movies

Cleaning the data

##As you saw earlier, the “budget” and “us_gross_income” variables were defined as characters. They also had a dollar symbol ($) which would be unnecessary in this project. Therefore, they were removed.

movies$budget <- gsub('[[:punct:][:blank:]]', '', movies$budget)
movies$usa_gross_income <- gsub('[[:punct:][:blank:]]', '', movies$usa_gross_income)

# movies

Now that the unnecessary symbols were removed, the character string had to be changed into a numeric type.

movies$budget <- as.numeric(movies$budget)

## Warning: NAs introduced by coercion

movies$usa_gross_income <- as.numeric(movies$usa_gross_income)

# movies

Before I looked at the profit however, I wanted to see how many movies were released each year.

Therefore, the only necessary variables for this would be the year and number of movies for each year.

totalreleases <- movies %>%
  select(title, year)
# totalreleases

Creating a bargraph for the number of movies released per year.

p1 <- ggplot(data = totalreleases, aes(x = year, fill = year)) +
  geom_bar() +
  ggtitle("Number of Movies Released in the USA from 2016-2020") +
  theme(plot.title = element_text(hjust = 0.50)) +
  xlab("Year") +
  ylab("Number of Movies") +
  labs(fill = "Year")

p1

Already the effect of COVID-19 can be seen in this graph alone. While the number of movies released in 2016-2018 were the same (give and take a couple for 2019), the number of movies released in 2020 barely surpasses 250. This is not even a third of the number of movies released in 2016-2018.

Determining the total profit for each year.

This time, the only necessary variables are the year, budget, us_gross_income, and the movies. Therefore, only these variables were selected.

budgetincome <- movies %>%
  select(year, title, budget, usa_gross_income)
# budgetincome

Calculating total profit

Observing the table above, one can clearly see that there are 1) too many NAs and 2) no variable for profit.

Therefore, rows with at least one NA value were removed as they would change the data if the movie only had data for either budget of US gross income. Then the total profit was calculated for each year by first grouping the data by year (group_by(year)) then calculating the profit manually.

budgetincome1 <- na.omit(budgetincome)

# budgetincome1

totalprofit <- budgetincome1 %>%
  group_by(year) %>%
  summarize (profit = sum(usa_gross_income) - sum(budget))
totalprofit

## # A tibble: 5 x 2
##   year      profit
##   <chr>      <dbl>
## 1 2016  1914216858
## 2 2017  1857405815
## 3 2018  3211215513
## 4 2019  2519533491
## 5 2020  -153516390

Creating the main visualization bargraph for the total profit per year.

p2 <- ggplot() +
  geom_bar(data = totalprofit, aes(x = year, y = profit, fill = year),
             position = "dodge", stat = "identity") +
  ggtitle("Total Movie Profit in the USA in Dollars from 2016-2020") +
  theme(plot.title = element_text(hjust = 0.50)) +
  xlab("Year") +
  ylab("Dollars") +
  labs(fill = "Year")

p2

Once again, there is a clear difference between years in 2016-2019 and 2020 alone. In fact, all the previous years had positive profit values but 2020 is the only year with a negative value for profit, indicating a net loss in money.

Quick check of the top 20 movies

Just for a quick reference I wanted to look at the top movies from 2016-2020 with the highest US gross income. Popular titles such as the Marvel movies, Star Wars, and Disney/Pixar movies were noticeable. However, not a single movie was from 2020.

visual <- budgetincome1 %>%
  arrange(desc(usa_gross_income)) %>%
  head(20)

visual

##    year                               title   budget usa_gross_income
## 1  2019                   Avengers: Endgame 3.56e+08        858373000
## 2  2018                       Black Panther 2.00e+08        700426566
## 3  2018              Avengers: Infinity War 3.21e+08        678815482
## 4  2017         Star Wars - Gli ultimi Jedi 3.17e+08        620181382
## 5  2018                   Gli Incredibili 2 2.00e+08        608581744
## 6  2016        Rogue One: A Star Wars Story 2.00e+08        532177324
## 7  2019    Star Wars: L'ascesa di Skywalker 2.75e+08        515202542
## 8  2017                La bella e la bestia 1.60e+08        504481165
## 9  2016                Alla ricerca di Dory 2.00e+08        486295561
## 10 2019 Frozen II - Il segreto di Arendelle 1.50e+08        477373578
## 11 2019                         Toy Story 4 2.00e+08        434038008
## 12 2018 Jurassic World - Il regno distrutto 1.70e+08        417719760
## 13 2016          Captain America: Civil War 2.50e+08        408084349
## 14 2019           Spider-Man: Far from Home 1.60e+08        390532085
## 15 2017     Guardiani della Galassia Vol. 2 2.00e+08        389813101
## 16 2016                            Deadpool 5.80e+07        363070709
## 17 2016                         Zootropolis 1.50e+08        341268248
## 18 2017              Spider-Man: Homecoming 1.75e+08        334201140
## 19 2016  Batman v Superman: Dawn of Justice 2.50e+08        330360194
## 20 2016                       Suicide Squad 1.75e+08        325100054

And just for fun lets look at a scatterplot that displays the US gross income based on budget

p3 <- ggplot(data = budgetincome, aes(x = budget, y = usa_gross_income)) +
  geom_point(aes(color = year)) +
  geom_smooth(aes(color = year), method = "lm", se = FALSE) +
  ggtitle("US Movie Gross Income Based on Budget") +
  theme(plot.title = element_text(hjust = 0.50)) +
  labs(
    x = "Budget",
    y = "US Gross Income")

p3

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 3289 rows containing non-finite values (stat_smooth).

## Warning: Removed 3289 rows containing missing values (geom_point).

Even in this graph the year of 2020 is significantly different from the others. The year of 2020 had a line of best fit that had a slope significantly lower than the other years, inidicating that despite the relatively high budget of a movie, the income for that movie was not very high.

In Conclusion

It’s been over a year since COVID-19 was introduced to the world. And since then it is fair to say that it has undoubtedly affected every one of our lives. The effects can be seen and observed almost everywhere from restaurants, schools, businesses, and even the movie industry. In this project, I wanted to see the impact that the pandemic has had on the movie industry.

Overall, the dataset included numerous amount of different variables. Everything from the movie title, language, production country, actors, directors, budget, US/worldwide income, and even movie summary were included within the dataset. However, a lot of these variables would not be viable in this project. Therefore, I decided that I would look at the total movie profit (gross income - movie budget) within each year. Therefore, I decided that I wanted to look at the total movie profit in the US (gross income - movie budget) within each year. I believed that the pandemic would have a significant impact on the profits made.

Because of this, I would have to only select a few of the variables that were necessary to complete my desired goal. First, only movies from the US and released in 2015-2020 were chosen. However, much the data under the budget and US gross income variable were blank (NAs), had dollar symbols ($), and the two variables were considered characters rather than numbers as well. Because of this, I cleaned the data by getting rid of all the dollar symbols and spaces to only include the numbers using gsub([[:punct:][:blank:]]. Specifically, this would remove all the punctuations and spaces in the specified variable. Afterward, I changed the two variables into numerics using as.numeric so that necessary calculations could be made later.

Now that the cleaning I made two different visualizations. The first shows the number of movies released each year from 2016-2020. Even from here the effects of the pandemic could be seen as the number of movies released in 2020 barely surpasses 250, which is not even a third of the number of movies released in 2016-2018.

The second visualization displays the desired graph. It displays the total profit each year made in movies. However, two problems occurred 1) there were too many NA values and 2) no there was no variable for profit. Therefore, rows with at least one NA value were removed (na.omit) as they would change the data if the movie only had data for either budget of US gross income. Then the total profit was calculated for each year by first grouping the data by year (group_by(year)) then calculating the profit manually (sum(usa_gross_income) - sum(budget))

Furthermore, I would have liked to find the most popular genres according to their rating (avg_vote) and how many were released each year. However, most of the movies were categorized with multiple genres - sometimes reaching even 5 different genres. Because of this, it was nearly impossible to categorize the movies under one specific genre.

Very clearly, the COVID-19 pandemic had a very large impact upon the movie industry. Net losses in profits were visible for the first time in forever. At first, I thought I would have to normalize the data by finding the percent profit (us_gross_income/budget) of each year in order to find a good comparison. But the graph displaying the total profits alone were enough to make a good comparison.

On a side note, interestingly enough 2019 actually had a greater profit value than that of 2016 and 2017 even though there was a noticeable drop in movies released. The reason for this may be because of the large number of big name movies such as Avengers: Endgame, Frozen II, Toy Story 4, and Star Wars: The Rise of Skywalker. These movies alone have made huge profits compared to its budget. And because these movies were released, many smaller movies may have been held back on release because the competition would not even be considered equal.

Ultimately, it can be said that COVID-19 had a significant impact on the US movie industry. The year 2020 had no profit in terms of the movies’ US gross income and budget. It was a very rough time for the movie industry, but with things looking better recently hopefully the mood will change not just for moviegoers, but for everyone.

Movie Profit Over the Years According to IMDb

Justin Park

6/21/2021