Hello everyone. It’s been over a year since COVID-19 was introduced to the world and since then it has affected every one of our lives. The effects can be seen and observed almost everywhere from restaurants, schools, businesses, and even the movie industry. In this project, I wanted to see the impact that the pandemic has had on the movie industry specifically. Therefore, I decided that I would look at the total movie profit (gross income - movie budget) within each year. The end result was quite shocking.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.4 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(tinytex)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
setwd("/Users/justinpark/Desktop/DATA 110 IMDb Movies Data")
movieslist <- read.csv("IMDb movies.csv")
str(movieslist)
## 'data.frame': 85855 obs. of 22 variables:
## $ imdb_title_id : chr "tt0000009" "tt0000574" "tt0001892" "tt0002101" ...
## $ title : chr "Miss Jerry" "The Story of the Kelly Gang" "Den sorte drøm" "Cleopatra" ...
## $ original_title : chr "Miss Jerry" "The Story of the Kelly Gang" "Den sorte drøm" "Cleopatra" ...
## $ year : chr "1894" "1906" "1911" "1912" ...
## $ date_published : chr "1894-10-09" "1906-12-26" "1911-08-19" "1912-11-13" ...
## $ genre : chr "Romance" "Biography, Crime, Drama" "Drama" "Drama, History" ...
## $ duration : int 45 70 53 100 68 60 85 120 120 55 ...
## $ country : chr "USA" "Australia" "Germany, Denmark" "USA" ...
## $ language : chr "None" "None" "" "English" ...
## $ director : chr "Alexander Black" "Charles Tait" "Urban Gad" "Charles L. Gaskill" ...
## $ writer : chr "Alexander Black" "Charles Tait" "Urban Gad, Gebhard Schätzler-Perasini" "Victorien Sardou" ...
## $ production_company : chr "Alexander Black Photoplays" "J. and N. Tait" "Fotorama" "Helen Gardner Picture Players" ...
## $ actors : chr "Blanche Bayliss, William Courtenay, Chauncey Depew" "Elizabeth Tait, John Tait, Norman Campbell, Bella Cola, Will Coyne, Sam Crewes, Jack Ennis, John Forde, Vera Li"| __truncated__ "Asta Nielsen, Valdemar Psilander, Gunnar Helsengreen, Emil Albes, Hugo Flink, Mary Hagen" "Helen Gardner, Pearl Sindelar, Miss Fielding, Miss Robson, Helene Costello, Charles Sindelar, Mr. Howard, James"| __truncated__ ...
## $ description : chr "The adventures of a female reporter in the 1890s." "True story of notorious Australian outlaw Ned Kelly (1855-80)." "Two men of high rank are both wooing the beautiful and famous equestrian acrobat Stella. While Stella ignores t"| __truncated__ "The fabled queen of Egypt's affair with Roman general Marc Antony is ultimately disastrous for both of them." ...
## $ avg_vote : num 5.9 6.1 5.8 5.2 7 5.7 6.8 6.2 6.7 5.5 ...
## $ votes : int 154 589 188 446 2237 484 753 273 198 225 ...
## $ budget : chr "" "$ 2250" "" "$ 45000" ...
## $ usa_gross_income : chr "" "" "" "" ...
## $ worlwide_gross_income: chr "" "" "" "" ...
## $ metascore : num NA NA NA NA NA NA NA NA NA NA ...
## $ reviews_from_users : num 1 7 5 25 31 13 12 7 4 8 ...
## $ reviews_from_critics : num 2 7 2 3 14 5 9 5 1 1 ...
There was an extraneous category of “TV Movie 2019”, however there was only 1 movie in this category so I decided to exclude it as well.
movies <- movieslist %>%
filter(country == "USA" & year > 2015 & year != "TV Movie 2019")
# movies
##As you saw earlier, the “budget” and “us_gross_income” variables were defined as characters. They also had a dollar symbol ($) which would be unnecessary in this project. Therefore, they were removed.
movies$budget <- gsub('[[:punct:][:blank:]]', '', movies$budget)
movies$usa_gross_income <- gsub('[[:punct:][:blank:]]', '', movies$usa_gross_income)
# movies
movies$budget <- as.numeric(movies$budget)
## Warning: NAs introduced by coercion
movies$usa_gross_income <- as.numeric(movies$usa_gross_income)
# movies
totalreleases <- movies %>%
select(title, year)
# totalreleases
p1 <- ggplot(data = totalreleases, aes(x = year, fill = year)) +
geom_bar() +
ggtitle("Number of Movies Released in the USA from 2016-2020") +
theme(plot.title = element_text(hjust = 0.50)) +
xlab("Year") +
ylab("Number of Movies") +
labs(fill = "Year")
p1
Already the effect of COVID-19 can be seen in this graph alone. While the number of movies released in 2016-2018 were the same (give and take a couple for 2019), the number of movies released in 2020 barely surpasses 250. This is not even a third of the number of movies released in 2016-2018.
budgetincome <- movies %>%
select(year, title, budget, usa_gross_income)
# budgetincome
Therefore, rows with at least one NA value were removed as they would change the data if the movie only had data for either budget of US gross income. Then the total profit was calculated for each year by first grouping the data by year (group_by(year)) then calculating the profit manually.
budgetincome1 <- na.omit(budgetincome)
# budgetincome1
totalprofit <- budgetincome1 %>%
group_by(year) %>%
summarize (profit = sum(usa_gross_income) - sum(budget))
totalprofit
## # A tibble: 5 x 2
## year profit
## <chr> <dbl>
## 1 2016 1914216858
## 2 2017 1857405815
## 3 2018 3211215513
## 4 2019 2519533491
## 5 2020 -153516390
p2 <- ggplot() +
geom_bar(data = totalprofit, aes(x = year, y = profit, fill = year),
position = "dodge", stat = "identity") +
ggtitle("Total Movie Profit in the USA in Dollars from 2016-2020") +
theme(plot.title = element_text(hjust = 0.50)) +
xlab("Year") +
ylab("Dollars") +
labs(fill = "Year")
p2
Once again, there is a clear difference between years in 2016-2019 and 2020 alone. In fact, all the previous years had positive profit values but 2020 is the only year with a negative value for profit, indicating a net loss in money.
Just for a quick reference I wanted to look at the top movies from 2016-2020 with the highest US gross income. Popular titles such as the Marvel movies, Star Wars, and Disney/Pixar movies were noticeable. However, not a single movie was from 2020.
visual <- budgetincome1 %>%
arrange(desc(usa_gross_income)) %>%
head(20)
visual
## year title budget usa_gross_income
## 1 2019 Avengers: Endgame 3.56e+08 858373000
## 2 2018 Black Panther 2.00e+08 700426566
## 3 2018 Avengers: Infinity War 3.21e+08 678815482
## 4 2017 Star Wars - Gli ultimi Jedi 3.17e+08 620181382
## 5 2018 Gli Incredibili 2 2.00e+08 608581744
## 6 2016 Rogue One: A Star Wars Story 2.00e+08 532177324
## 7 2019 Star Wars: L'ascesa di Skywalker 2.75e+08 515202542
## 8 2017 La bella e la bestia 1.60e+08 504481165
## 9 2016 Alla ricerca di Dory 2.00e+08 486295561
## 10 2019 Frozen II - Il segreto di Arendelle 1.50e+08 477373578
## 11 2019 Toy Story 4 2.00e+08 434038008
## 12 2018 Jurassic World - Il regno distrutto 1.70e+08 417719760
## 13 2016 Captain America: Civil War 2.50e+08 408084349
## 14 2019 Spider-Man: Far from Home 1.60e+08 390532085
## 15 2017 Guardiani della Galassia Vol. 2 2.00e+08 389813101
## 16 2016 Deadpool 5.80e+07 363070709
## 17 2016 Zootropolis 1.50e+08 341268248
## 18 2017 Spider-Man: Homecoming 1.75e+08 334201140
## 19 2016 Batman v Superman: Dawn of Justice 2.50e+08 330360194
## 20 2016 Suicide Squad 1.75e+08 325100054
And just for fun lets look at a scatterplot that displays the US gross income based on budget
p3 <- ggplot(data = budgetincome, aes(x = budget, y = usa_gross_income)) +
geom_point(aes(color = year)) +
geom_smooth(aes(color = year), method = "lm", se = FALSE) +
ggtitle("US Movie Gross Income Based on Budget") +
theme(plot.title = element_text(hjust = 0.50)) +
labs(
x = "Budget",
y = "US Gross Income")
p3
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 3289 rows containing non-finite values (stat_smooth).
## Warning: Removed 3289 rows containing missing values (geom_point).
Even in this graph the year of 2020 is significantly different from the others. The year of 2020 had a line of best fit that had a slope significantly lower than the other years, inidicating that despite the relatively high budget of a movie, the income for that movie was not very high.
It’s been over a year since COVID-19 was introduced to the world. And since then it is fair to say that it has undoubtedly affected every one of our lives. The effects can be seen and observed almost everywhere from restaurants, schools, businesses, and even the movie industry. In this project, I wanted to see the impact that the pandemic has had on the movie industry.
Overall, the dataset included numerous amount of different variables. Everything from the movie title, language, production country, actors, directors, budget, US/worldwide income, and even movie summary were included within the dataset. However, a lot of these variables would not be viable in this project. Therefore, I decided that I would look at the total movie profit (gross income - movie budget) within each year. Therefore, I decided that I wanted to look at the total movie profit in the US (gross income - movie budget) within each year. I believed that the pandemic would have a significant impact on the profits made.
Because of this, I would have to only select a few of the variables that were necessary to complete my desired goal. First, only movies from the US and released in 2015-2020 were chosen. However, much the data under the budget and US gross income variable were blank (NAs), had dollar symbols ($), and the two variables were considered characters rather than numbers as well. Because of this, I cleaned the data by getting rid of all the dollar symbols and spaces to only include the numbers using gsub([[:punct:][:blank:]]. Specifically, this would remove all the punctuations and spaces in the specified variable. Afterward, I changed the two variables into numerics using as.numeric so that necessary calculations could be made later.
Now that the cleaning I made two different visualizations. The first shows the number of movies released each year from 2016-2020. Even from here the effects of the pandemic could be seen as the number of movies released in 2020 barely surpasses 250, which is not even a third of the number of movies released in 2016-2018.
The second visualization displays the desired graph. It displays the total profit each year made in movies. However, two problems occurred 1) there were too many NA values and 2) no there was no variable for profit. Therefore, rows with at least one NA value were removed (na.omit) as they would change the data if the movie only had data for either budget of US gross income. Then the total profit was calculated for each year by first grouping the data by year (group_by(year)) then calculating the profit manually (sum(usa_gross_income) - sum(budget))
Furthermore, I would have liked to find the most popular genres according to their rating (avg_vote) and how many were released each year. However, most of the movies were categorized with multiple genres - sometimes reaching even 5 different genres. Because of this, it was nearly impossible to categorize the movies under one specific genre.
Very clearly, the COVID-19 pandemic had a very large impact upon the movie industry. Net losses in profits were visible for the first time in forever. At first, I thought I would have to normalize the data by finding the percent profit (us_gross_income/budget) of each year in order to find a good comparison. But the graph displaying the total profits alone were enough to make a good comparison.
On a side note, interestingly enough 2019 actually had a greater profit value than that of 2016 and 2017 even though there was a noticeable drop in movies released. The reason for this may be because of the large number of big name movies such as Avengers: Endgame, Frozen II, Toy Story 4, and Star Wars: The Rise of Skywalker. These movies alone have made huge profits compared to its budget. And because these movies were released, many smaller movies may have been held back on release because the competition would not even be considered equal.
Ultimately, it can be said that COVID-19 had a significant impact on the US movie industry. The year 2020 had no profit in terms of the movies’ US gross income and budget. It was a very rough time for the movie industry, but with things looking better recently hopefully the mood will change not just for moviegoers, but for everyone.