Data 607 Final Project

Natalie Kalukeerthie, Anna Moy, Bishoy Sokkar

Introduction

During the year of 2020, the world went into lock down, many people were forced to stay home, thus having the time to consume more media than ever before.

We could see movies being a popular source of entertainment during both the peak of lock down (Spring/Summer 2020) where people watched many movies in their homes and once restrictions were eased and people were allowed to finally congregate in public spaces again.

As the movie and entertainment industry have a large impact on today’s pop culture, we’re intrigued to see how successful certain movies were over the years, especially in 2019 and 2020. We will also be delving into the gross income of a variety of movies in order to find which movie was the most profitable. Comparing the 2019 and 2020 data to determine if the pandemic impacted the total profits theses movies received.

Our datasets will be coming from Kaggle: Movie Industry and through web scraping on Box Office Mojo.

With the Box Office Mojo dataset we had to conduct data transformation to make the data usable as all the data was in one column. In the Kaggle Movie dataset we did some cleaning for our analysis. We joined the Box Office Mojo and the Kaggle Movie dataset to find the profitability for both 2019 and 2020 movies. The data was broken down into groupings on genre to determine which genre made the most profit.

Our team used slack, discord and github in order to collaborate effectively.

Data Collection

#load library
library(tidyverse)
library(rvest)
library(lubridate)
library(sf)
library(ggrepel)
library(countrycode)
library(rnaturalearth)
#read csv file from kaggle
movie <- read_csv("https://raw.githubusercontent.com/AnnaMoy/Data-607-Project/main/movies.csv")
movie
## # A tibble: 7,668 × 15
##    name   rating genre  year released score  votes director writer star  country
##    <chr>  <chr>  <chr> <dbl> <chr>    <dbl>  <dbl> <chr>    <chr>  <chr> <chr>  
##  1 The S… R      Drama  1980 June 13…   8.4 9.27e5 Stanley… Steph… Jack… United…
##  2 The B… R      Adve…  1980 July 2,…   5.8 6.5 e4 Randal … Henry… Broo… United…
##  3 Star … PG     Acti…  1980 June 20…   8.7 1.20e6 Irvin K… Leigh… Mark… United…
##  4 Airpl… PG     Come…  1980 July 2,…   7.7 2.21e5 Jim Abr… Jim A… Robe… United…
##  5 Caddy… R      Come…  1980 July 25…   7.3 1.08e5 Harold … Brian… Chev… United…
##  6 Frida… R      Horr…  1980 May 9, …   6.4 1.23e5 Sean S.… Victo… Bets… United…
##  7 The B… R      Acti…  1980 June 20…   7.9 1.88e5 John La… Dan A… John… United…
##  8 Ragin… R      Biog…  1980 Decembe…   8.2 3.30e5 Martin … Jake … Robe… United…
##  9 Super… PG     Acti…  1980 June 19…   6.8 1.01e5 Richard… Jerry… Gene… United…
## 10 The L… R      Biog…  1980 May 16,…   7   1   e4 Walter … Bill … Davi… United…
## # ℹ 7,658 more rows
## # ℹ 4 more variables: budget <dbl>, gross <dbl>, company <chr>, runtime <dbl>
# Webscraping from Boxoffice Mojo website
# link to the website
url <- "https://www.boxofficemojo.com/year/2020/"

# read url link 
webpage <- read_html(url)

# pull in movie title, gross, total gross and release date from 2020 movie information 
gross_2020 <- html_nodes(webpage, ".mojo-field-type-date , .mojo-estimatable , .mojo-cell-wide") %>%
  html_text()

# put the information into a dataframe
df <- data.frame(gross = gross_2020)

head(df,20)
##                      gross
## 1    Release\n            
## 2                    Gross
## 3              Total Gross
## 4             Release Date
## 5        Bad Boys for Life
## 6             $204,417,855
## 7             $206,305,244
## 8                   Jan 17
## 9                     1917
## 10            $157,901,466
## 11            $159,227,644
## 12                  Dec 25
## 13      Sonic the Hedgehog
## 14            $146,066,470
## 15            $148,974,665
## 16                  Feb 14
## 17 Jumanji: The Next Level
## 18            $124,736,710
## 19            $320,314,960
## 20                  Dec 13

We decided to get our dataset through web-scraping. By using the function html_nodes(), we pulled in gross movie data for 2020 from https://www.boxofficemojo.com/year/2020/

Data Transformation and Cleaning

The originally scraped data is not in a usable format, so we must transform it into wide data so each column has it’s own observations in each column.

# All the data was in one column therefore we split up the data into their own columns and provided new column names
df <- df %>%
  mutate(ind = rep(c("title", "gross_data", "total_gross","date"),length.out = n())) %>%
  group_by(ind) %>%
  mutate(id = row_number()) %>%
  spread(ind, gross) %>%
  select(-id)

#remove line 1 which initially has the old titles
moviedf <- df[-1,]
# Transform the release date in the kaggle movie dataset
movie<- movie %>% 
  separate_wider_delim(released, delim =" (", names= c("date","location"))

#convert release date into date format
movie$date <- mdy(movie$date)
## Warning: 10 failed to parse.
#add in year and quarter
movie<-movie %>%
  mutate(qtr= quarter(date,with_year = T))
#joining the data for kaggle movie and webscraped data from Box Office Mojo
movie_df <-inner_join(movie, moviedf, by=c("name" = "title"))
#creating a subset and filter for 2020 movies
movie2020 <- movie_df %>%
  filter(year %in% c("2020"))

movie2020_numeric <- movie2020

#converting gross income into a numeric value for analysis
movie2020_numeric$total_gross_numeric <- movie2020_numeric$total_gross
movie2020_numeric$total_gross_numeric <- as.character(gsub("\\$", "", movie2020_numeric$total_gross_numeric))
movie2020_numeric$total_gross_numeric <- as.numeric(gsub(",", "", movie2020_numeric$total_gross_numeric))

#displaying highest gross incomes first
movie2020_numeric <- movie2020_numeric[order(movie2020_numeric$total_gross_numeric, decreasing = TRUE),]
movie2020_numeric
## # A tibble: 11 × 21
##    name      rating genre  year date.x     location score  votes director writer
##    <chr>     <chr>  <chr> <dbl> <date>     <chr>    <dbl>  <dbl> <chr>    <chr> 
##  1 Bad Boys… R      Acti…  2020 2020-01-17 United …   6.6 140000 Adil El… Peter…
##  2 Sonic th… PG     Acti…  2020 2020-02-14 United …   6.5 102000 Jeff Fo… Pat C…
##  3 Birds of… R      Acti…  2020 2020-02-07 United …   6.1 190000 Cathy Y… Chris…
##  4 Dolittle  PG     Adve…  2020 2020-01-17 United …   5.6  53000 Stephen… Steph…
##  5 The Invi… R      Drama  2020 2020-02-28 United …   7.1 186000 Leigh W… Leigh…
##  6 The Call… PG     Adve…  2020 2020-02-21 United …   6.8  42000 Chris S… Micha…
##  7 Onward    PG     Anim…  2020 2020-03-06 United …   7.4 120000 Dan Sca… Dan S…
##  8 Tenet     PG-13  Acti…  2020 2020-09-03 United …   7.4 387000 Christo… Chris…
##  9 Wonder W… PG-13  Acti…  2020 2020-12-25 United …   5.4 217000 Patty J… Patty…
## 10 Tulsa     PG-13  Come…  2020 2020-06-03 United …   5      294 Scott P… Scott…
## 11 The Eigh… Not R… Acti…  2020 2020-08-28 United …   6.8   3700 Hu Guan  Hu Gu…
## # ℹ 11 more variables: star <chr>, country <chr>, budget <dbl>, gross <dbl>,
## #   company <chr>, runtime <dbl>, qtr <dbl>, date.y <chr>, gross_data <chr>,
## #   total_gross <chr>, total_gross_numeric <dbl>
#creating a subset and filter for 2019 movies
movie2019 <- movie_df %>%
  filter(year %in% c("2019"))

movie2019_numeric <- movie2019

#converting gross income into a numeric value for analysis
movie2019_numeric$total_gross_numeric <- movie2019_numeric$total_gross
movie2019_numeric$total_gross_numeric <- as.character(gsub("\\$", "", movie2019_numeric$total_gross_numeric))
movie2019_numeric$total_gross_numeric <- as.numeric(gsub(",", "", movie2019_numeric$total_gross_numeric))

#displaying highest gross incomes first
movie2019_numeric <- movie2019_numeric[order(movie2019_numeric$total_gross_numeric, decreasing = TRUE),]
movie2019_numeric
## # A tibble: 17 × 21
##    name      rating genre  year date.x     location score  votes director writer
##    <chr>     <chr>  <chr> <dbl> <date>     <chr>    <dbl>  <dbl> <chr>    <chr> 
##  1 Star War… PG-13  Acti…  2019 2019-12-20 United …   6.5 3.94e5 J.J. Ab… Chris…
##  2 Frozen II PG     Anim…  2019 2019-11-22 United …   6.8 1.48e5 Chris B… Jenni…
##  3 Joker     R      Crime  2019 2019-10-04 United …   8.4 1   e6 Todd Ph… Todd …
##  4 Jumanji:… PG-13  Acti…  2019 2019-12-13 United …   6.6 2.04e5 Jake Ka… Jake …
##  5 Knives O… PG-13  Come…  2019 2019-11-27 United …   7.9 5.05e5 Rian Jo… Rian …
##  6 1917      R      Drama  2019 2020-01-10 United …   8.3 4.73e5 Sam Men… Sam M…
##  7 Ford v F… PG-13  Acti…  2019 2019-11-15 United …   8.1 3.23e5 James M… Jez B…
##  8 Malefice… PG     Adve…  2019 2019-10-18 United …   6.6 9.1 e4 Joachim… Linda…
##  9 Little W… PG     Drama  2019 2019-12-25 United …   7.8 1.59e5 Greta G… Greta…
## 10 The Adda… PG     Anim…  2019 2019-10-11 United …   5.8 3.10e4 Greg Ti… Matt …
## 11 Zombiela… R      Acti…  2019 2019-10-18 United …   6.7 1.51e5 Ruben F… Rhett…
## 12 A Beauti… PG     Biog…  2019 2019-11-22 United …   7.3 6.70e4 Mariell… Micah…
## 13 Parasite  R      Come…  2019 2019-11-08 United …   8.6 6.31e5 Bong Jo… Bong …
## 14 Jojo Rab… PG-13  Come…  2019 2019-11-08 United …   7.9 3.28e5 Taika W… Chris…
## 15 Weatheri… PG-13  Anim…  2019 2020-01-17 United …   7.5 2.8 e4 Makoto … Makot…
## 16 Ip Man 4… Not R… Acti…  2019 2019-12-20 China)     7   2.70e4 Wilson … Edmon…
## 17 Portrait… R      Drama  2019 2020-02-14 United …   8.1 7.4 e4 Céline … Célin…
## # ℹ 11 more variables: star <chr>, country <chr>, budget <dbl>, gross <dbl>,
## #   company <chr>, runtime <dbl>, qtr <dbl>, date.y <chr>, gross_data <chr>,
## #   total_gross <chr>, total_gross_numeric <dbl>

Analysis

Using our tidied data, we will use it to discover some insights on the movies released in 2020, such as:

Which movie in 2019 and 2020 was the most profitable?

To define a movie’s success, I will measure it by profit, where if a movie makes more than what they spend for their budget, they will be considered successful, and the movie with the largest profit will be the most successful.

## Warning: Removed 3 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).

Using a scatterplot to compare budget against gross_income of each movie, we can that there isn’t a general trend here, however, there are a few movies that had similar budgets but varied in their overall gross.

Now that we see the trend, let’s see which movies made a profit:

## Warning: Removed 3 rows containing missing values (`geom_point()`).
## Warning: Removed 1 rows containing missing values (`geom_point()`).

From this visual we can identify which movies had a total gross that surpassed their budget, thus making a profit. We can also see that a majority of movies did not make a profit.

# Sort the dataset by profit in descending order
movie2020_numeric_sorted <- movie2020_numeric[order(-movie2020_numeric$profit), ]

# We'll look at the top 5 movies
top_5_profitable <- head(movie2020_numeric_sorted, 5)

print(top_5_profitable[, c("name", "budget", "total_gross", "profit")])
## # A tibble: 5 × 4
##   name                    budget total_gross     profit
##   <chr>                    <dbl> <chr>            <dbl>
## 1 Bad Boys for Life     90000000 $206,305,244 116305244
## 2 Sonic the Hedgehog    85000000 $148,974,665  63974665
## 3 The Invisible Man      7000000 $70,410,000   63410000
## 4 Birds of Prey         84500000 $84,158,461    -341539
## 5 The Call of the Wild 135000000 $62,342,368  -72657632

From listing the top 5 movies, we can see that Bad Boys for Life was the most successful movie in 2020.

We can also compare the profits between 2019 and 2020 in our analysis.

#finding difference
movie2019_numeric$profit <- movie2019_numeric$total_gross_numeric - movie2019_numeric$budget

# Create a new column indicating whether the movie made a profit or not
movie2019_numeric$profitable <- ifelse(movie2019_numeric$profit > 0, "Profitable", "Not Profitable")


# Sort the dataset by profit in descending order
movie2019_numeric_sorted <- movie2019_numeric[order(-movie2019_numeric$profit), ]

# We'll look at the top 5 movies
top_5_profitable_2019 <- head(movie2019_numeric_sorted, 5)

print(top_5_profitable_2019[, c("name", "budget", "total_gross", "profit")])
## # A tibble: 5 × 4
##   name                                             budget total_gross     profit
##   <chr>                                             <dbl> <chr>            <dbl>
## 1 Frozen II                                     150000000 $477,373,578 327373578
## 2 Joker                                          55000000 $335,451,311 280451311
## 3 Star Wars: Episode IX - The Rise of Skywalker 275000000 $515,202,542 240202542
## 4 Jumanji: The Next Level                       125000000 $320,314,960 195314960
## 5 Knives Out                                     40000000 $165,363,234 125363234

From listing the top 5 movies, we can see that Frozen II was the most successful movie in 2019.

Another analysis we conducted was:

Which genre was the most profitable in 2019 and 2020?

genre2019 <- movie2019_numeric %>%
  mutate(profit = total_gross_numeric-budget) %>%
  group_by(genre) %>%
  summarize(avg_profit = mean(profit)) %>%
  arrange(desc(avg_profit))

genre2019
## # A tibble: 7 × 2
##   genre     avg_profit
##   <chr>          <dbl>
## 1 Crime     280451311 
## 2 Animation 133458373 
## 3 Action     87724194.
## 4 Comedy     62234630.
## 5 Biography  36704055 
## 6 Adventure -71070395 
## 7 Drama            NA

Crime and Animation genre movies has the highest profit in 2019. Drama genre movies has the highest profit in 2020

genre2020 <- movie2020_numeric %>%
  mutate(profit = total_gross_numeric-budget) %>%
  group_by(genre) %>%
  summarize(avg_profit = mean(profit)) %>%
  arrange(desc(avg_profit))

genre2020
## # A tibble: 5 × 2
##   genre     avg_profit
##   <chr>          <dbl>
## 1 Drama      63410000 
## 2 Adventure -85305284.
## 3 Action           NA 
## 4 Animation        NA 
## 5 Comedy           NA

Data Visualization New Feature To Create Map

Our team wanted to visualize the consumption of movies across the world before and after the pandemic 2019 vs 2020. Using ggplot along with the packages loaded below, our team was able to create a map showing total movie revenue by country. To create a map we used dplyr to filter the relevant columns, including the country. We then grouped by country summing the revenue from each movie. We then used the Iso package to find a common code for each country to be easily identifiable. We then used the sf package to load a map and plot the data on.

#First we create a data frame with the relevent information needed for the map. 
map_data <- movie %>%   
  select(year, country, budget, gross) %>% 
  mutate( profit = gross-budget) %>%  #created a new variable named profit, which calculates the net profit per movie.
  na.omit(gross) %>% #removed empty variables 
  filter(year == 2019) %>% #filtered for year 2019
  group_by(country) %>% #grouped by country 
  summarise(total_income = sum(profit)) 


#Add ISO which is a 3 letters code for each country  for Join function to work. This uses the countrycode package's countrycode function to convert country names in the "country" column into ISO 3166-1 alpha-3 country codes.

data_with_iso <- map_data %>%
  mutate(Iso3 = countrycode::countrycode(
    sourcevar = country,
    origin = "country.name",
    destination = "iso3c"))
#The following code uses the ne_countries function from the rnaturalearth package to retrieve a dataset containing geometrical information (e.g., boundaries) of countries worldwide.

world <- ne_countries(scale = "small", returnclass = "sf") 

#The following code combining the data frame we created above with data with ISO which will give the co ordinations of each country.
countries_movie_consumption <- world %>%
  select(geometry, name, iso_a3) %>%
  left_join(data_with_iso, by = c("iso_a3" = "Iso3"))

#The following code generates a map visualization showing the distribution of movie revenue by country for 2019, with darker shades representing higher revenue.
world %>% 
  st_transform(crs = "+proj=robin") %>% #This first transforms the geometrical data in the "world" dataset to use the Robinson projection. The Robinson projection is a pseudo-cylindrical projection commonly used for world maps.
  ggplot() +
  geom_sf(color = "darkgray") +
  geom_sf(data = countries_movie_consumption, aes(fill = total_income)) +
  theme_minimal() +
   labs(
    title = "World Movie Revenue by Country for 2019",
    subtitle = "Total Gross Income from Movies in million",
    fill = "Total Gross Income (USD)"
  )

From the plot above we can see that the USA is a lead in movie revenue. The USA movie industry not only generates substantial revenue but also serves as a key ambassador of American culture and values worldwide.

map_data_2020 <- movie %>%
  select(year, country, budget, gross) %>%
  mutate( profit = gross-budget) %>%
  na.omit(gross) %>%
  filter(year == 2020) %>%
  group_by(country) %>%
  summarise(total_gross_income = sum(gross))

data_with_iso <- map_data_2020 %>%
  mutate(Iso3 = countrycode::countrycode(
    sourcevar = country,
    origin = "country.name",
    destination = "iso3c"))


world_2020 <- ne_countries(scale = "small", returnclass = "sf") 

countries_movie_consumption_2020 <- world_2020 %>%
  select(geometry, name, iso_a3) %>%
  left_join(data_with_iso, by = c("iso_a3" = "Iso3"))


world_2020 %>% 
  st_transform(crs = "+proj=robin") %>%
  ggplot() +
  geom_sf(color = "darkgray") +
  geom_sf(data = countries_movie_consumption_2020, aes(fill = total_gross_income)) +
  theme_minimal() +
   labs(
    title = "World Movie Revenue by Country for 2020",
    subtitle = "Total Gross Income from Movies in million",
    fill = "Total Gross Income (USD)"
  )

We can see from the plot above that in 2020 the movie revenue for three countries: USA, Canada and China. The plot also shows a significant decline in Europe profits through the movie industry, compared to the year prior.

Statistical Analysis

Null Hypothesis: There is no difference between 2019 and 2020 profit

Alternative Hypothesis: There is a difference between 2019 and 2020 profit

Reject the Null Hypothesis since p-value is less than .05.

Based on the t-test it can be concluded there is a difference between the 2019 and 2020 profit.

#creating a subset and filter for 2019 and 2020 movies
movieall <- movie_df %>%
  filter(year %in% c("2019", "2020"))


#converting gross income into a numeric value for analysis
movieall$total_gross_numeric <- movieall$total_gross
movieall$total_gross_numeric <- as.character(gsub("\\$", "", movieall$total_gross_numeric))
movieall$total_gross_numeric <- as.numeric(gsub(",", "", movieall$total_gross_numeric))

#displaying highest gross incomes first
movieall2 <- movieall[order(movieall$total_gross_numeric, decreasing = TRUE),]


# 2019 and 2020 movie gross data
movieall3 <- movieall2 %>%
  mutate(profit = (total_gross_numeric- budget)) 

#T Test 
t.test(profit ~ year, data = movieall3)
## 
##  Welch Two Sample t-test
## 
## data:  profit by year
## t = 2.432, df = 17.118, p-value = 0.02628
## alternative hypothesis: true difference in means between group 2019 and group 2020 is not equal to 0
## 95 percent confidence interval:
##   14213909 199646652
## sample estimates:
## mean in group 2019 mean in group 2020 
##           87757113          -19173167
#Make a copy of the year column
movieall3$yearchar = movieall3$year 
# change yearchar to categorical 
movieall3$yearchar <- as.factor(movieall3$yearchar)


#boxplot for year and profits
ggplot(movieall3, aes(yearchar, profit)) +
  geom_boxplot() +
  labs(
    title = "Movie Profit for 2019 and 2020",
    x = "Years",
    y = "Profit")
## Warning: Removed 4 rows containing non-finite values (`stat_boxplot()`).

Conclusion

We can see through our analysis that movies had a large impact on views, especially during the pandemic, where it became one of the main source of entertainment. The profits in movies in 2020 were significantly lower than those in 2019 due to the pandemic a lot less people went to the movies. Movie genre in Crimes made more profit in 2019 and in 2020 the Drama genre made the most profit.

We did come across a few challenges in the making of this project, one of them being choosing two datasets that have a relationship and can be used in analysis. It took us quite some time to find two datasets, especially since the two datasets could not be from the same source. Despite this difficulty, taking the time to discover and compare multiple datasets helped us think and interpret data in a relational sense. We encountered a challenge while generating the map where the fill did not function properly for countries with similar names. To address this issue, we incorporated the ISO3 dataset and merged it with our original dataset. This allowed us to produce a visually appealing map with accurate fill coloring.