Subsets and Visualizations: Movie Mojo

For this homework we are using the data set mojo from the classdata package. mojo constains data on box office revenue for movies based on the website https://www.boxofficemojo.com .

Run the following two commands to install the newest version of the package from github:

library(devtools) # error? then run install.packages("devtools")
devtools::install_github("heike/classdata")
library(dplyr)

Check what is in the data set:

library(classdata)
head(mojo)
##   TW LW                         Title Studio Weekend Gross % Change
## 1  1  1             Crazy Rich Asians     WB      21964345    -11.5
## 2  2  2                       The Meg     WB      10535459    -17.8
## 3  3  4 Mission: Impossible - Fallout   Par.       7032499    -13.0
## 4  4 22                     Searching   SGem       6066463   1460.4
## 5  5 NA              Operation Finale    MGM       6022758       NA
## 6  6  6    Disney's Christopher Robin     BV       5277955    -15.7
##   Theater Count Theater Change Average Total Gross Budget (in Million)
## 1          3865            339    5683   110691733                  30
## 2          3761           -270    2801   120521875                 130
## 3          2639           -413    2665   204379028                 178
## 4          1207           1198    5026     6574943                  NA
## 5          1818             NA    3313     7749853                  24
## 6          2925           -469    1804    85686823                  NA
##   Week                     Weekend Year WeekNo
## 1    3 August 31-September 2, 2018 2018     35
## 2    4 August 31-September 2, 2018 2018     35
## 3    6 August 31-September 2, 2018 2018     35
## 4    2 August 31-September 2, 2018 2018     35
## 5    1 August 31-September 2, 2018 2018     35
## 6    5 August 31-September 2, 2018 2018     35
  1. Download the RMarkdown file with these homework instructions to use as a template for your work. Make sure to replace “Your Name” in the YAML with your name.

2.

What is the difference between the variables Week and WeekNo? Describe in your words.

str(mojo)
## 'data.frame':    31718 obs. of  15 variables:
##  $ TW                 : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ LW                 : num  1 2 4 22 NA 6 7 3 8 5 ...
##  $ Title              : chr  "Crazy Rich Asians" "The Meg" "Mission: Impossible - Fallout" "Searching" ...
##  $ Studio             : chr  "WB" "WB" "Par." "SGem" ...
##  $ Weekend Gross      : num  21964345 10535459 7032499 6066463 6022758 ...
##  $ % Change           : num  -11.5 -17.8 -13 1460.4 NA ...
##  $ Theater Count      : num  3865 3761 2639 1207 1818 ...
##  $ Theater Change     : num  339 -270 -413 1198 NA ...
##  $ Average            : num  5683 2801 2665 5026 3313 ...
##  $ Total Gross        : num  1.11e+08 1.21e+08 2.04e+08 6.57e+06 7.75e+06 ...
##  $ Budget (in Million): num  30 130 178 NA 24 NA 51 40 15 50 ...
##  $ Week               : num  3 4 6 2 1 5 3 2 4 3 ...
##  $ Weekend            : chr  "August 31-September 2, 2018" "August 31-September 2, 2018" "August 31-September 2, 2018" "August 31-September 2, 2018" ...
##  $ Year               : num  2018 2018 2018 2018 2018 ...
##  $ WeekNo             : num  35 35 35 35 35 35 35 35 35 35 ...
summary(mojo$Week)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -48.00    3.00    6.00   23.72   12.00  873.00
summary(mojo$WeekNo)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   14.00   26.00   26.14   38.00   53.00

Using the funcitons str and summary for variables week and weekno, week is a numeric variable, but weekno seems to be a factor variable.

3.

Use ggplot2 to plot total gross (Total Gross) against week number (WeekNo). Facet by Year. Interpret the result. Which movie had the highest total gross over the time frame? How many weeks was that movie on rank 1? How long was it in box offices overall?

library(tidyverse)
## -- Attaching packages --------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
mojo%>%
  ggplot(aes(x = WeekNo, y = `Total Gross`))+
  geom_point()+
  facet_wrap(~Year, scales = "free")

  which.max(mojo$`Total Gross`)
## [1] 12881
mojo[12881, "Title"]
## [1] "Star Wars: The Force Awakens"
mojo%>%
  filter(Title=="Star Wars: The Force Awakens" )-> mojo.starwars
sum(mojo.starwars$TW=="1")
## [1] 5
length(mojo.starwars$Week)
## [1] 25

Over time, the total gross of the movies over weeks have increased considerably. There is one top- selling movie in 2016 with the highest total gross among the movies of the years 2013-2018. This movie is the “Star wars: The Force Awakens”. It was on rank one of the box office for five weeks. This movie was overall in box offices for 25 weeks.

4.

Pick two movies that were in box office some time between 2013 and 2018 and find the corresponding data in the mojo data. How does total gross of the two movies compare? Draw a plot and comment on the result.

mojo%>%
  filter(Title=="The Equalizer 2" | Title=="Hacksaw Ridge" )->fav.movies
fav.movies%>%
  ggplot(aes(x = Week, y = `Total Gross`, colour=Title)) +
  geom_line(linetype = "dashed")+
  geom_point()

For this part of the exercise, I pick two movies “The Equalizer 2” and “Hacksaw Ridge”. If we plot their total gross, we would have two increasing trends because the total gross of the movies is cumlative over the weeks the movie was on box office. Therefore, I make the total gross per week and plot them versus week number. Looking at the first plot, The equalizer 2 had a larger total gross in the first weeks and it ended up with a higher total gross comparing to Hacksa Ridge which was in box office for 18 weeks.

fav.movies%>%
  filter(Title=="The Equalizer 2")->the.equalizer

fav.movies%>%
  filter(Title=="Hacksaw Ridge")->hacksaw.ridge

the.equalizer[, "Total Gross"]<-rev(c(the.equalizer[7, "Total Gross"],abs(diff(sort(the.equalizer$`Total Gross`)))))

hacksaw.ridge[, "Total Gross"]<-rev(c(hacksaw.ridge[18, "Total Gross"],abs(diff(sort(hacksaw.ridge$`Total Gross`)))))

fav.movies[, "Total Gross"]<- c(the.equalizer$`Total Gross`, hacksaw.ridge$`Total Gross`)

fav.movies%>%
  ggplot(aes(x = Week, y = `Total Gross`, colour=Title)) +
  geom_line(linetype = "dashed")+
  geom_point()

In the second plot, it is evident that they both lost popularity over time however there was an increase in the total gross of Hacksa Ridge on the second week.

5. a)

Hit or Flop? The variable Budget (in Millions) contains estimated budget numbers for some movies. For how many movies is this information available (careful! trick question - look at what the function unique does)?

na.budget<-which(is.na(mojo$`Budget (in Million)`))
mojo.new<-mojo[-na.budget, ]
#table(mojo$Title)
length(unique(mojo.new$Title))
## [1] 658

There are 658 movies which have the info of Budget (in Million).

5. b)

Studios would like to see their budget returned by the opening weekend. What is the percentage of movies for which that happened? How many movies did not have their budget matched in total gross by the third weekend?

mojo.new%>%
  filter(Week==1)%>%
  filter(`Weekend Gross`/10^6 > `Budget (in Million)`)->returned.budget
nrow(returned.budget)
## [1] 103

There are only 103 movies that chould return their budget on the first opening week.

mojo.new%>%
  filter(Week==3)%>%
  filter(`Total Gross`/10^6 > `Budget (in Million)`)->week.3
nrow(week.3)
## [1] 308

308 studios with available budget value earned back their budget by the third week.

5. c)

For each of these two questions describe your ‘plan of attack’, i.e. lay out how you go about finding an answer to the question.

First remove the NAs from the variable Budget (in million), then find the unique movies with their titles because after removing NAs from Budget, there are still some movies with duplicated titles, but different information in other variables. Then the length of the vector shows the number of unique movies with available Budget information. For the second part of the problem, I filtered the variable week==1 for the data with available information of their Budget. Then filter again using a logical vector to get the number of movies which earned back their budget on the opening week.

5.

Identify one movie, that did not match its budget by week 3. Plot the incurred loss over time.

mojo.new%>%
  filter(Week==3)%>%
  filter(`Total Gross`/10^6 < `Budget (in Million)`)->week.3
head(week.3)
##   TW LW                 Title   Studio Weekend Gross % Change
## 1  7  7                 Alpha Studio 8       4548627    -24.2
## 2 10  5               Mile 22      STX       3765869    -40.8
## 3  2  2               The Meg       WB      12812615    -39.4
## 4 13  6 The Spy Who Dumped Me      LGF       2617298    -59.5
## 5 21 17     Death of a Nation       QF        311193    -68.5
## 6 22 13     The Darkest Minds      Fox        255173    -88.1
##   Theater Count Theater Change Average Total Gross Budget (in Million)
## 1          2881            162    1579    27470847                  51
## 2          2950           -570    1277    31952203                  50
## 3          4031            -87    3179   105083261                 130
## 4          2409           -702    1086    29998313                  40
## 5           354           -471     879     5312420                   6
## 6           448          -2679     570    12289659                  34
##   Week                     Weekend Year WeekNo
## 1    3 August 31-September 2, 2018 2018     35
## 2    3 August 31-September 2, 2018 2018     35
## 3    3          August 24-26, 2018 2018     34
## 4    3          August 17-19, 2018 2018     33
## 5    3          August 17-19, 2018 2018     33
## 6    3          August 17-19, 2018 2018     33
mojo.new%>%
  filter(Title=="The Meg")%>%
  ggplot(aes(x=Week, y = `Weekend Gross`))+
  geom_point(colour="Blue")+
  geom_line(linetype = "dashed", colour="Red")