For this homework we are using the data set mojo from the classdata package. mojo constains data on box office revenue for movies based on the website https://www.boxofficemojo.com .
Run the following two commands to install the newest version of the package from github:
library(devtools) # error? then run install.packages("devtools")
devtools::install_github("heike/classdata")
library(dplyr)
Check what is in the data set:
library(classdata)
head(mojo)
## TW LW Title Studio Weekend Gross % Change
## 1 1 1 Crazy Rich Asians WB 21964345 -11.5
## 2 2 2 The Meg WB 10535459 -17.8
## 3 3 4 Mission: Impossible - Fallout Par. 7032499 -13.0
## 4 4 22 Searching SGem 6066463 1460.4
## 5 5 NA Operation Finale MGM 6022758 NA
## 6 6 6 Disney's Christopher Robin BV 5277955 -15.7
## Theater Count Theater Change Average Total Gross Budget (in Million)
## 1 3865 339 5683 110691733 30
## 2 3761 -270 2801 120521875 130
## 3 2639 -413 2665 204379028 178
## 4 1207 1198 5026 6574943 NA
## 5 1818 NA 3313 7749853 24
## 6 2925 -469 1804 85686823 NA
## Week Weekend Year WeekNo
## 1 3 August 31-September 2, 2018 2018 35
## 2 4 August 31-September 2, 2018 2018 35
## 3 6 August 31-September 2, 2018 2018 35
## 4 2 August 31-September 2, 2018 2018 35
## 5 1 August 31-September 2, 2018 2018 35
## 6 5 August 31-September 2, 2018 2018 35
What is the difference between the variables Week and WeekNo? Describe in your words.
str(mojo)
## 'data.frame': 31718 obs. of 15 variables:
## $ TW : num 1 2 3 4 5 6 7 8 9 10 ...
## $ LW : num 1 2 4 22 NA 6 7 3 8 5 ...
## $ Title : chr "Crazy Rich Asians" "The Meg" "Mission: Impossible - Fallout" "Searching" ...
## $ Studio : chr "WB" "WB" "Par." "SGem" ...
## $ Weekend Gross : num 21964345 10535459 7032499 6066463 6022758 ...
## $ % Change : num -11.5 -17.8 -13 1460.4 NA ...
## $ Theater Count : num 3865 3761 2639 1207 1818 ...
## $ Theater Change : num 339 -270 -413 1198 NA ...
## $ Average : num 5683 2801 2665 5026 3313 ...
## $ Total Gross : num 1.11e+08 1.21e+08 2.04e+08 6.57e+06 7.75e+06 ...
## $ Budget (in Million): num 30 130 178 NA 24 NA 51 40 15 50 ...
## $ Week : num 3 4 6 2 1 5 3 2 4 3 ...
## $ Weekend : chr "August 31-September 2, 2018" "August 31-September 2, 2018" "August 31-September 2, 2018" "August 31-September 2, 2018" ...
## $ Year : num 2018 2018 2018 2018 2018 ...
## $ WeekNo : num 35 35 35 35 35 35 35 35 35 35 ...
summary(mojo$Week)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -48.00 3.00 6.00 23.72 12.00 873.00
summary(mojo$WeekNo)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 14.00 26.00 26.14 38.00 53.00
Using the funcitons str and summary for variables week and weekno, week is a numeric variable, but weekno seems to be a factor variable.
Use ggplot2 to plot total gross (Total Gross) against week number (WeekNo). Facet by Year. Interpret the result. Which movie had the highest total gross over the time frame? How many weeks was that movie on rank 1? How long was it in box offices overall?
library(tidyverse)
## -- Attaching packages --------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
mojo%>%
ggplot(aes(x = WeekNo, y = `Total Gross`))+
geom_point()+
facet_wrap(~Year, scales = "free")
which.max(mojo$`Total Gross`)
## [1] 12881
mojo[12881, "Title"]
## [1] "Star Wars: The Force Awakens"
mojo%>%
filter(Title=="Star Wars: The Force Awakens" )-> mojo.starwars
sum(mojo.starwars$TW=="1")
## [1] 5
length(mojo.starwars$Week)
## [1] 25
Over time, the total gross of the movies over weeks have increased considerably. There is one top- selling movie in 2016 with the highest total gross among the movies of the years 2013-2018. This movie is the “Star wars: The Force Awakens”. It was on rank one of the box office for five weeks. This movie was overall in box offices for 25 weeks.
Pick two movies that were in box office some time between 2013 and 2018 and find the corresponding data in the mojo data. How does total gross of the two movies compare? Draw a plot and comment on the result.
mojo%>%
filter(Title=="The Equalizer 2" | Title=="Hacksaw Ridge" )->fav.movies
fav.movies%>%
ggplot(aes(x = Week, y = `Total Gross`, colour=Title)) +
geom_line(linetype = "dashed")+
geom_point()
For this part of the exercise, I pick two movies “The Equalizer 2” and “Hacksaw Ridge”. If we plot their total gross, we would have two increasing trends because the total gross of the movies is cumlative over the weeks the movie was on box office. Therefore, I make the total gross per week and plot them versus week number. Looking at the first plot, The equalizer 2 had a larger total gross in the first weeks and it ended up with a higher total gross comparing to Hacksa Ridge which was in box office for 18 weeks.
fav.movies%>%
filter(Title=="The Equalizer 2")->the.equalizer
fav.movies%>%
filter(Title=="Hacksaw Ridge")->hacksaw.ridge
the.equalizer[, "Total Gross"]<-rev(c(the.equalizer[7, "Total Gross"],abs(diff(sort(the.equalizer$`Total Gross`)))))
hacksaw.ridge[, "Total Gross"]<-rev(c(hacksaw.ridge[18, "Total Gross"],abs(diff(sort(hacksaw.ridge$`Total Gross`)))))
fav.movies[, "Total Gross"]<- c(the.equalizer$`Total Gross`, hacksaw.ridge$`Total Gross`)
fav.movies%>%
ggplot(aes(x = Week, y = `Total Gross`, colour=Title)) +
geom_line(linetype = "dashed")+
geom_point()
In the second plot, it is evident that they both lost popularity over time however there was an increase in the total gross of Hacksa Ridge on the second week.
Hit or Flop? The variable Budget (in Millions) contains estimated budget numbers for some movies. For how many movies is this information available (careful! trick question - look at what the function unique does)?
na.budget<-which(is.na(mojo$`Budget (in Million)`))
mojo.new<-mojo[-na.budget, ]
#table(mojo$Title)
length(unique(mojo.new$Title))
## [1] 658
There are 658 movies which have the info of Budget (in Million).
Studios would like to see their budget returned by the opening weekend. What is the percentage of movies for which that happened? How many movies did not have their budget matched in total gross by the third weekend?
mojo.new%>%
filter(Week==1)%>%
filter(`Weekend Gross`/10^6 > `Budget (in Million)`)->returned.budget
nrow(returned.budget)
## [1] 103
There are only 103 movies that chould return their budget on the first opening week.
mojo.new%>%
filter(Week==3)%>%
filter(`Total Gross`/10^6 > `Budget (in Million)`)->week.3
nrow(week.3)
## [1] 308
308 studios with available budget value earned back their budget by the third week.
For each of these two questions describe your ‘plan of attack’, i.e. lay out how you go about finding an answer to the question.
First remove the NAs from the variable Budget (in million), then find the unique movies with their titles because after removing NAs from Budget, there are still some movies with duplicated titles, but different information in other variables. Then the length of the vector shows the number of unique movies with available Budget information. For the second part of the problem, I filtered the variable week==1 for the data with available information of their Budget. Then filter again using a logical vector to get the number of movies which earned back their budget on the opening week.
Identify one movie, that did not match its budget by week 3. Plot the incurred loss over time.
mojo.new%>%
filter(Week==3)%>%
filter(`Total Gross`/10^6 < `Budget (in Million)`)->week.3
head(week.3)
## TW LW Title Studio Weekend Gross % Change
## 1 7 7 Alpha Studio 8 4548627 -24.2
## 2 10 5 Mile 22 STX 3765869 -40.8
## 3 2 2 The Meg WB 12812615 -39.4
## 4 13 6 The Spy Who Dumped Me LGF 2617298 -59.5
## 5 21 17 Death of a Nation QF 311193 -68.5
## 6 22 13 The Darkest Minds Fox 255173 -88.1
## Theater Count Theater Change Average Total Gross Budget (in Million)
## 1 2881 162 1579 27470847 51
## 2 2950 -570 1277 31952203 50
## 3 4031 -87 3179 105083261 130
## 4 2409 -702 1086 29998313 40
## 5 354 -471 879 5312420 6
## 6 448 -2679 570 12289659 34
## Week Weekend Year WeekNo
## 1 3 August 31-September 2, 2018 2018 35
## 2 3 August 31-September 2, 2018 2018 35
## 3 3 August 24-26, 2018 2018 34
## 4 3 August 17-19, 2018 2018 33
## 5 3 August 17-19, 2018 2018 33
## 6 3 August 17-19, 2018 2018 33
mojo.new%>%
filter(Title=="The Meg")%>%
ggplot(aes(x=Week, y = `Weekend Gross`))+
geom_point(colour="Blue")+
geom_line(linetype = "dashed", colour="Red")