We’ll pull data about movie sales from the “boxoffice” data source (for which we installed the boxoffice package above). First we have to decide what time frame we want the data for.

Upload library

Define time periods for which to collect data

In this case set start from 2015 to 2020

date.seq <- paste(2015:2020,"-12-31",sep="")
date.seq
## [1] "2015-12-31" "2016-12-31" "2017-12-31" "2018-12-31" "2019-12-31"
## [6] "2020-12-31"

Fetch the data

movies <- boxoffice(date = as.Date(date.seq), top_n = 50)
 
dim(movies) # what is the size of the data frame
## [1] 242   9
names(movies) 
## [1] "movie"          "distributor"    "gross"          "percent_change"
## [5] "theaters"       "per_theater"    "total_gross"    "days"          
## [9] "date"
kable(head(movies))
movie distributor gross percent_change theaters per_theater total_gross days date
Star Wars Ep. VII: The Fo Walt Disney 22932686 -18 4134 5547 651967269 14 2015-12-31
Daddys Home Paramount Pi 5881997 -12 3271 1798 64684278 7 2015-12-31
Joy 20th Century 3213634 23 2896 1110 28310094 7 2015-12-31
Alvin and the Chipmunks: 20th Century 3208787 -24 3705 866 55575427 14 2015-12-31
The Hateful Eight Weinstein Co. 3151247 -11 2266 1391 13339210 7 2015-12-31
Sisters Universal 2280740 -28 2962 770 49123380 14 2015-12-31

Extract the Year, then Rank by Sales

movies <- movies %>% na.omit() %>% mutate(Year =  as.numeric(format(as.Date(date), "%Y"))) 
head(movies)
##                       movie   distributor    gross percent_change theaters
## 1 Star Wars Ep. VII: The Fo   Walt Disney 22932686            -18     4134
## 2               Daddys Home  Paramount Pi  5881997            -12     3271
## 3                       Joy  20th Century  3213634             23     2896
## 4 Alvin and the Chipmunks:   20th Century  3208787            -24     3705
## 5         The Hateful Eight Weinstein Co.  3151247            -11     2266
## 6                   Sisters     Universal  2280740            -28     2962
##   per_theater total_gross days       date Year
## 1        5547   651967269   14 2015-12-31 2015
## 2        1798    64684278    7 2015-12-31 2015
## 3        1110    28310094    7 2015-12-31 2015
## 4         866    55575427   14 2015-12-31 2015
## 5        1391    13339210    7 2015-12-31 2015
## 6         770    49123380   14 2015-12-31 2015
# na.omit() omits the rows with NA values; create new column Year. which extracts the Y (year) from the date

Extract the Year, then Rank by Sales

movies <- movies %>% group_by(Year) %>% arrange(desc(total_gross)) %>%  mutate(rank=row_number())

head(movies)
## # A tibble: 6 x 11
## # Groups:   Year [4]
##   movie distributor  gross percent_change theaters per_theater total_gross  days
##   <chr> <chr>        <dbl>          <dbl>    <dbl>       <dbl>       <dbl> <dbl>
## 1 Star~ Walt Disney 2.29e7            -18     4134        5547   651967269    14
## 2 Star~ Walt Disney 1.36e7            -32     4232        3206   517218368    17
## 3 Froz~ Walt Disney 4.04e6             -9     3265        1237   430144682    40
## 4 Rogu~ Walt Disney 1.46e7            -20     4157        3520   408235850    16
## 5 Star~ Walt Disney 1.32e7            -14     4406        3000   390706234    12
## 6 Joker Warner Bro~ 2.50e4            -29      120         208   333772511    89
## # ... with 3 more variables: date <date>, Year <dbl>, rank <int>

Visualizations of box office sales

Now let’s take a look at the data. You can look at the data in tabular form (let’s do that in the RStudio interface). But it will be more insightful to construct visualizations of the data. Let’s start by looking at total_gross revenues for each rank within each year.

p1 <- ggplot(data=movies, aes(x=rank,y=total_gross)) + geom_line(aes(color=as.factor(Year))) + theme_classic()

p2 <- p1 + coord_trans(y = "log10") # convert y axis to log scale

grid.arrange(p1, p2, ncol=2) # arrange both plots side by side, in two columns #ok

What are the top movies of the year

How much are they total_grossing? To make the question (or answers) more meaningful let’s limit the analysis to the top 10 movies each year. To get a sense of the differences in sales, let’s take a quick look at the #1 and #10 ranked movies each year.

movies.top10 <- movies %>% filter(rank %in% c(1,10)) %>% group_by(Year) %>% arrange(rank)
movies.top10
## # A tibble: 12 x 11
## # Groups:   Year [6]
##    movie distributor  gross percent_change theaters per_theater total_gross
##    <chr> <chr>        <dbl>          <dbl>    <dbl>       <dbl>       <dbl>
##  1 Star~ Walt Disney 2.29e7            -18     4134        5547   651967269
##  2 Star~ Walt Disney 1.36e7            -32     4232        3206   517218368
##  3 Froz~ Walt Disney 4.04e6             -9     3265        1237   430144682
##  4 Rogu~ Walt Disney 1.46e7            -20     4157        3520   408235850
##  5 Dr. ~ Universal   8.23e5            -27     2555         322   266280410
##  6 The ~ Universal   4.78e5            -10     1726         277    32334280
##  7 Cree~ MGM         3.00e5            -37     1068         281   112448520
##  8 Arri~ Paramount ~ 3.86e5            -16      545         709    91676099
##  9 Zomb~ Sony Pictu~ 8.88e3            -31       90          99    72930156
## 10 A Ba~ STX Entert~ 3.25e4            -51      214         152    71891988
## 11 Dadd~ Paramount ~ 5.88e6            -12     3271        1798    64684278
## 12 Prom~ Focus Feat~ 1.14e5             -9     1310          87     1208180
## # ... with 4 more variables: days <dbl>, date <date>, Year <dbl>, rank <int>

How much are they total_grossing?

movies.top10 %>% select(movie, Year, rank, total_gross) %>% arrange(Year)
## # A tibble: 12 x 4
## # Groups:   Year [6]
##    movie                         Year  rank total_gross
##    <chr>                        <dbl> <int>       <dbl>
##  1 Star Wars Ep. VII: The Fo     2015     1   651967269
##  2 Daddys Home                   2015    10    64684278
##  3 Rogue One: A Star Wars Story  2016     1   408235850
##  4 Arrival                       2016    10    91676099
##  5 Star Wars Ep. VIII: The L     2017     1   517218368
##  6 A Bad Moms Christmas          2017    10    71891988
##  7 Dr. Seuss The Grinch          2018     1   266280410
##  8 Creed II                      2018    10   112448520
##  9 Frozen II                     2019     1   430144682
## 10 Zombieland: Double Tap        2019    10    72930156
## 11 The Croods: A New Age         2020     1    32334280
## 12 Promising Young Woman         2020    10     1208180
movies.top10
## # A tibble: 12 x 11
## # Groups:   Year [6]
##    movie distributor  gross percent_change theaters per_theater total_gross
##    <chr> <chr>        <dbl>          <dbl>    <dbl>       <dbl>       <dbl>
##  1 Star~ Walt Disney 2.29e7            -18     4134        5547   651967269
##  2 Star~ Walt Disney 1.36e7            -32     4232        3206   517218368
##  3 Froz~ Walt Disney 4.04e6             -9     3265        1237   430144682
##  4 Rogu~ Walt Disney 1.46e7            -20     4157        3520   408235850
##  5 Dr. ~ Universal   8.23e5            -27     2555         322   266280410
##  6 The ~ Universal   4.78e5            -10     1726         277    32334280
##  7 Cree~ MGM         3.00e5            -37     1068         281   112448520
##  8 Arri~ Paramount ~ 3.86e5            -16      545         709    91676099
##  9 Zomb~ Sony Pictu~ 8.88e3            -31       90          99    72930156
## 10 A Ba~ STX Entert~ 3.25e4            -51      214         152    71891988
## 11 Dadd~ Paramount ~ 5.88e6            -12     3271        1798    64684278
## 12 Prom~ Focus Feat~ 1.14e5             -9     1310          87     1208180
## # ... with 4 more variables: days <dbl>, date <date>, Year <dbl>, rank <int>