We’ll pull data about movie sales from the “boxoffice” data source (for which we installed the boxoffice package above). First we have to decide what time frame we want the data for.
In this case set start from 2015 to 2020
date.seq <- paste(2015:2020,"-12-31",sep="")
date.seq
## [1] "2015-12-31" "2016-12-31" "2017-12-31" "2018-12-31" "2019-12-31"
## [6] "2020-12-31"
movies <- boxoffice(date = as.Date(date.seq), top_n = 50)
dim(movies) # what is the size of the data frame
## [1] 242 9
names(movies)
## [1] "movie" "distributor" "gross" "percent_change"
## [5] "theaters" "per_theater" "total_gross" "days"
## [9] "date"
kable(head(movies))
| movie | distributor | gross | percent_change | theaters | per_theater | total_gross | days | date |
|---|---|---|---|---|---|---|---|---|
| Star Wars Ep. VII: The Fo | Walt Disney | 22932686 | -18 | 4134 | 5547 | 651967269 | 14 | 2015-12-31 |
| Daddys Home | Paramount Pi | 5881997 | -12 | 3271 | 1798 | 64684278 | 7 | 2015-12-31 |
| Joy | 20th Century | 3213634 | 23 | 2896 | 1110 | 28310094 | 7 | 2015-12-31 |
| Alvin and the Chipmunks: | 20th Century | 3208787 | -24 | 3705 | 866 | 55575427 | 14 | 2015-12-31 |
| The Hateful Eight | Weinstein Co. | 3151247 | -11 | 2266 | 1391 | 13339210 | 7 | 2015-12-31 |
| Sisters | Universal | 2280740 | -28 | 2962 | 770 | 49123380 | 14 | 2015-12-31 |
movies <- movies %>% na.omit() %>% mutate(Year = as.numeric(format(as.Date(date), "%Y")))
head(movies)
## movie distributor gross percent_change theaters
## 1 Star Wars Ep. VII: The Fo Walt Disney 22932686 -18 4134
## 2 Daddys Home Paramount Pi 5881997 -12 3271
## 3 Joy 20th Century 3213634 23 2896
## 4 Alvin and the Chipmunks: 20th Century 3208787 -24 3705
## 5 The Hateful Eight Weinstein Co. 3151247 -11 2266
## 6 Sisters Universal 2280740 -28 2962
## per_theater total_gross days date Year
## 1 5547 651967269 14 2015-12-31 2015
## 2 1798 64684278 7 2015-12-31 2015
## 3 1110 28310094 7 2015-12-31 2015
## 4 866 55575427 14 2015-12-31 2015
## 5 1391 13339210 7 2015-12-31 2015
## 6 770 49123380 14 2015-12-31 2015
# na.omit() omits the rows with NA values; create new column Year. which extracts the Y (year) from the date
movies <- movies %>% group_by(Year) %>% arrange(desc(total_gross)) %>% mutate(rank=row_number())
head(movies)
## # A tibble: 6 x 11
## # Groups: Year [4]
## movie distributor gross percent_change theaters per_theater total_gross days
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Star~ Walt Disney 2.29e7 -18 4134 5547 651967269 14
## 2 Star~ Walt Disney 1.36e7 -32 4232 3206 517218368 17
## 3 Froz~ Walt Disney 4.04e6 -9 3265 1237 430144682 40
## 4 Rogu~ Walt Disney 1.46e7 -20 4157 3520 408235850 16
## 5 Star~ Walt Disney 1.32e7 -14 4406 3000 390706234 12
## 6 Joker Warner Bro~ 2.50e4 -29 120 208 333772511 89
## # ... with 3 more variables: date <date>, Year <dbl>, rank <int>
Now let’s take a look at the data. You can look at the data in tabular form (let’s do that in the RStudio interface). But it will be more insightful to construct visualizations of the data. Let’s start by looking at total_gross revenues for each rank within each year.
p1 <- ggplot(data=movies, aes(x=rank,y=total_gross)) + geom_line(aes(color=as.factor(Year))) + theme_classic()
p2 <- p1 + coord_trans(y = "log10") # convert y axis to log scale
grid.arrange(p1, p2, ncol=2) # arrange both plots side by side, in two columns #ok
How much are they total_grossing? To make the question (or answers) more meaningful let’s limit the analysis to the top 10 movies each year. To get a sense of the differences in sales, let’s take a quick look at the #1 and #10 ranked movies each year.
movies.top10 <- movies %>% filter(rank %in% c(1,10)) %>% group_by(Year) %>% arrange(rank)
movies.top10
## # A tibble: 12 x 11
## # Groups: Year [6]
## movie distributor gross percent_change theaters per_theater total_gross
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Star~ Walt Disney 2.29e7 -18 4134 5547 651967269
## 2 Star~ Walt Disney 1.36e7 -32 4232 3206 517218368
## 3 Froz~ Walt Disney 4.04e6 -9 3265 1237 430144682
## 4 Rogu~ Walt Disney 1.46e7 -20 4157 3520 408235850
## 5 Dr. ~ Universal 8.23e5 -27 2555 322 266280410
## 6 The ~ Universal 4.78e5 -10 1726 277 32334280
## 7 Cree~ MGM 3.00e5 -37 1068 281 112448520
## 8 Arri~ Paramount ~ 3.86e5 -16 545 709 91676099
## 9 Zomb~ Sony Pictu~ 8.88e3 -31 90 99 72930156
## 10 A Ba~ STX Entert~ 3.25e4 -51 214 152 71891988
## 11 Dadd~ Paramount ~ 5.88e6 -12 3271 1798 64684278
## 12 Prom~ Focus Feat~ 1.14e5 -9 1310 87 1208180
## # ... with 4 more variables: days <dbl>, date <date>, Year <dbl>, rank <int>
movies.top10 %>% select(movie, Year, rank, total_gross) %>% arrange(Year)
## # A tibble: 12 x 4
## # Groups: Year [6]
## movie Year rank total_gross
## <chr> <dbl> <int> <dbl>
## 1 Star Wars Ep. VII: The Fo 2015 1 651967269
## 2 Daddys Home 2015 10 64684278
## 3 Rogue One: A Star Wars Story 2016 1 408235850
## 4 Arrival 2016 10 91676099
## 5 Star Wars Ep. VIII: The L 2017 1 517218368
## 6 A Bad Moms Christmas 2017 10 71891988
## 7 Dr. Seuss The Grinch 2018 1 266280410
## 8 Creed II 2018 10 112448520
## 9 Frozen II 2019 1 430144682
## 10 Zombieland: Double Tap 2019 10 72930156
## 11 The Croods: A New Age 2020 1 32334280
## 12 Promising Young Woman 2020 10 1208180
movies.top10
## # A tibble: 12 x 11
## # Groups: Year [6]
## movie distributor gross percent_change theaters per_theater total_gross
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Star~ Walt Disney 2.29e7 -18 4134 5547 651967269
## 2 Star~ Walt Disney 1.36e7 -32 4232 3206 517218368
## 3 Froz~ Walt Disney 4.04e6 -9 3265 1237 430144682
## 4 Rogu~ Walt Disney 1.46e7 -20 4157 3520 408235850
## 5 Dr. ~ Universal 8.23e5 -27 2555 322 266280410
## 6 The ~ Universal 4.78e5 -10 1726 277 32334280
## 7 Cree~ MGM 3.00e5 -37 1068 281 112448520
## 8 Arri~ Paramount ~ 3.86e5 -16 545 709 91676099
## 9 Zomb~ Sony Pictu~ 8.88e3 -31 90 99 72930156
## 10 A Ba~ STX Entert~ 3.25e4 -51 214 152 71891988
## 11 Dadd~ Paramount ~ 5.88e6 -12 3271 1798 64684278
## 12 Prom~ Focus Feat~ 1.14e5 -9 1310 87 1208180
## # ... with 4 more variables: days <dbl>, date <date>, Year <dbl>, rank <int>