** Please click all the tabs (in sequence) to get the entire set of information in these pages. **
** You can download all the code by clicking “Code” as shown in this picture. **

First we’ll declare some useful configuration settings. Don’t worry if you don’t understand why.
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
options(scipen=10000000)
options(digits=3)
Next add a few packages that are handy. These packages need to first be installed. If they’re not already in your system, you should see a note/yellow banner above from RStudio asking if it should install these packages for you. Silently mutter “Oh thanks RStudio” and click Yes.
# install.packages("knitr")
library(knitr)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(ggrepel)
library(boxoffice)
Now there is one special package that is not installed in the usual way. “boxoffice” (through which we’ll get movie sales data). To install it, run this next chunk once (by ensuring it says “eval=TRUE”) then switch to eval=FALSE.
install.packages("devtools")
devtools::install_github("jacobkap/boxoffice")
Session 2: Let’s get our hands dirty
Our first text (this) and code chunk (below).
2+2 #
[1] 4
1:5 # generate a sequence of integers
[1] 1 2 3 4 5
vector.1 <- 1:5 # assign the name vector.1 to this sequence
vector.1 + 5 # add 5 to every element of the vector
[1] 6 7 8 9 10
vector.2 <- vector.1 + 5:6 # add 5 to the 1st element, then 6 to second,
vector.2 # print vector.2
[1] 6 8 8 10 10
mean(vector.1)
[1] 3
c(sum(vector.1), sum(vector.2), mean(vector.2), min(vector.2), max(vector.2))
[1] 15.0 42.0 8.4 6.0 10.0
The movies data
Ok, now that we’ve done some basic things let’s move on to something useful. We’ll pull data about movie sales from the “boxoffice” data source (for which we installed the boxoffice package above). First we have to decide what time frame we want the data for.
# Let's define time periods for which to collect data
date.seq <- paste(2000:2009,"-12-31",sep="")
# date.seq <- c(as.Date("2013-12-31"),as.Date("2014-12-31"),as.Date("2015-12-31"), as.Date("2016-12-31"),as.Date("2017-12-31"),as.Date("2018-12-31"),as.Date("2019-12-31"))
# Fetch the data
movies <- boxoffice(date = as.Date(date.seq), top_n = 50)
dim(movies) # what is the size of the data frame
[1] 189 9
names(movies) # or, movies %>% names # names of the columns of the data frame
[1] "movie" "distributor" "gross" "percent_change"
[5] "theaters" "per_theater" "total_gross" "days"
[9] "date"
kable(head(movies))
Cast Away |
20th Century |
7938594 |
-32 |
2927 |
2712 |
100628594 |
10 |
2000-12-31 |
What Women Want |
Paramount Pi |
4955561 |
-40 |
3046 |
1627 |
110187561 |
17 |
2000-12-31 |
The Family Man |
Universal |
3010330 |
-40 |
2395 |
1257 |
39170330 |
10 |
2000-12-31 |
The Emperor’s New Groove |
Walt Disney |
2814336 |
-29 |
2887 |
975 |
47465336 |
17 |
2000-12-31 |
Miss Congeniality |
Warner Bros. |
2142573 |
-64 |
2668 |
803 |
40784573 |
10 |
2000-12-31 |
How the Grinch Stole Chri |
Universal |
1524105 |
-42 |
3170 |
481 |
251629105 |
45 |
2000-12-31 |
A few commands featured above include 1) assignment to an object, 2) selection of a subset of data from a data frame,
We can modify or extend the data. For instance, we’ll want to isolate the Year (from the date field). Also, it will be useful to rank movies by sales (within each year), and create a new rank variable.
movies <- movies %>% na.omit() %>% mutate(Year = as.numeric(format(as.Date(date), "%Y"))) # na.omit() omits the rows with NA values; create new column Year. which extracts the Y (year) from the date
# Extract the Year, then Rank by Sales
movies <- movies %>% group_by(Year) %>% arrange(desc(total_gross)) %>% mutate(rank=row_number())
Visualizations of box office sales
Now let’s take a look at the data. You can look at the data in tabular form (let’s do that in the RStudio interface). But it will be more insightful to construct visualizations of the data. Let’s start by looking at total_gross revenues for each rank within each year.
p1 <- ggplot(data=movies, aes(x=rank,y=total_gross)) + geom_line(aes(color=as.factor(Year))) + theme_classic()
p2 <- p1 + coord_trans(y = "log10") # convert y axis to log scale
grid.arrange(p1, p2, ncol=2) # arrange both plots side by side, in two columns

What are the top movies of the year, and how much are they total_grossing? To make the question (or answers) more meaningful let’s limit the analysis to the top 10 movies each year. To get a sense of the differences in sales, let’s take a quick look at the #1 and #10 ranked movies each year.
movies.top10 <- movies %>% filter(rank %in% c(1,10)) %>% group_by(Year) %>% arrange(rank)
kable(movies.top10 %>% select(movie, Year, rank, total_gross) %>% arrange(Year))
How the Grinch Stole Chri |
2000 |
1 |
251629105 |
All the Pretty Horses |
2000 |
10 |
7640564 |
Harry Potter and the Sorc |
2001 |
1 |
288493000 |
A Beautiful Mind |
2001 |
10 |
15949000 |
Harry Potter and the Cham |
2002 |
1 |
243855000 |
The Hot Chick |
2002 |
10 |
24021000 |
The Lord of the Rings: Th |
2003 |
1 |
249400000 |
Peter Pan |
2003 |
10 |
22000000 |
The Polar Express |
2004 |
1 |
151623383 |
Harry Potter and the Gobl |
2005 |
1 |
273281180 |
The Ringer |
2005 |
10 |
17265628 |
The Polar Express |
2006 |
1 |
176454984 |
Rocky Balboa |
2006 |
10 |
47940632 |
I am Legend |
2007 |
1 |
199345154 |
Mr. Magorium’s Wonder Emp |
2007 |
10 |
31049456 |
The Dark Knight |
2008 |
1 |
530924926 |
Yes Man |
2008 |
10 |
60029690 |
The Twilight Saga: New Moon |
2009 |
1 |
284512392 |
Paranormal Activity |
2009 |
10 |
107792845 |
Looking at the numbers it seems that the top-1 and top-10 have hugely different sales numbers. Putting all of them (and all between these ranks) into the same chart will make it very hard to see the differences. In such cases it is useful to use a log transformation, which brings the numbers closer together and easier to see.
The graph we’ll produce has the rank as the x (horizontal) axis and gross revenues as the y (vertical axis). We’ll identify the movie itself by placing a dot (bullet) based on its (x,y) value, and write the name of the movie as close to the bullet as possible.
ggplot(data=movies %>% filter(rank < 11), aes(x=rank, y=total_gross, color=factor(Year))) + geom_point() + theme_classic() + theme(axis.text.x = element_text(size=12), axis.text.y = element_text(size=12), axis.title.x = element_text(size = rel(1.5)), axis.title.y = element_text(size = rel(1.5), margin = margin(t = 0, r = 20, b = 0, l = 10))) + geom_text_repel(aes(label = movie), nudge_y=1, force=6, box.padding = unit(0.75, "lines"), segment.color="gray") + coord_trans(y = 'log10')

Who’s making the winning movies? This is identified by the “distributor” column. So, this time we’ll write the distributor’s name rather than the movie name.
ggplot(data=movies %>% filter(rank < 11), aes(x=rank, y=total_gross, color=factor(Year))) + geom_point() + theme_classic() + theme(axis.text.x = element_text(size=12), axis.text.y = element_text(size=12), axis.title.x = element_text(size = rel(1.5)), axis.title.y = element_text(size = rel(1.5), margin = margin(t = 0, r = 20, b = 0, l = 10))) + geom_text_repel(aes(label = distributor), nudge_y=1, force=6, box.padding = unit(0.75, "lines"), segment.color="gray") + coord_trans(y = 'log10')

# ggsave()
