** Please click all the tabs (in sequence) to get the entire set of information in these pages. **

** You can download all the code by clicking “Code” as shown in this picture. **

First we’ll declare some useful configuration settings. Don’t worry if you don’t understand why.

knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
options(scipen=10000000)
options(digits=3)

Next add a few packages that are handy. These packages need to first be installed. If they’re not already in your system, you should see a note/yellow banner above from RStudio asking if it should install these packages for you. Silently mutter “Oh thanks RStudio” and click Yes.

# install.packages("knitr")
library(knitr)

library(dplyr)
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(ggrepel)
library(boxoffice)

Now there is one special package that is not installed in the usual way. “boxoffice” (through which we’ll get movie sales data). To install it, run this next chunk once (by ensuring it says “eval=TRUE”) then switch to eval=FALSE.

install.packages("devtools")
devtools::install_github("jacobkap/boxoffice")

Session 2: Let’s get our hands dirty

Our first text (this) and code chunk (below).

2+2 # 
[1] 4
1:5 # generate a sequence of integers
[1] 1 2 3 4 5
vector.1 <- 1:5 # assign the name vector.1 to this sequence
vector.1 + 5 # add 5 to every element of the vector
[1]  6  7  8  9 10
vector.2 <- vector.1 + 5:6 # add 5 to the 1st element, then 6 to second, 
vector.2 # print vector.2
[1]  6  8  8 10 10
mean(vector.1)
[1] 3
c(sum(vector.1), sum(vector.2), mean(vector.2), min(vector.2), max(vector.2))
[1] 15.0 42.0  8.4  6.0 10.0

The movies data

Ok, now that we’ve done some basic things let’s move on to something useful. We’ll pull data about movie sales from the “boxoffice” data source (for which we installed the boxoffice package above). First we have to decide what time frame we want the data for.

# Let's define time periods for which to collect data
date.seq <- paste(2000:2009,"-12-31",sep="")

# date.seq <- c(as.Date("2013-12-31"),as.Date("2014-12-31"),as.Date("2015-12-31"), as.Date("2016-12-31"),as.Date("2017-12-31"),as.Date("2018-12-31"),as.Date("2019-12-31"))

# Fetch the data 
movies <- boxoffice(date = as.Date(date.seq), top_n = 50)
 
dim(movies) # what is the size of the data frame
[1] 189   9
names(movies) # or, movies %>% names # names of the columns of the data frame
[1] "movie"          "distributor"    "gross"          "percent_change"
[5] "theaters"       "per_theater"    "total_gross"    "days"          
[9] "date"          
kable(head(movies))
movie distributor gross percent_change theaters per_theater total_gross days date
Cast Away 20th Century 7938594 -32 2927 2712 100628594 10 2000-12-31
What Women Want Paramount Pi 4955561 -40 3046 1627 110187561 17 2000-12-31
The Family Man Universal 3010330 -40 2395 1257 39170330 10 2000-12-31
The Emperor’s New Groove Walt Disney 2814336 -29 2887 975 47465336 17 2000-12-31
Miss Congeniality Warner Bros. 2142573 -64 2668 803 40784573 10 2000-12-31
How the Grinch Stole Chri Universal 1524105 -42 3170 481 251629105 45 2000-12-31

A few commands featured above include 1) assignment to an object, 2) selection of a subset of data from a data frame,

We can modify or extend the data. For instance, we’ll want to isolate the Year (from the date field). Also, it will be useful to rank movies by sales (within each year), and create a new rank variable.


movies <- movies %>% na.omit() %>% mutate(Year =  as.numeric(format(as.Date(date), "%Y"))) # na.omit() omits the rows with NA values; create new column Year. which extracts the Y (year) from the date

# Extract the Year, then Rank by Sales

movies <- movies %>% group_by(Year) %>% arrange(desc(total_gross)) %>%  mutate(rank=row_number())

Visualizations of box office sales

Now let’s take a look at the data. You can look at the data in tabular form (let’s do that in the RStudio interface). But it will be more insightful to construct visualizations of the data. Let’s start by looking at total_gross revenues for each rank within each year.

p1 <- ggplot(data=movies, aes(x=rank,y=total_gross)) + geom_line(aes(color=as.factor(Year))) + theme_classic()

p2 <- p1 + coord_trans(y = "log10") # convert y axis to log scale

grid.arrange(p1, p2, ncol=2) # arrange both plots side by side, in two columns

What are the top movies of the year, and how much are they total_grossing? To make the question (or answers) more meaningful let’s limit the analysis to the top 10 movies each year. To get a sense of the differences in sales, let’s take a quick look at the #1 and #10 ranked movies each year.

movies.top10 <- movies %>% filter(rank %in% c(1,10)) %>% group_by(Year) %>% arrange(rank)
kable(movies.top10 %>% select(movie, Year, rank, total_gross) %>% arrange(Year))
movie Year rank total_gross
How the Grinch Stole Chri 2000 1 251629105
All the Pretty Horses 2000 10 7640564
Harry Potter and the Sorc 2001 1 288493000
A Beautiful Mind 2001 10 15949000
Harry Potter and the Cham 2002 1 243855000
The Hot Chick 2002 10 24021000
The Lord of the Rings: Th 2003 1 249400000
Peter Pan 2003 10 22000000
The Polar Express 2004 1 151623383
Harry Potter and the Gobl 2005 1 273281180
The Ringer 2005 10 17265628
The Polar Express 2006 1 176454984
Rocky Balboa 2006 10 47940632
I am Legend 2007 1 199345154
Mr. Magorium’s Wonder Emp 2007 10 31049456
The Dark Knight 2008 1 530924926
Yes Man 2008 10 60029690
The Twilight Saga: New Moon 2009 1 284512392
Paranormal Activity 2009 10 107792845

Looking at the numbers it seems that the top-1 and top-10 have hugely different sales numbers. Putting all of them (and all between these ranks) into the same chart will make it very hard to see the differences. In such cases it is useful to use a log transformation, which brings the numbers closer together and easier to see.

The graph we’ll produce has the rank as the x (horizontal) axis and gross revenues as the y (vertical axis). We’ll identify the movie itself by placing a dot (bullet) based on its (x,y) value, and write the name of the movie as close to the bullet as possible.

ggplot(data=movies %>% filter(rank < 11), aes(x=rank, y=total_gross, color=factor(Year))) + geom_point() + theme_classic() + theme(axis.text.x = element_text(size=12), axis.text.y = element_text(size=12), axis.title.x = element_text(size = rel(1.5)), axis.title.y = element_text(size = rel(1.5), margin = margin(t = 0, r = 20, b = 0, l = 10))) + geom_text_repel(aes(label = movie), nudge_y=1, force=6, box.padding = unit(0.75, "lines"), segment.color="gray") + coord_trans(y = 'log10')

Who’s making the winning movies? This is identified by the “distributor” column. So, this time we’ll write the distributor’s name rather than the movie name.

ggplot(data=movies %>% filter(rank < 11), aes(x=rank, y=total_gross, color=factor(Year))) + geom_point() + theme_classic() + theme(axis.text.x = element_text(size=12), axis.text.y = element_text(size=12), axis.title.x = element_text(size = rel(1.5)), axis.title.y = element_text(size = rel(1.5), margin = margin(t = 0, r = 20, b = 0, l = 10))) + geom_text_repel(aes(label = distributor), nudge_y=1, force=6, box.padding = unit(0.75, "lines"), segment.color="gray") + coord_trans(y = 'log10')


# ggsave()
---
title: "Session 2"
output: html_notebook
---

** Please click all the tabs (in sequence) to get the entire set of information in these pages. **

** You can download all the code by clicking "Code" as shown in this picture. **

```{r fig.align="center", out.width="50%", echo=FALSE}
knitr::include_graphics("Images/session2-code.png")
```

First we'll declare some useful configuration settings. Don't worry if you don't understand why. 

```{r setup}
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
options(scipen=10000000)
options(digits=3)
```

Next add a few packages that are handy. These packages need to first be installed. If they're not already in your system, you should see a note/yellow banner above from RStudio asking if it should install these packages for you. Silently mutter "Oh thanks RStudio" and click Yes. 

```{r packages}
# install.packages("knitr")
library(knitr)

library(dplyr)
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(ggrepel)
library(boxoffice)
```

Now there is one special package that is not installed in the usual way. "boxoffice" (through which we'll get movie sales data). To install it, run this next chunk once (by ensuring it says "eval=TRUE") then switch to eval=FALSE. 

```{r onetime, warning=FALSE, message=FALSE, eval=FALSE} 
install.packages("devtools")
devtools::install_github("jacobkap/boxoffice")
```


## Session 2: Let's get our hands dirty {.tabset}

Our first text (this) and code chunk (below). 

```{r example}
2+2 # 
1:5 # generate a sequence of integers
vector.1 <- 1:5 # assign the name vector.1 to this sequence
vector.1 + 5 # add 5 to every element of the vector
vector.2 <- vector.1 + 5:6 # add 5 to the 1st element, then 6 to second, 
vector.2 # print vector.2
mean(vector.1)
c(sum(vector.1), sum(vector.2), mean(vector.2), min(vector.2), max(vector.2))
```
 
## The movies data 
 
Ok, now that we've done some basic things let's move on to something useful. We'll pull data about movie sales from the "boxoffice" data source (for which we installed the boxoffice package above). First we have to decide what time frame we want the data for.  
 
```{r boxoffice}
# Let's define time periods for which to collect data
date.seq <- paste(2000:2009,"-12-31",sep="")

# date.seq <- c(as.Date("2013-12-31"),as.Date("2014-12-31"),as.Date("2015-12-31"), as.Date("2016-12-31"),as.Date("2017-12-31"),as.Date("2018-12-31"),as.Date("2019-12-31"))

# Fetch the data 
movies <- boxoffice(date = as.Date(date.seq), top_n = 50)
 
dim(movies) # what is the size of the data frame
names(movies) # or, movies %>% names # names of the columns of the data frame
kable(head(movies))
```


A few commands featured above include 1) assignment to an object, 2) selection of a subset of data from a data frame, 

We can modify or extend the data. For instance, we'll want to isolate the Year (from the date field). Also, it will be useful to rank movies by sales (within each year), and create a new rank variable. 

```{r movies.extend}

movies <- movies %>% na.omit() %>% mutate(Year =  as.numeric(format(as.Date(date), "%Y"))) # na.omit() omits the rows with NA values; create new column Year. which extracts the Y (year) from the date

# Extract the Year, then Rank by Sales

movies <- movies %>% group_by(Year) %>% arrange(desc(total_gross)) %>%  mutate(rank=row_number())

```

### Visualizations of box office sales

Now let's take a look at the data. You can look at the data in tabular form (let's do that in the RStudio interface). But it will be more insightful to construct visualizations of the data. Let's start by looking at total_gross revenues for each rank within each year. 

```{r movies.rank.plot,fig.width=14,fig.height=3}
p1 <- ggplot(data=movies, aes(x=rank,y=total_gross)) + geom_line(aes(color=as.factor(Year))) + theme_classic()

p2 <- p1 + coord_trans(y = "log10") # convert y axis to log scale

grid.arrange(p1, p2, ncol=2) # arrange both plots side by side, in two columns
```


What are the top movies of the year, and how much are they total_grossing? To make the question (or answers) more meaningful let's limit the analysis to the top 10 movies each year. To get a sense of the differences in sales, let's take a quick look at the #1 and #10 ranked movies each year. 

```{r top10}
movies.top10 <- movies %>% filter(rank %in% c(1,10)) %>% group_by(Year) %>% arrange(rank)
kable(movies.top10 %>% select(movie, Year, rank, total_gross) %>% arrange(Year))
```

Looking at the numbers it seems that the top-1 and top-10 have hugely different sales numbers. Putting all of them (and all between these ranks) into the same chart will make it very hard to see the differences. In such cases it is useful to use a log transformation, which brings the numbers closer together and easier to see. 

The graph we'll produce has the rank as the x (horizontal) axis and gross revenues as the y (vertical axis). We'll identify the movie itself by placing a dot (bullet) based on its (x,y) value, and write the name of the movie as close to the bullet as possible. 

```{r movies.ranks, fig.width=15,fig.height=7}
ggplot(data=movies %>% filter(rank < 11), aes(x=rank, y=total_gross, color=factor(Year))) + geom_point() + theme_classic() + theme(axis.text.x = element_text(size=12), axis.text.y = element_text(size=12), axis.title.x = element_text(size = rel(1.5)), axis.title.y = element_text(size = rel(1.5), margin = margin(t = 0, r = 20, b = 0, l = 10))) + geom_text_repel(aes(label = movie), nudge_y=1, force=6, box.padding = unit(0.75, "lines"), segment.color="gray") + coord_trans(y = 'log10')
```

Who's making the winning movies? This is identified by the "distributor" column. So, this time we'll write the distributor's name rather than the movie name. 

```{r movies.ranks.distributor,  fig.width=15,fig.height=7}
ggplot(data=movies %>% filter(rank < 11), aes(x=rank, y=total_gross, color=factor(Year))) + geom_point() + theme_classic() + theme(axis.text.x = element_text(size=12), axis.text.y = element_text(size=12), axis.title.x = element_text(size = rel(1.5)), axis.title.y = element_text(size = rel(1.5), margin = margin(t = 0, r = 20, b = 0, l = 10))) + geom_text_repel(aes(label = distributor), nudge_y=1, force=6, box.padding = unit(0.75, "lines"), segment.color="gray") + coord_trans(y = 'log10')

# ggsave()
```





