knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
options(scipen=10000000)
options(digits=3)
library(knitr)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(ggrepel)
library(boxoffice)
We’ll get everyone setup (quickly, as long as you did the Prequisites and Preparation Steps above), get familiar with RStudio, learn how to extend it with new packages (and load them). Then look at some good examples of getting useful insights or making tough decisions with the help of data analysis, and finally see how to get large amounts of data from external websites.
After the basic steps of learning our way around RStudio, we’ll do something fun! We’ll write some code to download movie sales data, process it, and visualize it.
For the rest and details, see https://rpubs.com/hkb/DAX-Session1.
There are many sites that supply box office sales data. We’ll pick one for now (boxofficemojo.com) and do some analyses, but later you will repeat and run your own analyses with other data sources. Here’s the code and a sample of the data retrieved.
# Let's define time periods for which to collect data
date.seq <- paste(2010:2019,"-12-31",sep="")
# date.seq <- c(as.Date("2013-12-31"),as.Date("2014-12-31"),as.Date("2015-12-31"), as.Date("2016-12-31"),as.Date("2017-12-31"),as.Date("2018-12-31"),as.Date("2019-12-31"))
# Fetch the data
movies <- boxoffice(date = as.Date(date.seq), top_n = 50)
dim(movies) # what is the size of the data frame
names(movies) # or, movies %>% names # names of the columns of the data frame
kable(head(movies))
A few commands featured above include 1) assignment to an object, 2) selection of a subset of data from a data frame,
We can modify or extend the data. For instance, we’ll want to isolate the Year (from the date field). Also, it will be useful to rank movies by sales (within each year), and create a new rank variable.
movies <- movies %>% na.omit() %>% mutate(Year = as.numeric(format(as.Date(date), "%Y")))
# Extract the Year, then Rank by Sales
movies <- movies %>% group_by(Year) %>% arrange(desc(total_gross)) %>% mutate(rank=row_number())
Now let’s take a look at the data. You can look at the data in tabular form (let’s do that in the RStudio interface). But it will be more insightful to construct visualizations of the data. Let’s start by looking at total_gross revenues for each rank within each year.
p1 <- ggplot(movies, aes(x=rank,y=total_gross)) + geom_line(aes(color=as.factor(Year))) + theme_classic()
p2 <- p1 + coord_trans(y = "log10")
grid.arrange(p1, p2, ncol=2)
What are the top movies of the year, and how much are they total_grossing? To make the question (or answers) more meaningful let’s limit the analysis to the top 10 movies each year. To get a sense of the differences in sales, let’s take a quick look at the #1 and #10 ranked movies each year.
movies.top10 <- movies %>% filter(rank %in% c(1,10)) %>% group_by(Year) %>% arrange(rank)
kable(movies.top10 %>% select(movie, Year, rank, total_gross) %>% arrange(Year))
plot(1:10000, log(1:10000),type="l")
Looking at the numbers it seems that the top-1 and top-10 have hugely different sales numbers. Putting all of them (and all between these ranks) into the same chart will make it very hard to see the differences. In such cases it is useful to use a log transformation, which brings the numbers closer together and easier to see.
The graph we’ll produce has the rank as the x (horizontal) axis and gross revenues as the y (vertical axis). We’ll identify the movie itself by placing a dot (bullet) based on its (x,y) value, and write the name of the movie as close to the bullet as possible.
ggplot(data=movies %>% filter(rank < 11), aes(x=rank, y=total_gross, color=factor(Year))) + geom_point() + theme_classic() + theme(axis.text.x = element_text(size=12), axis.text.y = element_text(size=12), axis.title.x = element_text(size = rel(1.5)), axis.title.y = element_text(size = rel(1.5), margin = margin(t = 0, r = 20, b = 0, l = 10))) + geom_text_repel(aes(label = movie), nudge_y=1, force=6, box.padding = unit(0.75, "lines"), segment.color="gray") + coord_trans(y = 'log10')
Who’s making the winning movies? This is identified by the “distributor” column. So, this time we’ll write the distributor’s name rather than the movie name.
ggplot(data=movies %>% filter(rank < 11), aes(x=rank, y=total_gross, color=factor(Year))) + geom_point() + theme_classic() + theme(axis.text.x = element_text(size=12), axis.text.y = element_text(size=12), axis.title.x = element_text(size = rel(1.5)), axis.title.y = element_text(size = rel(1.5), margin = margin(t = 0, r = 20, b = 0, l = 10))) + geom_text_repel(aes(label = distributor), nudge_y=1, force=6, box.padding = unit(0.75, "lines"), segment.color="gray") + coord_trans(y = 'log10')
# ggsave()