** Please click all the tabs (in sequence) to get the entire set of information in these pages. **
** To download code, see the instructions in Session 2: https://rpubs.com/hkb/DAX-Session2 **
knitr::opts_chunk$set(echo = TRUE, warning=FALSE, message=FALSE)
options(scipen=10000000)
options(digits=3)
# install.packages("knitr")
library(knitr)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(gridExtra)
library(ggrepel)
library(boxoffice) # because the package is already installed
Life as a Data Analyst - Sonic Prabhudesai
Storytelling with Data - from FindHotel.com, https://blog.findhotel.net/2020/06/data-story-of-the-travel-market-recovery/
The Movie Business: Creating new Stories with the Tools you’ve Learnt
Lets load the data by making a call to boxofficemojo.com through the boxoffice() library. If, for some reason, you have not yet installed the package look through Session 2 notes and do it.
date.seq <- paste(2000:2019,"-12-31",sep="")
# Fetch the data
movies <- boxoffice(date = as.Date(date.seq), top_n = 50)
We’ll extend the data frame by adding - for each movie in the database - Year, and Rank within Year based on gross revenues.
movies <- movies %>% na.omit() %>% mutate(Year = as.numeric(format(as.Date(date), "%Y"))) # na.omit() omits the rows with NA values; create new column Year. which extracts the Y (year) from the date
# Extract the Year, then Rank by Sales
movies <- movies %>% group_by(Year) %>% arrange(desc(total_gross)) %>% mutate(rank=row_number())
But what if you want to know how many times a distributor is in the top 50 (or top.n for that year) one year at a time.
leading.distributors <- movies %>% select(distributor, Year) %>% group_by(distributor,Year) %>% mutate(count = n())
ggplot(leading.distributors %>% filter(Year > 2009), aes(x=Year, y=distributor, size=count)) + geom_point() + theme_classic()
count.movies <- data.frame(table(movies$Year))
names(count.movies) = c("Year", "Count")
ggplot(count.movies, aes(x=Year, y=Count)) + geom_point() + expand_limits(y=0) + theme_classic()
So, there’s more data in recent years, and fewer movies per year in past years.
sort(table((movies %>% filter(rank < 11, Year > 2009))$distributor), decreasing=TRUE)
Warner Bros. Walt Disney 20th Century
23 21 14
Sony Pictures Paramount Pi Universal
13 10 7
Lionsgate STX Entertai MGM
6 2 1
Relativity Summit Enter Weinstein Co.
1 1 1
hits.per.year <- as.data.frame(table(movies$distributor, movies$Year))
names(hits.per.year) = c("distributor", "Year", "count")
hits.per.year <- hits.per.year %>% group_by(distributor) %>% mutate(total = sum(count)) %>% arrange(desc(total))
movies.top10 <- movies %>% filter(rank < 11) # %>% group_by(Year) %>% arrange(rank)
kable(movies.top10 %>% filter(rank %in% c(1,10)) %>% select(movie, Year, rank, total_gross) %>% arrange(rank))
movie | Year | rank | total_gross |
---|---|---|---|
Star Wars Ep. VII: The Fo | 2015 | 1 | 651967269 |
The Dark Knight | 2008 | 1 | 530924926 |
Star Wars Ep. VIII: The L | 2017 | 1 | 517218368 |
Frozen II | 2019 | 1 | 430144682 |
Rogue One: A Star Wars Story | 2016 | 1 | 408235850 |
The Hunger Games: Catchin | 2013 | 1 | 395526705 |
The Hunger Games: Mocking | 2014 | 1 | 313282914 |
Inception | 2010 | 1 | 292558188 |
Skyfall | 2012 | 1 | 290904271 |
Harry Potter and the Sorc | 2001 | 1 | 288493000 |
The Twilight Saga: New Moon | 2009 | 1 | 284512392 |
The Twilight Saga: Breaki | 2011 | 1 | 274841954 |
Harry Potter and the Gobl | 2005 | 1 | 273281180 |
Dr. Seuss The Grinch | 2018 | 1 | 266280410 |
How the Grinch Stole Chri | 2000 | 1 | 251629105 |
The Lord of the Rings: Th | 2003 | 1 | 249400000 |
Harry Potter and the Cham | 2002 | 1 | 243855000 |
I am Legend | 2007 | 1 | 199345154 |
The Polar Express | 2006 | 1 | 176454984 |
The Polar Express | 2004 | 1 | 151623383 |
Lincoln | 2012 | 10 | 134189097 |
Creed II | 2018 | 10 | 112448520 |
Paranormal Activity | 2009 | 10 | 107792845 |
Jackass Presents: Bad Gra | 2013 | 10 | 101337759 |
The Social Network | 2010 | 10 | 92876282 |
Arrival | 2016 | 10 | 91676099 |
Tower Heist | 2011 | 10 | 76650420 |
Zombieland: Double Tap | 2019 | 10 | 72930156 |
A Bad Moms Christmas | 2017 | 10 | 71891988 |
Night at the Museum: Secr | 2014 | 10 | 69354582 |
Daddys Home | 2015 | 10 | 64684278 |
Yes Man | 2008 | 10 | 60029690 |
Rocky Balboa | 2006 | 10 | 47940632 |
Mr. Magorium’s Wonder Emp | 2007 | 10 | 31049456 |
The Hot Chick | 2002 | 10 | 24021000 |
Peter Pan | 2003 | 10 | 22000000 |
The Ringer | 2005 | 10 | 17265628 |
A Beautiful Mind | 2001 | 10 | 15949000 |
All the Pretty Horses | 2000 | 10 | 7640564 |
p1 <- ggplot(data=movies, aes(x=rank,y=total_gross)) + geom_line(aes(color=as.factor(Year))) + theme_classic()
p2 <- p1 + coord_trans(y = "log10") # convert y axis to log scale
grid.arrange(p1, p2, ncol=2) # arrange both plots side by side, in two columns
Let’s flip what we put on the horizontal axis vs color. So, let’s make Year the horizontal axis and display the n-th highest grossing movie each year, using color for Rank
p1 <- ggplot(data= movies.top10 %>% filter(rank < 41), aes(x=Year,y=total_gross)) + geom_line(aes(color=as.factor(rank))) + theme_classic()
p2 <- p1 + coord_trans(y = "log10") # convert y axis to log scale
grid.arrange(p1, p2, ncol=2) # arrange both plots side by side, in two columns
But what if you want to know how many times a distributor is in the top 50 (or top.n for that year) one year at a time.
leading.movies <- movies.top10 %>% select(movie, Year, total_gross)
ggplot(movies.top10, aes(x=Year, y=rank, size=total_gross)) + geom_point() + theme_classic()
movies.top10 <- movies.top10 %>% mutate(Year.new = Year-1999, rank.new=rev(rank))
lm <- lm(total_gross ~ rank.new * Year.new, data = movies.top10)
summary(lm)
Call:
lm(formula = total_gross ~ rank.new * Year.new, data = movies.top10)
Residuals:
Min 1Q Median 3Q Max
-98080938 -34321252 -2413607 25734341 338186276
Coefficients:
Estimate Std. Error t value
(Intercept) -44050046 17349403 -2.54
rank.new 20728716 2830252 7.32
Year.new 4205876 1444418 2.91
rank.new:Year.new 520312 234292 2.22
Pr(>|t|)
(Intercept) 0.012 *
rank.new 0.0000000000063 ***
Year.new 0.004 **
rank.new:Year.new 0.028 *
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 54400000 on 194 degrees of freedom
Multiple R-squared: 0.72, Adjusted R-squared: 0.715
F-statistic: 166 on 3 and 194 DF, p-value: <0.0000000000000002