week 6 Assignment

By Brian Weinfeld

March 24, 2018

Set Up

I began by creating a function that will process each call to the NY Times API movie database. This function passes along the queries that I want to the appropriate url.

url <- 'https://api.nytimes.com/svc/movies/v2/reviews/search.json'
api.key <- read_table('C:\\Users\\Brian\\Desktop\\GradClasses\\Spring18\\607\\assignments\\week6assignmentKey.txt', col_names=FALSE) %>% 
  unlist() %>%
  as.vector()

API.Query <- function(params){
  Sys.sleep(1)
  GET(url, query=c('api-key'=api.key, params)) %>%
    content(as='text') %>%
    fromJSON(flatten=TRUE) %>%
    .[[5]] %>%
    as.tibble()
}

I connected to a website that lists movie box office recipts and scraped the top 100 domestic grossing movies of 2017 along with several other helpful pieces of data.

movie.data.html <- getURL('http://www.boxofficemojo.com/yearly/chart/?yr=2017&p=.htm') %>% 
  htmlParse()

movie.data.headers <- movie.data.html %>%
  xpathSApply('//*[@id="body"]/table[3]//tr//td', xmlValue) %>%
  .[c(7:9, 11:14)] %>%
  str_split(' / ') %>% 
  unlist() %>%
  str_extract('\\w+') %>%
  unique()

movie.data.frame <- movie.data.html %>%
  xpathSApply('//*[@id="body"]/table[3]//tr//td', xmlValue) %>%
  .[15:914] %>%
  matrix(ncol=9, byrow=T) %>%
  as.data.frame() %>%
  select(-7) %>%
  setNames(movie.data.headers) %>%
  mutate(Movie=str_replace(Movie, '(.*?)( \\(2017\\))$', '\\1')) %>%
  select(1:4)
movie.data.frame %>% head(20) %>% kable()

Rank	Movie	Studio	Total
1	Star Wars: The Last Jedi	BV	$619,896,809
2	Beauty and the Beast	BV	$504,014,165
3	Wonder Woman	WB	$412,563,408
4	Jumanji: Welcome to the Jungle	Sony	$400,818,450
5	Guardians of the Galaxy Vol. 2	BV	$389,813,101
6	Spider-Man: Homecoming	Sony	$334,201,140
7	It	WB (NL)	$327,481,748
8	Thor: Ragnarok	BV	$315,058,289
9	Despicable Me 3	Uni.	$264,624,300
10	Justice League	WB	$229,024,295
11	Logan	Fox	$226,277,068
12	The Fate of the Furious	Uni.	$226,008,385
13	Coco	BV	$209,271,528
14	Dunkirk	WB	$188,045,546
15	Get Out	Uni.	$176,040,665
16	The LEGO Batman Movie	WB	$175,750,384
17	The Boss Baby	Fox	$175,003,033
18	Pirates of the Caribbean: Dead Men Tell No Tales	BV	$172,558,876
19	The Greatest Showman	Fox	$170,448,688
20	Kong: Skull Island	WB	$168,052,812

I made 100 calls to the NY Times API sending in each movie one at a time. All movies that contained a review were stroed in a data frame.

review.data.frame <- movie.data.frame$Movie %>%
  map_df(~API.Query(list('query'=as.character(.)))) %>% 
  filter(publication_date %>% startsWith('2017')) %>%
  filter(display_title %in% movie.data.frame$Movie) %>%
  select(1:3) %>%
  unique()
review.data.frame %>% head() %>% kable()

display_title	mpaa_rating	critics_pick
Star Wars: The Last Jedi	PG-13	1
Beauty and the Beast	PG	1
Wonder Woman	PG-13	1
Jumanji: Welcome to the Jungle	PG-13	0
Guardians of the Galaxy Vol. 2	PG-13	0
Spider-Man: Homecoming	PG-13	0

I combined the two data frames together, keeping the relevant information.

combined.frame <- movie.data.frame %>%
  inner_join(review.data.frame, by=c('Movie'='display_title')) %>%
  mutate(Rank = Rank %>% as.numeric(),
         Total = Total %>% parse_number(),
         critics_pick = critics_pick %>% as.factor()
         ) %>%
  distinct()
combined.frame %>% head(10) %>% kable()

Rank	Movie	Studio	Total	mpaa_rating	critics_pick
1	Star Wars: The Last Jedi	BV	619896809	PG-13	1
13	Beauty and the Beast	BV	504014165	PG	1
24	Wonder Woman	WB	412563408	PG-13	1
35	Jumanji: Welcome to the Jungle	Sony	400818450	PG-13	0
46	Guardians of the Galaxy Vol. 2	BV	389813101	PG-13	0
57	Spider-Man: Homecoming	Sony	334201140	PG-13	0
79	Thor: Ragnarok	BV	315058289	PG-13	0
90	Despicable Me 3	Uni.	264624300	PG	0
2	Justice League	WB	229024295	PG-13	0
4	Logan	Fox	226277068	R	0

Analysis

This first graph shows the distribution of movies based on their ratings. It highlights how much big grossing movies are either PG-13 or R.

ggplot(combined.frame) +
  geom_bar(aes(x=mpaa_rating, fill=Studio)) +
  labs(x='Mpaa Rating',
        y='Count',
        title='PG-13 and R Movies are Biggest Sellers'
        )

This graph shows the distributions of movies based on their rating. The top 3 grossing movies in each category are displayed along with a line indicating the top grossing movie in each rating group. This clearly shows that although PG-13 and R movies have roughly equal representation on the top 100, PG-13 movies are the bigger grossing movies.

ggplot(combined.frame) + 
  geom_point(aes(x=mpaa_rating %>% as.factor(), y=Total)) +
  geom_hline(yintercept=combined.frame %>% 
                          group_by(mpaa_rating) %>% 
                          arrange(Total %>% 
                                    desc()
                                  ) %>% 
                          top_n(1, Total) %>% 
                          .$Total
             ) +
  geom_label_repel(data=. %>% 
                          group_by(mpaa_rating) %>% 
                          top_n(3, Total),
                   aes(x=mpaa_rating, y=Total, label=Movie, color=mpaa_rating)
                  ) +
  scale_y_continuous(limits=c(combined.frame$Total %>% 
                                min(), 
                              combined.frame$Total %>% 
                                max()
                             ),
                     labels=comma
  ) + 
  labs(x='MPAA Rating',
       title='PG-13 Movies Are Highest Earners'
       ) +
  theme(legend.position='none')

For the next analysis I limited the data to only the top 5 most represented movie studios.

best.studios <- combined.frame %>%
  group_by(Studio) %>%
  summarize(count=n()) %>%
  arrange(count %>% desc()) %>%
  top_n(5, count)
best.studios %>% kable()

Studio	count
Fox	12
Uni.	11
WB	9
BV	7
Par.	7

This graph shows the top 5 represented movie studios with each of their movies seperated based on whether the NY Times recommended them or not. This shows that Buena Vista (Disney) is the most liked studio while Universal is the least.

combined.frame %>%
  filter(Studio %in% best.studios$Studio) %>%
  ggplot() +
  geom_histogram(aes(x=critics_pick), stat='count') + 
  facet_wrap(~Studio) +
  theme(axis.title.y=element_blank()) +
  labs(x='Critics Picks', 
       title='Top 100 Grossing Movies of 2017 by Studio and Recommendations') +
  scale_x_discrete(labels=c('No', 'Yes')) +
  scale_y_continuous(labels=0:10, breaks=0:10)

PG-13 movies are the most profitable although R movies have strong, although lower, repesentation in the top 100. G movies rarely appear

Buena Vista had the most well received blockbuster movies in 2017 while Universal had the least.