week 6 Assignment

By Brian Weinfeld

March 24, 2018

I began by creating a function that will process each call to the NY Times API movie database. This function passes along the queries that I want to the appropriate url.

url <- 'https://api.nytimes.com/svc/movies/v2/reviews/search.json'
api.key <- read_table('C:\\Users\\Brian\\Desktop\\GradClasses\\Spring18\\607\\assignments\\week6assignmentKey.txt', col_names=FALSE) %>% 
  unlist() %>%
  as.vector()

API.Query <- function(params){
  Sys.sleep(1)
  GET(url, query=c('api-key'=api.key, params)) %>%
    content(as='text') %>%
    fromJSON(flatten=TRUE) %>%
    .[[5]] %>%
    as.tibble()
}

I connected to a website that lists movie box office recipts and scraped the top 100 domestic grossing movies of 2017 along with several other helpful pieces of data.

movie.data.html <- getURL('http://www.boxofficemojo.com/yearly/chart/?yr=2017&p=.htm') %>% 
  htmlParse()

movie.data.headers <- movie.data.html %>%
  xpathSApply('//*[@id="body"]/table[3]//tr//td', xmlValue) %>%
  .[c(7:9, 11:14)] %>%
  str_split(' / ') %>% 
  unlist() %>%
  str_extract('\\w+') %>%
  unique()

movie.data.frame <- movie.data.html %>%
  xpathSApply('//*[@id="body"]/table[3]//tr//td', xmlValue) %>%
  .[15:914] %>%
  matrix(ncol=9, byrow=T) %>%
  as.data.frame() %>%
  select(-7) %>%
  setNames(movie.data.headers) %>%
  mutate(Movie=str_replace(Movie, '(.*?)( \\(2017\\))$', '\\1')) %>%
  select(1:4)
movie.data.frame %>% head(20) %>% kable()
Rank Movie Studio Total
1 Star Wars: The Last Jedi BV $619,896,809
2 Beauty and the Beast BV $504,014,165
3 Wonder Woman WB $412,563,408
4 Jumanji: Welcome to the Jungle Sony $400,818,450
5 Guardians of the Galaxy Vol. 2 BV $389,813,101
6 Spider-Man: Homecoming Sony $334,201,140
7 It WB (NL) $327,481,748
8 Thor: Ragnarok BV $315,058,289
9 Despicable Me 3 Uni. $264,624,300
10 Justice League WB $229,024,295
11 Logan Fox $226,277,068
12 The Fate of the Furious Uni. $226,008,385
13 Coco BV $209,271,528
14 Dunkirk WB $188,045,546
15 Get Out Uni. $176,040,665
16 The LEGO Batman Movie WB $175,750,384
17 The Boss Baby Fox $175,003,033
18 Pirates of the Caribbean: Dead Men Tell No Tales BV $172,558,876
19 The Greatest Showman Fox $170,448,688
20 Kong: Skull Island WB $168,052,812

I made 100 calls to the NY Times API sending in each movie one at a time. All movies that contained a review were stroed in a data frame.

review.data.frame <- movie.data.frame$Movie %>%
  map_df(~API.Query(list('query'=as.character(.)))) %>% 
  filter(publication_date %>% startsWith('2017')) %>%
  filter(display_title %in% movie.data.frame$Movie) %>%
  select(1:3) %>%
  unique()
review.data.frame %>% head() %>% kable()
display_title mpaa_rating critics_pick
Star Wars: The Last Jedi PG-13 1
Beauty and the Beast PG 1
Wonder Woman PG-13 1
Jumanji: Welcome to the Jungle PG-13 0
Guardians of the Galaxy Vol. 2 PG-13 0
Spider-Man: Homecoming PG-13 0

I combined the two data frames together, keeping the relevant information.

combined.frame <- movie.data.frame %>%
  inner_join(review.data.frame, by=c('Movie'='display_title')) %>%
  mutate(Rank = Rank %>% as.numeric(),
         Total = Total %>% parse_number(),
         critics_pick = critics_pick %>% as.factor()
         ) %>%
  distinct()
combined.frame %>% head(10) %>% kable()
Rank Movie Studio Total mpaa_rating critics_pick
1 Star Wars: The Last Jedi BV 619896809 PG-13 1
13 Beauty and the Beast BV 504014165 PG 1
24 Wonder Woman WB 412563408 PG-13 1
35 Jumanji: Welcome to the Jungle Sony 400818450 PG-13 0
46 Guardians of the Galaxy Vol. 2 BV 389813101 PG-13 0
57 Spider-Man: Homecoming Sony 334201140 PG-13 0
79 Thor: Ragnarok BV 315058289 PG-13 0
90 Despicable Me 3 Uni. 264624300 PG 0
2 Justice League WB 229024295 PG-13 0
4 Logan Fox 226277068 R 0

This first graph shows the distribution of movies based on their ratings. It highlights how much big grossing movies are either PG-13 or R.

ggplot(combined.frame) +
  geom_bar(aes(x=mpaa_rating, fill=Studio)) +
  labs(x='Mpaa Rating',
        y='Count',
        title='PG-13 and R Movies are Biggest Sellers'
        )

This graph shows the distributions of movies based on their rating. The top 3 grossing movies in each category are displayed along with a line indicating the top grossing movie in each rating group. This clearly shows that although PG-13 and R movies have roughly equal representation on the top 100, PG-13 movies are the bigger grossing movies.

ggplot(combined.frame) + 
  geom_point(aes(x=mpaa_rating %>% as.factor(), y=Total)) +
  geom_hline(yintercept=combined.frame %>% 
                          group_by(mpaa_rating) %>% 
                          arrange(Total %>% 
                                    desc()
                                  ) %>% 
                          top_n(1, Total) %>% 
                          .$Total
             ) +
  geom_label_repel(data=. %>% 
                          group_by(mpaa_rating) %>% 
                          top_n(3, Total),
                   aes(x=mpaa_rating, y=Total, label=Movie, color=mpaa_rating)
                  ) +
  scale_y_continuous(limits=c(combined.frame$Total %>% 
                                min(), 
                              combined.frame$Total %>% 
                                max()
                             ),
                     labels=comma
  ) + 
  labs(x='MPAA Rating',
       title='PG-13 Movies Are Highest Earners'
       ) +
  theme(legend.position='none')

For the next analysis I limited the data to only the top 5 most represented movie studios.

best.studios <- combined.frame %>%
  group_by(Studio) %>%
  summarize(count=n()) %>%
  arrange(count %>% desc()) %>%
  top_n(5, count)
best.studios %>% kable()
Studio count
Fox 12
Uni. 11
WB 9
BV 7
Par. 7

This graph shows the top 5 represented movie studios with each of their movies seperated based on whether the NY Times recommended them or not. This shows that Buena Vista (Disney) is the most liked studio while Universal is the least.

combined.frame %>%
  filter(Studio %in% best.studios$Studio) %>%
  ggplot() +
  geom_histogram(aes(x=critics_pick), stat='count') + 
  facet_wrap(~Studio) +
  theme(axis.title.y=element_blank()) +
  labs(x='Critics Picks', 
       title='Top 100 Grossing Movies of 2017 by Studio and Recommendations') +
  scale_x_discrete(labels=c('No', 'Yes')) +
  scale_y_continuous(labels=0:10, breaks=0:10)

PG-13 movies are the most profitable although R movies have strong, although lower, repesentation in the top 100. G movies rarely appear

Buena Vista had the most well received blockbuster movies in 2017 while Universal had the least.