I began by creating a function that will process each call to the NY Times API movie database. This function passes along the queries that I want to the appropriate url.
url <- 'https://api.nytimes.com/svc/movies/v2/reviews/search.json'
api.key <- read_table('C:\\Users\\Brian\\Desktop\\GradClasses\\Spring18\\607\\assignments\\week6assignmentKey.txt', col_names=FALSE) %>%
unlist() %>%
as.vector()
API.Query <- function(params){
Sys.sleep(1)
GET(url, query=c('api-key'=api.key, params)) %>%
content(as='text') %>%
fromJSON(flatten=TRUE) %>%
.[[5]] %>%
as.tibble()
}
I connected to a website that lists movie box office recipts and scraped the top 100 domestic grossing movies of 2017 along with several other helpful pieces of data.
movie.data.html <- getURL('http://www.boxofficemojo.com/yearly/chart/?yr=2017&p=.htm') %>%
htmlParse()
movie.data.headers <- movie.data.html %>%
xpathSApply('//*[@id="body"]/table[3]//tr//td', xmlValue) %>%
.[c(7:9, 11:14)] %>%
str_split(' / ') %>%
unlist() %>%
str_extract('\\w+') %>%
unique()
movie.data.frame <- movie.data.html %>%
xpathSApply('//*[@id="body"]/table[3]//tr//td', xmlValue) %>%
.[15:914] %>%
matrix(ncol=9, byrow=T) %>%
as.data.frame() %>%
select(-7) %>%
setNames(movie.data.headers) %>%
mutate(Movie=str_replace(Movie, '(.*?)( \\(2017\\))$', '\\1')) %>%
select(1:4)
movie.data.frame %>% head(20) %>% kable()
| Rank | Movie | Studio | Total |
|---|---|---|---|
| 1 | Star Wars: The Last Jedi | BV | $619,896,809 |
| 2 | Beauty and the Beast | BV | $504,014,165 |
| 3 | Wonder Woman | WB | $412,563,408 |
| 4 | Jumanji: Welcome to the Jungle | Sony | $400,818,450 |
| 5 | Guardians of the Galaxy Vol. 2 | BV | $389,813,101 |
| 6 | Spider-Man: Homecoming | Sony | $334,201,140 |
| 7 | It | WB (NL) | $327,481,748 |
| 8 | Thor: Ragnarok | BV | $315,058,289 |
| 9 | Despicable Me 3 | Uni. | $264,624,300 |
| 10 | Justice League | WB | $229,024,295 |
| 11 | Logan | Fox | $226,277,068 |
| 12 | The Fate of the Furious | Uni. | $226,008,385 |
| 13 | Coco | BV | $209,271,528 |
| 14 | Dunkirk | WB | $188,045,546 |
| 15 | Get Out | Uni. | $176,040,665 |
| 16 | The LEGO Batman Movie | WB | $175,750,384 |
| 17 | The Boss Baby | Fox | $175,003,033 |
| 18 | Pirates of the Caribbean: Dead Men Tell No Tales | BV | $172,558,876 |
| 19 | The Greatest Showman | Fox | $170,448,688 |
| 20 | Kong: Skull Island | WB | $168,052,812 |
I made 100 calls to the NY Times API sending in each movie one at a time. All movies that contained a review were stroed in a data frame.
review.data.frame <- movie.data.frame$Movie %>%
map_df(~API.Query(list('query'=as.character(.)))) %>%
filter(publication_date %>% startsWith('2017')) %>%
filter(display_title %in% movie.data.frame$Movie) %>%
select(1:3) %>%
unique()
review.data.frame %>% head() %>% kable()
| display_title | mpaa_rating | critics_pick |
|---|---|---|
| Star Wars: The Last Jedi | PG-13 | 1 |
| Beauty and the Beast | PG | 1 |
| Wonder Woman | PG-13 | 1 |
| Jumanji: Welcome to the Jungle | PG-13 | 0 |
| Guardians of the Galaxy Vol. 2 | PG-13 | 0 |
| Spider-Man: Homecoming | PG-13 | 0 |
I combined the two data frames together, keeping the relevant information.
combined.frame <- movie.data.frame %>%
inner_join(review.data.frame, by=c('Movie'='display_title')) %>%
mutate(Rank = Rank %>% as.numeric(),
Total = Total %>% parse_number(),
critics_pick = critics_pick %>% as.factor()
) %>%
distinct()
combined.frame %>% head(10) %>% kable()
| Rank | Movie | Studio | Total | mpaa_rating | critics_pick |
|---|---|---|---|---|---|
| 1 | Star Wars: The Last Jedi | BV | 619896809 | PG-13 | 1 |
| 13 | Beauty and the Beast | BV | 504014165 | PG | 1 |
| 24 | Wonder Woman | WB | 412563408 | PG-13 | 1 |
| 35 | Jumanji: Welcome to the Jungle | Sony | 400818450 | PG-13 | 0 |
| 46 | Guardians of the Galaxy Vol. 2 | BV | 389813101 | PG-13 | 0 |
| 57 | Spider-Man: Homecoming | Sony | 334201140 | PG-13 | 0 |
| 79 | Thor: Ragnarok | BV | 315058289 | PG-13 | 0 |
| 90 | Despicable Me 3 | Uni. | 264624300 | PG | 0 |
| 2 | Justice League | WB | 229024295 | PG-13 | 0 |
| 4 | Logan | Fox | 226277068 | R | 0 |
This first graph shows the distribution of movies based on their ratings. It highlights how much big grossing movies are either PG-13 or R.
ggplot(combined.frame) +
geom_bar(aes(x=mpaa_rating, fill=Studio)) +
labs(x='Mpaa Rating',
y='Count',
title='PG-13 and R Movies are Biggest Sellers'
)
This graph shows the distributions of movies based on their rating. The top 3 grossing movies in each category are displayed along with a line indicating the top grossing movie in each rating group. This clearly shows that although PG-13 and R movies have roughly equal representation on the top 100, PG-13 movies are the bigger grossing movies.
ggplot(combined.frame) +
geom_point(aes(x=mpaa_rating %>% as.factor(), y=Total)) +
geom_hline(yintercept=combined.frame %>%
group_by(mpaa_rating) %>%
arrange(Total %>%
desc()
) %>%
top_n(1, Total) %>%
.$Total
) +
geom_label_repel(data=. %>%
group_by(mpaa_rating) %>%
top_n(3, Total),
aes(x=mpaa_rating, y=Total, label=Movie, color=mpaa_rating)
) +
scale_y_continuous(limits=c(combined.frame$Total %>%
min(),
combined.frame$Total %>%
max()
),
labels=comma
) +
labs(x='MPAA Rating',
title='PG-13 Movies Are Highest Earners'
) +
theme(legend.position='none')
For the next analysis I limited the data to only the top 5 most represented movie studios.
best.studios <- combined.frame %>%
group_by(Studio) %>%
summarize(count=n()) %>%
arrange(count %>% desc()) %>%
top_n(5, count)
best.studios %>% kable()
| Studio | count |
|---|---|
| Fox | 12 |
| Uni. | 11 |
| WB | 9 |
| BV | 7 |
| Par. | 7 |
This graph shows the top 5 represented movie studios with each of their movies seperated based on whether the NY Times recommended them or not. This shows that Buena Vista (Disney) is the most liked studio while Universal is the least.
combined.frame %>%
filter(Studio %in% best.studios$Studio) %>%
ggplot() +
geom_histogram(aes(x=critics_pick), stat='count') +
facet_wrap(~Studio) +
theme(axis.title.y=element_blank()) +
labs(x='Critics Picks',
title='Top 100 Grossing Movies of 2017 by Studio and Recommendations') +
scale_x_discrete(labels=c('No', 'Yes')) +
scale_y_continuous(labels=0:10, breaks=0:10)
PG-13 movies are the most profitable although R movies have strong, although lower, repesentation in the top 100. G movies rarely appear
Buena Vista had the most well received blockbuster movies in 2017 while Universal had the least.