The data is obtained from the website BoxOfficeMojo, which has detailed data concerning the box office totals for on going movie releases.
URL: https://www.boxofficemojo.com/alltime/world/
The website has information about the movies spread in 8 webpages.
Information is read from BoxOfficeMojo into R.
We will perfom analyses in this project to answer the following questions:
Do certain movie studios have a higher percentage of domestic grosses?
Has the percentage of international grosses gone up recently (I’m defining “recent” this as the last decade or so)?
Have movies been getting more popular, grossing more money lately?
library(knitr)
library(XML)
suppressMessages(library(RCurl))
suppressMessages(library(tidyverse))
## Warning: package 'dplyr' was built under R version 3.5.1
start_url <- "https://www.boxofficemojo.com/alltime/world/?pagenum="
end_url <- "&p=.htm"
# function
pages <- function(n){
urls <- vector('character')
for (i in 1:n){
temp <- str_c(start_url,i,end_url, collapse = "")
urls <- c(urls, temp)
}
return(urls)
}
urls <-pages(8)
urls
## [1] "https://www.boxofficemojo.com/alltime/world/?pagenum=1&p=.htm"
## [2] "https://www.boxofficemojo.com/alltime/world/?pagenum=2&p=.htm"
## [3] "https://www.boxofficemojo.com/alltime/world/?pagenum=3&p=.htm"
## [4] "https://www.boxofficemojo.com/alltime/world/?pagenum=4&p=.htm"
## [5] "https://www.boxofficemojo.com/alltime/world/?pagenum=5&p=.htm"
## [6] "https://www.boxofficemojo.com/alltime/world/?pagenum=6&p=.htm"
## [7] "https://www.boxofficemojo.com/alltime/world/?pagenum=7&p=.htm"
## [8] "https://www.boxofficemojo.com/alltime/world/?pagenum=8&p=.htm"
We create a dataframe with empty vectors to storage all the information that is going to be read from the website.
movie_df <- data.frame(V1 = character(0), V2 = character(0), V3 = character(0), V4 = character(0), V5 = character(0), V6 = character(0), V7 = character(0), V8 = character(0), V9=character(0), stringsAsFactors = FALSE)
sapply(movie_df, class)
## V1 V2 V3 V4 V5 V6
## "character" "character" "character" "character" "character" "character"
## V7 V8 V9
## "character" "character" "character"
Now we retrieve the information from each page, parse it and append to the above dataframe.
Tyding the data will also be performed in this step, since the numerical data contains the symbols $, % and ^.
for (i in 1:8){
url <-urls[i]
htmlData <- getURL(url)
parsedData <-htmlParse(htmlData, encoding = "UTF-8")
data <- readHTMLTable(parsedData, skip.rows =1 , stringsAsFactors = FALSE)
table <- data[[2]]
# combine dataframes
movie_df <- suppressWarnings(bind_rows(movie_df, table))
}
tbl_df(movie_df)
## # A tibble: 752 x 9
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Avatar Fox $2,78… $760… 27.3% $2,02… 72.7% 2009^
## 2 2 Titanic Par. $2,18… $659… 30.1% $1,52… 69.9% 1997^
## 3 3 Star Wars: The Force… BV $2,06… $936… 45.3% $1,13… 54.7% 2015
## 4 4 Avengers: Infinity W… BV $2,04… $678… 33.2% $1,36… 66.8% 2018
## 5 5 Jurassic World Uni. $1,67… $652… 39.0% $1,01… 61.0% 2015
## 6 6 Marvel's The Avengers BV $1,51… $623… 41.0% $895.5 59.0% 2012
## 7 7 Furious 7 Uni. $1,51… $353… 23.3% $1,16… 76.7% 2015
## 8 8 Avengers: Age of Ult… BV $1,40… $459… 32.7% $946.4 67.3% 2015
## 9 9 Black Panther BV $1,34… $700… 52.0% $646.8 48.0% 2018
## 10 10 Harry Potter and the… WB $1,34… $381… 28.4% $960.5 71.6% 2011
## # ... with 742 more rows
We identify the box office values in the thounsands and convert to millions.
For example, we want to extract the string “18.1” from “$18.1k”, then convert to numeric and divide by 10000.
for (k in c(4,5,6)){
indices <- grep(".k$",movie_df[,k],value=F)
if(length(indices)>0){
values <- grep(".k$",movie_df[,k],value=T)
cat("Before... ",values, "\n")
tempVal <- unlist(str_extract_all(values,'[\\d\\.]+'))
tempVal <- as.numeric(as.character(tempVal))
tempVal <- as.character(tempVal/1000)
cat("After... ",tempVal, "\n\n")
movie_df[indices,k] <- tempVal
print("This is the result of the changes...")
print(movie_df[indices,])
}
}
## Before... $32.8k $706k $363k $880k $18.1k
## After... 0.0328 0.706 0.363 0.88 0.0181
##
## [1] "This is the result of the changes..."
## V1 V2 V3 V4 V5
## 288 288 Monster Hunt FR $385.3 0.0328
## 330 330 Monster Hunt 2 LGF $361.7 0.706
## 548 548 Kung Fu Yoga WGUSA $254.2 0.363
## 568 568 Journey to the West: The Demons Strike Back Sony $246.6 0.88
## 726 726 Journey to the West Magn. $205.6 0.0181
## V6 V7 V8 V9
## 288 0.0% $385.3 100.0% 2016
## 330 0.2% $361.0 99.8% 2018
## 548 0.1% $253.8 99.9% 2017
## 568 0.4% $245.7 99.6% 2017
## 726 0.0% $205.6 100.0% 2014
Removing the symbols $, % and ^ from all the other columns and changing them to numeric type in R.
for(j in seq(4,9)){
movie_df[,j] <- suppressWarnings(as.numeric(gsub("[%,\\$\\^]","", movie_df[,j])))
}
sapply(movie_df, class)
## V1 V2 V3 V4 V5 V6
## "character" "character" "character" "numeric" "numeric" "numeric"
## V7 V8 V9
## "numeric" "numeric" "numeric"
And rename the columns of the movie_df.
colnames(movie_df) <- c( "Rank", "Title" , "Studio", "Worldwide", "Domestic", "Domestic.Perc", "Overseas", "Overseas.Perc", "Year")
tbl_df(movie_df)
## # A tibble: 752 x 9
## Rank Title Studio Worldwide Domestic Domestic.Perc Overseas
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Avat… Fox 2788 760. 27.3 2028.
## 2 2 Tita… Par. 2188. 659. 30.1 1528.
## 3 3 Star… BV 2068. 937. 45.3 1132.
## 4 4 Aven… BV 2046. 679. 33.2 1368.
## 5 5 Jura… Uni. 1672. 652. 39 1019.
## 6 6 Marv… BV 1519. 623. 41 896.
## 7 7 Furi… Uni. 1516 353 23.3 1163
## 8 8 Aven… BV 1405. 459 32.7 946.
## 9 9 Blac… BV 1347. 700. 52 647.
## 10 10 Harr… WB 1342. 381 28.4 960.
## # ... with 742 more rows, and 2 more variables: Overseas.Perc <dbl>,
## # Year <dbl>
Let’s find the top studios with the highest Ratio of Domestic to International percentages:
StudioData2<-movie_df %>% group_by(Studio) %>%
summarise(mean=round(mean(Domestic.Perc),2))
StudioData2[order(StudioData2$mean,decreasing=T)[1:10],]
## # A tibble: 10 x 2
## Studio mean
## <chr> <dbl>
## 1 Dis. 68
## 2 IFC 65.5
## 3 NM 60.6
## 4 SPC 60
## 5 USA 59.8
## 6 Art. 56.5
## 7 Col. 54.9
## 8 SonR 54.5
## 9 Lions 53.6
## 10 NL 52.2
To answer that question, let’s calculate the average % of overseas grosses in the last decade and compare it with the % in earlier years.
#Average International gross percentages between 2008 and 2018
movie_df %>% filter(Year>=2008) %>%
summarise(mean=mean(Overseas.Perc))
## mean
## 1 63.80488
#Average International gross percentages before 2008
movie_df %>% filter(Year<2008) %>%
summarise(mean=mean(Overseas.Perc))
## mean
## 1 54.97135
We can conclude that it seems that International grosses have gone up in the last decade.
To answer that question, let’s calculate the average total gross dollars in the last decade and compare it with the dollars in earlier years.
#Average total gross dollars between 2008 and 2018
movie_df %>% filter(Year>=2008) %>%
summarise(mean=mean(Worldwide))
## mean
## 1 482.6488
#Average total gross dollars before 2008
movie_df %>% filter(Year<2008) %>%
summarise(mean=mean(Worldwide))
## mean
## 1 375.0611
We can conclude that movies have been grossing more money in the past decade.
plot(movie_df$Year, movie_df$Worldwide,
xlim=c(1938,2018),
main = "Worldwide Movie Grosses",
xlab = "Year",
ylab = "Worlwide Gross Dollars (Millions)"
)
The graph above illustrated that visually as well. There seems to be more top films in recent years.