As a first step, we came up with a stategy for working on the assignment together. First, we picked the 3 datasets we were interested in analyzing. Then we each picked one dataset to take a lead on after discussing some basic strategies. We worked on those individually, collaborating and reaching out to each other with questions. That gave us an opportunity to perfect our data wrangling and collaboration skils. Then we worked on a 3rd dataset together.
The data is obtained from the website BoxOfficeMojo, which has detailed data concerning the box office totals for on going movie releases.
URL: https://www.boxofficemojo.com/alltime/world/
The website has information about the movies spread in 8 webpages.
Information is read from BoxOfficeMojo into R.
We will perfom analyses in this project to answer the following questions:
Do certain movie studios have a higher percentage of domestic grosses?
Has the percentage of international grosses gone up recently (I’m defining “recent” this as the last decade or so)?
Have movies been getting more popular, grossing more money lately?
library(knitr)
library(XML)
suppressMessages(library(RCurl))
suppressMessages(library(tidyverse))
start_url <- "https://www.boxofficemojo.com/alltime/world/?pagenum="
end_url <- "&p=.htm"
# function
pages <- function(n){
urls <- vector('character')
for (i in 1:n){
temp <- str_c(start_url,i,end_url, collapse = "")
urls <- c(urls, temp)
}
return(urls)
}
urls <-pages(8)
urls
## [1] "https://www.boxofficemojo.com/alltime/world/?pagenum=1&p=.htm"
## [2] "https://www.boxofficemojo.com/alltime/world/?pagenum=2&p=.htm"
## [3] "https://www.boxofficemojo.com/alltime/world/?pagenum=3&p=.htm"
## [4] "https://www.boxofficemojo.com/alltime/world/?pagenum=4&p=.htm"
## [5] "https://www.boxofficemojo.com/alltime/world/?pagenum=5&p=.htm"
## [6] "https://www.boxofficemojo.com/alltime/world/?pagenum=6&p=.htm"
## [7] "https://www.boxofficemojo.com/alltime/world/?pagenum=7&p=.htm"
## [8] "https://www.boxofficemojo.com/alltime/world/?pagenum=8&p=.htm"
We create a dataframe with empty vectors to storage all the information that is going to be read from the website.
movie_df <- data.frame(V1 = character(0), V2 = character(0), V3 = character(0), V4 = character(0), V5 = character(0), V6 = character(0), V7 = character(0), V8 = character(0), V9=character(0), stringsAsFactors = FALSE)
sapply(movie_df, class)
## V1 V2 V3 V4 V5 V6
## "character" "character" "character" "character" "character" "character"
## V7 V8 V9
## "character" "character" "character"
Now we retrieve the information from each page, parse it and append to the above dataframe.
Tyding the data will also be performed in this step, since the numerical data contains the symbols $, % and ^.
for (i in 1:8){
url <-urls[i]
htmlData <- getURL(url)
parsedData <-htmlParse(htmlData, encoding = "UTF-8")
data <- readHTMLTable(parsedData, skip.rows =1 , stringsAsFactors = FALSE)
table <- data[[2]]
# combine dataframes
movie_df <- suppressWarnings(bind_rows(movie_df, table))
}
tbl_df(movie_df)
## # A tibble: 752 x 9
## V1 V2 V3 V4 V5 V6 V7 V8 V9
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 Avatar Fox $2,78… $760… 27.3% $2,02… 72.7% 2009^
## 2 2 Titanic Par. $2,18… $659… 30.1% $1,52… 69.9% 1997^
## 3 3 Star Wars: The Force… BV $2,06… $936… 45.3% $1,13… 54.7% 2015
## 4 4 Avengers: Infinity W… BV $2,04… $678… 33.2% $1,36… 66.8% 2018
## 5 5 Jurassic World Uni. $1,67… $652… 39.0% $1,01… 61.0% 2015
## 6 6 Marvel's The Avengers BV $1,51… $623… 41.0% $895.5 59.0% 2012
## 7 7 Furious 7 Uni. $1,51… $353… 23.3% $1,16… 76.7% 2015
## 8 8 Avengers: Age of Ult… BV $1,40… $459… 32.7% $946.4 67.3% 2015
## 9 9 Black Panther BV $1,34… $700… 52.0% $646.8 48.0% 2018
## 10 10 Harry Potter and the… WB $1,34… $381… 28.4% $960.5 71.6% 2011
## # ... with 742 more rows
We identify the box office values in the thounsands and convert to millions.
For example, we want to extract the string “18.1” from “$18.1k”, then convert to numeric and divide by 10000.
for (k in c(4,5,6)){
indices <- grep(".k$",movie_df[,k],value=F)
if(length(indices)>0){
values <- grep(".k$",movie_df[,k],value=T)
cat("Before... ",values, "\n")
tempVal <- unlist(str_extract_all(values,'[\\d\\.]+'))
tempVal <- as.numeric(as.character(tempVal))
tempVal <- as.character(tempVal/1000)
cat("After... ",tempVal, "\n\n")
movie_df[indices,k] <- tempVal
print("This the result of the changes...")
print(movie_df[indices,])
}
}
## Before... $32.8k $706k $363k $880k $18.1k
## After... 0.0328 0.706 0.363 0.88 0.0181
##
## [1] "This the result of the changes..."
## V1 V2 V3 V4 V5
## 288 288 Monster Hunt FR $385.3 0.0328
## 330 330 Monster Hunt 2 LGF $361.7 0.706
## 548 548 Kung Fu Yoga WGUSA $254.2 0.363
## 568 568 Journey to the West: The Demons Strike Back Sony $246.6 0.88
## 726 726 Journey to the West Magn. $205.6 0.0181
## V6 V7 V8 V9
## 288 0.0% $385.3 100.0% 2016
## 330 0.2% $361.0 99.8% 2018
## 548 0.1% $253.8 99.9% 2017
## 568 0.4% $245.7 99.6% 2017
## 726 0.0% $205.6 100.0% 2014
Removing the symbols $, % and ^ from all the other columns and changing them to numeric type in R.
for(j in seq(4,9)){
movie_df[,j] <- suppressWarnings(as.numeric(gsub("[%,\\$\\^]","", movie_df[,j])))
}
sapply(movie_df, class)
## V1 V2 V3 V4 V5 V6
## "character" "character" "character" "numeric" "numeric" "numeric"
## V7 V8 V9
## "numeric" "numeric" "numeric"
And rename the columns of the movie_df.
colnames(movie_df) <- c( "Rank", "Title" , "Studio", "Worldwide", "Domestic", "Domestic.Perc", "Overseas", "Overseas.Perc", "Year")
tbl_df(movie_df)
## # A tibble: 752 x 9
## Rank Title Studio Worldwide Domestic Domestic.Perc Overseas
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Avat… Fox 2788 760. 27.3 2028.
## 2 2 Tita… Par. 2188. 659. 30.1 1528.
## 3 3 Star… BV 2068. 937. 45.3 1132.
## 4 4 Aven… BV 2046. 679. 33.2 1368.
## 5 5 Jura… Uni. 1672. 652. 39 1019.
## 6 6 Marv… BV 1519. 623. 41 896.
## 7 7 Furi… Uni. 1516 353 23.3 1163
## 8 8 Aven… BV 1405. 459 32.7 946.
## 9 9 Blac… BV 1347. 700. 52 647.
## 10 10 Harr… WB 1342. 381 28.4 960.
## # ... with 742 more rows, and 2 more variables: Overseas.Perc <dbl>,
## # Year <dbl>
Let’s sort the domestic grosses to see what the top 10 studios are.
StudioData2<-movie_df %>% group_by(Studio) %>%
summarise(mean=round(mean(Domestic.Perc),2))
StudioData2[order(StudioData2$mean,decreasing=T)[1:10],]
## # A tibble: 10 x 2
## Studio mean
## <chr> <dbl>
## 1 Dis. 68
## 2 IFC 65.5
## 3 NM 60.6
## 4 SPC 60
## 5 USA 59.8
## 6 Art. 56.5
## 7 Col. 54.9
## 8 SonR 54.5
## 9 Lions 53.6
## 10 NL 52.2
To answer that question, let’s calculate the average % of overseas grosses in the last decade and compare it with the % in earlier years.
#Average International gross percentages between 2008 and 2018
movie_df %>% filter(Year>=2008) %>%
summarise(mean=mean(Overseas.Perc))
## mean
## 1 63.80488
#Average International gross percentages before 2008
movie_df %>% filter(Year<2008) %>%
summarise(mean=mean(Overseas.Perc))
## mean
## 1 54.97135
We can conclude that it seems that International grosses have gone up in the last decade.
To answer that question, let’s calculate the average total gross dollars in the last decade and compare it with the dollars in earlier years.
#Average total gross dollars between 2008 and 2018
movie_df %>% filter(Year>=2008) %>%
summarise(mean=mean(Worldwide))
## mean
## 1 482.6488
#Average total gross dollars before 2008
movie_df %>% filter(Year<2008) %>%
summarise(mean=mean(Worldwide))
## mean
## 1 375.0611
We can conclude that movies have been grossing more money in the past decade.
plot(movie_df$Year, movie_df$Worldwide,
xlim=c(1938,2018),
main = "Worldwide Movie Grosses",
xlab = "Year",
ylab = "Worlwide Gross Dollars (Millions)"
)
The graph above illustrated that visually as well. There seems to be more top films in recent years.