Dataset

The data is obtained from the website BoxOfficeMojo, which has detailed data concerning the box office totals for on going movie releases.

URL: https://www.boxofficemojo.com/alltime/world/

The website has information about the movies spread in 8 webpages.

Tidying, transforming and analyzing Data

Information is read from BoxOfficeMojo into R.

We will perfom analyses in this project to answer the following questions:


Load packages

library(knitr)
library(XML)
suppressMessages(library(RCurl))
suppressMessages(library(tidyverse))
## Warning: package 'dplyr' was built under R version 3.5.1

Parsing data from web site

We create a dataframe with empty vectors to storage all the information that is going to be read from the website.

movie_df <- data.frame(V1 = character(0), V2 = character(0), V3 = character(0), V4 = character(0), V5 = character(0), V6 = character(0), V7 = character(0), V8 = character(0), V9=character(0), stringsAsFactors = FALSE)

sapply(movie_df, class)
##          V1          V2          V3          V4          V5          V6 
## "character" "character" "character" "character" "character" "character" 
##          V7          V8          V9 
## "character" "character" "character"

Now we retrieve the information from each page, parse it and append to the above dataframe.

Tyding the data will also be performed in this step, since the numerical data contains the symbols $, % and ^.

for (i in 1:8){
        url <-urls[i]
        htmlData <- getURL(url)
        parsedData <-htmlParse(htmlData, encoding = "UTF-8")
        data <- readHTMLTable(parsedData, skip.rows =1 , stringsAsFactors = FALSE)
        table <- data[[2]]
        
        # combine dataframes
        movie_df <- suppressWarnings(bind_rows(movie_df, table)) 
}    

tbl_df(movie_df)
## # A tibble: 752 x 9
##    V1    V2                    V3    V4     V5    V6    V7     V8    V9   
##    <chr> <chr>                 <chr> <chr>  <chr> <chr> <chr>  <chr> <chr>
##  1 1     Avatar                Fox   $2,78… $760… 27.3% $2,02… 72.7% 2009^
##  2 2     Titanic               Par.  $2,18… $659… 30.1% $1,52… 69.9% 1997^
##  3 3     Star Wars: The Force… BV    $2,06… $936… 45.3% $1,13… 54.7% 2015 
##  4 4     Avengers: Infinity W… BV    $2,04… $678… 33.2% $1,36… 66.8% 2018 
##  5 5     Jurassic World        Uni.  $1,67… $652… 39.0% $1,01… 61.0% 2015 
##  6 6     Marvel's The Avengers BV    $1,51… $623… 41.0% $895.5 59.0% 2012 
##  7 7     Furious 7             Uni.  $1,51… $353… 23.3% $1,16… 76.7% 2015 
##  8 8     Avengers: Age of Ult… BV    $1,40… $459… 32.7% $946.4 67.3% 2015 
##  9 9     Black Panther         BV    $1,34… $700… 52.0% $646.8 48.0% 2018 
## 10 10    Harry Potter and the… WB    $1,34… $381… 28.4% $960.5 71.6% 2011 
## # ... with 742 more rows

We identify the box office values in the thounsands and convert to millions.

For example, we want to extract the string “18.1” from “$18.1k”, then convert to numeric and divide by 10000.

for (k in c(4,5,6)){
        indices <- grep(".k$",movie_df[,k],value=F)
        if(length(indices)>0){
                values <- grep(".k$",movie_df[,k],value=T)
                cat("Before... ",values, "\n")
                tempVal <- unlist(str_extract_all(values,'[\\d\\.]+'))
                tempVal <- as.numeric(as.character(tempVal))
                tempVal <- as.character(tempVal/1000)
                cat("After... ",tempVal, "\n\n")
                movie_df[indices,k] <- tempVal
                print("This is the result of the changes...")
                print(movie_df[indices,])
        }
}
## Before...  $32.8k $706k $363k $880k $18.1k 
## After...  0.0328 0.706 0.363 0.88 0.0181 
## 
## [1] "This is the result of the changes..."
##      V1                                          V2    V3     V4     V5
## 288 288                                Monster Hunt    FR $385.3 0.0328
## 330 330                              Monster Hunt 2   LGF $361.7  0.706
## 548 548                                Kung Fu Yoga WGUSA $254.2  0.363
## 568 568 Journey to the West: The Demons Strike Back  Sony $246.6   0.88
## 726 726                         Journey to the West Magn. $205.6 0.0181
##       V6     V7     V8   V9
## 288 0.0% $385.3 100.0% 2016
## 330 0.2% $361.0  99.8% 2018
## 548 0.1% $253.8  99.9% 2017
## 568 0.4% $245.7  99.6% 2017
## 726 0.0% $205.6 100.0% 2014

Removing the symbols $, % and ^ from all the other columns and changing them to numeric type in R.

for(j in seq(4,9)){
        movie_df[,j] <- suppressWarnings(as.numeric(gsub("[%,\\$\\^]","", movie_df[,j])))
}

sapply(movie_df, class)
##          V1          V2          V3          V4          V5          V6 
## "character" "character" "character"   "numeric"   "numeric"   "numeric" 
##          V7          V8          V9 
##   "numeric"   "numeric"   "numeric"

And rename the columns of the movie_df.

colnames(movie_df) <- c( "Rank", "Title" , "Studio", "Worldwide", "Domestic", "Domestic.Perc", "Overseas", "Overseas.Perc", "Year")
tbl_df(movie_df)
## # A tibble: 752 x 9
##    Rank  Title Studio Worldwide Domestic Domestic.Perc Overseas
##    <chr> <chr> <chr>      <dbl>    <dbl>         <dbl>    <dbl>
##  1 1     Avat… Fox        2788      760.          27.3    2028.
##  2 2     Tita… Par.       2188.     659.          30.1    1528.
##  3 3     Star… BV         2068.     937.          45.3    1132.
##  4 4     Aven… BV         2046.     679.          33.2    1368.
##  5 5     Jura… Uni.       1672.     652.          39      1019.
##  6 6     Marv… BV         1519.     623.          41       896.
##  7 7     Furi… Uni.       1516      353           23.3    1163 
##  8 8     Aven… BV         1405.     459           32.7     946.
##  9 9     Blac… BV         1347.     700.          52       647.
## 10 10    Harr… WB         1342.     381           28.4     960.
## # ... with 742 more rows, and 2 more variables: Overseas.Perc <dbl>,
## #   Year <dbl>

Analysis

1. Do certain movie studios have a higher percentage of domestic grosses?

Let’s find the top studios with the highest Ratio of Domestic to International percentages:

StudioData2<-movie_df %>% group_by(Studio) %>%
  summarise(mean=round(mean(Domestic.Perc),2))
StudioData2[order(StudioData2$mean,decreasing=T)[1:10],]
## # A tibble: 10 x 2
##    Studio  mean
##    <chr>  <dbl>
##  1 Dis.    68  
##  2 IFC     65.5
##  3 NM      60.6
##  4 SPC     60  
##  5 USA     59.8
##  6 Art.    56.5
##  7 Col.    54.9
##  8 SonR    54.5
##  9 Lions   53.6
## 10 NL      52.2

2. Has the percentage of international grosses gone up recently (I’m defining “recent” this as the last decade or so)?

To answer that question, let’s calculate the average % of overseas grosses in the last decade and compare it with the % in earlier years.

#Average International gross percentages between 2008 and 2018
movie_df %>% filter(Year>=2008) %>%
  summarise(mean=mean(Overseas.Perc))
##       mean
## 1 63.80488
#Average International gross percentages before 2008
movie_df %>% filter(Year<2008) %>%
  summarise(mean=mean(Overseas.Perc))
##       mean
## 1 54.97135

We can conclude that it seems that International grosses have gone up in the last decade.