Tidying, transforming and analyzing Data

Information is read from BoxOfficeMojo into R.

We will perfom analyses in this project to answer the following questions:

Do certain movie studios have a higher percentage of domestic grosses?
Has the percentage of international grosses gone up recently (I’m defining “recent” this as the last decade or so)?
Have movies been getting more popular, grossing more money lately?

Load packages

library(knitr)
library(XML)
suppressMessages(library(RCurl))
suppressMessages(library(tidyverse))

## Warning: package 'dplyr' was built under R version 3.5.1

Function that creates a vector with all links to be accessed to retrieve data

start_url <- "https://www.boxofficemojo.com/alltime/world/?pagenum="
end_url <- "&p=.htm"
# function
pages <- function(n){
        urls <- vector('character')
        for (i in 1:n){
                temp <- str_c(start_url,i,end_url, collapse = "")
                urls <- c(urls, temp)
        }
        return(urls)
}

urls <-pages(8)
urls

## [1] "https://www.boxofficemojo.com/alltime/world/?pagenum=1&p=.htm"
## [2] "https://www.boxofficemojo.com/alltime/world/?pagenum=2&p=.htm"
## [3] "https://www.boxofficemojo.com/alltime/world/?pagenum=3&p=.htm"
## [4] "https://www.boxofficemojo.com/alltime/world/?pagenum=4&p=.htm"
## [5] "https://www.boxofficemojo.com/alltime/world/?pagenum=5&p=.htm"
## [6] "https://www.boxofficemojo.com/alltime/world/?pagenum=6&p=.htm"
## [7] "https://www.boxofficemojo.com/alltime/world/?pagenum=7&p=.htm"
## [8] "https://www.boxofficemojo.com/alltime/world/?pagenum=8&p=.htm"

Parsing data from web site

We create a dataframe with empty vectors to storage all the information that is going to be read from the website.

movie_df <- data.frame(V1 = character(0), V2 = character(0), V3 = character(0), V4 = character(0), V5 = character(0), V6 = character(0), V7 = character(0), V8 = character(0), V9=character(0), stringsAsFactors = FALSE)

sapply(movie_df, class)

##          V1          V2          V3          V4          V5          V6 
## "character" "character" "character" "character" "character" "character" 
##          V7          V8          V9 
## "character" "character" "character"

Now we retrieve the information from each page, parse it and append to the above dataframe.

Tyding the data will also be performed in this step, since the numerical data contains the symbols $, % and ^.

for (i in 1:8){
        url <-urls[i]
        htmlData <- getURL(url)
        parsedData <-htmlParse(htmlData, encoding = "UTF-8")
        data <- readHTMLTable(parsedData, skip.rows =1 , stringsAsFactors = FALSE)
        table <- data[[2]]
        
        # combine dataframes
        movie_df <- suppressWarnings(bind_rows(movie_df, table)) 
}    

tbl_df(movie_df)

## # A tibble: 752 x 9
##    V1    V2                    V3    V4     V5    V6    V7     V8    V9   
##    <chr> <chr>                 <chr> <chr>  <chr> <chr> <chr>  <chr> <chr>
##  1 1     Avatar                Fox   $2,78… $760… 27.3% $2,02… 72.7% 2009^
##  2 2     Titanic               Par.  $2,18… $659… 30.1% $1,52… 69.9% 1997^
##  3 3     Star Wars: The Force… BV    $2,06… $936… 45.3% $1,13… 54.7% 2015 
##  4 4     Avengers: Infinity W… BV    $2,04… $678… 33.2% $1,36… 66.8% 2018 
##  5 5     Jurassic World        Uni.  $1,67… $652… 39.0% $1,01… 61.0% 2015 
##  6 6     Marvel's The Avengers BV    $1,51… $623… 41.0% $895.5 59.0% 2012 
##  7 7     Furious 7             Uni.  $1,51… $353… 23.3% $1,16… 76.7% 2015 
##  8 8     Avengers: Age of Ult… BV    $1,40… $459… 32.7% $946.4 67.3% 2015 
##  9 9     Black Panther         BV    $1,34… $700… 52.0% $646.8 48.0% 2018 
## 10 10    Harry Potter and the… WB    $1,34… $381… 28.4% $960.5 71.6% 2011 
## # ... with 742 more rows

We identify the box office values in the thounsands and convert to millions.

For example, we want to extract the string “18.1” from “$18.1k”, then convert to numeric and divide by 10000.

for (k in c(4,5,6)){
        indices <- grep(".k$",movie_df[,k],value=F)
        if(length(indices)>0){
                values <- grep(".k$",movie_df[,k],value=T)
                cat("Before... ",values, "\n")
                tempVal <- unlist(str_extract_all(values,'[\\d\\.]+'))
                tempVal <- as.numeric(as.character(tempVal))
                tempVal <- as.character(tempVal/1000)
                cat("After... ",tempVal, "\n\n")
                movie_df[indices,k] <- tempVal
                print("This is the result of the changes...")
                print(movie_df[indices,])
        }
}

## Before...  $32.8k $706k $363k $880k $18.1k 
## After...  0.0328 0.706 0.363 0.88 0.0181 
## 
## [1] "This is the result of the changes..."
##      V1                                          V2    V3     V4     V5
## 288 288                                Monster Hunt    FR $385.3 0.0328
## 330 330                              Monster Hunt 2   LGF $361.7  0.706
## 548 548                                Kung Fu Yoga WGUSA $254.2  0.363
## 568 568 Journey to the West: The Demons Strike Back  Sony $246.6   0.88
## 726 726                         Journey to the West Magn. $205.6 0.0181
##       V6     V7     V8   V9
## 288 0.0% $385.3 100.0% 2016
## 330 0.2% $361.0  99.8% 2018
## 548 0.1% $253.8  99.9% 2017
## 568 0.4% $245.7  99.6% 2017
## 726 0.0% $205.6 100.0% 2014

Removing the symbols $, % and ^ from all the other columns and changing them to numeric type in R.

for(j in seq(4,9)){
        movie_df[,j] <- suppressWarnings(as.numeric(gsub("[%,\\$\\^]","", movie_df[,j])))
}

sapply(movie_df, class)

##          V1          V2          V3          V4          V5          V6 
## "character" "character" "character"   "numeric"   "numeric"   "numeric" 
##          V7          V8          V9 
##   "numeric"   "numeric"   "numeric"

And rename the columns of the movie_df.

colnames(movie_df) <- c( "Rank", "Title" , "Studio", "Worldwide", "Domestic", "Domestic.Perc", "Overseas", "Overseas.Perc", "Year")
tbl_df(movie_df)

## # A tibble: 752 x 9
##    Rank  Title Studio Worldwide Domestic Domestic.Perc Overseas
##    <chr> <chr> <chr>      <dbl>    <dbl>         <dbl>    <dbl>
##  1 1     Avat… Fox        2788      760.          27.3    2028.
##  2 2     Tita… Par.       2188.     659.          30.1    1528.
##  3 3     Star… BV         2068.     937.          45.3    1132.
##  4 4     Aven… BV         2046.     679.          33.2    1368.
##  5 5     Jura… Uni.       1672.     652.          39      1019.
##  6 6     Marv… BV         1519.     623.          41       896.
##  7 7     Furi… Uni.       1516      353           23.3    1163 
##  8 8     Aven… BV         1405.     459           32.7     946.
##  9 9     Blac… BV         1347.     700.          52       647.
## 10 10    Harr… WB         1342.     381           28.4     960.
## # ... with 742 more rows, and 2 more variables: Overseas.Perc <dbl>,
## #   Year <dbl>

Analysis

1. Do certain movie studios have a higher percentage of domestic grosses?

Let’s find the top studios with the highest Ratio of Domestic to International percentages:

StudioData2<-movie_df %>% group_by(Studio) %>%
  summarise(mean=round(mean(Domestic.Perc),2))
StudioData2[order(StudioData2$mean,decreasing=T)[1:10],]

## # A tibble: 10 x 2
##    Studio  mean
##    <chr>  <dbl>
##  1 Dis.    68  
##  2 IFC     65.5
##  3 NM      60.6
##  4 SPC     60  
##  5 USA     59.8
##  6 Art.    56.5
##  7 Col.    54.9
##  8 SonR    54.5
##  9 Lions   53.6
## 10 NL      52.2

2. Has the percentage of international grosses gone up recently (I’m defining “recent” this as the last decade or so)?

To answer that question, let’s calculate the average % of overseas grosses in the last decade and compare it with the % in earlier years.

#Average International gross percentages between 2008 and 2018
movie_df %>% filter(Year>=2008) %>%
  summarise(mean=mean(Overseas.Perc))

##       mean
## 1 63.80488

#Average International gross percentages before 2008
movie_df %>% filter(Year<2008) %>%
  summarise(mean=mean(Overseas.Perc))

##       mean
## 1 54.97135

We can conclude that it seems that International grosses have gone up in the last decade.

3. Have movies been getting more popular, grossing more money lately?

To answer that question, let’s calculate the average total gross dollars in the last decade and compare it with the dollars in earlier years.

#Average total gross dollars between 2008 and 2018
movie_df %>% filter(Year>=2008) %>%
  summarise(mean=mean(Worldwide))

##       mean
## 1 482.6488

#Average total gross dollars before 2008
movie_df %>% filter(Year<2008) %>%
  summarise(mean=mean(Worldwide))

##       mean
## 1 375.0611

We can conclude that movies have been grossing more money in the past decade.

plot(movie_df$Year, movie_df$Worldwide, 
   xlim=c(1938,2018),
   main = "Worldwide Movie Grosses",
   xlab = "Year",
   ylab = "Worlwide Gross Dollars (Millions)"
)

The graph above illustrated that visually as well. There seems to be more top films in recent years.

Project 2 - Dataset 3

E. Azrilyan and B. Sosnovski

10/04/2018

Dataset