Capstone Project

library(readxl)
library(dplyr)
library(tidyr)
library(ggplot2)

##Abstract

The film industry was greatly impacted by the Covid-19 Pandemic with the production and theatrical releases stopping in the majority of 2020 and early 2021. During this time, studios tested releasing films in theaters, but it has not been promising. For example, the anticipated blockbuster movie of 2020, Christopher Nolan’s Tenet netted only 20% of expected box office sales last summer. This has resulted in studios to pivot their strategy to either selling their to streaming platforms (Netflix, Amazon, Apple TV+) or releasing it on their own sister platforms (HBO Max, Disney Plus, Peacock). While there have been significant changes to the industry, studios are still committed to theatrical releases and have strategically used this time to prepare for life after the pandemic. Covid-19. One project that can help the industry is improving the strategy around which theaters to release new movies, as some movies perform better in some areas while underperforming in others.

##The Problem Movie sales were projected to increase in revenue prior to the pandemic happening. According to the US census Bureau, the Motion Picture & Sound Recording Industry revenue sales decreased by 14.8% in 2020, however the industry is looking to recover by 2024 with a projected increase 1.3% from 2019 to 2024 (Statista, 2020). According to another study conducted by the Video Trends Report from TiVo, 44% of people in the US and Canada said they were less likely to see a movie, while 66% said they were more or about the same likelihood of watching a movie in theaters (Stoll, 2021). One market that saw a decline in the movie and entertainment market prior to the pandemic, was the UK. According to a journal published by MarketLine, the movies and entertainment industry in the UK fell by 5.5% in revenue from 2014- 2018 (MarketLine, 2019). The main reason for the decline was the increase in streaming services available in the nation. While this represents a decline in the industry, studios won’t stop theater releases in the UK and there will still be a market for theater experience. While the industry is paused, studios have an opportunity to strategize their theatrical releases during this time. With all the data that is available right now, studios can modify their strategy on how they open their films in the theater. Studios are one of the most traditional industries and they are very conservative on their planning strategies. For example, the strategy around trailer releases has not changed in decades where studios continue to release teaser trailers months before the release date and one the date is finalized, they release a new trailer exactly four weeks prior to the release. Currently theater revenues dictate if a film will be released in a specific theater, but it doesn’t expand beyond that and opens an opportunity for growth. One question that can aid the industry is why certain movies over perform, while others underperform.

##The Data

I was able to come across movie theater revenue data for top 100 movies from 2015-2019 in the UK. The data is broken down by film, region, city, theater name, with weekly revenue results. After exploring the data, it needs to be cleaned and used for predictive modeling. The project needs to be a tool used for studios planning and budgeting, so I want the tool to be interactive with predictive modeling happening in the backend.

##Data cleaning First we need to clean the data to make sure we have everything we need.

master_file <- read_excel("C:/Users/munkhnaran.gankhuyag/OneDrive - insidemedia.net/Python/Rentrak Project/master_file_0509.xlsx", 
col_types = c("numeric", "text", "numeric", 
"text", "text", "numeric", "text", 
"text", "text", "text", "date", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric","numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric"))
summary(master_file)

##       Year          Film           TheatreNumber     TheatreName       
##  Min.   :2015   Length:57068       Min.   :   7550   Length:57068      
##  1st Qu.:2016   Class :character   1st Qu.:   8467   Class :character  
##  Median :2017   Mode  :character   Median :  14767   Mode  :character  
##  Mean   :2017                      Mean   : 229319                     
##  3rd Qu.:2018                      3rd Qu.:  50273                     
##  Max.   :2019                      Max.   :1092965                     
##                                                                        
##      City           City Population     Circuit             Region         
##  Length:57068       Min.   :   1409   Length:57068       Length:57068      
##  Class :character   1st Qu.:  48516   Class :character   Class :character  
##  Mode  :character   Median : 117763   Mode  :character   Mode  :character  
##                     Mean   : 703944                                        
##                     3rd Qu.: 318014                                        
##                     Max.   :7556900                                        
##                     NA's   :1919                                           
##     Branch             State           DateofFirstEngagement        
##  Length:57068       Length:57068       Min.   :2015-01-01 00:00:00  
##  Class :character   Class :character   1st Qu.:2016-06-18 00:00:00  
##  Mode  :character   Mode  :character   Median :2017-09-08 00:00:00  
##                                        Mean   :2017-08-03 12:47:23  
##                                        3rd Qu.:2018-11-14 00:00:00  
##                                        Max.   :2020-10-19 00:00:00  
##                                                                     
##      Weeks        Weeks\r\nInRange  MarketShare          TotalGross     
##  Min.   : 1.000   Min.   : 1.000   Min.   :2.000e-08   Min.   :      1  
##  1st Qu.: 6.000   1st Qu.: 5.000   1st Qu.:9.240e-04   1st Qu.:  18990  
##  Median : 9.000   Median : 8.000   Median :1.890e-03   Median :  40989  
##  Mean   : 9.411   Mean   : 8.882   Mean   :2.278e-03   Mean   :  59970  
##  3rd Qu.:12.000   3rd Qu.:12.000   3rd Qu.:3.137e-03   3rd Qu.:  77285  
##  Max.   :33.000   Max.   :33.000   Max.   :3.651e-02   Max.   :1885916  
##                                                                         
##  OpeningWeekendGross Opening7DayGross   Week0Gross        Week1WeekendGross
##  Min.   :     0      Min.   :     0   Min.   :    0.000   Min.   :     0   
##  1st Qu.:  3667      1st Qu.:  6097   1st Qu.:    0.000   1st Qu.:  2650   
##  Median :  8615      Median : 14378   Median :    0.000   Median :  8341   
##  Mean   : 14324      Mean   : 23707   Mean   :    8.365   Mean   : 13893   
##  3rd Qu.: 17730      3rd Qu.: 29256   3rd Qu.:    0.000   3rd Qu.: 17715   
##  Max.   :301698      Max.   :609211   Max.   :13828.830   Max.   :301698   
##                                                                            
##    Week1Gross     Week2WeekendGross   Week2Gross     Week3WeekendGross
##  Min.   :     0   Min.   :     0    Min.   :     0   Min.   :     0   
##  1st Qu.:  4536   1st Qu.:  2671    1st Qu.:  4491   1st Qu.:  1629   
##  Median : 13494   Median :  5891    Median :  9729   Median :  3849   
##  Mean   : 21723   Mean   :  8738    Mean   : 14333   Mean   :  5596   
##  3rd Qu.: 28083   3rd Qu.: 11270    3rd Qu.: 18360   3rd Qu.:  7423   
##  Max.   :492339   Max.   :158272    Max.   :314517   Max.   :153820   
##                                                                       
##    Week3Gross     Week4WeekendGross    Week4Gross     Week5WeekendGross
##  Min.   :     0   Min.   :     0.0   Min.   :     0   Min.   :     0   
##  1st Qu.:  2654   1st Qu.:   808.6   1st Qu.:  1285   1st Qu.:   239   
##  Median :  6127   Median :  2308.8   Median :  3582   Median :  1356   
##  Mean   :  8890   Mean   :  3408.2   Mean   :  5466   Mean   :  2171   
##  3rd Qu.: 11739   3rd Qu.:  4645.8   3rd Qu.:  7242   3rd Qu.:  3008   
##  Max.   :298679   Max.   :126346.5   Max.   :235305   Max.   :101123   
##                                                                        
##    Week5Gross       Week6WeekendGross   Week6Gross     Week7WeekendGross
##  Min.   :     0.0   Min.   :    0.0   Min.   :     0   Min.   :    0.0  
##  1st Qu.:   395.2   1st Qu.:    0.0   1st Qu.:     0   1st Qu.:    0.0  
##  Median :  2133.1   Median :  676.1   Median :  1024   Median :  223.1  
##  Mean   :  3520.2   Mean   : 1329.3   Mean   :  2145   Mean   :  869.0  
##  3rd Qu.:  4882.1   3rd Qu.: 1896.0   3rd Qu.:  2923   3rd Qu.: 1195.9  
##  Max.   :177792.0   Max.   :72739.7   Max.   :133714   Max.   :64386.0  
##                                                                         
##    Week7Gross       Week8WeekendGross   Week8Gross      Week9WeekendGross
##  Min.   :     0.0   Min.   :    0.0   Min.   :    0.0   Min.   :    0.0  
##  1st Qu.:     0.0   1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:    0.0  
##  Median :   335.1   Median :    0.0   Median :    0.0   Median :    0.0  
##  Mean   :  1355.4   Mean   :  517.8   Mean   :  801.7   Mean   :  307.4  
##  3rd Qu.:  1745.8   3rd Qu.:  610.5   3rd Qu.:  880.3   3rd Qu.:  285.4  
##  Max.   :112019.8   Max.   :60438.4   Max.   :83665.4   Max.   :32806.6  
##                                                                          
##    Week9Gross      Week10WeekendGross  Week10Gross        Adventure     
##  Min.   :    0.0   Min.   :    0.0    Min.   :    0.0   Min.   :0.0000  
##  1st Qu.:    0.0   1st Qu.:    0.0    1st Qu.:    0.0   1st Qu.:0.0000  
##  Median :    0.0   Median :    0.0    Median :    0.0   Median :1.0000  
##  Mean   :  489.9   Mean   :  208.5    Mean   :  316.2   Mean   :0.6703  
##  3rd Qu.:  466.6   3rd Qu.:  115.2    3rd Qu.:  181.4   3rd Qu.:1.0000  
##  Max.   :62609.3   Max.   :21501.2    Max.   :32906.2   Max.   :1.0000  
##                                                                         
##      Action          Musical           Drama           History       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.4493   Mean   :0.1635   Mean   :0.2803   Mean   :0.08411  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##                                                                      
##      Comedy          Mystery           Romance          Thriller     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0.3971   Mean   :0.08658   Mean   :0.1141   Mean   :0.2056  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##                                                                      
##      Sci Fi            War              Family           Horror       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.4815   Mean   :0.03701   Mean   :0.3236   Mean   :0.02173  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
##                                                                       
##    Superhero          Novel          Animation        Franchise     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.1963   Mean   :0.1097   Mean   :0.2322   Mean   :0.6068  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##      Sequel      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4687  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
##

When we observe the data, we see that the year we have years range from 2015-2019. 2020 data was excluded, as limited movies were released during that time. In addition to the original data, some time was spent gathering additional data such as genres and city populations. Our primary focus will be total gross, which will be our independent variable in our dataset.

We need to look for ourliers in our dataset. Since there are multiple levels(Film, Theatre), we need to observe both the average gross for the film and theaters to make sure there aren’t any outliers.

film_avg <- master_file %>% group_by(Film) %>% summarise(avg_gross = mean(TotalGross), avg_weeks = mean(Weeks)) %>% arrange(desc(avg_gross))

head(film_avg)

## # A tibble: 6 x 3
##   Film                    avg_gross avg_weeks
##   <chr>                       <dbl>     <dbl>
## 1 Force Awakens             239999.     11.6 
## 2 Spectre                   157770.     10.1 
## 3 AVENGERS ENDGAME          153506.      8.67
## 4 STAR WARS THE LAST JEDI   148581.      7.79
## 5 BEAUTY AND THE BEAST      139429.     12.3 
## 6 AVENGERS INFINITY WAR     136245.      9.70

theatre_avg <- master_file %>% group_by(TheatreNumber) %>% summarise(avg_gross = mean(TotalGross), avg_weeks = mean(Weeks)) %>% arrange(desc(avg_gross))

head(theatre_avg)

## # A tibble: 6 x 3
##   TheatreNumber avg_gross avg_weeks
##           <dbl>     <dbl>     <dbl>
## 1         52140   276850.     13.6 
## 2         14600   251588.     11.3 
## 3         14568   246481.      4.47
## 4         14382   227757.     12.1 
## 5         14289   218552.     13.2 
## 6        997733   204193.     13.9

Instantly, we notice that Force Awakens has a significant gross per theater.

Splitting the data in quartiles

"Film Average Gross Sales Summary"

## [1] "Film Average Gross Sales Summary"

summary(film_avg$avg_gross)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   22546   37162   47063   61637   78539  239999

print("")

## [1] ""

"Theatre Average Gross Sales Summary"

## [1] "Theatre Average Gross Sales Summary"

summary(theatre_avg$avg_gross)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      28    4122   16003   33052   50026  276850

Some charts to further investigate the data.

ggplot(film_avg, aes(x=avg_weeks, y=avg_gross)) +
  geom_point()+labs(title="Average Film Sales vs Average of Weeks in Theatre",
        x ="Average Number of Weeks", y = "Average Gross")

ggplot(film_avg) +
  aes(x = "", y = avg_gross) +
  geom_boxplot(fill = "#0c4c8a") +
  theme_minimal() + labs(title="Average Film Sales",
        x ="Films", y = "Average Gross")

ggplot(theatre_avg, aes(x=avg_weeks, y=avg_gross)) +
  geom_point()+labs(title="Average Theatre Sales vs Average of Weeks in Theatre",
        x ="Average Number of Weeks", y = "Average Gross")

ggplot(theatre_avg) +
  aes(x = "", y = avg_gross) +
  geom_boxplot(fill = "#0c4c8a") +
  theme_minimal() + labs(title="Average Threatre Sales",
        x ="Threatres", y = "Average Gross")

Histogram charts.

hist(film_avg$avg_gross, 50, main = "Histogram Avg. Gross Sales by Film", xlab = "Average Sales")

hist(theatre_avg$avg_gross, 50, main = "Histogram Avg. Gross Sales by Theatre", xlab = "Average Sales")

We will use IQR method to remove the outliers in the data.

Q <- quantile(theatre_avg$avg_gross, probs= c(0.25, .75), na.rm = TRUE)
Q2 <- quantile(film_avg$avg_gross, probs= c(0.25, .75), na.rm = TRUE)

iqr <- IQR(theatre_avg$avg_gross)
iqr2 <- IQR(film_avg$avg_gross)

up <-  Q[2]+1.5*iqr # Upper Range  
low<- Q[1]-1.5*iqr # Lower Range


up2 <-  Q2[2]+1.5*iqr2 # Upper Range  
low2<- Q2[1]-1.5*iqr2 # Lower Range

We notice the 75% cutoff is 118

'Theatre and Film cutoff'

## [1] "Theatre and Film cutoff"

"Theatre Gross"

## [1] "Theatre Gross"

up

##      75% 
## 118882.2

low

##       25% 
## -64733.78

"Film Gross"

## [1] "Film Gross"

up2

##      75% 
## 140605.5

low2

##       25% 
## -24904.13

Eliminate the data outside of cutoff point and identify which rows were removed. We see that 43 Theatres and 4 Films were considered outliers and were removed from our dataset.

#eliminate outliers
eliminated<- subset(theatre_avg, avg_gross > (Q[1] - 1.5*iqr) & theatre_avg$avg_gross < (Q[2]+1.5*iqr))
eliminated2<- subset(film_avg, avg_gross > (Q2[1] - 1.5*iqr2) & film_avg$avg_gross < (Q2[2]+1.5*iqr2))

#which outliers were removed
which(theatre_avg$avg_gross %in% c(boxplot.stats(theatre_avg$avg_gross)$out))

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43

which(film_avg$avg_gross %in% c(boxplot.stats(film_avg$avg_gross)$out))

## [1] 1 2 3 4

library(outliers)
highest <- grubbs.test(eliminated$avg_gross)
lowest <- grubbs.test(eliminated$avg_gross, opposite = TRUE )

highest2 <- grubbs.test(eliminated2$avg_gross)
lowest2 <- grubbs.test(eliminated2$avg_gross, opposite = TRUE )

highest

## 
##  Grubbs test for one outlier
## 
## data:  eliminated$avg_gross
## G = 3.05789, U = 0.98998, p-value = 1
## alternative hypothesis: highest value 118306.926076923 is an outlier

lowest

## 
##  Grubbs test for one outlier
## 
## data:  eliminated$avg_gross
## G = 0.90979, U = 0.99911, p-value = 1
## alternative hypothesis: lowest value 28.0000000030941 is an outlier

highest2

## 
##  Grubbs test for one outlier
## 
## data:  eliminated2$avg_gross
## G = 2.82005, U = 0.93318, p-value = 0.254
## alternative hypothesis: highest value 139428.994704322 is an outlier

lowest2

## 
##  Grubbs test for one outlier
## 
## data:  eliminated2$avg_gross
## G = 1.22238, U = 0.98744, p-value = 1
## alternative hypothesis: lowest value 22546.4670281431 is an outlier

After cleaning our dataset and removing our outliers, we have a clean dataset to work with for our analysis.

exclude_films <- film_avg[1:4,1]
exclude_theatres <- theatre_avg[1:43,1]

x<- subset(master_file, !Film %in% exclude_films[[1]])
master_clean <- subset(x, !TheatreNumber %in% exclude_theatres[[1]])

##Analysis

Now that we have the data we need, we can start our analysis.

master_clean_r <- read_excel("C:/Users/munkhnaran.gankhuyag/OneDrive - insidemedia.net/Python/Rentrak Project/master_clean_r5.xlsx", 
col_types = c("numeric", "text", "text", 
"text", "text", "numeric", "text", 
"text", "text", "text", "date", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "numeric", "numeric", 
"numeric", "text", "numeric", "numeric", 
"numeric", "numeric"))
summary(master_clean_r)

##       Year          Film           TheatreNumber      TheatreName       
##  Min.   :2015   Length:46795       Length:46795       Length:46795      
##  1st Qu.:2016   Class :character   Class :character   Class :character  
##  Median :2017   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :2017                                                           
##  3rd Qu.:2018                                                           
##  Max.   :2019                                                           
##      City           CityPopulation      Circuit             Region         
##  Length:46795       Min.   :   1409   Length:46795       Length:46795      
##  Class :character   1st Qu.:  46870   Class :character   Class :character  
##  Mode  :character   Median : 102629   Mode  :character   Mode  :character  
##                     Mean   : 613024                                        
##                     3rd Qu.: 287535                                        
##                     Max.   :7556900                                        
##     Branch             State           DateofFirstEngagement        
##  Length:46795       Length:46795       Min.   :2015-01-01 00:00:00  
##  Class :character   Class :character   1st Qu.:2016-06-23 00:00:00  
##  Mode  :character   Mode  :character   Median :2017-07-21 00:00:00  
##                                        Mean   :2017-07-29 10:16:57  
##                                        3rd Qu.:2018-11-09 00:00:00  
##                                        Max.   :2020-09-04 00:00:00  
##      Weeks        Weeks\r\r\r\nInRange  MarketShare          TotalGross    
##  Min.   : 1.000   Min.   : 1.00        Min.   :1.765e-05   Min.   :  1003  
##  1st Qu.: 6.000   1st Qu.: 6.00        1st Qu.:1.028e-03   1st Qu.: 20619  
##  Median : 9.000   Median : 8.00        Median :1.859e-03   Median : 39131  
##  Mean   : 9.688   Mean   : 9.15        Mean   :2.059e-03   Mean   : 50992  
##  3rd Qu.:12.000   3rd Qu.:12.00        3rd Qu.:2.878e-03   3rd Qu.: 67364  
##  Max.   :33.000   Max.   :33.00        Max.   :1.742e-02   Max.   :417447  
##  OpeningWeekendGross Opening7DayGross   Week0Gross       Week1WeekendGross
##  Min.   :     0      Min.   :     0   Min.   :    0.00   Min.   :     0   
##  1st Qu.:  4064      1st Qu.:  6714   1st Qu.:    0.00   1st Qu.:  3296   
##  Median :  8287      Median : 13852   Median :    0.00   Median :  8126   
##  Mean   : 11934      Mean   : 19405   Mean   :    6.34   Mean   : 11539   
##  3rd Qu.: 15480      3rd Qu.: 25571   3rd Qu.:    0.00   3rd Qu.: 15512   
##  Max.   :157095      Max.   :257244   Max.   :13828.83   Max.   :153963   
##    Week1Gross     Week2WeekendGross   Week2Gross     Week3WeekendGross
##  Min.   :     0   Min.   :    0     Min.   :     0   Min.   :    0    
##  1st Qu.:  5569   1st Qu.: 2961     1st Qu.:  4931   1st Qu.: 1830    
##  Median : 13152   Median : 5654     Median :  9329   Median : 3685    
##  Mean   : 18073   Mean   : 7487     Mean   : 12137   Mean   : 4798    
##  3rd Qu.: 24571   3rd Qu.: 9936     3rd Qu.: 16165   3rd Qu.: 6543    
##  Max.   :224744   Max.   :75661     Max.   :119096   Max.   :40255    
##    Week3Gross    Week4WeekendGross   Week4Gross    Week5WeekendGross
##  Min.   :    0   Min.   :    0     Min.   :    0   Min.   :    0.0  
##  1st Qu.: 2953   1st Qu.:  946     1st Qu.: 1501   1st Qu.:  365.8  
##  Median : 5897   Median : 2217     Median : 3434   Median : 1311.0  
##  Mean   : 7704   Mean   : 2963     Mean   : 4807   Mean   : 1893.2  
##  3rd Qu.:10389   3rd Qu.: 4127     3rd Qu.: 6458   3rd Qu.: 2687.9  
##  Max.   :70610   Max.   :28730     Max.   :53169   Max.   :24929.8  
##    Week5Gross      Week6WeekendGross   Week6Gross    Week7WeekendGross
##  Min.   :    0.0   Min.   :    0.0   Min.   :    0   Min.   :    0.0  
##  1st Qu.:  584.1   1st Qu.:    0.0   1st Qu.:    0   1st Qu.:    0.0  
##  Median : 2053.0   Median :  661.1   Median : 1002   Median :  230.4  
##  Mean   : 3085.8   Mean   : 1144.3   Mean   : 1862   Mean   :  739.6  
##  3rd Qu.: 4377.1   3rd Qu.: 1694.6   3rd Qu.: 2608   3rd Qu.: 1071.2  
##  Max.   :38053.9   Max.   :18573.1   Max.   :32885   Max.   :18245.0  
##    Week7Gross    Week8WeekendGross   Week8Gross      Week9WeekendGross
##  Min.   :    0   Min.   :    0.0   Min.   :    0.0   Min.   :   0.0   
##  1st Qu.:    0   1st Qu.:    0.0   1st Qu.:    0.0   1st Qu.:   0.0   
##  Median :  339   Median :    0.0   Median :    0.0   Median :   0.0   
##  Mean   : 1156   Mean   :  437.5   Mean   :  673.5   Mean   : 255.1   
##  3rd Qu.: 1556   3rd Qu.:  548.5   3rd Qu.:  784.6   3rd Qu.: 260.1   
##  Max.   :23313   Max.   :11116.2   Max.   :23738.2   Max.   :7576.1   
##    Week9Gross      Week10WeekendGross  Week10Gross        Adventure     
##  Min.   :    0.0   Min.   :   0.00    Min.   :    0.0   Min.   :0.0000  
##  1st Qu.:    0.0   1st Qu.:   0.00    1st Qu.:    0.0   1st Qu.:0.0000  
##  Median :    0.0   Median :   0.00    Median :    0.0   Median :1.0000  
##  Mean   :  395.8   Mean   : 165.88    Mean   :  252.4   Mean   :0.6561  
##  3rd Qu.:  421.3   3rd Qu.:  98.75    3rd Qu.:  155.1   3rd Qu.:1.0000  
##  Max.   :14170.3   Max.   :9175.00    Max.   :13230.9   Max.   :1.0000  
##      Action          Musical           Drama           History       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.4322   Mean   :0.1641   Mean   :0.2916   Mean   :0.08326  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##      Comedy          Mystery           Romance          Thriller     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.0000  
##  Mean   :0.4125   Mean   :0.08377   Mean   :0.1204   Mean   :0.2025  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##      SciFi             War              Family           Horror       
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.4781   Mean   :0.03964   Mean   :0.3288   Mean   :0.02321  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
##    Superhero          Novel          Animation        Franchise     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000   Median :1.0000  
##  Mean   :0.2006   Mean   :0.1172   Mean   :0.2346   Mean   :0.5935  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      Sequel         NoTheatres      GrossPerWeek     PrimaryGenre      
##  Min.   :0.0000   Min.   : 1.000   Min.   :  107.8   Length:46795      
##  1st Qu.:0.0000   1st Qu.: 1.000   1st Qu.: 2573.6   Class :character  
##  Median :0.0000   Median : 2.000   Median : 4290.7   Mode  :character  
##  Mean   :0.4516   Mean   : 6.038   Mean   : 5331.0                     
##  3rd Qu.:1.0000   3rd Qu.: 4.000   3rd Qu.: 6905.4                     
##  Max.   :1.0000   Max.   :57.000   Max.   :48592.9                     
##  AvgTheatreGross   AvgCityGross    MonthofFirstEngagement    CitySize     
##  Min.   :  1004   Min.   :  1347   Min.   : 1.000         Min.   :  5000  
##  1st Qu.: 29668   1st Qu.: 37796   1st Qu.: 3.000         1st Qu.: 10000  
##  Median : 49048   Median : 49836   Median : 6.000         Median :100000  
##  Mean   : 50992   Mean   : 50992   Mean   : 6.247         Mean   :144163  
##  3rd Qu.: 70987   3rd Qu.: 62607   3rd Qu.: 9.000         3rd Qu.:250000  
##  Max.   :116479   Max.   :110064   Max.   :12.000         Max.   :500000

####Multi Regression Model

model_data <- master_clean_r %>% dplyr::select(Adventure, Action, Musical, Drama, History, Comedy, Mystery, Romance, Thriller, SciFi, War, Family, Horror, Superhero, Novel, Animation, Franchise, Sequel, MonthofFirstEngagement, CitySize,NoTheatres, TotalGross, AvgCityGross,CityPopulation)

##Sample the dataset. The return for this is row nos.
set.seed(1)
row.number <- sample(1:nrow(model_data), 0.8*nrow(model_data))
train = model_data[row.number,]
test = model_data[-row.number,]
dim(train)

## [1] 37436    24

dim(test)

## [1] 9359   24

ggplot(train, aes(TotalGross)) + geom_density(fill="blue")

ggplot(train, aes(log(TotalGross))) + geom_density(fill="blue")

ggplot(train, aes(sqrt(TotalGross))) + geom_density(fill="blue")

#Lets make default model.
model1 = lm(log(TotalGross)~., data=train)
summary(model1)

## 
## Call:
## lm(formula = log(TotalGross) ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5818 -0.4048  0.0603  0.5231  2.4725 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             8.830e+00  2.423e-02 364.375  < 2e-16 ***
## Adventure               1.021e-01  1.237e-02   8.255  < 2e-16 ***
## Action                 -1.071e-01  1.635e-02  -6.551 5.78e-11 ***
## Musical                 1.606e-01  1.406e-02  11.424  < 2e-16 ***
## Drama                  -9.121e-02  1.553e-02  -5.872 4.35e-09 ***
## History                 1.334e-01  2.187e-02   6.100 1.07e-09 ***
## Comedy                 -5.587e-02  1.127e-02  -4.959 7.10e-07 ***
## Mystery                -1.157e-01  1.801e-02  -6.426 1.32e-10 ***
## Romance                 2.843e-01  1.723e-02  16.499  < 2e-16 ***
## Thriller                8.115e-02  1.435e-02   5.654 1.58e-08 ***
## SciFi                  -2.107e-02  1.158e-02  -1.820   0.0688 .  
## War                     1.440e-01  2.405e-02   5.985 2.18e-09 ***
## Family                  3.096e-01  1.687e-02  18.351  < 2e-16 ***
## Horror                 -3.529e-02  3.348e-02  -1.054   0.2919    
## Superhero               3.634e-01  1.543e-02  23.555  < 2e-16 ***
## Novel                   1.409e-01  1.630e-02   8.644  < 2e-16 ***
## Animation              -1.832e-02  1.804e-02  -1.015   0.3099    
## Franchise               2.577e-02  1.204e-02   2.139   0.0324 *  
## Sequel                  1.653e-01  1.072e-02  15.414  < 2e-16 ***
## MonthofFirstEngagement  1.608e-02  1.351e-03  11.903  < 2e-16 ***
## CitySize                5.308e-08  3.584e-08   1.481   0.1386    
## NoTheatres             -1.780e-02  2.990e-03  -5.953 2.65e-09 ***
## AvgCityGross            2.409e-05  2.033e-07 118.490  < 2e-16 ***
## CityPopulation          1.236e-07  2.136e-08   5.788 7.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8064 on 37412 degrees of freedom
## Multiple R-squared:  0.3214, Adjusted R-squared:  0.321 
## F-statistic: 770.5 on 23 and 37412 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model1)

Improve the model by removing statistically insignificant variables.

model2 = update(model1, ~.-CitySize-Horror-Animation-SciFi) 
summary(model2)

## 
## Call:
## lm(formula = log(TotalGross) ~ Adventure + Action + Musical + 
##     Drama + History + Comedy + Mystery + Romance + Thriller + 
##     War + Family + Superhero + Novel + Franchise + Sequel + MonthofFirstEngagement + 
##     NoTheatres + AvgCityGross + CityPopulation, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5862 -0.4057  0.0598  0.5245  2.4731 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             8.824e+00  2.386e-02 369.916  < 2e-16 ***
## Adventure               9.655e-02  1.207e-02   8.001 1.26e-15 ***
## Action                 -1.070e-01  1.534e-02  -6.977 3.07e-12 ***
## Musical                 1.646e-01  1.382e-02  11.908  < 2e-16 ***
## Drama                  -8.772e-02  1.530e-02  -5.733 9.97e-09 ***
## History                 1.342e-01  2.182e-02   6.151 7.77e-10 ***
## Comedy                 -5.694e-02  1.081e-02  -5.267 1.39e-07 ***
## Mystery                -1.176e-01  1.648e-02  -7.139 9.56e-13 ***
## Romance                 2.853e-01  1.667e-02  17.120  < 2e-16 ***
## Thriller                7.545e-02  1.356e-02   5.562 2.68e-08 ***
## War                     1.454e-01  2.400e-02   6.059 1.39e-09 ***
## Family                  2.965e-01  1.345e-02  22.049  < 2e-16 ***
## Superhero               3.513e-01  1.451e-02  24.216  < 2e-16 ***
## Novel                   1.442e-01  1.533e-02   9.407  < 2e-16 ***
## Franchise               2.973e-02  1.134e-02   2.622  0.00875 ** 
## Sequel                  1.635e-01  1.056e-02  15.485  < 2e-16 ***
## MonthofFirstEngagement  1.592e-02  1.330e-03  11.978  < 2e-16 ***
## NoTheatres             -1.585e-02  2.691e-03  -5.890 3.90e-09 ***
## AvgCityGross            2.413e-05  1.998e-07 120.771  < 2e-16 ***
## CityPopulation          1.125e-07  2.003e-08   5.615 1.98e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8064 on 37416 degrees of freedom
## Multiple R-squared:  0.3213, Adjusted R-squared:  0.3209 
## F-statistic: 932.1 on 19 and 37416 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model2)

pred1 <- predict(model2, newdata = test)
rmse <- sqrt(sum((exp(pred1) - test$TotalGross)^2)/length(test$TotalGross))
c(RMSE = rmse, R2=summary(model2)$r.squared)

##         RMSE           R2 
## 38383.509900     0.321274

Since we have a cross-sectional design, our R2 of 32% is acceptable for out dataset.

par(mfrow=c(1,1))
plot(test$TotalGross, exp(pred1))

###Multi level Regression Model

library(gridExtra)
library(lme4)
library(knitr)
library(broom)
library(tidyverse)

We’ll take the same data set but include City, Film and TheatreNumber to address the different levels in our data.

model_data2 <- master_clean_r %>% dplyr::select(City, Film, TheatreNumber, Adventure, Action, Musical, Drama, History, Comedy, Mystery, Romance, Thriller, SciFi, War, Family, Horror, Superhero, Novel, Animation, Franchise, Sequel, MonthofFirstEngagement, CitySize,NoTheatres, TotalGross, AvgCityGross,CityPopulation)

While controlling the city as a random variable, we get the following model

model2.1 = lmer(TotalGross ~1 + (1|City), REML = FALSE, data = model_data2)
summary(model2.1)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: TotalGross ~ 1 + (1 | City)
##    Data: model_data2
## 
##       AIC       BIC    logLik  deviance  df.resid 
## 1122515.4 1122541.6 -561254.7 1122509.4     46792 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.6109 -0.6108 -0.2196  0.3325  9.3887 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.
##  City     (Intercept) 6.598e+08 25686   
##  Residual             1.491e+09 38615   
## Number of obs: 46795, groups:  City, 405
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)    40341       1363   29.59

While controlling the Film and Theatre as a random variable, we get the following model

model.a1 <- lmer(TotalGross~ 1 + (1|Film) + (1|TheatreNumber)
                  , data = model_data2)

summary(model.a1)

## Linear mixed model fit by REML ['lmerMod']
## Formula: TotalGross ~ 1 + (1 | Film) + (1 | TheatreNumber)
##    Data: model_data2
## 
## REML criterion at convergence: 1079736
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.5131 -0.5220 -0.0298  0.4425 11.2152 
## 
## Random effects:
##  Groups        Name        Variance  Std.Dev.
##  TheatreNumber (Intercept) 666057910 25808   
##  Film          (Intercept) 677626440 26031   
##  Residual                  573290128 23943   
## Number of obs: 46795, groups:  TheatreNumber, 692; Film, 121
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)    40081       2582   15.53

While controlling the film, theatre and city as a random variable, we get the following model

model.b1 <- lmer(TotalGross~ 1 + (1|Film) + (1|TheatreNumber) +(1|City)
                  , data = model_data2)

summary(model.b1)

## Linear mixed model fit by REML ['lmerMod']
## Formula: TotalGross ~ 1 + (1 | Film) + (1 | TheatreNumber) + (1 | City)
##    Data: model_data2
## 
## REML criterion at convergence: 1079731
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.5140 -0.5220 -0.0298  0.4425 11.2144 
## 
## Random effects:
##  Groups        Name        Variance  Std.Dev.
##  TheatreNumber (Intercept) 643846341 25374   
##  City          (Intercept)  17937618  4235   
##  Film          (Intercept) 677702086 26033   
##  Residual                  573296991 23944   
## Number of obs: 46795, groups:  TheatreNumber, 692; City, 405; Film, 121
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)    40633       2599   15.63

##Conclusion The random effects of lmer models allow us to account for the different levels in the dataset such as city and theatre. We are not looking for specific film prediction, we need to predict performance of a theatre based on specific movie titles. While multiple level modeling can be helpful, its challenging to interpret therefore, when we develop our program, we can use the multiple linear model and account for the different cities based on population and number of movie theaters as it does a good job fitting our model.