library(readxl)
library(dplyr)
library(tidyr)
library(ggplot2)
##Abstract
The film industry was greatly impacted by the Covid-19 Pandemic with the production and theatrical releases stopping in the majority of 2020 and early 2021. During this time, studios tested releasing films in theaters, but it has not been promising. For example, the anticipated blockbuster movie of 2020, Christopher Nolan’s Tenet netted only 20% of expected box office sales last summer. This has resulted in studios to pivot their strategy to either selling their to streaming platforms (Netflix, Amazon, Apple TV+) or releasing it on their own sister platforms (HBO Max, Disney Plus, Peacock). While there have been significant changes to the industry, studios are still committed to theatrical releases and have strategically used this time to prepare for life after the pandemic. Covid-19. One project that can help the industry is improving the strategy around which theaters to release new movies, as some movies perform better in some areas while underperforming in others.
##The Problem Movie sales were projected to increase in revenue prior to the pandemic happening. According to the US census Bureau, the Motion Picture & Sound Recording Industry revenue sales decreased by 14.8% in 2020, however the industry is looking to recover by 2024 with a projected increase 1.3% from 2019 to 2024 (Statista, 2020). According to another study conducted by the Video Trends Report from TiVo, 44% of people in the US and Canada said they were less likely to see a movie, while 66% said they were more or about the same likelihood of watching a movie in theaters (Stoll, 2021). One market that saw a decline in the movie and entertainment market prior to the pandemic, was the UK. According to a journal published by MarketLine, the movies and entertainment industry in the UK fell by 5.5% in revenue from 2014- 2018 (MarketLine, 2019). The main reason for the decline was the increase in streaming services available in the nation. While this represents a decline in the industry, studios won’t stop theater releases in the UK and there will still be a market for theater experience. While the industry is paused, studios have an opportunity to strategize their theatrical releases during this time. With all the data that is available right now, studios can modify their strategy on how they open their films in the theater. Studios are one of the most traditional industries and they are very conservative on their planning strategies. For example, the strategy around trailer releases has not changed in decades where studios continue to release teaser trailers months before the release date and one the date is finalized, they release a new trailer exactly four weeks prior to the release. Currently theater revenues dictate if a film will be released in a specific theater, but it doesn’t expand beyond that and opens an opportunity for growth. One question that can aid the industry is why certain movies over perform, while others underperform.
##The Data
I was able to come across movie theater revenue data for top 100 movies from 2015-2019 in the UK. The data is broken down by film, region, city, theater name, with weekly revenue results. After exploring the data, it needs to be cleaned and used for predictive modeling. The project needs to be a tool used for studios planning and budgeting, so I want the tool to be interactive with predictive modeling happening in the backend.
##Data cleaning First we need to clean the data to make sure we have everything we need.
master_file <- read_excel("C:/Users/munkhnaran.gankhuyag/OneDrive - insidemedia.net/Python/Rentrak Project/master_file_0509.xlsx",
col_types = c("numeric", "text", "numeric",
"text", "text", "numeric", "text",
"text", "text", "text", "date", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric","numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric"))
summary(master_file)
## Year Film TheatreNumber TheatreName
## Min. :2015 Length:57068 Min. : 7550 Length:57068
## 1st Qu.:2016 Class :character 1st Qu.: 8467 Class :character
## Median :2017 Mode :character Median : 14767 Mode :character
## Mean :2017 Mean : 229319
## 3rd Qu.:2018 3rd Qu.: 50273
## Max. :2019 Max. :1092965
##
## City City Population Circuit Region
## Length:57068 Min. : 1409 Length:57068 Length:57068
## Class :character 1st Qu.: 48516 Class :character Class :character
## Mode :character Median : 117763 Mode :character Mode :character
## Mean : 703944
## 3rd Qu.: 318014
## Max. :7556900
## NA's :1919
## Branch State DateofFirstEngagement
## Length:57068 Length:57068 Min. :2015-01-01 00:00:00
## Class :character Class :character 1st Qu.:2016-06-18 00:00:00
## Mode :character Mode :character Median :2017-09-08 00:00:00
## Mean :2017-08-03 12:47:23
## 3rd Qu.:2018-11-14 00:00:00
## Max. :2020-10-19 00:00:00
##
## Weeks Weeks\r\nInRange MarketShare TotalGross
## Min. : 1.000 Min. : 1.000 Min. :2.000e-08 Min. : 1
## 1st Qu.: 6.000 1st Qu.: 5.000 1st Qu.:9.240e-04 1st Qu.: 18990
## Median : 9.000 Median : 8.000 Median :1.890e-03 Median : 40989
## Mean : 9.411 Mean : 8.882 Mean :2.278e-03 Mean : 59970
## 3rd Qu.:12.000 3rd Qu.:12.000 3rd Qu.:3.137e-03 3rd Qu.: 77285
## Max. :33.000 Max. :33.000 Max. :3.651e-02 Max. :1885916
##
## OpeningWeekendGross Opening7DayGross Week0Gross Week1WeekendGross
## Min. : 0 Min. : 0 Min. : 0.000 Min. : 0
## 1st Qu.: 3667 1st Qu.: 6097 1st Qu.: 0.000 1st Qu.: 2650
## Median : 8615 Median : 14378 Median : 0.000 Median : 8341
## Mean : 14324 Mean : 23707 Mean : 8.365 Mean : 13893
## 3rd Qu.: 17730 3rd Qu.: 29256 3rd Qu.: 0.000 3rd Qu.: 17715
## Max. :301698 Max. :609211 Max. :13828.830 Max. :301698
##
## Week1Gross Week2WeekendGross Week2Gross Week3WeekendGross
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 4536 1st Qu.: 2671 1st Qu.: 4491 1st Qu.: 1629
## Median : 13494 Median : 5891 Median : 9729 Median : 3849
## Mean : 21723 Mean : 8738 Mean : 14333 Mean : 5596
## 3rd Qu.: 28083 3rd Qu.: 11270 3rd Qu.: 18360 3rd Qu.: 7423
## Max. :492339 Max. :158272 Max. :314517 Max. :153820
##
## Week3Gross Week4WeekendGross Week4Gross Week5WeekendGross
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0
## 1st Qu.: 2654 1st Qu.: 808.6 1st Qu.: 1285 1st Qu.: 239
## Median : 6127 Median : 2308.8 Median : 3582 Median : 1356
## Mean : 8890 Mean : 3408.2 Mean : 5466 Mean : 2171
## 3rd Qu.: 11739 3rd Qu.: 4645.8 3rd Qu.: 7242 3rd Qu.: 3008
## Max. :298679 Max. :126346.5 Max. :235305 Max. :101123
##
## Week5Gross Week6WeekendGross Week6Gross Week7WeekendGross
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 395.2 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0.0
## Median : 2133.1 Median : 676.1 Median : 1024 Median : 223.1
## Mean : 3520.2 Mean : 1329.3 Mean : 2145 Mean : 869.0
## 3rd Qu.: 4882.1 3rd Qu.: 1896.0 3rd Qu.: 2923 3rd Qu.: 1195.9
## Max. :177792.0 Max. :72739.7 Max. :133714 Max. :64386.0
##
## Week7Gross Week8WeekendGross Week8Gross Week9WeekendGross
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 335.1 Median : 0.0 Median : 0.0 Median : 0.0
## Mean : 1355.4 Mean : 517.8 Mean : 801.7 Mean : 307.4
## 3rd Qu.: 1745.8 3rd Qu.: 610.5 3rd Qu.: 880.3 3rd Qu.: 285.4
## Max. :112019.8 Max. :60438.4 Max. :83665.4 Max. :32806.6
##
## Week9Gross Week10WeekendGross Week10Gross Adventure
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. :0.0000
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.:0.0000
## Median : 0.0 Median : 0.0 Median : 0.0 Median :1.0000
## Mean : 489.9 Mean : 208.5 Mean : 316.2 Mean :0.6703
## 3rd Qu.: 466.6 3rd Qu.: 115.2 3rd Qu.: 181.4 3rd Qu.:1.0000
## Max. :62609.3 Max. :21501.2 Max. :32906.2 Max. :1.0000
##
## Action Musical Drama History
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.4493 Mean :0.1635 Mean :0.2803 Mean :0.08411
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
##
## Comedy Mystery Romance Thriller
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.3971 Mean :0.08658 Mean :0.1141 Mean :0.2056
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
##
## Sci Fi War Family Horror
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.4815 Mean :0.03701 Mean :0.3236 Mean :0.02173
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.00000
##
## Superhero Novel Animation Franchise
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :1.0000
## Mean :0.1963 Mean :0.1097 Mean :0.2322 Mean :0.6068
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## Sequel
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4687
## 3rd Qu.:1.0000
## Max. :1.0000
##
When we observe the data, we see that the year we have years range from 2015-2019. 2020 data was excluded, as limited movies were released during that time. In addition to the original data, some time was spent gathering additional data such as genres and city populations. Our primary focus will be total gross, which will be our independent variable in our dataset.
We need to look for ourliers in our dataset. Since there are multiple levels(Film, Theatre), we need to observe both the average gross for the film and theaters to make sure there aren’t any outliers.
film_avg <- master_file %>% group_by(Film) %>% summarise(avg_gross = mean(TotalGross), avg_weeks = mean(Weeks)) %>% arrange(desc(avg_gross))
head(film_avg)
## # A tibble: 6 x 3
## Film avg_gross avg_weeks
## <chr> <dbl> <dbl>
## 1 Force Awakens 239999. 11.6
## 2 Spectre 157770. 10.1
## 3 AVENGERS ENDGAME 153506. 8.67
## 4 STAR WARS THE LAST JEDI 148581. 7.79
## 5 BEAUTY AND THE BEAST 139429. 12.3
## 6 AVENGERS INFINITY WAR 136245. 9.70
theatre_avg <- master_file %>% group_by(TheatreNumber) %>% summarise(avg_gross = mean(TotalGross), avg_weeks = mean(Weeks)) %>% arrange(desc(avg_gross))
head(theatre_avg)
## # A tibble: 6 x 3
## TheatreNumber avg_gross avg_weeks
## <dbl> <dbl> <dbl>
## 1 52140 276850. 13.6
## 2 14600 251588. 11.3
## 3 14568 246481. 4.47
## 4 14382 227757. 12.1
## 5 14289 218552. 13.2
## 6 997733 204193. 13.9
Instantly, we notice that Force Awakens has a significant gross per theater.
Splitting the data in quartiles
"Film Average Gross Sales Summary"
## [1] "Film Average Gross Sales Summary"
summary(film_avg$avg_gross)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22546 37162 47063 61637 78539 239999
print("")
## [1] ""
"Theatre Average Gross Sales Summary"
## [1] "Theatre Average Gross Sales Summary"
summary(theatre_avg$avg_gross)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 28 4122 16003 33052 50026 276850
Some charts to further investigate the data.
ggplot(film_avg, aes(x=avg_weeks, y=avg_gross)) +
geom_point()+labs(title="Average Film Sales vs Average of Weeks in Theatre",
x ="Average Number of Weeks", y = "Average Gross")
ggplot(film_avg) +
aes(x = "", y = avg_gross) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal() + labs(title="Average Film Sales",
x ="Films", y = "Average Gross")
ggplot(theatre_avg, aes(x=avg_weeks, y=avg_gross)) +
geom_point()+labs(title="Average Theatre Sales vs Average of Weeks in Theatre",
x ="Average Number of Weeks", y = "Average Gross")
ggplot(theatre_avg) +
aes(x = "", y = avg_gross) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal() + labs(title="Average Threatre Sales",
x ="Threatres", y = "Average Gross")
Histogram charts.
hist(film_avg$avg_gross, 50, main = "Histogram Avg. Gross Sales by Film", xlab = "Average Sales")
hist(theatre_avg$avg_gross, 50, main = "Histogram Avg. Gross Sales by Theatre", xlab = "Average Sales")
We will use IQR method to remove the outliers in the data.
Q <- quantile(theatre_avg$avg_gross, probs= c(0.25, .75), na.rm = TRUE)
Q2 <- quantile(film_avg$avg_gross, probs= c(0.25, .75), na.rm = TRUE)
iqr <- IQR(theatre_avg$avg_gross)
iqr2 <- IQR(film_avg$avg_gross)
up <- Q[2]+1.5*iqr # Upper Range
low<- Q[1]-1.5*iqr # Lower Range
up2 <- Q2[2]+1.5*iqr2 # Upper Range
low2<- Q2[1]-1.5*iqr2 # Lower Range
We notice the 75% cutoff is 118
'Theatre and Film cutoff'
## [1] "Theatre and Film cutoff"
"Theatre Gross"
## [1] "Theatre Gross"
up
## 75%
## 118882.2
low
## 25%
## -64733.78
"Film Gross"
## [1] "Film Gross"
up2
## 75%
## 140605.5
low2
## 25%
## -24904.13
Eliminate the data outside of cutoff point and identify which rows were removed. We see that 43 Theatres and 4 Films were considered outliers and were removed from our dataset.
#eliminate outliers
eliminated<- subset(theatre_avg, avg_gross > (Q[1] - 1.5*iqr) & theatre_avg$avg_gross < (Q[2]+1.5*iqr))
eliminated2<- subset(film_avg, avg_gross > (Q2[1] - 1.5*iqr2) & film_avg$avg_gross < (Q2[2]+1.5*iqr2))
#which outliers were removed
which(theatre_avg$avg_gross %in% c(boxplot.stats(theatre_avg$avg_gross)$out))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
which(film_avg$avg_gross %in% c(boxplot.stats(film_avg$avg_gross)$out))
## [1] 1 2 3 4
library(outliers)
highest <- grubbs.test(eliminated$avg_gross)
lowest <- grubbs.test(eliminated$avg_gross, opposite = TRUE )
highest2 <- grubbs.test(eliminated2$avg_gross)
lowest2 <- grubbs.test(eliminated2$avg_gross, opposite = TRUE )
highest
##
## Grubbs test for one outlier
##
## data: eliminated$avg_gross
## G = 3.05789, U = 0.98998, p-value = 1
## alternative hypothesis: highest value 118306.926076923 is an outlier
lowest
##
## Grubbs test for one outlier
##
## data: eliminated$avg_gross
## G = 0.90979, U = 0.99911, p-value = 1
## alternative hypothesis: lowest value 28.0000000030941 is an outlier
highest2
##
## Grubbs test for one outlier
##
## data: eliminated2$avg_gross
## G = 2.82005, U = 0.93318, p-value = 0.254
## alternative hypothesis: highest value 139428.994704322 is an outlier
lowest2
##
## Grubbs test for one outlier
##
## data: eliminated2$avg_gross
## G = 1.22238, U = 0.98744, p-value = 1
## alternative hypothesis: lowest value 22546.4670281431 is an outlier
After cleaning our dataset and removing our outliers, we have a clean dataset to work with for our analysis.
exclude_films <- film_avg[1:4,1]
exclude_theatres <- theatre_avg[1:43,1]
x<- subset(master_file, !Film %in% exclude_films[[1]])
master_clean <- subset(x, !TheatreNumber %in% exclude_theatres[[1]])
##Analysis
Now that we have the data we need, we can start our analysis.
master_clean_r <- read_excel("C:/Users/munkhnaran.gankhuyag/OneDrive - insidemedia.net/Python/Rentrak Project/master_clean_r5.xlsx",
col_types = c("numeric", "text", "text",
"text", "text", "numeric", "text",
"text", "text", "text", "date", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "numeric", "numeric",
"numeric", "text", "numeric", "numeric",
"numeric", "numeric"))
summary(master_clean_r)
## Year Film TheatreNumber TheatreName
## Min. :2015 Length:46795 Length:46795 Length:46795
## 1st Qu.:2016 Class :character Class :character Class :character
## Median :2017 Mode :character Mode :character Mode :character
## Mean :2017
## 3rd Qu.:2018
## Max. :2019
## City CityPopulation Circuit Region
## Length:46795 Min. : 1409 Length:46795 Length:46795
## Class :character 1st Qu.: 46870 Class :character Class :character
## Mode :character Median : 102629 Mode :character Mode :character
## Mean : 613024
## 3rd Qu.: 287535
## Max. :7556900
## Branch State DateofFirstEngagement
## Length:46795 Length:46795 Min. :2015-01-01 00:00:00
## Class :character Class :character 1st Qu.:2016-06-23 00:00:00
## Mode :character Mode :character Median :2017-07-21 00:00:00
## Mean :2017-07-29 10:16:57
## 3rd Qu.:2018-11-09 00:00:00
## Max. :2020-09-04 00:00:00
## Weeks Weeks\r\r\r\nInRange MarketShare TotalGross
## Min. : 1.000 Min. : 1.00 Min. :1.765e-05 Min. : 1003
## 1st Qu.: 6.000 1st Qu.: 6.00 1st Qu.:1.028e-03 1st Qu.: 20619
## Median : 9.000 Median : 8.00 Median :1.859e-03 Median : 39131
## Mean : 9.688 Mean : 9.15 Mean :2.059e-03 Mean : 50992
## 3rd Qu.:12.000 3rd Qu.:12.00 3rd Qu.:2.878e-03 3rd Qu.: 67364
## Max. :33.000 Max. :33.00 Max. :1.742e-02 Max. :417447
## OpeningWeekendGross Opening7DayGross Week0Gross Week1WeekendGross
## Min. : 0 Min. : 0 Min. : 0.00 Min. : 0
## 1st Qu.: 4064 1st Qu.: 6714 1st Qu.: 0.00 1st Qu.: 3296
## Median : 8287 Median : 13852 Median : 0.00 Median : 8126
## Mean : 11934 Mean : 19405 Mean : 6.34 Mean : 11539
## 3rd Qu.: 15480 3rd Qu.: 25571 3rd Qu.: 0.00 3rd Qu.: 15512
## Max. :157095 Max. :257244 Max. :13828.83 Max. :153963
## Week1Gross Week2WeekendGross Week2Gross Week3WeekendGross
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 5569 1st Qu.: 2961 1st Qu.: 4931 1st Qu.: 1830
## Median : 13152 Median : 5654 Median : 9329 Median : 3685
## Mean : 18073 Mean : 7487 Mean : 12137 Mean : 4798
## 3rd Qu.: 24571 3rd Qu.: 9936 3rd Qu.: 16165 3rd Qu.: 6543
## Max. :224744 Max. :75661 Max. :119096 Max. :40255
## Week3Gross Week4WeekendGross Week4Gross Week5WeekendGross
## Min. : 0 Min. : 0 Min. : 0 Min. : 0.0
## 1st Qu.: 2953 1st Qu.: 946 1st Qu.: 1501 1st Qu.: 365.8
## Median : 5897 Median : 2217 Median : 3434 Median : 1311.0
## Mean : 7704 Mean : 2963 Mean : 4807 Mean : 1893.2
## 3rd Qu.:10389 3rd Qu.: 4127 3rd Qu.: 6458 3rd Qu.: 2687.9
## Max. :70610 Max. :28730 Max. :53169 Max. :24929.8
## Week5Gross Week6WeekendGross Week6Gross Week7WeekendGross
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 584.1 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0.0
## Median : 2053.0 Median : 661.1 Median : 1002 Median : 230.4
## Mean : 3085.8 Mean : 1144.3 Mean : 1862 Mean : 739.6
## 3rd Qu.: 4377.1 3rd Qu.: 1694.6 3rd Qu.: 2608 3rd Qu.: 1071.2
## Max. :38053.9 Max. :18573.1 Max. :32885 Max. :18245.0
## Week7Gross Week8WeekendGross Week8Gross Week9WeekendGross
## Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0
## Median : 339 Median : 0.0 Median : 0.0 Median : 0.0
## Mean : 1156 Mean : 437.5 Mean : 673.5 Mean : 255.1
## 3rd Qu.: 1556 3rd Qu.: 548.5 3rd Qu.: 784.6 3rd Qu.: 260.1
## Max. :23313 Max. :11116.2 Max. :23738.2 Max. :7576.1
## Week9Gross Week10WeekendGross Week10Gross Adventure
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. :0.0000
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.:0.0000
## Median : 0.0 Median : 0.00 Median : 0.0 Median :1.0000
## Mean : 395.8 Mean : 165.88 Mean : 252.4 Mean :0.6561
## 3rd Qu.: 421.3 3rd Qu.: 98.75 3rd Qu.: 155.1 3rd Qu.:1.0000
## Max. :14170.3 Max. :9175.00 Max. :13230.9 Max. :1.0000
## Action Musical Drama History
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.4322 Mean :0.1641 Mean :0.2916 Mean :0.08326
## 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## Comedy Mystery Romance Thriller
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.0000
## Mean :0.4125 Mean :0.08377 Mean :0.1204 Mean :0.2025
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000
## SciFi War Family Horror
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.4781 Mean :0.03964 Mean :0.3288 Mean :0.02321
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.00000
## Superhero Novel Animation Franchise
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0.0000 Median :1.0000
## Mean :0.2006 Mean :0.1172 Mean :0.2346 Mean :0.5935
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Sequel NoTheatres GrossPerWeek PrimaryGenre
## Min. :0.0000 Min. : 1.000 Min. : 107.8 Length:46795
## 1st Qu.:0.0000 1st Qu.: 1.000 1st Qu.: 2573.6 Class :character
## Median :0.0000 Median : 2.000 Median : 4290.7 Mode :character
## Mean :0.4516 Mean : 6.038 Mean : 5331.0
## 3rd Qu.:1.0000 3rd Qu.: 4.000 3rd Qu.: 6905.4
## Max. :1.0000 Max. :57.000 Max. :48592.9
## AvgTheatreGross AvgCityGross MonthofFirstEngagement CitySize
## Min. : 1004 Min. : 1347 Min. : 1.000 Min. : 5000
## 1st Qu.: 29668 1st Qu.: 37796 1st Qu.: 3.000 1st Qu.: 10000
## Median : 49048 Median : 49836 Median : 6.000 Median :100000
## Mean : 50992 Mean : 50992 Mean : 6.247 Mean :144163
## 3rd Qu.: 70987 3rd Qu.: 62607 3rd Qu.: 9.000 3rd Qu.:250000
## Max. :116479 Max. :110064 Max. :12.000 Max. :500000
####Multi Regression Model
model_data <- master_clean_r %>% dplyr::select(Adventure, Action, Musical, Drama, History, Comedy, Mystery, Romance, Thriller, SciFi, War, Family, Horror, Superhero, Novel, Animation, Franchise, Sequel, MonthofFirstEngagement, CitySize,NoTheatres, TotalGross, AvgCityGross,CityPopulation)
##Sample the dataset. The return for this is row nos.
set.seed(1)
row.number <- sample(1:nrow(model_data), 0.8*nrow(model_data))
train = model_data[row.number,]
test = model_data[-row.number,]
dim(train)
## [1] 37436 24
dim(test)
## [1] 9359 24
ggplot(train, aes(TotalGross)) + geom_density(fill="blue")
ggplot(train, aes(log(TotalGross))) + geom_density(fill="blue")
ggplot(train, aes(sqrt(TotalGross))) + geom_density(fill="blue")
#Lets make default model.
model1 = lm(log(TotalGross)~., data=train)
summary(model1)
##
## Call:
## lm(formula = log(TotalGross) ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5818 -0.4048 0.0603 0.5231 2.4725
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.830e+00 2.423e-02 364.375 < 2e-16 ***
## Adventure 1.021e-01 1.237e-02 8.255 < 2e-16 ***
## Action -1.071e-01 1.635e-02 -6.551 5.78e-11 ***
## Musical 1.606e-01 1.406e-02 11.424 < 2e-16 ***
## Drama -9.121e-02 1.553e-02 -5.872 4.35e-09 ***
## History 1.334e-01 2.187e-02 6.100 1.07e-09 ***
## Comedy -5.587e-02 1.127e-02 -4.959 7.10e-07 ***
## Mystery -1.157e-01 1.801e-02 -6.426 1.32e-10 ***
## Romance 2.843e-01 1.723e-02 16.499 < 2e-16 ***
## Thriller 8.115e-02 1.435e-02 5.654 1.58e-08 ***
## SciFi -2.107e-02 1.158e-02 -1.820 0.0688 .
## War 1.440e-01 2.405e-02 5.985 2.18e-09 ***
## Family 3.096e-01 1.687e-02 18.351 < 2e-16 ***
## Horror -3.529e-02 3.348e-02 -1.054 0.2919
## Superhero 3.634e-01 1.543e-02 23.555 < 2e-16 ***
## Novel 1.409e-01 1.630e-02 8.644 < 2e-16 ***
## Animation -1.832e-02 1.804e-02 -1.015 0.3099
## Franchise 2.577e-02 1.204e-02 2.139 0.0324 *
## Sequel 1.653e-01 1.072e-02 15.414 < 2e-16 ***
## MonthofFirstEngagement 1.608e-02 1.351e-03 11.903 < 2e-16 ***
## CitySize 5.308e-08 3.584e-08 1.481 0.1386
## NoTheatres -1.780e-02 2.990e-03 -5.953 2.65e-09 ***
## AvgCityGross 2.409e-05 2.033e-07 118.490 < 2e-16 ***
## CityPopulation 1.236e-07 2.136e-08 5.788 7.19e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8064 on 37412 degrees of freedom
## Multiple R-squared: 0.3214, Adjusted R-squared: 0.321
## F-statistic: 770.5 on 23 and 37412 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model1)
Improve the model by removing statistically insignificant variables.
model2 = update(model1, ~.-CitySize-Horror-Animation-SciFi)
summary(model2)
##
## Call:
## lm(formula = log(TotalGross) ~ Adventure + Action + Musical +
## Drama + History + Comedy + Mystery + Romance + Thriller +
## War + Family + Superhero + Novel + Franchise + Sequel + MonthofFirstEngagement +
## NoTheatres + AvgCityGross + CityPopulation, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5862 -0.4057 0.0598 0.5245 2.4731
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.824e+00 2.386e-02 369.916 < 2e-16 ***
## Adventure 9.655e-02 1.207e-02 8.001 1.26e-15 ***
## Action -1.070e-01 1.534e-02 -6.977 3.07e-12 ***
## Musical 1.646e-01 1.382e-02 11.908 < 2e-16 ***
## Drama -8.772e-02 1.530e-02 -5.733 9.97e-09 ***
## History 1.342e-01 2.182e-02 6.151 7.77e-10 ***
## Comedy -5.694e-02 1.081e-02 -5.267 1.39e-07 ***
## Mystery -1.176e-01 1.648e-02 -7.139 9.56e-13 ***
## Romance 2.853e-01 1.667e-02 17.120 < 2e-16 ***
## Thriller 7.545e-02 1.356e-02 5.562 2.68e-08 ***
## War 1.454e-01 2.400e-02 6.059 1.39e-09 ***
## Family 2.965e-01 1.345e-02 22.049 < 2e-16 ***
## Superhero 3.513e-01 1.451e-02 24.216 < 2e-16 ***
## Novel 1.442e-01 1.533e-02 9.407 < 2e-16 ***
## Franchise 2.973e-02 1.134e-02 2.622 0.00875 **
## Sequel 1.635e-01 1.056e-02 15.485 < 2e-16 ***
## MonthofFirstEngagement 1.592e-02 1.330e-03 11.978 < 2e-16 ***
## NoTheatres -1.585e-02 2.691e-03 -5.890 3.90e-09 ***
## AvgCityGross 2.413e-05 1.998e-07 120.771 < 2e-16 ***
## CityPopulation 1.125e-07 2.003e-08 5.615 1.98e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8064 on 37416 degrees of freedom
## Multiple R-squared: 0.3213, Adjusted R-squared: 0.3209
## F-statistic: 932.1 on 19 and 37416 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model2)
pred1 <- predict(model2, newdata = test)
rmse <- sqrt(sum((exp(pred1) - test$TotalGross)^2)/length(test$TotalGross))
c(RMSE = rmse, R2=summary(model2)$r.squared)
## RMSE R2
## 38383.509900 0.321274
Since we have a cross-sectional design, our R2 of 32% is acceptable for out dataset.
par(mfrow=c(1,1))
plot(test$TotalGross, exp(pred1))
###Multi level Regression Model
library(gridExtra)
library(lme4)
library(knitr)
library(broom)
library(tidyverse)
We’ll take the same data set but include City, Film and TheatreNumber to address the different levels in our data.
model_data2 <- master_clean_r %>% dplyr::select(City, Film, TheatreNumber, Adventure, Action, Musical, Drama, History, Comedy, Mystery, Romance, Thriller, SciFi, War, Family, Horror, Superhero, Novel, Animation, Franchise, Sequel, MonthofFirstEngagement, CitySize,NoTheatres, TotalGross, AvgCityGross,CityPopulation)
While controlling the city as a random variable, we get the following model
model2.1 = lmer(TotalGross ~1 + (1|City), REML = FALSE, data = model_data2)
summary(model2.1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: TotalGross ~ 1 + (1 | City)
## Data: model_data2
##
## AIC BIC logLik deviance df.resid
## 1122515.4 1122541.6 -561254.7 1122509.4 46792
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.6109 -0.6108 -0.2196 0.3325 9.3887
##
## Random effects:
## Groups Name Variance Std.Dev.
## City (Intercept) 6.598e+08 25686
## Residual 1.491e+09 38615
## Number of obs: 46795, groups: City, 405
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 40341 1363 29.59
While controlling the Film and Theatre as a random variable, we get the following model
model.a1 <- lmer(TotalGross~ 1 + (1|Film) + (1|TheatreNumber)
, data = model_data2)
summary(model.a1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: TotalGross ~ 1 + (1 | Film) + (1 | TheatreNumber)
## Data: model_data2
##
## REML criterion at convergence: 1079736
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.5131 -0.5220 -0.0298 0.4425 11.2152
##
## Random effects:
## Groups Name Variance Std.Dev.
## TheatreNumber (Intercept) 666057910 25808
## Film (Intercept) 677626440 26031
## Residual 573290128 23943
## Number of obs: 46795, groups: TheatreNumber, 692; Film, 121
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 40081 2582 15.53
While controlling the film, theatre and city as a random variable, we get the following model
model.b1 <- lmer(TotalGross~ 1 + (1|Film) + (1|TheatreNumber) +(1|City)
, data = model_data2)
summary(model.b1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: TotalGross ~ 1 + (1 | Film) + (1 | TheatreNumber) + (1 | City)
## Data: model_data2
##
## REML criterion at convergence: 1079731
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -5.5140 -0.5220 -0.0298 0.4425 11.2144
##
## Random effects:
## Groups Name Variance Std.Dev.
## TheatreNumber (Intercept) 643846341 25374
## City (Intercept) 17937618 4235
## Film (Intercept) 677702086 26033
## Residual 573296991 23944
## Number of obs: 46795, groups: TheatreNumber, 692; City, 405; Film, 121
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 40633 2599 15.63
##Conclusion The random effects of lmer models allow us to account for the different levels in the dataset such as city and theatre. We are not looking for specific film prediction, we need to predict performance of a theatre based on specific movie titles. While multiple level modeling can be helpful, its challenging to interpret therefore, when we develop our program, we can use the multiple linear model and account for the different cities based on population and number of movie theaters as it does a good job fitting our model.