Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)
library(grid)
library(gridExtra)

Load data

load("C:/Users/danie/OneDrive/Desktop/movies.RData")

Part 1: Data

The dataset movies is a collection of data on movies produced and released before 2016, it was sampled randomly from Rotten Tomatoes and IMDB. movies composes 651 rows and 32 columns. Because of its decent sample size, the results can be generalized to other movies that are also produced and released before 2016.
The data is observational, hence we can’t conclude any casualty from it. In terms bias, there’s no bias, but there exists some potential bias that may be caused by the missing values. For example:

sum(is.na(movies$studio))

## [1] 8

Next, select some variables that may affect the popularity of a movie and then we’ll clean the data so that we can use them for further analysis.

str(movies)

## tibble [651 x 32] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num [1:651] 2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

From the list above, I will not choose variables like thtr_rel_year etc, actor1 etc, best_pic_nom etc, top200_box, imdb_url and rt_url since I don’t consider them as important factors to our analysis.

Clean data with selected variables:

movies_select <- na.omit(movies) %>% 
  select(title_type, genre, runtime, mpaa_rating, imdb_rating, critics_score, critics_rating, audience_rating, audience_score)

head(movies_select)

## # A tibble: 6 x 9
##   title_type   genre  runtime mpaa_rating imdb_rating critics_score critics_rating
##   <fct>        <fct>    <dbl> <fct>             <dbl>         <dbl> <fct>         
## 1 Feature Film Drama       80 R                   5.5            45 Rotten        
## 2 Feature Film Drama      101 PG-13               7.3            96 Certified Fre~
## 3 Feature Film Comedy      84 R                   7.6            91 Certified Fre~
## 4 Feature Film Drama      139 PG                  7.2            80 Certified Fre~
## 5 Feature Film Horror      90 R                   5.1            33 Rotten        
## 6 Feature Film Drama      142 PG-13               7.2            57 Rotten        
## # ... with 2 more variables: audience_rating <fct>, audience_score <dbl>

Part 2: Research question

What does it take to make popular movies? a.k.a. what affects the audience_score.

Part 3: Exploratory data analysis

boxplot(movies_select$audience_score)

hist(movies_select$audience_score)

We can see that the audience_score is left skewed with a median around 65.

Discover relationships between audience_score and runtime:

ggplot(movies_select, aes(x = runtime, y = audience_score))+
  geom_jitter()+
  geom_smooth(method = 'lm', se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

The relationship between audience_score and runtime is linear, but it needs more investigations later.

audience_score and critics_score:

ggplot(movies_select, aes(x = critics_score, y = audience_score))+
  geom_jitter()+
  geom_smooth(method = 'lm', se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

The relationship between audience_score and title_type is obvious linear.

imdb_rating and audience_score:

ggplot(movies_select, aes(x = imdb_rating, y = audience_score))+
  geom_jitter()+
  geom_smooth(method = 'lm', se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

mpaa_rating and audience_score:

ggplot(movies_select, aes(x = mpaa_rating, y = audience_score))+
  geom_jitter()

We can see that the audience score is mostly concentrated on rated R movies.
Let’s consider all the rest categorical variables together:

c1<- ggplot(movies_select, aes(x = title_type)) + geom_bar() + coord_flip() + ggtitle('Movies by Title Type')
c2<- ggplot(movies_select, aes(x = genre))+geom_bar()+labs(title = 'Movies by Genre') + coord_flip()
c3<- ggplot(movies_select, aes(x = critics_rating)) + geom_bar() + ggtitle('Movies by Critics Rating')+ coord_flip()
c4<- ggplot(movies_select, aes(x = audience_rating)) + geom_bar() + ggtitle('Movies by Audience Rating') + coord_flip()
grid.arrange(c1, c2, c3, c4, ncol = 2)

It’s clear that the spread of critics rating and audience rating are reasonable, while the spread of title type and genre are highly concentrated in one category respectively.

Let’s check the correlations between the following numerical variables: runtime, imdb_rating and critics_score

ggpairs(movies_select, columns = c(3, 5, 6))

We can see that critics_score and imdb_rating are highly correlated with a 0.76 correlation coefficient, hence these two are collinear, which means adding more than one of these variables to the model would not add much value to the model. Here I decide to drop critics_score.

movies_select %>% 
  summarise(cor(runtime, audience_score), cor(imdb_rating, audience_score))

## # A tibble: 1 x 2
##   `cor(runtime, audience_score)` `cor(imdb_rating, audience_score)`
##                            <dbl>                              <dbl>
## 1                          0.203                              0.861

runtime has a low correlation coefficient, therefore I will drop it as well.

Part 4: Modeling

Search for a model:

m1 <- lm(audience_score ~ title_type + genre + mpaa_rating + critics_rating + audience_rating +imdb_rating, data = movies_select)
summary(m1)$adj.r.squared

## [1] 0.8832178

This model has a high correlation coefficient, hence we will continue using this model for the next step.

Note that this \(R^{2}_{adj}\) is larger than the one from the model with runtime (0.8832), this again confirmed that runtime does not contribute to our predicting model.

Search for the best model possible via backward stepwise elimination:

step(m1, direction = 'backward', trace = FALSE)

## 
## Call:
## lm(formula = audience_score ~ genre + audience_rating + imdb_rating, 
##     data = movies_select)
## 
## Coefficients:
##                    (Intercept)                  genreAnimation  
##                       -12.2663                          5.5041  
## genreArt House & International                     genreComedy  
##                        -2.2434                          1.6229  
##               genreDocumentary                      genreDrama  
##                         1.3614                         -0.6963  
##                    genreHorror  genreMusical & Performing Arts  
##                        -1.3892                          2.7493  
##        genreMystery & Suspense                      genreOther  
##                        -3.2959                          0.7362  
## genreScience Fiction & Fantasy          audience_ratingUpright  
##                         0.5883                         20.5042  
##                    imdb_rating  
##                         9.7107

Final model (Parsimonious Model):

m_final <- lm(audience_score ~ genre + audience_rating + imdb_rating, data = movies_select)

summary(m_final)$adj.r.squared

## [1] 0.8833851

Final model confirmed based on the high coefficient above.

Interpretation of the model:

Intercept: -12.2663, is the estimated audience score considering all other variables being zero.
Genre: i.e. comedy with coefficient 1.6229 meaning Comedy gets on average an audience score 1.6229 higher than other genres if the rest variables are controlled. (Same for other genres)
Audience rating (Upright) with coefficient 20.5042 meaning the audience rating of Upright is on average 20.5042 higher than that of Spilled if all other variables hold constant.
Imdb rating with coefficient 9.7107 meaning with one unit increase of the imdb rating, the model predicts a 9.7107 increase in audience score.

Model disgnostics

Check linearity

ggplot(m_final, aes(x = .fitted, y = .resid))+
  geom_point()+
  geom_hline(yintercept = 0, linetype = 'dashed')+
  xlab('Fitted Values')+
  ylab('Residuals')

No fan shape in the residual plot, hence the residuals have constant variance.

plot(m_final$residuals)

The residuals are randomly scattered around 0, therefore the model has independent residuals.

Check nearly normality

hist(m_final$residuals)

The distribution of residuals is nearly normal.

or a normal probability plot of residuals:

qqnorm(m_final$residuals)
qqline(m_final$residuals, col = 5)

The condition of normality is met confirmed.

Part 5: Prediction (Jurassic Park)

test_movie <- data.frame(genre = 'Action & Adventure', imdb_rating = 8.1, audience_rating = 'Upright')
predict(m_final, test_movie)

##        1 
## 86.89458

We can also construct a prediction interval around this prediction, which will provide a measure of uncertainty around the prediction.

predict(m_final, test_movie, interval = 'prediction', level = 0.95)

##        fit      lwr      upr
## 1 86.89458 73.20089 100.5883

Hence, the model predicts, with 95% confidence, that a movie with genre “Action & Adventure”, 8.1 imdb rating and an “Upright” audience rating is expected to have an evaluation score between 73.20 and 100.59.

Part 6: Conclusion

Our model m_final can be used to predict the popularity of movies produced and released before 2016 with just three variables: genre, imdb_rating and audience_rating.
However, we have to admit that this model has flaws and the predicting power is limited because of the data we used is not large enough. To increase its predicting power, we will need much more data and a wider variety of factors.