Setup

Load Packages

library(ggplot2)
library(dplyr)
library(statsr)
library(grid)
library(gridExtra)
library(knitr)

Load Data

load("movies.Rdata")

Part 1: Data

The raw data provided for this analysis consists of audience and critic review scores for 651 movies released during the years prior to 2016. The data comes from the Rotten Tomatoes and IMDB web sites. In addition to review scores, the data contains several other variables regarding each movie such as genre, running time, MPAA rating, production studio, Oscar nominations, and more.

The raw data is not a complete list of all movies released prior to 2016. Instead, it is a random sample taken from the full data set. No information is available regarding the specific sampling method used. It is assumed for the sake of this analysis that the conclusions drawn are generalizable to the population of all movies and that there is no bias introduced by the sampling method.

A possible non-independent bias may arise with regard to movie sequels (for example, all of the Star Wars episodes) whereby the popularity of a sequel movie may be influenced by that of the prior releases. A 2011 study by metacritic showed that critics and audiences tended to rate sequels and other derivative works higher than the original work. On the other hand, a 2016 chart produced by reddit of ratings for original movies versus their sequls (considering 1990-2016 films only) appears to indicate just the opposite. In light of

  1. these conflicting results,
  2. not knowing how sequels were treated in the data sampling method, and
  3. not having a definitive means of identifying sequels and originals in the raw data set so as to try to identify a bias,

it is assumed that no bias exists for the purpose of this analysis.

Overall, it is assumed that the individual movie observations are independent of each other. The size of the general population of movies produced is difficult to quantify exactly, however the IMDB web site lists over 105,000 feature film titles produced in the U.S. alone. A sample of 651 movies is certainly well below the 10% of the population threshold needed for independence.

As the raw data is a sample taken from existing data, this is an observational study and no causal relationships can be inferred or assumed from the conclusions drawn.


Part 2: Research Question

The research question addressed in this analysis is this: is it possible to predict the popularity of a feature film prior to its release based upon certain characteristic information about it? More specifically, do variables such as movie genre, MPAA rating, run length, etc. work as reasonable predictors of a movie’s popularity?

A good predictive model would be valuable to a theater owner when choosing which movies to show. Popular movies are more likely to draw larger crowds of patrons, leading to higher ticket and concession revenues. Predicting which movies are likely to be popular ahead of their release would be valuable information when deciding which movies to show.


Part 3: Exploratory Data Analysis

Data Codebook

The following is a copy of the codebook provided with the raw data.

  1. title: Title of movie
  2. title_type: Type of movie (Documentary, Feature Film, TV Movie)
  3. genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
  4. runtime: Runtime of movie (in minutes)
  5. mpaa_rating: MPAA rating of the movie (G, PG, PG-13, R, Unrated)
  6. studio: Studio that produced the movie
  7. thtr_rel_year: Year the movie is released in theaters
  8. thtr_rel_month: Month the movie is released in theaters
  9. thtr_rel_day: Day of the month the movie is released in theaters
  10. dvd_rel_year: Year the movie is released on DVD
  11. dvd_rel_month: Month the movie is released on DVD
  12. dvd_rel_day: Day of the month the movie is released on DVD
  13. imdb_rating: Rating on IMDB
  14. imdb_num_votes: Number of votes on IMDB
  15. critics_rating: Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)
  16. critics_score: Critics score on Rotten Tomatoes
  17. audience_rating: Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright)
  18. audience_score: Audience score on Rotten Tomatoes
  19. best_pic_nom: Whether or not the movie was nominated for a best picture Oscar (no, yes)
  20. best_pic_win: Whether or not the movie won a best picture Oscar (no, yes)
  21. best_actor_win: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) - note that this is not necessarily whether the actor won an Oscar for their role in the given movie
  22. best_actress win: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) - not that this is not necessarily whether the actresses won an Oscar for their role in the given movie
  23. best_dir_win: Whether or not the director of the movie ever won an Oscar (no, yes) - not that this is not necessarily whether the director won an Oscar for the given movie
  24. top200_box: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)
  25. director: Director of the movie
  26. actor1: First main actor/actress in the abridged cast of the movie
  27. actor2: Second main actor/actress in the abridged cast of the movie
  28. actor3: Third main actor/actress in the abridged cast of the movie
  29. actor4: Fourth main actor/actress in the abridged cast of the movie
  30. actor5: Fifth main actor/actress in the abridged cast of the movie
  31. imdb_url: Link to IMDB page for the movie
  32. rt_url: Link to Rotten Tomatoes page for the movie

Data Characteristics

As noted before, there are 651 movies represented in the raw data. The following charts show a breakdown of the type of movies included in the sample.

# Create histograms of some of the key movie characteristic data.
p1 <- ggplot(data=movies, aes(x=genre)) + 
      geom_bar(fill="blue") + 
      xlab("Movie Genre") +
      theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0))
p2 <- ggplot(data=movies, aes(x=title_type)) + 
      geom_bar(fill="blue") + 
      xlab("Movie Type") +
      theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0))
p3 <- ggplot(data=movies, aes(x=mpaa_rating)) + 
      geom_bar(fill="blue") + 
      xlab("Movie MPAA Rating") +
      theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0))
p4 <- ggplot(data=movies, aes(x=runtime)) + 
      geom_histogram(binwidth=10, fill="blue") +
      xlab("Movie Runtime")
grid.arrange(p2, p3, p1, p4, nrow=2,
             top="Movie Characteristics")

There are 60 movies in the raw data that are type “Documentary” or “TV Movie”. These will be removed from the analysis as they would not likely be shown in a typical movie theater. Also, there are 52 movies with MPAA ratings of NC-17 or are unrated. These, as well, would not likely be shown in a typical movie theater and will be excluded from the analysis.

# Clean the raw data to remove the movies to be excluded from the analysis.
pmovies <- movies %>% filter(title_type=="Feature Film") %>%
           filter(!(mpaa_rating %in% c("NC-17", "Unrated")))

After removing these data, 573 movies remain in the processed data set for analysis.


Part 4: Modeling

Model Development

The target response variable for the prediction model is a movie rating score, but with three to choose from, which one should be used? Two of the ratings come from the Rotten Tomatoes web site: one is an average of reviews by movie critics and the other is an average of reviews from the public (a.k.a., audience). The third rating is an average of reviews on the IMDB web site (no distinction made between critics and audience reviews).

One would expect to see a correlation between the different rating scores. The following plots show that to be the case.

# Helper function for adding correlation coeficient values to a pairwise plot
# (taken from pairs() help page).
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    usr <- par("usr"); on.exit(par(usr))
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}

# Create pairwise plots of the movie rating scores to test for colinearity.
# Using the helper function above, the linear correlation R value is included
# on the chart.
pairs(~ imdb_rating + critics_score + audience_score, 
      data=pmovies, lower.panel=panel.smooth, upper.panel=panel.cor)

Given these correlations, only one of the ratings will be used as the response variable.

Histograms of the data for the three ratings are shown below.

# Create histograms of the movie rating scores.
p1 <- ggplot(data=pmovies, aes(x=imdb_rating)) + 
      geom_histogram(binwidth=0.5, fill="blue") +
      xlab("IMDB Scores")
p2 <- ggplot(data=pmovies, aes(x=critics_score)) + 
      geom_histogram(binwidth=5, fill="blue") +
      xlab("Critics Scores")
p3 <- ggplot(data=pmovies, aes(x=audience_score)) + 
      geom_histogram(binwidth=5, fill="blue") +
      xlab("Audience Scores")
grid.arrange(p1, p2, p3, nrow=1,
             top="Distribution of Rating Scores")

Contrary to the two Rotten Tomatoe scores, the IMDB scores show a nice, mostly normal distribution centered around a mean of 6.37 with somewhat of a left-side skew. Given its distribution and the fact that it has the highest pairwise correlation with the other scores, the IMDB rating (imdb_rating) was the chosen response variable.

Since the goal is to predict the popularity of a movie prior to its release, the prediction model uses only variables from the data set that could be known ahead of time. Thus, variables such as DVD release date, number of IMDB votes, best picture nomination/win, etc. were not chosen to be in the model. Variables with large domains, such as studio name, actor/director names, URLs, etc. were excluded as well.

The first pass model was a multi-variable linear regression model using all relevant variables as predictors to derive a predicted IMDB rating score. Specifically, the initial set of predictors was:

  1. genre
  2. runtime
  3. mpaa_rating
  4. thtr_rel_month
  5. best_actor_win
  6. best_actress win
  7. best_dir_win

Theater release month was included assuming that movies released at certain times of the year may be more popular than others. Release year was discarded as being irrelevant (without time travel capabilities, no future movie will be released in a year that has already passed) and release day was thought to be too granular to be a worthwhile predictor.

The initial model fitting results are summarized below.

# Fit a linear regression model using all candidate predictor variables.
fullMod <- lm(imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_month + 
              best_actor_win + best_actress_win + best_dir_win, data=pmovies)
summary(fullMod)
## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + mpaa_rating + thtr_rel_month + 
##     best_actor_win + best_actress_win + best_dir_win, data = pmovies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8670 -0.5307  0.0430  0.6227  1.9998 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.033861   0.376917  13.355  < 2e-16 ***
## genreAnimation                 -0.306365   0.373898  -0.819 0.412921    
## genreArt House & International  0.386649   0.321918   1.201 0.230233    
## genreComedy                    -0.071803   0.158142  -0.454 0.649977    
## genreDocumentary                0.790707   0.675727   1.170 0.242441    
## genreDrama                      0.593064   0.134232   4.418 1.20e-05 ***
## genreHorror                    -0.113862   0.241935  -0.471 0.638089    
## genreMusical & Performing Arts  0.922245   0.353553   2.609 0.009339 ** 
## genreMystery & Suspense         0.384125   0.175588   2.188 0.029113 *  
## genreOther                      0.736434   0.279093   2.639 0.008558 ** 
## genreScience Fiction & Fantasy -0.266850   0.333834  -0.799 0.424431    
## runtime                         0.015921   0.002688   5.923 5.56e-09 ***
## mpaa_ratingPG                  -0.790634   0.284121  -2.783 0.005574 ** 
## mpaa_ratingPG-13               -1.057501   0.289001  -3.659 0.000277 ***
## mpaa_ratingR                   -0.715151   0.282431  -2.532 0.011613 *  
## thtr_rel_month                  0.006145   0.011526   0.533 0.594129    
## best_actor_winyes              -0.044683   0.114659  -0.390 0.696908    
## best_actress_winyes             0.065389   0.124816   0.524 0.600569    
## best_dir_winyes                 0.358104   0.155371   2.305 0.021545 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9365 on 554 degrees of freedom
## Multiple R-squared:  0.248,  Adjusted R-squared:  0.2236 
## F-statistic: 10.15 on 18 and 554 DF,  p-value: < 2.2e-16
anova(fullMod)
## Analysis of Variance Table
## 
## Response: imdb_rating
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## genre             10  95.24   9.524 10.8585 < 2.2e-16 ***
## runtime            1  41.21  41.211 46.9851 1.913e-11 ***
## mpaa_rating        3  18.54   6.181  7.0468 0.0001178 ***
## thtr_rel_month     1   0.25   0.252  0.2869 0.5924429    
## best_actor_win     1   0.10   0.096  0.1093 0.7410594    
## best_actress_win   1   0.27   0.272  0.3101 0.5778658    
## best_dir_win       1   4.66   4.659  5.3122 0.0215453 *  
## Residuals        554 485.92   0.877                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With an adjusted R-square value of 0.2236, there is plenty of room for improvement in the model.

The next procedure was to one-by-one remove insignificant variables based on p values, eliminating the variable with the highest p value each time, until all remaining variables were significant. P value was used for the elimination criteria because the goal was to create a model with the highest predictive value using only variables with demonstrable sigificance. It seems like a possible case of overfitting to create a model with an artificially inflated R-squared value due to incuding statistically insignificant predictors.

The result of pruning the predictors is a model using only the variables for genre, runtime, MPAA rating, and whether the director ever won an Oscar as predictors. The model results are summarized below.

# Fit the final parsimonious model.
finalMod <- lm(imdb_rating ~ genre + runtime + mpaa_rating + 
               best_dir_win, data=pmovies)
summary(finalMod)
## 
## Call:
## lm(formula = imdb_rating ~ genre + runtime + mpaa_rating + best_dir_win, 
##     data = pmovies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8689 -0.5436  0.0404  0.6056  2.0432 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.035253   0.368299  13.672  < 2e-16 ***
## genreAnimation                 -0.290502   0.371977  -0.781 0.435153    
## genreArt House & International  0.392689   0.320509   1.225 0.221017    
## genreComedy                    -0.059992   0.156688  -0.383 0.701958    
## genreDocumentary                0.779294   0.673510   1.157 0.247742    
## genreDrama                      0.599714   0.132672   4.520 7.55e-06 ***
## genreHorror                    -0.107019   0.241248  -0.444 0.657499    
## genreMusical & Performing Arts  0.924963   0.352691   2.623 0.008965 ** 
## genreMystery & Suspense         0.382923   0.173083   2.212 0.027347 *  
## genreOther                      0.731349   0.277479   2.636 0.008631 ** 
## genreScience Fiction & Fantasy -0.263124   0.332956  -0.790 0.429709    
## runtime                         0.016264   0.002488   6.537 1.42e-10 ***
## mpaa_ratingPG                  -0.790039   0.283350  -2.788 0.005481 ** 
## mpaa_ratingPG-13               -1.060774   0.288355  -3.679 0.000257 ***
## mpaa_ratingR                   -0.715358   0.281770  -2.539 0.011394 *  
## best_dir_winyes                 0.359052   0.155015   2.316 0.020907 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9346 on 557 degrees of freedom
## Multiple R-squared:  0.2471, Adjusted R-squared:  0.2268 
## F-statistic: 12.19 on 15 and 557 DF,  p-value: < 2.2e-16
anova(finalMod)
## Analysis of Variance Table
## 
## Response: imdb_rating
##               Df Sum Sq Mean Sq F value    Pr(>F)    
## genre         10  95.24   9.524 10.9040 < 2.2e-16 ***
## runtime        1  41.21  41.211 47.1820 1.735e-11 ***
## mpaa_rating    3  18.54   6.181  7.0763 0.0001129 ***
## best_dir_win   1   4.69   4.686  5.3650 0.0209069 *  
## Residuals    557 486.51   0.873                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that the adjusted R-square value (0.2268) is virtually the same as the initial model, but all predictor variables in the final model are statistically significant. With a virtually zero overall p value, we would reject a null hpothesis that all of the model coefficients are actually zero in favor of an alternative hypothesis that at least one of the coefficients is not.

The coefficients of the model tell us a number of things. First, the genre variable is a mixed bag. The coefficients for only six of the eleven genre categories show statistical significance. Of those, it appears that movies in the “Musicals & Performing Arts” genre are rated almost a full point higher than the base genre, which is “Action & Adventure” movies (all other predictors held constant).

The MPAA rating predictor is similiar, although this time all the rating categories are significant and all have a depressive affect relative to the base “G” rating; up to more than a full point reduction for a PG-13 rated movie (all other predictors held constant).

Movie run time appears to have a positive effect on movie rating. There is probably an upper limit to this that was not tested in this analysis. Would a five hour movie really be rated four points higher than a one hour movie? The relationship holds at least over the range of movie runtimes in the analysis data (68 - 202 minutes).

Finally and surprisingly, the model raises the predicted movie rating if the director has ever won an Oscar (all other predictors held constant) whereas the variables for the same being true of the lead actor or actress were removed from the model as being statistically insignificant.

Model Diagnostics

# Supplement the model data to make it easier to produce the diagnostic plots.
pMod <- fortify(finalMod)

# Create residuals scatter plot.
p1 <- ggplot(pMod, aes(x=.fitted, y=.resid))+geom_point() +
      geom_smooth(se=FALSE)+geom_hline(yintercept=0, col="red", linetype="dashed") +
      xlab("Fitted Values")+ylab("Residuals") +
      ggtitle("Residual vs Fitted Plot")

# The following is a bunch of extra code to get around ggplot not being able
# to automatically draw a normal distribution line on a QQ plot.
# This code comes from a blog post at http://mgimond.github.io/ES218/Week06a.html
pMod$.qqnorm <- qqnorm(pMod$.stdresid, plot.it=FALSE)$x  
y <- quantile(pMod$.stdresid, c(0.25, 0.75)) # Find the 1st and 3rd quartiles
x <- quantile(pMod$.qqnorm, c(0.25, 0.75))   # Find the 1st and 3rd quartiles
slope <- diff(y) / diff(x)             # Compute the line slope
int <- y[1] - slope * x[1]             # Compute the line intercept

# Create residuals QQ plot.
p2 <- ggplot(pMod, aes(.qqnorm, .stdresid)) +
      geom_point(na.rm = TRUE) +
      geom_abline(intercept=int, slope=slope, color="red") +
      xlab("Theoretical Quantiles")+ylab("Standardized Residuals") +
      ggtitle("Normal Q-Q Plot")

# Create residuals histogram plot.
p3 <- ggplot(data=pMod, aes(x=.resid)) + 
      geom_histogram(binwidth=0.5, fill="blue") +
      xlab("Residuals") +
      ggtitle("Distribution of Residuals")

grid.arrange(p1, p3, p2, nrow=1, top="Model Diagnostic Plots")

The model diagnostic plots above show that the model is passable. There is good scatter of the residuals around zero for the range of fitted values (the mean value of the residuals is, in fact, zero). The residuals Q-Q plot and distribution histogram show a pretty normal distribution, and one that mimics the left-hand skew of the original rating scores.

Overall, the evidence points toward the final model being valid.


Part 5: Prediction

The predictive capability of the model was tested using two movies released early in 2016. The movies were Dirty Grandpa and Deadpool. Early release dates were chosen to give time for the movies to have been in the theaters for a while and for the ratings to mature. The results are below. The information for these movies was obtained from the IMDB web site in order to be consistent with the analysis data (links: Dirty Grandpa, Deadpool).

# Use the final model to generate rating predictions for Dirty Grandpa released
# in January 2016 and for Deadpool released in February 2016.
dataDG <- data.frame(genre="Comedy", runtime=102, mpaa_rating="R", best_dir_win="no")
predDG <- predict(finalMod, dataDG, interval="predict")

dataDead <- data.frame(genre="Action & Adventure", runtime=108, mpaa_rating="R", best_dir_win="no")
predDead <- predict(finalMod, dataDead, interval="predict")

# Show prediction results.
df <- data.frame(t=c("Dirty Grandpa", "Deadpool"),
                 p=c(sprintf("%2.1f", predDG[1]), 
                     sprintf("%2.1f", predDead[1])),
                 i=c(sprintf("%2.1f - %2.1f", predDG[2], predDG[3]), 
                     sprintf("%2.1f - %2.1f", predDead[2], predDead[3])),
                 r=c("6.0", "8.1"))
kable(df, col.names=c("Movie Title", "Predicted Rating", "95% Prediction Interval", "Actual Rating"))
Movie Title Predicted Rating 95% Prediction Interval Actual Rating
Dirty Grandpa 5.9 4.1 - 7.8 6.0
Deadpool 6.1 4.2 - 7.9 8.1

As can be seen, the model was very close in predicting the rating for Dirty Grandpa, but significantly off in its prediction for Deadpool; the real rating for which is even outside of the 95% confidence prediction interval (the interval around the predicted rating score within which we are 95% confident the real movie score would fall).

Note that the 95% confidence prediction intervals are very wide. This is a reflection of the poor predictive capability of the model (further evidenced by its F-statistic and adjusted R-square values).


Part 6: Conclusion

In answer to the research question stated in Part 2, yes, it is possible to predict the popularity of a movie based upon basic movie characteristic data. In this analysis, a valid, parsimonious, multi-variable, linear regression model was created that proved to have some capability for predicting movie popularity as indicated by IMDB movie rating score.

But, there is much room for improvement. As shown in Part 5, the predictive power of the model is limited. Further analysis could pursue the following suggestions for improving the model.

  1. Start with a larger analysis sample to capture more variability in the population data.
  2. Use a stratified sample reflecting the true proportion of movie genres in the population rather than a simple random sample.
  3. Create separate models for each movie genre.
  4. Identify other movie characteristic data to add to the model; identificaton of sequels and their ratings or searching for keywords in the movie title or description, for example.