Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(leaps)
library(grid)
library(gridExtra)

Load data

load("movies.Rdata")

Part 1: Data

The data were derived from an observational, not experimental, study which precludes a generalization to the population from which it was obtained. Note that the sample size is very small, comparatively, which also does not allow us to draw conclusions about the population of movie watchers as a whole. Nevertheless, the sample was obtained randomly so we can, cautiously, generalize to the population as a whole but we cannot infer causality.

Part 2: Research question

The research question considered here is whether a combination of variables, e.g., genre can be used to predict the audience score given to a particular movie. This would of particular interest to companies producing movies and movie websites such as Netflix which can focus their offerings on movies that the audience is likely to enjoy and, as a corollary, spend their money on.

Part 3: Exploratory data analysis

A boxplot of the audience score as a function of the genre shows some interesting patterns of the data.

ggplot(movies, aes(x=factor(genre), y=audience_score)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Several genres such as Animation, Documentary, and Musical & Performing Arts seem to have a relatively high audience score with “little” variation while the audience scores for other genres such as Action and Adventure, Drama, and Science Fiction & Fantasy are lower on average with a much greater range of scores. The lower variation for the genres Animation and Science Fiction $ Fantasy could be due to the relatively low number of movies within these genres.

A plot of either the theater release date or the DVD release date as the predictor variable and audience score as the response variable did not show any obvious pattern for any particular genre. The Figure shown below displays the audience score as a function of the theater release year.

ggplot(movies, aes(x=thtr_rel_year, y=audience_score)) +
  geom_point(aes(colour=factor(genre))) +
  xlab("Year of Release (theater)") +
  ylab("Audience Score")

Although it might be possible to combine the actor’s names of a particular movie into a single numerical or categorical variable it would not aid the interpretability of the regression model. Moreover, given the small sample size, it is a distinct possibility that most actors’ names may not be meaningful predictors. The following code block removes some variables and lists those selected for inclusion into the regression model.

df <- movies[ -c(1, 6:12, 14, 25:32) ]
df <- na.omit(df)
#Variables included for modeling
names(df)

##  [1] "title_type"       "genre"            "runtime"         
##  [4] "mpaa_rating"      "imdb_rating"      "critics_rating"  
##  [7] "critics_score"    "audience_rating"  "audience_score"  
## [10] "best_pic_nom"     "best_pic_win"     "best_actor_win"  
## [13] "best_actress_win" "best_dir_win"     "top200_box"

Part 4: Modeling

The modeling approach selected here will use adjusted R^2 in order to obtain more reliable predictions on movies not present in the original dataset. Using the leaps package, we performed a backwards regression subset selection without limiting the number of predictors (nvmax=NULL). Subsequently, a graphic table of the best subsets, based on adjusted R-squared is displayed. Note that it is possible to select more than one “best” model but this makes the subsequent graphic table almost incomprehensible.

regsubsets.out <- regsubsets(audience_score ~., data=df, nbest=1,
                    nvmax=NULL, method="backward")
plot(regsubsets.out, scale = "adjr2", main = "Adjusted R^2")

The resulting model has the following predictors:

genre
runtime
mpaa_rating
imdb rating
critics rating
audience rating
best pic nom
best actress win

The actual coefficients of the model are:

model <- lm(audience_score ~ genre + runtime + mpaa_rating + 
              imdb_rating + critics_rating + audience_rating + best_pic_nom + 
              best_actress_win, data=df)
summary(model)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + mpaa_rating + 
##     imdb_rating + critics_rating + audience_rating + best_pic_nom + 
##     best_actress_win, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.4399  -4.3327   0.5186   4.3101  24.7995 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -7.17284    3.45462  -2.076   0.0383 *  
## genreAnimation                  2.61630    2.68840   0.973   0.3308    
## genreArt House & International -2.54056    2.07513  -1.224   0.2213    
## genreComedy                     1.49709    1.14511   1.307   0.1916    
## genreDocumentary                0.15730    1.57397   0.100   0.9204    
## genreDrama                     -0.53785    0.99627  -0.540   0.5895    
## genreHorror                    -1.70849    1.71366  -0.997   0.3192    
## genreMusical & Performing Arts  2.78928    2.20407   1.266   0.2062    
## genreMystery & Suspense        -2.81156    1.28981  -2.180   0.0296 *  
## genreOther                     -0.08428    1.94461  -0.043   0.9654    
## genreScience Fiction & Fantasy -0.20950    2.44927  -0.086   0.9319    
## runtime                        -0.02367    0.01601  -1.479   0.1398    
## mpaa_ratingNC-17               -0.61549    5.19552  -0.118   0.9057    
## mpaa_ratingPG                  -0.10462    1.88935  -0.055   0.9559    
## mpaa_ratingPG-13               -0.89298    1.94049  -0.460   0.6455    
## mpaa_ratingR                   -1.13266    1.87162  -0.605   0.5453    
## mpaa_ratingUnrated             -0.52847    2.14424  -0.246   0.8054    
## imdb_rating                     9.58879    0.41679  23.007   <2e-16 ***
## critics_ratingFresh            -0.10915    0.79970  -0.136   0.8915    
## critics_ratingRotten           -1.17379    0.91116  -1.288   0.1981    
## audience_ratingUpright         20.00528    0.78593  25.454   <2e-16 ***
## best_pic_nomyes                 3.44937    1.62148   2.127   0.0338 *  
## best_actress_winyes            -1.31640    0.90172  -1.460   0.1448    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.842 on 627 degrees of freedom
## Multiple R-squared:  0.8895, Adjusted R-squared:  0.8857 
## F-statistic: 229.5 on 22 and 627 DF,  p-value: < 2.2e-16

In order for the multiple regression model to be valid it will be necessary that (i) there is a linear relationship between any numerical predictor variables (runtime and imdb rating) and the response variable (audience_score), (ii) the residuals are nearly normally distributed, (iii) residuals display constant variability, and (iv) the residuals are independent.

First, we’ll examine whether the two numerical variables included in the model, namely runtime and imdb_rating, are linearly related to the response variable, audience score by examining the distribution of the residuals.

g_runtime <- ggplot(data=NULL, aes(x=df$runtime, y=model$residuals)) +
  geom_point()
g_imdb_rating <- ggplot(data=NULL, aes(x=df$imdb_rating, y=model$residuals)) +
  geom_point()
grid.arrange(g_runtime, g_imdb_rating)

For both runtime and imdb rating it would indeed appear that residuals are scattered randomly around 0.

Next, we need to check whether the residuals display a nearly normal distribution centered around 0.

par(mfrow = c(1, 2))
hist(model$residuals)
qqnorm(model$residuals)
qqline(model$residuals)

The resuls of the histogram plot suggest that the residuals are indeed distributed normally around 0 while the Q-Q plot indicates some skewness in the tails but there are no huge deviations.

The next two plots display the (absolute) values of the model’s residuals as a function of the model’s fitted values.

par(mfrow = c(1, 2))
plot(model$fitted.values, model$residuals)
plot(model$fitted.values, abs(model$residuals))

The results show that the residuals are equally variable for low and high values of the predicted values, i.e., residuals have a constant variability.

Finally, we need to check whether the residuals are independent which would indicate that our observations are independent.

par(mfrow=c(1,1))
plot(model$residuals)

The results show that there does not appear to be any structure to the residual values. In addition, the residuals do not show any pattern when plotted as a function of the theater release data (not shown).

With regards to inference for the model, the P-value of the model’s F-statistic indicates that the model as a whole is significant. It should be noted that not all explanatory variables have a significant P-value as the model was developed using highest adjusted R-squared as a determinant. Nevertheless, as an example, the coefficient for imdb rating shows that for each unit increase in the imdb rating value, the audience score is increased by approximately 9.6%.

One particular reason we might prefer to look at an ANOVA table is that it allows the use information about the possible associations between the independent variables and the dependent variable that gets thrown away by the t-tests in the summary output. The results of the ANOVA test are shown below.

anova(model)

## Analysis of Variance Table
## 
## Response: audience_score
##                   Df Sum Sq Mean Sq   F value    Pr(>F)    
## genre             10  51633    5163  110.2867 < 2.2e-16 ***
## runtime            1   6236    6236  133.1918 < 2.2e-16 ***
## mpaa_rating        5   5480    1096   23.4084 < 2.2e-16 ***
## imdb_rating        1 140669  140669 3004.6451 < 2.2e-16 ***
## critics_rating     2   1568     784   16.7415 8.256e-08 ***
## audience_rating    1  30510   30510  651.6720 < 2.2e-16 ***
## best_pic_nom       1    178     178    3.8079   0.05146 .  
## best_actress_win   1    100     100    2.1313   0.14482    
## Residuals        627  29354      47                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results show that all independent variables except “best pic nom” and “best actress win” are indeed considered significant predictors based on their P-values.

Part 5: Prediction

Finally, the model is used to predict the audience score for the movie “The Revenant” which was released in December 2015 in order for the values for “best pic nom” and “best actress win” to be included. The values for the predictor values were obtained from the sources mentioned in the codebook.

movie1 <- data.frame(title_type="Feature Film",
                     genre="Action & Adventure",
                     runtime=156,
                     mpaa_rating="R",
                     imdb_rating=8.1,
                     critics_rating="Certified Fresh",
                     critics_score=82,
                     audience_rating="Upright",
                     audience_score=85,
                     best_pic_nom="yes",
                     best_pic_win="yes",
                     best_actor_win="yes",
                     best_actress_win="no",
                     best_dir_win="yes",
                     top200_box="yes")
prediction_Revenant <- predict(model, newdata=movie1, interval="confidence")
prediction_Revenant

##        fit      lwr      upr
## 1 89.12618 85.44594 92.80641

The value obtained, 89.1, is fairly close to the actual audience score of 85 and, based on the confidence interval, we can be 95% confident that the actual audience score for this particular movie has a lower bound of approximately 85.4 and a higher bound of approximately 92.8.

Part 6: Conclusion

In conclusion, the predictive model presented here may be used to predict audience scores for a movie. It should be noted that the model is based on a very small sample and it may be beneficial to remove one or more predictors, such as “best pic nom”, if these values will be not available. In addition, some genres were not sufficiently represented in the data set which may decrease the usefullness of the model for these particular types of movies.