Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)

Load data

load("movies.Rdata")

Part 1: Data

The movies data set is comprised of 651 randomly sampled movies produced and released before 2016. Because it is a random sample of movies, the results are generalizable to the target population of all movies produced and released before 2016. The variables relate to how much audiences and critics like the movies and contain other variables about the movies.

The data set is an observational study. The movie attributes are not randomly assigned, so there no causal inferences can be made.

Part 2: Research question

We would like to know what attributes make a movie popular. Because we cannot infer causality, our research question is what attributes are associated with popular movies? We are interested in learning something new about movies.

Part 3: Exploratory data analysis

There are 32 variables in the data set, but only a subset are relevant to our investigation of variables related to popular movies. The following variables are possible measures of a popular movie:

imdb_rating: Rating on IMDB
critics_score: Critics score on Rotten Tomatoes
audience_score: Audience score on Rotten Tomatoes
best_pic_nom: Whether or not the movie was nominated for a best picture Oscar (no, yes)
top200_box: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)

The following variables are possible explanatory variables of popular movies:

title_type: Type of movie (Documentary, Feature Film, TV Movie)
genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
runtime: Runtime of movie (in minutes)
mpaa_rating: MPAA rating of the movie (G, PG, PG-13, R, Unrated)
studio: Studio that produced the movie
best_actor_win: Whether or not one of the main actors in the movie ever won an Oscar (no, yes)
best_actress win: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes)
best_dir_win: Whether or not the director of the movie ever won an Oscar (no, yes)

There are multiple candidate response variables. Two response variables are factor variables (best_pic_nom and top200_box). Stick with the numeric variables so that we can see degrees of popularity. For the remaining three candidate response variables, check their similarity with paired covariance plots.
```
ggpairs(movies, columns = c(13, 16, 18))
```
The numeric variables are collinear. It is reasonable to select a single variable as the response variable. We will examine the audience_score variable because it has the most variability (scores of 1-100), and because audience scores are a more direct indicator of audience reactions to movies.

Consider the relationship of each explanatory variable on the response variable audience_score. The explanatory variables are title_type, genre, runtime, mpaa_rating, best_actor_win, best_actress win, and best_dir_win.
```
ggplot(movies, aes(x = title_type, y = audience_score)) +
  geom_boxplot() +
  ggtitle("Audience Scores vs Movie Type")
```
The three movie types have distinct distributions. Documentaries have the highest median and tightest distribution. It appears the most popular films are the documentaries. Feature films have the lowest median and the widest range. TV movies have the middle median, but a very wide inter-quartile range.
```
ggplot(movies, aes(x = genre, y = audience_score)) +
  geom_boxplot() +
  ggtitle("Audience Scores vs Genre") +
  scale_x_discrete(labels = function(attend) lapply(strwrap(attend, width = 5, simplify = FALSE), paste, collapse="\n"))
```
The genres have quite different distributions of audience scores. Again, the Documentaries have the highest median. Horror films have the lowest median. It looks like film studies should steer away from horror films.
```
ggplot(movies, aes(x = runtime, y = audience_score)) +
  geom_point() +
  ggtitle("Audience Scores vs Runtime")
```
```
## Warning: Removed 1 rows containing missing values (geom_point).
```
Movie run times are mostly between 75 and 130 minutes. Within this range it is difficult to discern a relationship at a glance.
```
ggplot(movies, aes(x = mpaa_rating, y = audience_score)) +
  geom_boxplot() +
  ggtitle("Audience Scores vs MPAA Rating")
```
It appears that Unrated movies have the highest popularity. They have the highest median with a tight distribution. NC-17 movies appear to receive the lowest audience scores.
```
ggplot(movies, aes(x = best_actor_win, y = audience_score)) +
  geom_boxplot() +
  ggtitle("Audience Scores vs Main Actor has Won an Oscar")
```
There does not appear to be any relationship between having a former Oscar winning actor in the cast. The median audience score appears to be the same for both groups.
```
ggplot(movies, aes(x = best_actress_win, y = audience_score)) +
  geom_boxplot() +
  ggtitle("Audience Scores vs Main Actress has Won an Oscar")
```
There may be some benefit including a former Oscar winning actress in the movie. However, the interquartile range of the two groups are pretty similar.
```
ggplot(movies, aes(x = best_dir_win, y = audience_score)) +
  geom_boxplot() +
  ggtitle("Audience Scores vs Director has Won an Oscar")
```
There appears to be a benefit of hiring an Oscar-winning director for a movie. The median is higher. The interquartile ranges are similar, but the Oscar-winning director range is shifted upward. * * *

Part 4: Modeling

Our goal is to determine which movie attributes are most important in a popular movie. We are not necesarly trying to create a model that can identify popular movies based on the explanatory variables. For this reason we will use the p-value approach of model selection so that our model only contains explanatory variables significant at the 95% level.

mdl <- lm(audience_score ~ title_type + genre + runtime + mpaa_rating + best_actor_win + best_actress_win + best_dir_win, data = movies)
summary(mdl)

## 
## Call:
## lm(formula = audience_score ~ title_type + genre + runtime + 
##     mpaa_rating + best_actor_win + best_actress_win + best_dir_win, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.922 -13.141   1.345  13.288  39.379 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     57.87280    8.98988   6.438 2.41e-10 ***
## title_typeFeature Film         -11.81598    6.60709  -1.788  0.07420 .  
## title_typeTV Movie             -21.28000   10.40310  -2.046  0.04122 *  
## genreAnimation                   4.57893    6.96119   0.658  0.51092    
## genreArt House & International   9.52693    5.38828   1.768  0.07753 .  
## genreComedy                      0.93615    2.98160   0.314  0.75364    
## genreDocumentary                16.63436    7.09395   2.345  0.01934 *  
## genreDrama                      10.83157    2.53401   4.274 2.21e-05 ***
## genreHorror                     -6.87161    4.44575  -1.546  0.12269    
## genreMusical & Performing Arts  19.74893    6.04714   3.266  0.00115 ** 
## genreMystery & Suspense          1.14310    3.32519   0.344  0.73113    
## genreOther                      12.05988    5.03688   2.394  0.01694 *  
## genreScience Fiction & Fantasy  -4.00734    6.35190  -0.631  0.52834    
## runtime                          0.17427    0.04147   4.202 3.03e-05 ***
## mpaa_ratingNC-17               -10.29690   13.52500  -0.761  0.44675    
## mpaa_ratingPG                   -9.78790    4.89685  -1.999  0.04606 *  
## mpaa_ratingPG-13               -15.58905    4.99853  -3.119  0.00190 ** 
## mpaa_ratingR                   -10.01117    4.84975  -2.064  0.03940 *  
## mpaa_ratingUnrated              -6.44836    5.60048  -1.151  0.25001    
## best_actor_winyes               -1.03796    2.11178  -0.492  0.62324    
## best_actress_winyes             -0.12591    2.32486  -0.054  0.95683    
## best_dir_winyes                  6.13622    2.92350   2.099  0.03622 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.82 on 628 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2493, Adjusted R-squared:  0.2242 
## F-statistic: 9.932 on 21 and 628 DF,  p-value: < 2.2e-16

The best_actress_win variable has a p-value of .95683. Remove this variable and re-run the regression.

mdl <- lm(audience_score ~ title_type + genre + runtime + mpaa_rating + best_actor_win + best_dir_win, data = movies)
summary(mdl)

## 
## Call:
## lm(formula = audience_score ~ title_type + genre + runtime + 
##     mpaa_rating + best_actor_win + best_dir_win, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.902 -13.126   1.301  13.301  39.384 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     57.90767    8.95968   6.463 2.06e-10 ***
## title_typeFeature Film         -11.81662    6.60184  -1.790  0.07395 .  
## title_typeTV Movie             -21.29456   10.39138  -2.049  0.04085 *  
## genreAnimation                   4.56065    6.94749   0.656  0.51178    
## genreArt House & International   9.51314    5.37800   1.769  0.07739 .  
## genreComedy                      0.92043    2.96510   0.310  0.75634    
## genreDocumentary                16.62488    7.08616   2.346  0.01928 *  
## genreDrama                      10.81328    2.50942   4.309 1.90e-05 ***
## genreHorror                     -6.87936    4.43992  -1.549  0.12178    
## genreMusical & Performing Arts  19.75002    6.04231   3.269  0.00114 ** 
## genreMystery & Suspense          1.12324    3.30227   0.340  0.73386    
## genreOther                      12.04755    5.02774   2.396  0.01686 *  
## genreScience Fiction & Fantasy  -4.00936    6.34675  -0.632  0.52780    
## runtime                          0.17394    0.04099   4.244 2.53e-05 ***
## mpaa_ratingNC-17               -10.27603   13.50878  -0.761  0.44713    
## mpaa_ratingPG                   -9.78875    4.89294  -2.001  0.04587 *  
## mpaa_ratingPG-13               -15.58932    4.99457  -3.121  0.00188 ** 
## mpaa_ratingR                   -10.00672    4.84521  -2.065  0.03931 *  
## mpaa_ratingUnrated              -6.44191    5.59478  -1.151  0.25000    
## best_actor_winyes               -1.04551    2.10550  -0.497  0.61967    
## best_dir_winyes                  6.13159    2.91994   2.100  0.03613 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.81 on 629 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2493, Adjusted R-squared:  0.2254 
## F-statistic: 10.44 on 20 and 629 DF,  p-value: < 2.2e-16

The highest p-value is genreComedy, but there are levels of the genre factor variable that have p-values below .05, so look for the next highest p-value outside this factor variable. The next highest factor variable is best_actor_win at .61967. Remove this variable and re-run the regression.

mdl <- lm(audience_score ~ title_type + genre + runtime + mpaa_rating + best_dir_win, data = movies)
summary(mdl)

## 
## Call:
## lm(formula = audience_score ~ title_type + genre + runtime + 
##     mpaa_rating + best_dir_win, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.671 -13.017   1.343  13.316  39.399 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     58.3175     8.9163   6.541 1.27e-10 ***
## title_typeFeature Film         -11.8562     6.5974  -1.797  0.07280 .  
## title_typeTV Movie             -21.2172    10.3840  -2.043  0.04144 *  
## genreAnimation                   4.4799     6.9414   0.645  0.51891    
## genreArt House & International   9.5847     5.3728   1.784  0.07492 .  
## genreComedy                      0.9012     2.9631   0.304  0.76111    
## genreDocumentary                16.5874     7.0815   2.342  0.01947 *  
## genreDrama                      10.7557     2.5052   4.293 2.04e-05 ***
## genreHorror                     -6.8424     4.4366  -1.542  0.12352    
## genreMusical & Performing Arts  19.7933     6.0381   3.278  0.00110 ** 
## genreMystery & Suspense          0.9625     3.2844   0.293  0.76958    
## genreOther                      11.9933     5.0236   2.387  0.01726 *  
## genreScience Fiction & Fantasy  -3.9078     6.3397  -0.616  0.53785    
## runtime                          0.1696     0.0400   4.239 2.58e-05 ***
## mpaa_ratingNC-17               -10.6730    13.4770  -0.792  0.42869    
## mpaa_ratingPG                   -9.8586     4.8880  -2.017  0.04413 *  
## mpaa_ratingPG-13               -15.6005     4.9915  -3.125  0.00186 ** 
## mpaa_ratingR                   -10.0048     4.8423  -2.066  0.03923 *  
## mpaa_ratingUnrated              -6.4048     5.5909  -1.146  0.25241    
## best_dir_winyes                  6.0833     2.9166   2.086  0.03740 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.8 on 630 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.249,  Adjusted R-squared:  0.2264 
## F-statistic: 10.99 on 19 and 630 DF,  p-value: < 2.2e-16

Now all six remaining explanatory variables have p-values below .05. P-values and parameter estimates should only be trusted if the conditions for the regression are reasonable. Check the regression conditions with diagnostic plots.

(to run diagnostics, there cannot be explanatory variables with null values, so filter them out)

movies2 <- movies %>%
  filter(!is.na(title_type), !is.na(genre), !is.na(runtime), !is.na(mpaa_rating), !is.na(best_dir_win))
mdl <- lm(audience_score ~ title_type + genre + runtime + mpaa_rating + best_dir_win, data = movies2)

Linearity. The explanatory variables must each linearly related to the response variable. A residual plot of (e ~ x) with no pattern indicates a linear relationship because the linear model captures the trend and leaves behind only noise. For factor variables, create a box plot of residuals for each level. Look for differences in the distribution shape or variability of the residuals.

ggplot(movies2, aes(x = title_type, y = mdl$residuals)) +
  geom_boxplot() +
  ggtitle("Residuals plot of title_type")

ggplot(movies2, aes(x = genre, y = mdl$residuals)) +
  geom_boxplot() +
  ggtitle("Residuals plot of genre") +
  scale_x_discrete(labels = function(attend) lapply(strwrap(attend, width = 5, simplify = FALSE), paste, collapse="\n"))

ggplot(movies2, aes(x = runtime, y = mdl$residuals)) +
  geom_point() +
  ggtitle("Residuals plot of runtime")

ggplot(movies2, aes(x = mpaa_rating, y = mdl$residuals)) +
  geom_boxplot() +
  ggtitle("Residuals plot of mpaa_rating")

ggplot(movies2, aes(x = best_dir_win, y = mdl$residuals)) +
  geom_boxplot() +
  ggtitle("Residuals plot of best_dir_win")

The numeric variable runtime shows no obvious pattern. The boxplots for the factor variables all have similar distributions among their factor levels.

Normality. The residuals must be distributed nearly normally. A normal probability plot or a normal quantile plot will indicate deviations from normality. Look for a bow-shaped pattern. A histogram plot should appear normal.

# Normal residuals:
ggplot(movies2, aes(x = mdl$residuals)) +
  geom_histogram() + 
  ggtitle("Residuals Histogram")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qqnorm(mdl$residuals) 
qqline(mdl$residuals)

There appears to be some left skew in the distribution, but not a large amount.

Homoscedasticity. The variance of the residuals must be nearly constant. The residuals should be the same size at both low and high values of the response variable. A residuals plot of (e ~ y) should have random scatter in a band of constant width around 0 (no fan shape!).

ggplot(movies2, aes(x = mdl$fitted.values, y = mdl$residuals)) +
  geom_point() +
  ggtitle("Homoscedasticity test")

The residuals band does not appear to be constant - it is smaller for the larger fitted values. Heteroscedasticity can compromise the parameter confidence intervals.

Independance. The observations should be a random sample from the population. This assumption may not hold. First, it is a random sample of movies whose ratings may be inconsistently contributed from various segments of society.

Let’s look at the model results.

summary(mdl)

## 
## Call:
## lm(formula = audience_score ~ title_type + genre + runtime + 
##     mpaa_rating + best_dir_win, data = movies2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.671 -13.017   1.343  13.316  39.399 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     58.3175     8.9163   6.541 1.27e-10 ***
## title_typeFeature Film         -11.8562     6.5974  -1.797  0.07280 .  
## title_typeTV Movie             -21.2172    10.3840  -2.043  0.04144 *  
## genreAnimation                   4.4799     6.9414   0.645  0.51891    
## genreArt House & International   9.5847     5.3728   1.784  0.07492 .  
## genreComedy                      0.9012     2.9631   0.304  0.76111    
## genreDocumentary                16.5874     7.0815   2.342  0.01947 *  
## genreDrama                      10.7557     2.5052   4.293 2.04e-05 ***
## genreHorror                     -6.8424     4.4366  -1.542  0.12352    
## genreMusical & Performing Arts  19.7933     6.0381   3.278  0.00110 ** 
## genreMystery & Suspense          0.9625     3.2844   0.293  0.76958    
## genreOther                      11.9933     5.0236   2.387  0.01726 *  
## genreScience Fiction & Fantasy  -3.9078     6.3397  -0.616  0.53785    
## runtime                          0.1696     0.0400   4.239 2.58e-05 ***
## mpaa_ratingNC-17               -10.6730    13.4770  -0.792  0.42869    
## mpaa_ratingPG                   -9.8586     4.8880  -2.017  0.04413 *  
## mpaa_ratingPG-13               -15.6005     4.9915  -3.125  0.00186 ** 
## mpaa_ratingR                   -10.0048     4.8423  -2.066  0.03923 *  
## mpaa_ratingUnrated              -6.4048     5.5909  -1.146  0.25241    
## best_dir_winyes                  6.0833     2.9166   2.086  0.03740 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.8 on 630 degrees of freedom
## Multiple R-squared:  0.249,  Adjusted R-squared:  0.2264 
## F-statistic: 10.99 on 19 and 630 DF,  p-value: < 2.2e-16

The reference level for the title variable is Documentaries. The model shows that, all else equal, Documentaries receive higher audience ratings that Feature Films (12 points), and TV moview (21 points).

The reference level for the genre variable is Action & Adventure. The model shows that, all else equal, Musical & Performing Arts receive the highest audience ratings. Documentaries are second. Horror films receive the worst.

For each additional minute of runtime, audience ratings increase by .17 points. Longer movies are more popular than shorter movies.

The reference level for the mpaa_rating variable is “G”. G-rated movies receive the highest audience ratings, on average. PG-13 receive ratings, all else equal, 16 points below G movies.

If the model was directed by an Oscar award winning director, audiences ratings will be 6 points higher on average.

Part 5: Prediction

Pick the 2016 movie “Deadpool” link (a new movie that is not in the sample). What does the model predict its user rating will be?

new.movie <- data.frame("Feature Film", "Action & Adventure", 108, "PG-13", "no")
colnames(new.movie) <- c("title_type", "genre", "runtime", "mpaa_rating", "best_dir_win")
predict(mdl, newdata = new.movie, interval = "confidence")

##        fit      lwr      upr
## 1 49.17337 44.05692 54.28981

The model predicts an audience rating of 49. The 95% confidence interval is 44 to 54, meaning we are 95% confident that, all else being equal, the model predicts that the moview would receive an audience rating of 44 to 54.

Part 6: Conclusion

The model shows that all else equal,

Documentaries are more popular that other title types and genres,
Longer movies are better than shorter movies,
G rated movies are more popular than other rated movies,
It’s better to hire an Oscar award winning director, if available.

So Paramount Pictures should consider hiring Steven Spielburg to direct a 3 hour long cartoon documentary, perhaps on statistics.

This study has some shortcomings. There may be bias in the ratings. For example, G-rated films may be rated by children, or by their mothers. Horror films may receive the highest viewership from teenages, and college drop-outs. In other words, the audiences behind the audience ratings may not be independent and identically distributed.

The movie length is also troubling. Surely there are diminishing returns to tacking on another scene to your already wearisome documentary on William Gosset.