Bayesian modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(grid)
library(gridExtra)
library(MASS)

Load data

load("movies.Rdata")

Part 1: Data

The data were derived from an observational, not experimental, study which precludes a generalization to the population from which it was obtained. Note that the sample size is very small, comparatively, which also does not allow us to draw conclusions about the population of movie watchers as a whole. In addition, random assignment was not used. Nevertheless, the sample was obtained randomly so we can, cautiously, generalize to the population as a whole but we cannot infer causality.

Part 2: Data manipulation

Construct some new variables:

#Construct new variables
movies$feature_film <- ifelse(movies$title_type == "Feature Film", "yes", "no")
movies$drama <- ifelse(movies$genre == "Drama", "yes", "no")
movies$mpaa_rating_R <- ifelse(movies$mpaa_rating == "R", "yes", "no")
movies$oscar_season <- ifelse(movies$thtr_rel_month == 10 | movies$thtr_rel_month == 11 | movies$thtr_rel_month == 12, "yes", "no")
movies$summer_season <- ifelse(movies$thtr_rel_month > 4 & movies$thtr_rel_month < 9, "yes", "no")

Part 3: Exploratory data analysis

ggplot(movies, aes(x=factor(genre), y=audience_score)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Several genres such as Animation, Documentary, and Musical & Performing Arts seem to have a relatively high audience score with little variation while the audience scores for other genres such as Action and Adventure, Drama, and Science Fiction & Fantasy are lower on average with a much greater range of scores. The lower variation for the genres Animation and Science Fiction and Fantasy could be due to the relatively low number of movies within these genres.

A plot of either the theater release date or the DVD release date as the predictor variable and audience score as the response variable did not show any obvious pattern for any particular genre. The Figure shown below displays the audience score as a function of the theater release year.

ggplot(movies, aes(x=thtr_rel_year, y=audience_score)) +
  geom_point(aes(colour=factor(genre))) +
  xlab("Year of Release (theater)") +
  ylab("Audience Score")

Although it might be possible to combine the actor’s names of a particular movie into a single numerical or categorical variable it would not aid the interpretability of the regression model. Moreover, given the small sample size, it is a distinct possibility that most actors’ names may not be meaningful predictors. The following code block removes some variables and lists those selected for inclusion into the regression model.

df <- movies[ -c(1, 6, 8:12, 25:32) ]
df <- na.omit(df)
#Variables included for modeling
names(df)

##  [1] "title_type"       "genre"            "runtime"         
##  [4] "mpaa_rating"      "thtr_rel_year"    "imdb_rating"     
##  [7] "imdb_num_votes"   "critics_rating"   "critics_score"   
## [10] "audience_rating"  "audience_score"   "best_pic_nom"    
## [13] "best_pic_win"     "best_actor_win"   "best_actress_win"
## [16] "best_dir_win"     "top200_box"       "feature_film"    
## [19] "drama"            "mpaa_rating_R"    "oscar_season"    
## [22] "summer_season"

Part 4: Modeling

Part 4A. Obtain regression coefficients.

The modeling approach selected here will use adjusted R^2 in order to obtain more reliable predictions on movies not present in the original dataset. Using the MASS package, we perform a backwards regression subset selection removing variables until the Bayes Information Criterion (BIC) can no longer be lowered. It takes as inputs a full model, and a penalty parameter k which is equal to the logarithm of the number of data points. The following code block prints the final model and the associated coefficents

lm_rating <- stepAIC(lm(audience_score ~., data=df, direction="backward", k=log(nrow(df))))

## Start:  AIC=2535.89
## audience_score ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_year + imdb_rating + imdb_num_votes + critics_rating + 
##     critics_score + audience_rating + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box + 
##     feature_film + drama + mpaa_rating_R + oscar_season + summer_season
## 
## 
## Step:  AIC=2535.89
## audience_score ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_year + imdb_rating + imdb_num_votes + critics_rating + 
##     critics_score + audience_rating + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box + 
##     feature_film + drama + oscar_season + summer_season
## 
## 
## Step:  AIC=2535.89
## audience_score ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_year + imdb_rating + imdb_num_votes + critics_rating + 
##     critics_score + audience_rating + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box + 
##     feature_film + oscar_season + summer_season
## 
## 
## Step:  AIC=2535.89
## audience_score ~ title_type + genre + runtime + mpaa_rating + 
##     thtr_rel_year + imdb_rating + imdb_num_votes + critics_rating + 
##     critics_score + audience_rating + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box + 
##     oscar_season + summer_season
## 
##                    Df Sum of Sq   RSS    AIC
## - mpaa_rating       5      91.7 29053 2527.9
## - critics_rating    2      24.6 28986 2532.4
## - title_type        2      53.0 29014 2533.1
## - best_actor_win    1       0.1 28962 2533.9
## - best_dir_win      1       3.2 28965 2534.0
## - critics_score     1       3.6 28965 2534.0
## - top200_box        1       6.1 28968 2534.0
## - best_pic_win      1      42.7 29004 2534.8
## - summer_season     1      73.0 29034 2535.5
## - imdb_num_votes    1      82.6 29044 2535.7
## <none>                          28961 2535.9
## - best_actress_win  1      94.4 29056 2536.0
## - runtime           1     110.4 29072 2536.4
## - thtr_rel_year     1     116.8 29078 2536.5
## - oscar_season      1     129.5 29091 2536.8
## - genre            10     962.2 29924 2537.1
## - best_pic_nom      1     209.5 29171 2538.6
## - imdb_rating       1   17033.1 45995 2834.6
## - audience_rating   1   29450.2 58412 2989.9
## 
## Step:  AIC=2527.94
## audience_score ~ title_type + genre + runtime + thtr_rel_year + 
##     imdb_rating + imdb_num_votes + critics_rating + critics_score + 
##     audience_rating + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win + top200_box + oscar_season + 
##     summer_season
## 
##                    Df Sum of Sq   RSS    AIC
## - critics_rating    2      19.0 29072 2524.4
## - title_type        2      42.5 29096 2524.9
## - best_actor_win    1       0.3 29053 2525.9
## - top200_box        1       2.1 29055 2526.0
## - best_dir_win      1       2.3 29055 2526.0
## - critics_score     1       7.7 29061 2526.1
## - best_pic_win      1      38.3 29091 2526.8
## - imdb_num_votes    1      67.3 29120 2527.4
## - summer_season     1      67.9 29121 2527.5
## <none>                          29053 2527.9
## - best_actress_win  1      90.0 29143 2527.9
## - runtime           1     117.0 29170 2528.6
## - oscar_season      1     126.7 29180 2528.8
## - thtr_rel_year     1     148.5 29202 2529.2
## - best_pic_nom      1     214.2 29267 2530.7
## - genre            10    1141.3 30194 2533.0
## - imdb_rating       1   17057.6 46111 2826.2
## - audience_rating   1   29593.2 58646 2982.5
## 
## Step:  AIC=2524.37
## audience_score ~ title_type + genre + runtime + thtr_rel_year + 
##     imdb_rating + imdb_num_votes + critics_score + audience_rating + 
##     best_pic_nom + best_pic_win + best_actor_win + best_actress_win + 
##     best_dir_win + top200_box + oscar_season + summer_season
## 
##                    Df Sum of Sq   RSS    AIC
## - title_type        2      45.6 29118 2521.4
## - best_actor_win    1       0.3 29072 2522.4
## - top200_box        1       2.0 29074 2522.4
## - best_dir_win      1       2.8 29075 2522.4
## - best_pic_win      1      38.7 29111 2523.2
## - summer_season     1      68.3 29140 2523.9
## - imdb_num_votes    1      82.7 29155 2524.2
## <none>                          29072 2524.4
## - best_actress_win  1      92.4 29165 2524.4
## - critics_score     1     103.2 29175 2524.7
## - runtime           1     123.9 29196 2525.1
## - oscar_season      1     127.6 29200 2525.2
## - thtr_rel_year     1     147.8 29220 2525.7
## - best_pic_nom      1     218.5 29291 2527.2
## - genre            10    1141.1 30213 2529.4
## - imdb_rating       1   17419.5 46492 2827.5
## - audience_rating   1   30206.9 59279 2985.5
## 
## Step:  AIC=2521.38
## audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + audience_rating + best_pic_nom + 
##     best_pic_win + best_actor_win + best_actress_win + best_dir_win + 
##     top200_box + oscar_season + summer_season
## 
##                    Df Sum of Sq   RSS    AIC
## - best_actor_win    1       0.5 29118 2519.4
## - top200_box        1       2.0 29120 2519.4
## - best_dir_win      1       3.6 29121 2519.5
## - best_pic_win      1      40.8 29158 2520.3
## - summer_season     1      59.9 29178 2520.7
## - critics_score     1      87.6 29205 2521.3
## <none>                          29118 2521.4
## - imdb_num_votes    1      93.4 29211 2521.5
## - best_actress_win  1      93.9 29212 2521.5
## - oscar_season      1     124.6 29242 2522.2
## - runtime           1     125.5 29243 2522.2
## - thtr_rel_year     1     161.0 29279 2523.0
## - best_pic_nom      1     223.1 29341 2524.3
## - genre            10    1169.7 30287 2527.0
## - imdb_rating       1   17539.7 46657 2825.8
## - audience_rating   1   30171.3 59289 2981.6
## 
## Step:  AIC=2519.4
## audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + audience_rating + best_pic_nom + 
##     best_pic_win + best_actress_win + best_dir_win + top200_box + 
##     oscar_season + summer_season
## 
##                    Df Sum of Sq   RSS    AIC
## - top200_box        1       1.9 29120 2517.4
## - best_dir_win      1       3.8 29122 2517.5
## - best_pic_win      1      41.6 29160 2518.3
## - summer_season     1      60.3 29179 2518.7
## - critics_score     1      87.7 29206 2519.3
## <none>                          29118 2519.4
## - imdb_num_votes    1      93.1 29211 2519.5
## - best_actress_win  1      93.4 29212 2519.5
## - oscar_season      1     124.1 29242 2520.2
## - runtime           1     126.2 29245 2520.2
## - thtr_rel_year     1     161.5 29280 2521.0
## - best_pic_nom      1     228.2 29346 2522.5
## - genre            10    1170.8 30289 2525.0
## - imdb_rating       1   17549.2 46668 2824.0
## - audience_rating   1   30257.7 59376 2980.5
## 
## Step:  AIC=2517.44
## audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + audience_rating + best_pic_nom + 
##     best_pic_win + best_actress_win + best_dir_win + oscar_season + 
##     summer_season
## 
##                    Df Sum of Sq   RSS    AIC
## - best_dir_win      1       4.0 29124 2515.5
## - best_pic_win      1      41.6 29162 2516.4
## - summer_season     1      61.3 29181 2516.8
## - critics_score     1      86.5 29207 2517.4
## <none>                          29120 2517.4
## - imdb_num_votes    1      92.2 29212 2517.5
## - best_actress_win  1      95.2 29215 2517.6
## - oscar_season      1     126.8 29247 2518.3
## - runtime           1     127.1 29247 2518.3
## - thtr_rel_year     1     159.6 29280 2519.0
## - best_pic_nom      1     229.7 29350 2520.6
## - genre            10    1169.4 30290 2523.0
## - imdb_rating       1   17618.5 46739 2823.0
## - audience_rating   1   30272.3 59393 2978.7
## 
## Step:  AIC=2515.53
## audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + audience_rating + best_pic_nom + 
##     best_pic_win + best_actress_win + oscar_season + summer_season
## 
##                    Df Sum of Sq   RSS    AIC
## - best_pic_win      1      37.7 29162 2514.4
## - summer_season     1      59.9 29184 2514.9
## - critics_score     1      89.0 29213 2515.5
## <none>                          29124 2515.5
## - imdb_num_votes    1      93.4 29218 2515.6
## - best_actress_win  1      95.3 29219 2515.7
## - runtime           1     123.3 29247 2516.3
## - oscar_season      1     125.0 29249 2516.3
## - thtr_rel_year     1     163.8 29288 2517.2
## - best_pic_nom      1     226.9 29351 2518.6
## - genre            10    1167.6 30292 2521.1
## - imdb_rating       1   17659.3 46783 2821.6
## - audience_rating   1   30319.0 59443 2977.3
## 
## Step:  AIC=2514.37
## audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + audience_rating + best_pic_nom + 
##     best_actress_win + oscar_season + summer_season
## 
##                    Df Sum of Sq   RSS    AIC
## - summer_season     1      61.1 29223 2513.7
## - imdb_num_votes    1      74.1 29236 2514.0
## - critics_score     1      86.6 29248 2514.3
## <none>                          29162 2514.4
## - best_actress_win  1     101.1 29263 2514.6
## - oscar_season      1     120.4 29282 2515.1
## - runtime           1     128.4 29290 2515.2
## - thtr_rel_year     1     151.6 29313 2515.7
## - best_pic_nom      1     189.2 29351 2516.6
## - genre            10    1155.9 30318 2519.6
## - imdb_rating       1   17736.5 46898 2821.2
## - audience_rating   1   30368.8 59531 2976.2
## 
## Step:  AIC=2513.73
## audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + audience_rating + best_pic_nom + 
##     best_actress_win + oscar_season
## 
##                    Df Sum of Sq   RSS    AIC
## - imdb_num_votes    1      70.0 29293 2513.3
## - oscar_season      1      71.3 29294 2513.3
## - critics_score     1      74.5 29297 2513.4
## <none>                          29223 2513.7
## - best_actress_win  1     103.7 29327 2514.0
## - runtime           1     136.6 29360 2514.8
## - thtr_rel_year     1     152.6 29376 2515.1
## - best_pic_nom      1     193.8 29417 2516.0
## - genre            10    1176.5 30399 2519.4
## - imdb_rating       1   18220.5 47443 2826.7
## - audience_rating   1   30433.0 59656 2975.6
## 
## Step:  AIC=2513.28
## audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + 
##     critics_score + audience_rating + best_pic_nom + best_actress_win + 
##     oscar_season
## 
##                    Df Sum of Sq   RSS    AIC
## - critics_score     1      71.5 29364 2512.9
## - oscar_season      1      74.7 29368 2512.9
## <none>                          29293 2513.3
## - runtime           1      99.3 29392 2513.5
## - best_actress_win  1      99.5 29392 2513.5
## - thtr_rel_year     1     108.0 29401 2513.7
## - best_pic_nom      1     260.6 29554 2517.0
## - genre            10    1163.4 30456 2518.6
## - imdb_rating       1   19926.8 49220 2848.6
## - audience_rating   1   30627.2 59920 2976.5
## 
## Step:  AIC=2512.87
## audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + 
##     audience_rating + best_pic_nom + best_actress_win + oscar_season
## 
##                    Df Sum of Sq   RSS    AIC
## - oscar_season      1      80.8 29445 2512.7
## <none>                          29364 2512.9
## - best_actress_win  1      97.9 29462 2513.0
## - runtime           1     107.6 29472 2513.2
## - thtr_rel_year     1     128.5 29493 2513.7
## - best_pic_nom      1     279.5 29644 2517.0
## - genre            10    1181.5 30546 2518.5
## - audience_rating   1   31227.7 60592 2981.7
## - imdb_rating       1   31291.6 60656 2982.4
## 
## Step:  AIC=2512.65
## audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + 
##     audience_rating + best_pic_nom + best_actress_win
## 
##                    Df Sum of Sq   RSS    AIC
## <none>                          29445 2512.7
## - best_actress_win  1      99.1 29544 2512.8
## - thtr_rel_year     1     121.2 29566 2513.3
## - runtime           1     150.2 29595 2514.0
## - best_pic_nom      1     248.3 29693 2516.1
## - genre            10    1131.0 30576 2517.2
## - imdb_rating       1   31210.8 60656 2980.4
## - audience_rating   1   31496.1 60941 2983.4

The final model has the following predictor variables:

genre
thtr rel year
audience rating
best actress win
runtime
imdb rating
best pic nom

The coefficents for the different genres indicate that a movie categorized as “Animation” will have an audience score that is, on average, 3.62 points higher while a “Horror” movie may be expected to see a decrease, on average, of 2.05 points. Additionally, with every unit increase of the variable imdb_rating, the audience score is expected to increase, on average, by 9.84 points, etc.

The next code block runs the selected model:

lm_score <- lm(formula = audience_score ~ genre + runtime + thtr_rel_year + imdb_rating + audience_rating + best_pic_nom + best_actress_win, data = df)
summary(lm_score)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_year + 
##     imdb_rating + audience_rating + best_pic_nom + best_actress_win, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.541  -4.433   0.542   4.320  24.949 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    71.06751   50.28940   1.413   0.1581    
## genreAnimation                  3.62120    2.45779   1.473   0.1412    
## genreArt House & International -2.62636    2.02374  -1.298   0.1948    
## genreComedy                     1.48198    1.13027   1.311   0.1903    
## genreDocumentary                0.73529    1.39522   0.527   0.5984    
## genreDrama                     -0.58155    0.96311  -0.604   0.5462    
## genreHorror                    -2.04842    1.67007  -1.227   0.2204    
## genreMusical & Performing Arts  2.94579    2.18186   1.350   0.1775    
## genreMystery & Suspense        -2.94324    1.24792  -2.359   0.0187 *  
## genreOther                     -0.02557    1.92584  -0.013   0.9894    
## genreScience Fiction & Fantasy -0.02531    2.43020  -0.010   0.9917    
## runtime                        -0.02816    0.01567  -1.797   0.0728 .  
## thtr_rel_year                  -0.04048    0.02507  -1.614   0.1069    
## imdb_rating                     9.83752    0.37978  25.903   <2e-16 ***
## audience_ratingUpright         20.11201    0.77292  26.021   <2e-16 ***
## best_pic_nomyes                 3.66702    1.58725   2.310   0.0212 *  
## best_actress_winyes            -1.30702    0.89544  -1.460   0.1449    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.82 on 633 degrees of freedom
## Multiple R-squared:  0.8892, Adjusted R-squared:  0.8864 
## F-statistic: 317.5 on 16 and 633 DF,  p-value: < 2.2e-16

In order for the multiple regression model to be valid it will be necessary that (i) there is a linear relationship between any numerical predictor variables (runtime and imdb rating) and the response variable (audience_score), (ii) the residuals are nearly normally distributed, (iii) residuals display constant variability, and (iv) the residuals are independent.

First, we’ll examine whether the two numerical variables included in the model, namely runtime and imdb_rating, are linearly related to the response variable, audience score by examining the distribution of the residuals.

g_runtime <- ggplot(data=NULL, aes(x=df$runtime, y=lm_score$residuals)) +
  geom_point()
g_imdb_rating <- ggplot(data=NULL, aes(x=df$imdb_rating, y=lm_score$residuals)) + geom_point()
grid.arrange(g_runtime, g_imdb_rating)

For both runtime and imdb rating it would indeed appear that residuals are scattered randomly around 0.

Next, we need to check whether the residuals display a nearly normal distribution centered around 0.

par(mfrow = c(1, 2))
hist(lm_score$residuals)
qqnorm(lm_score$residuals)
qqline(lm_score$residuals)

The resuls of the histogram plot suggest that the residuals are indeed distributed normally around 0 while the Q-Q plot indicates some skewness in the tails but there are no huge deviations.

The next two plots display the (absolute) values of the model’s residuals as a function of the model’s fitted values.

par(mfrow = c(1, 2))
plot(lm_score$fitted.values, lm_score$residuals)
plot(lm_score$fitted.values, abs(lm_score$residuals))

The results show that the residuals are equally variable for low and high values of the predicted values, i.e., residuals have a constant variability.

Finally, we need to check whether the residuals are independent which would indicate that our observations are independent.

par(mfrow=c(1,1))
plot(lm_score$residuals)

The results show that there does not appear to be any structure to the residual values. In addition, the residuals do not show any pattern when plotted as a function of the theater release data (not shown).

With regards to inference for the model, the P-value of the model’s F-statistic indicates that the model as a whole is significant. It should be noted that not all explanatory variables have a significant P-value as the model was developed using highest adjusted R-squared as a determinant. Nevertheless, as an example, the coefficient for imdb rating shows that for each unit increase in the imdb rating value, the audience score is increased by approximately 9.6%.

One particular reason we might prefer to look at an ANOVA table is that it allows the use information about the possible associations between the independent variables and the dependent variable that gets thrown away by the t-tests in the summary output. The results of the ANOVA test are shown below.

anova(lm_score)

## Analysis of Variance Table
## 
## Response: audience_score
##                   Df Sum Sq Mean Sq   F value    Pr(>F)    
## genre             10  51633    5163  110.9992 < 2.2e-16 ***
## runtime            1   6236    6236  134.0523 < 2.2e-16 ***
## thtr_rel_year      1   1952    1952   41.9608 1.871e-10 ***
## imdb_rating        1 144347  144347 3103.1097 < 2.2e-16 ***
## audience_rating    1  31807   31807  683.7802 < 2.2e-16 ***
## best_pic_nom       1    209     209    4.4826   0.03463 *  
## best_actress_win   1     99      99    2.1305   0.14489    
## Residuals        633  29445      47                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results show that all independent variables except “best actress win” are indeed considered significant predictors based on their P-values.

Part 5: Prediction

Finally, the model is used to predict the audience score for the movie “The Revenant” which was released in December 2015 in order for the values for “best pic nom” and “best actress win” to be included. The values for the predictor values were obtained from the sources mentioned in the codebook.

movie1 <- data.frame(title_type="Feature Film",
                     genre="Action & Adventure",
                     runtime=156,
                     thtr_rel_year=2015,
                     mpaa_rating="R",
                     imdb_rating=8.1,
                     critics_rating="Certified Fresh",
                     critics_score=82,
                     audience_rating="Upright",
                     audience_score=85,
                     best_pic_nom="yes",
                     best_pic_win="yes",
                     best_actor_win="yes",
                     best_actress_win="no",
                     best_dir_win="yes",
                     top200_box="yes")
prediction_Revenant <- predict(lm_score, newdata=movie1, interval="confidence")
prediction_Revenant

##        fit      lwr      upr
## 1 88.57269 84.86001 92.28538

The value obtained, 88.6, is fairly close to the actual audience score of 85 and, based on the confidence interval, we can be 95% confident that the actual audience score for this particular movie has a lower bound of approximately 84.9 and a higher bound of approximately 92.3.

Part 6: Conclusion

In conclusion, the predictive model presented here may be used to predict audience scores for a movie. It should be noted that the model is based on a very small sample and it may be beneficial to remove one or more predictors, such as “best pic nom”, if these values will be not available. In addition, some genres were not sufficiently represented in the data set which may decrease the usefullness of the model for these particular types of movies.