Bayesian modeling and prediction for movies

Week 5 Project of the course Bayesian Statistics under the course track Statistics with R

Submitted by Olusola Afuwape

September 6^th 2019

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
library(ggridges)
library(lattice)
library(grid)
library(gridExtra)

Load data

load("movies.Rdata")

Part 1: Data

The variable thtr_rel_year is about years the movies were released

span <- movies %>% filter(!is.na(thtr_rel_year)) %>% select(thtr_rel_year)
range(span$thtr_rel_year)

[1] 1970 2014

Overview

The movies dataset consisted of 651 randomly sampled movies produced and released between 1970 and 2014. The dataset was generated from the Rotten Tomatoes and IMDb databases.

Scope of Inference

The movies dataset was compiled through observational studies. The focus of an observational study is to infer association or correlation. An observational study shows how the response variable is related to the explanatory variables. Observational studies do not utilize experimental measures like control, reproducibility, blocking etc. An observational study focuses on outcome rather than input. The movies dataset can only imply generalizability and not causality. To be able to infer causality, principles of expermental design (including control, randomize, replicate or reproduce, and block) must be engaged.

Part 2: Data manipulation

# Observe the dimension of the data
dim(movies)

[1] 651  32

Discussion

For this data analysis, five more varaibles must be created from four of the 32 existing variables. Sixteen explanatory variables are required in all for the modeling. The new five variables include:

feature_film created from the variable title_type
drama created from the variable genre
mpaa_rating_R created from the variable mpaa_rating
oscar_season created from the variable thtr_rel_month
summer_season created from the variable thtr_rel_month

# Variable feature_film
movies <- movies %>% mutate(feature_film = as.factor(ifelse(title_type == "Feature Film", "yes", "no")))

# Variable drama
movies <- movies %>% mutate(drama = as.factor(ifelse(genre == "Drama", "yes", "no")))

# Variable mpaa_rating_R
movies <- movies %>% mutate(mpaa_rating_R = as.factor(ifelse(mpaa_rating == "R", "yes", "no")))

# Variable oscar_season
movies <- movies %>% mutate(oscar_season = as.factor(ifelse(thtr_rel_month %in% c("10", "11", "12"), "yes", "no")))

# Variable summer_season
movies <- movies %>% mutate(summer_season = as.factor(ifelse(thtr_rel_month %in% c("6", "7", "8"), "yes", "no")))

Since five more variables have been added to the dataset, the variables in the dataset should increase to 37 from 32. Let verify this.

dim(movies)

[1] 651  37

Part 3: Exploratory data analysis

Discussion

Out of the 32 variables in the movies dataset, we only require 17 variables including the response variable for this analysis. The 17 variables are listed below starting with the response variable.

audience_score
feature_film
drama
mpaa_rating_R
oscar_season
summer_season
runtime
thtr_rel_year
imdb_rating
imdb_num_votes
critics_score
best_pic_nom
best_pic_win
best_actor_win
best_actress_win
best_dir_win
top200_box

# Select the 17 required variables from the dataset

bayesian_movies <- movies %>% filter(!is.na(audience_score), !is.na(feature_film), !is.na(drama), !is.na(mpaa_rating_R), !is.na(oscar_season), !is.na(summer_season), !is.na(runtime), !is.na(thtr_rel_year), !is.na(imdb_rating), !is.na(imdb_num_votes), !is.na(critics_score), !is.na(best_pic_nom), !is.na(best_pic_win), !is.na(best_actor_win), !is.na(best_actress_win), !is.na(best_dir_win), !is.na(top200_box)) %>% select(audience_score, feature_film, drama, mpaa_rating_R, oscar_season, summer_season, runtime, thtr_rel_year, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box)

# Observe bayesian_movies

str(bayesian_movies)

Classes 'tbl_df', 'tbl' and 'data.frame':   650 obs. of  17 variables:
 $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
 $ feature_film    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 1 2 2 1 2 ...
 $ drama           : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 2 2 1 2 ...
 $ mpaa_rating_R   : Factor w/ 2 levels "no","yes": 2 1 2 1 2 1 1 2 1 1 ...
 $ oscar_season    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
 $ summer_season   : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 1 1 1 1 ...
 $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
 $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
 $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
 $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
 $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
 $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
 $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
 $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Graphical visualization

# Boxplots of variable `audience_score` and categorical variables in the bayesian_movies dataset
box1 <- bwplot(~ audience_score | oscar_season, data = bayesian_movies, main = "Audience vs Oscar")

box2 <- bwplot(~ audience_score | summer_season, data = bayesian_movies, main = "Audience vs Summer")

box3 <- bwplot(~ audience_score | feature_film, data = bayesian_movies, main = "Audience vs Feature")

box4 <- bwplot(~ audience_score | drama, data = bayesian_movies, main = "Audience vs Drama")

box5 <- bwplot(~ audience_score | mpaa_rating_R, data = bayesian_movies, main = "Audience vs R")

box6 <- bwplot(~ audience_score | best_pic_nom, data = bayesian_movies, main = "Audience vs Pic nom")

box7 <- bwplot(~ audience_score | best_pic_win, data = bayesian_movies, main = "Audience vs Pic win")

box8 <- bwplot(~ audience_score | best_actor_win, data = bayesian_movies, main = "Audience vs Actor")

box9 <- bwplot(~ audience_score | best_actress_win, data = bayesian_movies, main = "Audience vs Actress")

box10 <- bwplot(~ audience_score | best_dir_win, data = bayesian_movies, main = "Audience vs Director")

box11 <- bwplot(~ audience_score | top200_box, data = bayesian_movies, main = "Audience vs Top200")



grid.arrange(box1, box2, ncol = 2)

grid.arrange(box3, box4, ncol = 2)

grid.arrange(box5, box6, ncol = 2)

grid.arrange(box7, box8, ncol = 2)

grid.arrange(box9, box10, ncol = 2)

grid.arrange(box11, ncol = 2)

Discussion

For much better view of the variables, let’s observe the boxplots again with the response variable on the y axis.

boxplot(audience_score ~ oscar_season, data = bayesian_movies, col = c("thistle", "tan"), xlab = "Oscar", ylab = "Audience")

boxplot(audience_score ~ summer_season, data = bayesian_movies, col = c("yellowgreen", "snow"), xlab = "Summer", ylab = "Audience")

boxplot(audience_score ~ feature_film, data = bayesian_movies, col = c("slateblue", "mediumaquamarine"), xlab = "Feature film", ylab = "Audience")

boxplot(audience_score ~ drama, data = bayesian_movies, col = c("pink4", "slategray1"), xlab = "Drama", ylab = "Audience")

boxplot(audience_score ~ mpaa_rating_R, data = bayesian_movies, col = c("mintcream", "lightgreen"), xlab = "MPAA Rating", ylab = "Audience")

boxplot(audience_score ~ best_pic_nom, data = bayesian_movies, col = c("ivory", "ivory4"), xlab = "Best picture nomination", ylab = "Audience")

boxplot(audience_score ~ best_pic_win, data = bayesian_movies, col = c("olivedrab1", "oldlace"), xlab = "Best picture winner", ylab = "Audience")

boxplot(audience_score ~ best_actor_win, data = bayesian_movies, col = c("lemonchiffon", "lemonchiffon4"), xlab = "Best actor winner", ylab = "Audience")

boxplot(audience_score ~ best_actress_win, data = bayesian_movies, col = c("lightskyblue", "lightskyblue4"), xlab = "Best actress winner", ylab = "Audience")

boxplot(audience_score ~ best_dir_win, data = bayesian_movies, col = c("paleturquoise1", "paleturquoise4"), xlab = "Best director winner", ylab = "Audience")

boxplot(audience_score ~ top200_box, data = bayesian_movies, col = c("maroon1", "lavender"), xlab = "Box office Top 200", ylab = "Audience")

Discussion

A close examination of the two forms of the set of boxplots showed that the plots of variable audience_score versus oscar_season, summer_season, drama, mpaa_rating, best_actor_winner, best_actress_winner, best_director_winner respectively, give no distinct trends between the response variable and the ‘yes’ or ‘no’ levels of the explanatory variables.

The plot of audience_score and feature_film gives useful information because level ‘no’ of the feature_film with its lowest ‘no’ values which is an outlier also correspond with level ‘yes’ lowest value which is not an outlier.

The plots of audience_score versus best_pic_nom, audience_score versus best_pic_win and audience_score versus top200_box respectively showed high ‘yes’ levels

# Histogram of numerical response variable showing mean and standard deviation

mean <- mean(bayesian_movies$audience_score)
standard_deviation <- sd(bayesian_movies$audience_score)
median <- median(bayesian_movies$audience_score)

as <- ggplot(bayesian_movies) + geom_histogram(aes(x = audience_score, y = ..density..), bins = 30, fill = "lightcyan1", color = "black") + geom_vline(mapping = aes(xintercept = mean, color = "limegreen")) + geom_vline(mapping = aes(xintercept = standard_deviation, color = "mediumblue"))

as + stat_function(fun = function(x) dnorm(x, mean, standard_deviation), color = "yellow")

mean

[1] 62.34769

standard_deviation

[1] 20.23466

median

[1] 65

Discussion

The histogram of the response variable is left skewed showing the mean, standard deviation and the regression curve. The calculated value of the median is slightly higher than that of mean because the distribution is left skewed.

# Scatter plots of the numeric response variable with the numeric explanatory variables

#class <- c("runtime", "thtr_rel_year", "imdb_rating", "imdb_num_votes", "critics_score")
r1 <- ggplot(bayesian_movies) + geom_point(aes(x = audience_score, y = runtime), color = "peru") + labs(title = "Audience score vs runtime") + labs(x = "Audience score", y = "Runtime")

r2 <- ggplot(bayesian_movies) + geom_point(aes(x = audience_score, y = thtr_rel_year), color = "pink4") + labs(title = "Audience score vs Year of release") + labs(x = "Audience score", y = "Year of release")

r3 <- ggplot(bayesian_movies) + geom_point(aes(x = audience_score, y = imdb_rating), color = "violetred4") + labs(title = "Audience score vs Rating") + labs(x = "Audience score", y = "IMDB rating")

r4 <- ggplot(bayesian_movies) + geom_point(aes(x = audience_score, y = imdb_num_votes), color = "navajowhite4") + labs(title = "Audience score vs Number of votes") + labs(x = "Audience score", y = "IMDB votes")

r5 <- ggplot(bayesian_movies) + geom_point(aes(x = audience_score, y = critics_score), color = "seagreen") + labs(title = "Audience score vs Critics score") + labs(x = "Audience score", y = "Critics score")

grid.arrange(r1, r2, r3, r4, r5, ncol = 2)

Discussion

The scatterplots of the response variable and the numerical explanatory variables showed some levels of variations. Linear correlation was exhibited between audience_score versus runtime, audience_score against thtr_rel_year and audience_score versus imdb_num_votes. However, positive correlation was seen for audience_score versus imdb_rating and audience_score versus critics_score respectively.

Part 4: Modeling

Bayesian Model Averaging will be used for this project.

# Data summary
summary(bayesian_movies)

 audience_score  feature_film drama     mpaa_rating_R oscar_season
 Min.   :11.00   no : 59      no :345   no :321       no :460     
 1st Qu.:46.00   yes:591      yes:305   yes:329       yes:190     
 Median :65.00                                                    
 Mean   :62.35                                                    
 3rd Qu.:80.00                                                    
 Max.   :97.00                                                    
 summer_season    runtime      thtr_rel_year   imdb_rating   
 no :486       Min.   : 39.0   Min.   :1970   Min.   :1.900  
 yes:164       1st Qu.: 92.0   1st Qu.:1990   1st Qu.:5.900  
               Median :103.0   Median :2000   Median :6.600  
               Mean   :105.8   Mean   :1998   Mean   :6.492  
               3rd Qu.:115.8   3rd Qu.:2007   3rd Qu.:7.300  
               Max.   :267.0   Max.   :2014   Max.   :9.000  
 imdb_num_votes   critics_score    best_pic_nom best_pic_win
 Min.   :   180   Min.   :  1.00   no :628      no :643     
 1st Qu.:  4584   1st Qu.: 33.00   yes: 22      yes:  7     
 Median : 15204   Median : 61.00                            
 Mean   : 57620   Mean   : 57.65                            
 3rd Qu.: 58484   3rd Qu.: 83.00                            
 Max.   :893008   Max.   :100.00                            
 best_actor_win best_actress_win best_dir_win top200_box
 no :557        no :578          no :607      no :635   
 yes: 93        yes: 72          yes: 43      yes: 15

# Modeling
bmm_all <- bas.lm(audience_score ~ ., data = bayesian_movies, prior = "BIC", modelprior = uniform())
summary(bmm_all)

                    P(B != 0 | Y)    model 1       model 2       model 3
Intercept              1.00000000     1.0000     1.0000000     1.0000000
feature_filmyes        0.06475617     0.0000     0.0000000     0.0000000
dramayes               0.04308556     0.0000     0.0000000     0.0000000
mpaa_rating_Ryes       0.20009502     0.0000     0.0000000     0.0000000
oscar_seasonyes        0.07703667     0.0000     0.0000000     0.0000000
summer_seasonyes       0.04080086     0.0000     0.0000000     0.0000000
runtime                0.47067030     1.0000     0.0000000     0.0000000
thtr_rel_year          0.09050567     0.0000     0.0000000     0.0000000
imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
imdb_num_votes         0.05796731     0.0000     0.0000000     0.0000000
critics_score          0.89156709     1.0000     1.0000000     1.0000000
best_pic_nomyes        0.13014756     0.0000     0.0000000     0.0000000
best_pic_winyes        0.03982897     0.0000     0.0000000     0.0000000
best_actor_winyes      0.14531668     0.0000     0.0000000     1.0000000
best_actress_winyes    0.14134972     0.0000     0.0000000     0.0000000
best_dir_winyes        0.06679156     0.0000     0.0000000     0.0000000
top200_boxyes          0.04771535     0.0000     0.0000000     0.0000000
BF                             NA     1.0000     0.9968489     0.2543185
PostProbs                      NA     0.1353     0.1348000     0.0344000
R2                             NA     0.7549     0.7525000     0.7539000
dim                            NA     4.0000     3.0000000     4.0000000
logmarg                        NA -3615.2791 -3615.2822108 -3616.6482224
                          model 4       model 5
Intercept               1.0000000     1.0000000
feature_filmyes         0.0000000     0.0000000
dramayes                0.0000000     0.0000000
mpaa_rating_Ryes        1.0000000     1.0000000
oscar_seasonyes         0.0000000     0.0000000
summer_seasonyes        0.0000000     0.0000000
runtime                 0.0000000     1.0000000
thtr_rel_year           0.0000000     0.0000000
imdb_rating             1.0000000     1.0000000
imdb_num_votes          0.0000000     0.0000000
critics_score           1.0000000     1.0000000
best_pic_nomyes         0.0000000     0.0000000
best_pic_winyes         0.0000000     0.0000000
best_actor_winyes       0.0000000     0.0000000
best_actress_winyes     0.0000000     0.0000000
best_dir_winyes         0.0000000     0.0000000
top200_boxyes           0.0000000     0.0000000
BF                      0.2521327     0.2391994
PostProbs               0.0341000     0.0324000
R2                      0.7539000     0.7563000
dim                     4.0000000     5.0000000
logmarg             -3616.6568544 -3616.7095127

bmm_all


Call:
bas.lm(formula = audience_score ~ ., data = bayesian_movies, 
    prior = "BIC", modelprior = uniform())


 Marginal Posterior Inclusion Probabilities: 
          Intercept      feature_filmyes             dramayes  
            1.00000              0.06476              0.04309  
   mpaa_rating_Ryes      oscar_seasonyes     summer_seasonyes  
            0.20010              0.07704              0.04080  
            runtime        thtr_rel_year          imdb_rating  
            0.47067              0.09051              1.00000  
     imdb_num_votes        critics_score      best_pic_nomyes  
            0.05797              0.89157              0.13015  
    best_pic_winyes    best_actor_winyes  best_actress_winyes  
            0.03983              0.14532              0.14135  
    best_dir_winyes        top200_boxyes  
            0.06679              0.04772

# Visualize model
image(bmm_all, rotate = FALSE)

Discussion

The result of the modeling showed imdb_rating and critics_score give meaningful posterior probabilities.

Model Diagnostics

# Plot model
plot(bmm_all, which = 1)

# Observe the confidence intervals
ci <- coefficients(bmm_all)
confint(ci)

                             2.5%        97.5%          beta
Intercept            6.157608e+01 6.309979e+01  6.234769e+01
feature_filmyes     -7.825445e-01 1.281690e-01 -1.026910e-01
dramayes             0.000000e+00 0.000000e+00  1.578519e-02
mpaa_rating_Ryes    -2.093391e+00 4.140017e-05 -3.041432e-01
oscar_seasonyes     -1.082559e+00 9.183840e-04 -8.402533e-02
summer_seasonyes     0.000000e+00 0.000000e+00  1.399590e-02
runtime             -8.393681e-02 0.000000e+00 -2.574711e-02
thtr_rel_year       -5.266187e-02 5.900535e-05 -4.518100e-03
imdb_rating          1.376169e+01 1.663050e+01  1.497376e+01
imdb_num_votes      -7.961308e-07 1.714902e-06  2.101237e-07
critics_score        0.000000e+00 1.052716e-01  6.332453e-02
best_pic_nomyes      0.000000e+00 5.014580e+00  5.009061e-01
best_pic_winyes      0.000000e+00 0.000000e+00 -8.236391e-03
best_actor_winyes   -2.604697e+00 0.000000e+00 -2.903941e-01
best_actress_winyes -2.824991e+00 5.833662e-03 -3.090744e-01
best_dir_winyes     -1.037789e+00 9.836323e-02 -1.190122e-01
top200_boxyes        0.000000e+00 0.000000e+00  8.704627e-02
attr(,"Probability")
[1] 0.95
attr(,"class")
[1] "confint.bas"

Discussion

The plot showed extreme values above and below the zero midline with some outliers above the line.

Part 5: Prediction

The bayesian model will be used to predict the movies titled Acrimony produced by Tyler Perry. The variables are gotten from Rotten Tomatoes and IMDb databases.

data_predict <- data.frame(feature_film = "yes", drama = "yes", runtime = 120, thtr_rel_year = 2018, oscar_season = "no", summer_season = "yes", imdb_num_votes = 1096, best_pic_nom = "no", best_pic_win = "no", best_actor_win = "no", best_actress_win = "no", best_dir_win = "no", top200_box = "no", critics_score = 32, imdb_rating = 6.3, mpaa_rating_R = "yes")
new_bmm <- predict(newdata = data_predict, bmm_all, estimator = "BMA", se.fit = TRUE)
movie_predict <- confint(new_bmm, estimator = "BMA")
movie_predict

         2.5%    97.5%     pred
[1,] 36.94274 76.17199 57.33616
attr(,"Probability")
[1] 0.95
attr(,"class")
[1] "confint.bas"

Part 6: Conclusion

The result of the prediction showed a credible interval that is 37% and 77% with an audience_score of 57%. According to the Rotten Tomatoes site, the audience_score of the movie is 51%. The prediction does not perform too badly. Though, this is not a foolproof prediction.