Bayesian Modeling and Prediction for Movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
library(tidyr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

load("movies.Rdata")
df <- movies

Part 1: Data

This dataset was gathered from a random sampling of movies from IMDB and Rotten Tomatoes that have release years between 1970 and 2016.

Since this was a random sample, as opposed to an experiment, only associations can be suggested, no causal relation can be confirmed from this data.

The results of this regression analysis can be generalized to the entire population of movies released in the time frame of the dataset.

Part 2: Data manipulation

Data manipulation will proceed as per the instructions in the assignment.

Create new variable based on title_type: New variable should be called feature_film with levels yes (movies that are feature films) and no

df <- df %>% mutate(feature_film = ifelse(title_type=="Feature Film","yes","no"))

Create new variable based on genre: New variable should be called drama with levels yes (movies that are dramas) and no

df <- df %>% mutate(drama = ifelse(genre=="Drama","yes","no"))

Create new variable based on mpaa_rating: New variable should be called mpaa_rating_R with levels yes (movies that are R rated) and no

df <- df %>% mutate(mpaa_rating_R = ifelse(mpaa_rating=="R","yes","no"))

Create two new variables based on thtr_rel_month:
1. New variable called oscar_season with levels yes (if movie is released in November, October, or December) and no (2 pt)
2. New variable called summer_season with levels yes (if movie is released in May, June, July, or August) and no

df <- df %>% mutate(oscar_season = ifelse(thtr_rel_month %in% 10:12 ,"yes","no"),
                    summer_season = ifelse(thtr_rel_month %in% 5:8 ,"yes","no"))

Confirm all new variables were added

tail(names(df),5)

## [1] "feature_film"  "drama"         "mpaa_rating_R" "oscar_season" 
## [5] "summer_season"

Part 3: Exploratory data analysis

Conduct exploratory analysis between audience_score and the new variables created in the last section.

First, create a new data frame with just the columns we need.

explDF <- df %>% select(audience_score, feature_film, drama, mpaa_rating_R, oscar_season, summer_season)

Next, create box plots for each variable against audience_score.

#Turn data frame into "long" form
gathDF <- gather(explDF,key=varname,value=val,-audience_score)

ggplot(data=gathDF, aes(x=val,y=audience_score,fill=val)) + 
  geom_boxplot() +
  facet_grid(~varname) +
  xlab("") + ylab("Audience Score") +
  labs(title="Audience Score by Variable",fill="Values")

The most interesting variable appears to be feature_film. The median for non-feature films is 85.5, while the median for feature films is 62. We’ll see if this ends up being an important factor in the model.

Part 4: Modeling

Here, I’ll use the BAS package for developing the Bayesian Regression Model. I will use the averaged model (BMA) for making predictions later.

Build model and print model diagnostics…

set.seed(123)
fit <- bas.lm(audience_score~feature_film+drama+
                runtime+mpaa_rating_R+thtr_rel_year+
                oscar_season+summer_season+imdb_rating+
                imdb_num_votes+critics_score+
                best_pic_nom+best_pic_win+best_actor_win+
                best_actress_win+best_dir_win+top200_box,
              data=df,
              prior="BIC",
              modelprior = uniform())
fit

## 
## Call:
## bas.lm(formula = audience_score ~ feature_film + drama + runtime +     mpaa_rating_R + thtr_rel_year + oscar_season + summer_season +     imdb_rating + imdb_num_votes + critics_score + best_pic_nom +     best_pic_win + best_actor_win + best_actress_win + best_dir_win +     top200_box, data = df, prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##             1.00000              0.06537              0.04320  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##             0.46971              0.19984              0.09069  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##             0.07506              0.08042              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.05774              0.88855              0.13119  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.03985              0.14435              0.14128  
##     best_dir_winyes        top200_boxyes  
##             0.06694              0.04762

The coefficients in the output demonstrate the likelihood that a variable is included in the posterior model. For instance, IMDB rating has a probability of 1, effectively guaranteeing that it will be included in the posterior model.

Part 5: Prediction

I will use the 2016 movie “Arrival”" to populate the criteria for prediction.

Make sure it is not in the dataset

grep("Arrival",df$title)

## integer(0)

References for where the data for this movie come from…

Top Box Office
http://www.boxofficemojo.com/alltime/domestic.htm?page=1&p=.htm
Movie Data
http://www.imdb.com/title/tt2543164/
Critic Score
https://www.rottentomatoes.com/m/arrival_2016/

Run prediction through the model

testMovie <- data.frame(feature_film="yes",
                        drama="yes",
                        runtime=116,
                        mpaa_rating_R="no",
                        thtr_rel_year=2016,
                        oscar_season="yes",
                        summer_season="no",
                        imdb_rating=7.9,
                        imdb_num_votes=433683,
                        critics_score=94,
                        best_pic_nom="no",
                        best_pic_win="no",
                        best_actor_win="yes",
                        best_actress_win="no",
                        best_dir_win="no",
                        top200_box="no")
BMA_pred_movies =  predict(newdata=testMovie, fit, estimator="BMA", se.fit=TRUE)
ci_bma_movies = confint(BMA_pred_movies, estimator="BMA")
ci_bma_movies

##          2.5%    97.5%     pred
## [1,] 65.09299 104.1806 85.31738
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

The credible interval for the Audience Score from Rotten Tomatoes for the movie Arrival is 65 to 104, with a prediction of 85. The actual score from Rotten Tomatoes is 82.

Part 6: Conclusion

Conclusion not repetitive of earlier statements…
Since the true Audience Score was within the credible interval, our model performed quite well with this movie.

Cohesive synthesis…
The model performs well when using the limited variables required by the assignment. More testing is needed before determining whether this model is a good fit overall, or just performed well with the single prediction, as the assignment stipulates.

Discussion of shortcomings…
The major shortcoming of this model is in the testing phase. Using the current number of IMDB votes and rating as input into the predictor is extrapolating the results. Further, testing on only a single movie does not mean that this model performs well overall. It would be much more appropriate to hold back a portion of the original data and test with that, in order to get a more accurate measure of model accuracy.