Setup

Congratulations on getting a job as a data scientist at a Hollywood film studio! Your boss has just acquired data about how much audiences and critics like movies as well as numerous other variables about the movies and is interested in learning what attributes make a movie popular. As part of this project you will complete exploratory data analysis (EDA), modeling, and prediction.

Develop a Bayesian regression model to predict Audience Score. Some variables are in the original dataset provided, and others are new variables you will need to construct in the data manipulation section using the mutate function in dplyr.

Load Packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)

Load Data

load("movies.Rdata")

Part 1: Data

This dataset includes information from Rotten Tomatoes and IMDB for a random sample of movies.

Because random sampling was used, we can believe that the sample is in fact random. However, the sample is not randomly assigned into different groups due to the specific characteristics of each movie. A causal relationship cannot be established.


Part 2: Data Manipulation

Here we want to create new variables, each with two levels of either “yes” or "no:

movies <- movies %>%
  mutate(feature_film = as.factor(ifelse(title_type == "Feature Film", "yes","no")),
         drama = as.factor(ifelse(genre == "Drama", "yes", "no")),
         mpaa_rating_R = as.factor(ifelse(mpaa_rating == "R", "yes", "no")),
         oscar_season = as.factor(ifelse(thtr_rel_month %in% 10:12, "yes", "no")),
         summer_season = as.factor(ifelse(thtr_rel_month %in% 5:8, "yes", "no")))

Part 3: Exploratory Data Analysis

First, we want to find the relationship between audience score and our newly created variables

newvar <- movies %>%
  dplyr::select(feature_film, drama, mpaa_rating_R, oscar_season, summer_season, audience_score)

ggplot(data = newvar, aes(x = feature_film, y = audience_score, color = feature_film)) + geom_boxplot()  +
  labs(title = "Audience Score of Feature Film", x = "Feature Film", 
       y = "Audience Score")  + theme_classic()

From the box plot above, we see that the median score for feature film is much less than that of non-featured films.

ggplot(movies, aes(x=thtr_rel_year, y=audience_score)) +
  geom_point(aes(color=factor(genre))) +
  xlab("Year Released") +
  ylab("Audience Score")

Here we are looking at other predictor variables that may affect audience score. There does not appear to be any obvious pattern for any particular genre or release year.

ggplot(data = newvar, aes(x = mpaa_rating_R, y= audience_score, color = mpaa_rating_R)) + geom_boxplot()  +
  labs(title = "Audience Score of MPAA Rating R", x = "MPAA Rating R", 
       y = "Audience Score")  + theme_classic()

Again, there does not appear to be any obvious pattern for MPAA Rating R.


Part 4: Modeling

There following are variables selected to perform the modeling analysis:

f_model <- movies %>%
  select(audience_score, feature_film, drama, runtime, mpaa_rating_R, 
         thtr_rel_year, oscar_season, summer_season, imdb_rating, 
         imdb_num_votes, critics_score, best_pic_nom, best_pic_win, 
         best_actor_win, best_actress_win, best_dir_win, top200_box) %>%
  filter(!is.na(runtime))
score_model = bas.lm(audience_score ~ . , data = f_model)

Bayesian regression will be conducted using Bayesian Model Averaging (BMA) with these parameters to be used in the model:

Prior: Zeller-Siow Cauchy

Model Prior: Uniform (equal probabilities to all models)

Method: Markov Chain Monte Carlo (MCMC) to ensure efficient sampling method


Residuals VS Fitted Values

The model is likely to perform better prediction at higher scores.

plot(score_model, which = 1)

Model Probabilities

The plot displays probability of the models it is sampled.

plot(score_model, which = 2)

Model Complexity

The plot shows the model complexity which suggests the number of parameters in the model.

plot(score_model, which = 3)

Marginal Inclusion Probabilities

Posterior Inclusion Probability is analyzed below.

plot(score_model, which = 4, ask = F, cex.lab = 0.75, cex.axis = 0.75)

Model Selection

It is apparent that imdb_rating and critics_score have posterior inclusion probability greater than 0.5 and are likely to be included in the model.

print(score_model)
## 
## Call:
## bas.lm(formula = audience_score ~ ., data = f_model)
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##            1.000000             0.018164             0.008494  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##            0.159697             0.048519             0.018601  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##            0.018996             0.019348             1.000000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##            0.010200             0.574202             0.022927  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##            0.008150             0.039335             0.035222  
##     best_dir_winyes        top200_boxyes  
##            0.015566             0.009816

The same results are produced, with the model having the highest posterior odds in runtime, imdb_rating, and critics_score.

image(score_model, rotate = F, cex.lab = 0.75, cex.axis=0.75)

Model Coefficients

Given the data, there is a 95% chance that the audience_score increases with an increase in the IMDb rating.

coef_movies <- coef(score_model)
coef_movies
## 
##  Marginal Posterior Summaries of Coefficients: 
## 
##  Using  BMA 
## 
##  Based on the top  65536 models 
##                      post mean   post SD     post p(B != 0)
## Intercept             6.235e+01   3.964e-01   1.000e+00    
## feature_filmyes      -3.539e-02   3.332e-01   1.816e-02    
## dramayes              1.934e-03   8.098e-02   8.494e-03    
## runtime              -8.806e-03   2.195e-02   1.597e-01    
## mpaa_rating_Ryes     -7.490e-02   3.749e-01   4.852e-02    
## thtr_rel_year        -9.038e-04   8.283e-03   1.860e-02    
## oscar_seasonyes      -2.253e-02   2.039e-01   1.900e-02    
## summer_seasonyes      2.250e-02   2.003e-01   1.935e-02    
## imdb_rating           1.532e+01   8.767e-01   1.000e+00    
## imdb_num_votes        2.649e-08   4.940e-07   1.020e-02    
## critics_score         4.156e-02   3.937e-02   5.742e-01    
## best_pic_nomyes       7.996e-02   6.367e-01   2.293e-02    
## best_pic_winyes      -2.294e-03   3.678e-01   8.150e-03    
## best_actor_winyes    -8.273e-02   4.689e-01   3.933e-02    
## best_actress_winyes  -7.941e-02   4.806e-01   3.522e-02    
## best_dir_winyes      -2.972e-02   3.138e-01   1.557e-02    
## top200_boxyes         1.715e-02   3.174e-01   9.816e-03
confint(coef_movies)
##                            2.5%      97.5%          beta
## Intercept           61.57289804 63.1156306  6.234769e+01
## feature_filmyes      0.00000000  0.0000000 -3.538671e-02
## dramayes             0.00000000  0.0000000  1.933600e-03
## runtime             -0.06497544  0.0000000 -8.805543e-03
## mpaa_rating_Ryes     0.00000000  0.0000000 -7.490287e-02
## thtr_rel_year        0.00000000  0.0000000 -9.038291e-04
## oscar_seasonyes      0.00000000  0.0000000 -2.253495e-02
## summer_seasonyes     0.00000000  0.0000000  2.250274e-02
## imdb_rating         13.74683722 16.7897388  1.532291e+01
## imdb_num_votes       0.00000000  0.0000000  2.649436e-08
## critics_score        0.00000000  0.1017715  4.155546e-02
## best_pic_nomyes      0.00000000  0.0000000  7.996231e-02
## best_pic_winyes      0.00000000  0.0000000 -2.294183e-03
## best_actor_winyes    0.00000000  0.0000000 -8.273357e-02
## best_actress_winyes  0.00000000  0.0000000 -7.940600e-02
## best_dir_winyes      0.00000000  0.0000000 -2.971645e-02
## top200_boxyes        0.00000000  0.0000000  1.714854e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

Part 5: Prediction

The film Arrival (2016) will be used to perform prediction using Bayesian Model Averaging (BMA).

The model predicts, with 95% credible interval, that Arrival is expected to have audience_score of 85.

Arrival <- data.frame(runtime = 116,
                            thtr_rel_year = 2016,
                            imdb_rating = 7.9,
                            imdb_num_votes = 619667,
                            critics_score = 94,
                            best_pic_nom = "no",
                            best_pic_win = "no",
                            best_actor_win = "no",
                            best_actress_win = "no",
                            best_dir_win = "no",
                            top200_box = "no",
                            feature_film = "yes",
                            drama = "yes",
                            mpaa_rating_R = "no",
                            oscar_season = "no",
                            summer_season = "yes")

prediction_model <- predict(score_model, Arrival, estimator = "BMA")$fit
prediction_model
## [1] 85.42587

The film’s actual audience_score is 82, which falls within the prediction interval.


Part 6: Conclusion

The most important predictors for audience_score prediction modeling are runtime, imdb_rating, and critics_score when using the Bayesian Model Averaging (BMA) method. Although the predicted score for Arrival is higher than the actual score by 3, it still falls within the prediction interval. The shortcomings for the prediction model include large variance and lack of unknown yet influential variables that could make significant impact.


Bayesian Statistics: Data Analysis Project by Cory Torrella
Statistics with R Specialization, Duke University