Congratulations on getting a job as a data scientist at a Hollywood film studio! Your boss has just acquired data about how much audiences and critics like movies as well as numerous other variables about the movies and is interested in learning what attributes make a movie popular. As part of this project you will complete exploratory data analysis (EDA), modeling, and prediction.
Develop a Bayesian regression model to predict Audience Score. Some variables are in the original dataset provided, and others are new variables you will need to construct in the data manipulation section using the mutate function in dplyr.
library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)load("movies.Rdata")This dataset includes information from Rotten Tomatoes and IMDB for a random sample of movies.
Because random sampling was used, we can believe that the sample is in fact random. However, the sample is not randomly assigned into different groups due to the specific characteristics of each movie. A causal relationship cannot be established.
Here we want to create new variables, each with two levels of either “yes” or "no:
feature_filmdramampaa_rating_Roscar_seasonsummer_seasonmovies <- movies %>%
mutate(feature_film = as.factor(ifelse(title_type == "Feature Film", "yes","no")),
drama = as.factor(ifelse(genre == "Drama", "yes", "no")),
mpaa_rating_R = as.factor(ifelse(mpaa_rating == "R", "yes", "no")),
oscar_season = as.factor(ifelse(thtr_rel_month %in% 10:12, "yes", "no")),
summer_season = as.factor(ifelse(thtr_rel_month %in% 5:8, "yes", "no")))First, we want to find the relationship between audience score and our newly created variables
newvar <- movies %>%
dplyr::select(feature_film, drama, mpaa_rating_R, oscar_season, summer_season, audience_score)
ggplot(data = newvar, aes(x = feature_film, y = audience_score, color = feature_film)) + geom_boxplot() +
labs(title = "Audience Score of Feature Film", x = "Feature Film",
y = "Audience Score") + theme_classic()From the box plot above, we see that the median score for feature film is much less than that of non-featured films.
ggplot(movies, aes(x=thtr_rel_year, y=audience_score)) +
geom_point(aes(color=factor(genre))) +
xlab("Year Released") +
ylab("Audience Score")Here we are looking at other predictor variables that may affect audience score. There does not appear to be any obvious pattern for any particular genre or release year.
ggplot(data = newvar, aes(x = mpaa_rating_R, y= audience_score, color = mpaa_rating_R)) + geom_boxplot() +
labs(title = "Audience Score of MPAA Rating R", x = "MPAA Rating R",
y = "Audience Score") + theme_classic()Again, there does not appear to be any obvious pattern for MPAA Rating R.
There following are variables selected to perform the modeling analysis:
runtimethtr_rel_yearimdb_ratingimdb_num_votescritics_scoreaudience_scorebest_pic_nombest_actor_winbest_actor_winbest_actress_winbest_dir_wintop200_boxfeature_filmdramampaa_rating_Roscar_seasonsummer_seasonf_model <- movies %>%
select(audience_score, feature_film, drama, runtime, mpaa_rating_R,
thtr_rel_year, oscar_season, summer_season, imdb_rating,
imdb_num_votes, critics_score, best_pic_nom, best_pic_win,
best_actor_win, best_actress_win, best_dir_win, top200_box) %>%
filter(!is.na(runtime))score_model = bas.lm(audience_score ~ . , data = f_model)Bayesian regression will be conducted using Bayesian Model Averaging (BMA) with these parameters to be used in the model:
Prior: Zeller-Siow Cauchy
Model Prior: Uniform (equal probabilities to all models)
Method: Markov Chain Monte Carlo (MCMC) to ensure efficient sampling method
Residuals VS Fitted Values
The model is likely to perform better prediction at higher scores.
plot(score_model, which = 1)Model Probabilities
The plot displays probability of the models it is sampled.
plot(score_model, which = 2)Model Complexity
The plot shows the model complexity which suggests the number of parameters in the model.
plot(score_model, which = 3)Marginal Inclusion Probabilities
Posterior Inclusion Probability is analyzed below.
plot(score_model, which = 4, ask = F, cex.lab = 0.75, cex.axis = 0.75)Model Selection
It is apparent that imdb_rating and critics_score have posterior inclusion probability greater than 0.5 and are likely to be included in the model.
print(score_model)##
## Call:
## bas.lm(formula = audience_score ~ ., data = f_model)
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmyes dramayes
## 1.000000 0.018164 0.008494
## runtime mpaa_rating_Ryes thtr_rel_year
## 0.159697 0.048519 0.018601
## oscar_seasonyes summer_seasonyes imdb_rating
## 0.018996 0.019348 1.000000
## imdb_num_votes critics_score best_pic_nomyes
## 0.010200 0.574202 0.022927
## best_pic_winyes best_actor_winyes best_actress_winyes
## 0.008150 0.039335 0.035222
## best_dir_winyes top200_boxyes
## 0.015566 0.009816
The same results are produced, with the model having the highest posterior odds in runtime, imdb_rating, and critics_score.
image(score_model, rotate = F, cex.lab = 0.75, cex.axis=0.75)Model Coefficients
Given the data, there is a 95% chance that the audience_score increases with an increase in the IMDb rating.
coef_movies <- coef(score_model)
coef_movies##
## Marginal Posterior Summaries of Coefficients:
##
## Using BMA
##
## Based on the top 65536 models
## post mean post SD post p(B != 0)
## Intercept 6.235e+01 3.964e-01 1.000e+00
## feature_filmyes -3.539e-02 3.332e-01 1.816e-02
## dramayes 1.934e-03 8.098e-02 8.494e-03
## runtime -8.806e-03 2.195e-02 1.597e-01
## mpaa_rating_Ryes -7.490e-02 3.749e-01 4.852e-02
## thtr_rel_year -9.038e-04 8.283e-03 1.860e-02
## oscar_seasonyes -2.253e-02 2.039e-01 1.900e-02
## summer_seasonyes 2.250e-02 2.003e-01 1.935e-02
## imdb_rating 1.532e+01 8.767e-01 1.000e+00
## imdb_num_votes 2.649e-08 4.940e-07 1.020e-02
## critics_score 4.156e-02 3.937e-02 5.742e-01
## best_pic_nomyes 7.996e-02 6.367e-01 2.293e-02
## best_pic_winyes -2.294e-03 3.678e-01 8.150e-03
## best_actor_winyes -8.273e-02 4.689e-01 3.933e-02
## best_actress_winyes -7.941e-02 4.806e-01 3.522e-02
## best_dir_winyes -2.972e-02 3.138e-01 1.557e-02
## top200_boxyes 1.715e-02 3.174e-01 9.816e-03
confint(coef_movies)## 2.5% 97.5% beta
## Intercept 61.57289804 63.1156306 6.234769e+01
## feature_filmyes 0.00000000 0.0000000 -3.538671e-02
## dramayes 0.00000000 0.0000000 1.933600e-03
## runtime -0.06497544 0.0000000 -8.805543e-03
## mpaa_rating_Ryes 0.00000000 0.0000000 -7.490287e-02
## thtr_rel_year 0.00000000 0.0000000 -9.038291e-04
## oscar_seasonyes 0.00000000 0.0000000 -2.253495e-02
## summer_seasonyes 0.00000000 0.0000000 2.250274e-02
## imdb_rating 13.74683722 16.7897388 1.532291e+01
## imdb_num_votes 0.00000000 0.0000000 2.649436e-08
## critics_score 0.00000000 0.1017715 4.155546e-02
## best_pic_nomyes 0.00000000 0.0000000 7.996231e-02
## best_pic_winyes 0.00000000 0.0000000 -2.294183e-03
## best_actor_winyes 0.00000000 0.0000000 -8.273357e-02
## best_actress_winyes 0.00000000 0.0000000 -7.940600e-02
## best_dir_winyes 0.00000000 0.0000000 -2.971645e-02
## top200_boxyes 0.00000000 0.0000000 1.714854e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
The film Arrival (2016) will be used to perform prediction using Bayesian Model Averaging (BMA).
The model predicts, with 95% credible interval, that Arrival is expected to have audience_score of 85.
Arrival <- data.frame(runtime = 116,
thtr_rel_year = 2016,
imdb_rating = 7.9,
imdb_num_votes = 619667,
critics_score = 94,
best_pic_nom = "no",
best_pic_win = "no",
best_actor_win = "no",
best_actress_win = "no",
best_dir_win = "no",
top200_box = "no",
feature_film = "yes",
drama = "yes",
mpaa_rating_R = "no",
oscar_season = "no",
summer_season = "yes")
prediction_model <- predict(score_model, Arrival, estimator = "BMA")$fitprediction_model## [1] 85.42587
The film’s actual audience_score is 82, which falls within the prediction interval.
The most important predictors for audience_score prediction modeling are runtime, imdb_rating, and critics_score when using the Bayesian Model Averaging (BMA) method. Although the predicted score for Arrival is higher than the actual score by 3, it still falls within the prediction interval. The shortcomings for the prediction model include large variance and lack of unknown yet influential variables that could make significant impact.
Bayesian Statistics: Data Analysis Project by Cory Torrella
Statistics with R Specialization, Duke University