library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
library(tidyr)Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.
load("movies.Rdata")
df <- moviesThis dataset was gathered from a random sampling of movies from IMDB and Rotten Tomatoes that have release years between 1970 and 2016.
Since this was a random sample, as opposed to an experiment, only associations can be suggested, no causal relation can be confirmed from this data.
The results of this regression analysis can be generalized to the entire population of movies released in the time frame of the dataset.
Data manipulation will proceed as per the instructions in the assignment.
title_type: New variable should be called feature_film with levels yes (movies that are feature films) and nodf <- df %>% mutate(feature_film = ifelse(title_type=="Feature Film","yes","no"))genre: New variable should be called drama with levels yes (movies that are dramas) and nodf <- df %>% mutate(drama = ifelse(genre=="Drama","yes","no"))mpaa_rating: New variable should be called mpaa_rating_R with levels yes (movies that are R rated) and nodf <- df %>% mutate(mpaa_rating_R = ifelse(mpaa_rating=="R","yes","no"))Create two new variables based on thtr_rel_month:
oscar_season with levels yes (if movie is released in November, October, or December) and no (2 pt)summer_season with levels yes (if movie is released in May, June, July, or August) and nodf <- df %>% mutate(oscar_season = ifelse(thtr_rel_month %in% 10:12 ,"yes","no"),
summer_season = ifelse(thtr_rel_month %in% 5:8 ,"yes","no"))Confirm all new variables were added
tail(names(df),5)## [1] "feature_film" "drama" "mpaa_rating_R" "oscar_season"
## [5] "summer_season"
Conduct exploratory analysis between audience_score and the new variables created in the last section.
First, create a new data frame with just the columns we need.
explDF <- df %>% select(audience_score, feature_film, drama, mpaa_rating_R, oscar_season, summer_season)Next, create box plots for each variable against audience_score.
#Turn data frame into "long" form
gathDF <- gather(explDF,key=varname,value=val,-audience_score)
ggplot(data=gathDF, aes(x=val,y=audience_score,fill=val)) +
geom_boxplot() +
facet_grid(~varname) +
xlab("") + ylab("Audience Score") +
labs(title="Audience Score by Variable",fill="Values")The most interesting variable appears to be feature_film. The median for non-feature films is 85.5, while the median for feature films is 62. We’ll see if this ends up being an important factor in the model.
Here, I’ll use the BAS package for developing the Bayesian Regression Model. I will use the averaged model (BMA) for making predictions later.
Build model and print model diagnostics…
set.seed(123)
fit <- bas.lm(audience_score~feature_film+drama+
runtime+mpaa_rating_R+thtr_rel_year+
oscar_season+summer_season+imdb_rating+
imdb_num_votes+critics_score+
best_pic_nom+best_pic_win+best_actor_win+
best_actress_win+best_dir_win+top200_box,
data=df,
prior="BIC",
modelprior = uniform())
fit##
## Call:
## bas.lm(formula = audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = df, prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmyes dramayes
## 1.00000 0.06537 0.04320
## runtime mpaa_rating_Ryes thtr_rel_year
## 0.46971 0.19984 0.09069
## oscar_seasonyes summer_seasonyes imdb_rating
## 0.07506 0.08042 1.00000
## imdb_num_votes critics_score best_pic_nomyes
## 0.05774 0.88855 0.13119
## best_pic_winyes best_actor_winyes best_actress_winyes
## 0.03985 0.14435 0.14128
## best_dir_winyes top200_boxyes
## 0.06694 0.04762
The coefficients in the output demonstrate the likelihood that a variable is included in the posterior model. For instance, IMDB rating has a probability of 1, effectively guaranteeing that it will be included in the posterior model.
I will use the 2016 movie “Arrival”" to populate the criteria for prediction.
Make sure it is not in the dataset
grep("Arrival",df$title)## integer(0)
References for where the data for this movie come from…
Top Box Office
http://www.boxofficemojo.com/alltime/domestic.htm?page=1&p=.htm
Movie Data
http://www.imdb.com/title/tt2543164/
Critic Score
https://www.rottentomatoes.com/m/arrival_2016/
Run prediction through the model
testMovie <- data.frame(feature_film="yes",
drama="yes",
runtime=116,
mpaa_rating_R="no",
thtr_rel_year=2016,
oscar_season="yes",
summer_season="no",
imdb_rating=7.9,
imdb_num_votes=433683,
critics_score=94,
best_pic_nom="no",
best_pic_win="no",
best_actor_win="yes",
best_actress_win="no",
best_dir_win="no",
top200_box="no")
BMA_pred_movies = predict(newdata=testMovie, fit, estimator="BMA", se.fit=TRUE)
ci_bma_movies = confint(BMA_pred_movies, estimator="BMA")
ci_bma_movies## 2.5% 97.5% pred
## [1,] 65.09299 104.1806 85.31738
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
The credible interval for the Audience Score from Rotten Tomatoes for the movie Arrival is 65 to 104, with a prediction of 85. The actual score from Rotten Tomatoes is 82.
Conclusion not repetitive of earlier statements…
Since the true Audience Score was within the credible interval, our model performed quite well with this movie.
Cohesive synthesis…
The model performs well when using the limited variables required by the assignment. More testing is needed before determining whether this model is a good fit overall, or just performed well with the single prediction, as the assignment stipulates.
Discussion of shortcomings…
The major shortcoming of this model is in the testing phase. Using the current number of IMDB votes and rating as input into the predictor is extrapolating the results. Further, testing on only a single movie does not mean that this model performs well overall. It would be much more appropriate to hold back a portion of the original data and test with that, in order to get a more accurate measure of model accuracy.