library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies
. Delete this note when before you submit your work.
load("movies.Rdata")
According to the cookbook provided, the data set is comprised of 651 randomly sampled movies produced and released before 2016. As the sample is a random sample, the results are able to be generalized to the population (all movies produced and released before 2016), however, since it is not an experiment with random assignment, no causality can be inferred based on the correlations observed.
movies <- movies %>%
mutate(feature_film = if_else(title_type == 'Feature Film', 'yes', 'no'),
drama = if_else(genre == 'Drama', 'yes', 'no'),
mpaa_rating_R = if_else(mpaa_rating == 'R', 'yes', 'no'),
oscar_season = if_else((thtr_rel_month == 10 |
thtr_rel_month == 11 |
thtr_rel_month == 12), 'yes', 'no'),
summer_season = if_else((thtr_rel_month == 5 |
thtr_rel_month == 6 |
thtr_rel_month == 7 |
thtr_rel_month == 8), 'yes', 'no'))
#Converting the new variables into factors
movies <- movies %>%
mutate(feature_film = as.factor(feature_film),
drama = as.factor(drama),
mpaa_rating_R = as.factor(mpaa_rating_R),
oscar_season = as.factor(oscar_season),
summer_season = as.factor(summer_season))
movies %>%
select(audience_score, feature_film, drama, mpaa_rating_R,
oscar_season, summer_season) %>%
summary()
## audience_score feature_film drama mpaa_rating_R oscar_season
## Min. :11.00 no : 60 no :346 no :322 no :460
## 1st Qu.:46.00 yes:591 yes:305 yes:329 yes:191
## Median :65.00
## Mean :62.36
## 3rd Qu.:80.00
## Max. :97.00
## summer_season
## no :443
## yes:208
##
##
##
##
According to this summary, the mean of the audience scores is 62.36 and the median is 65, so the sample is probably a little left-skewed. The range is 11-97. The drama and mpaa_rating categories are evenly represented in the dataset while the feature_film, oscar_season and summer_season seem to have a bias towards one of the “no” or “yes” bins.
movies %>%
ggplot(aes(x=feature_film, y=audience_score)) + geom_boxplot(fill='#f46542') + theme_minimal()
This boxplot shows that apart from a few outliers, movies that are not feature films tend to garner a bigger audience score. However we should be careful with this interpretation as there are many more feature films in the dataset than no feature films.
movies %>%
ggplot(aes(x=drama, y=audience_score)) + geom_boxplot(fill='#f46542') + theme_minimal()
According to this boxplot, drama movies seem to have a bigger median audience score with similar variability.
movies %>%
ggplot(aes(x=mpaa_rating_R, y=audience_score)) + geom_boxplot(fill='#f46542') + theme_minimal()
Movies with a mpaa rating of R seem to have a lower median audience score but the results are mostly similar.
movies %>%
ggplot(aes(x=oscar_season, y=audience_score)) + geom_boxplot(fill='#f46542') + theme_minimal()
Movies in the oscar season have a higher audience score with similar variability to those who are not oscar season. The number of movies in the oscar season is small (191) compared to those not in oscar season (460) in the dataset, so interpretation must be careful.
movies %>%
ggplot(aes(x=summer_season, y=audience_score)) + geom_boxplot(fill='#f46542') + theme_minimal()
Movies in summer season appear to have a smaller median audience score than those not in summer season.The smaller number of movies in summer season (208) than not in summer season (443) also shows a need for a careful interpretation.
set.seed(42)
movies_no_na = na.omit(movies)
n = nrow(movies_no_na)
n_cv = 50
ape = matrix(NA, ncol=4, nrow=n_cv)
colnames(ape) = c("BMA", "BPM", "HPM", "MPM")
for (i in 1:n_cv) {
train = sample(1:n, size=round(.90*n), replace=FALSE)
lmovies_train = movies_no_na[train,]
lmovies_test = movies_no_na[-train,]
bma_train_movie = bas.lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box , data=lmovies_train,
prior="BIC", modelprior=uniform(), initprobs="eplogp")
yhat_bma = predict(bma_train_movie, lmovies_test, estimator="BMA")$fit
yhat_hpm = predict(bma_train_movie, lmovies_test, estimator="HPM")$fit
yhat_mpm = predict(bma_train_movie, lmovies_test, estimator="MPM")$fit
yhat_bpm = predict(bma_train_movie, lmovies_test, estimator="BPM")$fit
ape[i, "BMA"] = cv.summary.bas(yhat_bma, lmovies_test$audience_score)
ape[i, "BPM"] = cv.summary.bas(yhat_bpm, lmovies_test$audience_score)
ape[i, "HPM"] = cv.summary.bas(yhat_hpm, lmovies_test$audience_score)
ape[i, "MPM"] = cv.summary.bas(yhat_mpm, lmovies_test$audience_score)
}
boxplot(ape)
set.seed(42)
movies_no_na = na.omit(movies)
n = nrow(movies_no_na)
n_cv = 50
ape = matrix(NA, ncol=4, nrow=n_cv)
colnames(ape) = c("BMA", "BPM", "HPM", "MPM")
for (i in 1:n_cv) {
train = sample(1:n, size=round(.90*n), replace=FALSE)
lmovies_train = movies_no_na[train,]
lmovies_test = movies_no_na[-train,]
bma_train_movie = bas.lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box , data=lmovies_train,
prior="AIC", modelprior=uniform(), initprobs="eplogp")
yhat_bma = predict(bma_train_movie, lmovies_test, estimator="BMA")$fit
yhat_hpm = predict(bma_train_movie, lmovies_test, estimator="HPM")$fit
yhat_mpm = predict(bma_train_movie, lmovies_test, estimator="MPM")$fit
yhat_bpm = predict(bma_train_movie, lmovies_test, estimator="BPM")$fit
ape[i, "BMA"] = cv.summary.bas(yhat_bma, lmovies_test$audience_score)
ape[i, "BPM"] = cv.summary.bas(yhat_bpm, lmovies_test$audience_score)
ape[i, "HPM"] = cv.summary.bas(yhat_hpm, lmovies_test$audience_score)
ape[i, "MPM"] = cv.summary.bas(yhat_mpm, lmovies_test$audience_score)
}
boxplot(ape)
According to those results, choosing the AIC prior with the BMA estimator seems to be the better choice for reducing the average prediction errors.
movies.lm <- bas.lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box , data=movies,
prior="AIC", modelprior=uniform(), initprobs="eplogp")
## Warning in bas.lm(audience_score ~ feature_film + drama + runtime +
## mpaa_rating_R + : dropping 1 rows due to missing data
summary(movies.lm)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.0000000 1.0000 1.0000000 1.0000000
## feature_filmyes 0.3811635 0.0000 0.0000000 0.0000000
## dramayes 0.3794742 0.0000 0.0000000 0.0000000
## runtime 0.8560184 1.0000 1.0000000 1.0000000
## mpaa_rating_Ryes 0.6831204 1.0000 1.0000000 1.0000000
## thtr_rel_year 0.5764312 1.0000 0.0000000 0.0000000
## oscar_seasonyes 0.3647827 0.0000 0.0000000 0.0000000
## summer_seasonyes 0.4160076 0.0000 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.4505720 0.0000 0.0000000 0.0000000
## critics_score 0.9666890 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.6933310 1.0000 1.0000000 1.0000000
## best_pic_winyes 0.3032774 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.5155485 1.0000 1.0000000 0.0000000
## best_actress_winyes 0.5941480 1.0000 1.0000000 1.0000000
## best_dir_winyes 0.3621784 0.0000 0.0000000 0.0000000
## top200_boxyes 0.3051228 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.9783606 0.9298905
## PostProbs NA 0.0017 0.0017000 0.0016000
## R2 NA 0.7601 0.7593000 0.7585000
## dim NA 9.0000 8.0000000 7.0000000
## logmarg NA -3604.4206 -3604.4424632 -3604.4932746
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmyes 0.0000000 0.0000000
## dramayes 0.0000000 0.0000000
## runtime 1.0000000 1.0000000
## mpaa_rating_Ryes 1.0000000 1.0000000
## thtr_rel_year 1.0000000 0.0000000
## oscar_seasonyes 0.0000000 0.0000000
## summer_seasonyes 0.0000000 1.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 1.0000000 1.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 1.0000000 1.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## BF 0.8903369 0.7914465
## PostProbs 0.0015000 0.0013000
## R2 0.7592000 0.7592000
## dim 8.0000000 8.0000000
## logmarg -3604.5367415 -3604.6544792
According to the results obtained from this summary, the best model for this data will be a model containing runtime, mpaa_rating_R, thtr_rel_year, imdb_rating, critics_score, best_pic_nom, best_actor_win and best_actress_win.
movies_final.lm <- bas.lm(audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win + best_actress_win, data = movies, prior='AIC', modelprior=uniform(), initprobs='eplogp')
## Warning in bas.lm(audience_score ~ runtime + mpaa_rating_R + thtr_rel_year
## + : dropping 1 rows due to missing data
summary(movies_final.lm)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.0000000 1.0000 1.0000000 1.0000000
## runtime 0.8628615 1.0000 1.0000000 1.0000000
## mpaa_rating_Ryes 0.7010683 1.0000 1.0000000 1.0000000
## thtr_rel_year 0.4914510 1.0000 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000 1.0000000 1.0000000
## critics_score 0.9782012 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.6956091 1.0000 1.0000000 1.0000000
## best_actor_winyes 0.5429452 1.0000 1.0000000 0.0000000
## best_actress_winyes 0.5834202 1.0000 1.0000000 1.0000000
## BF NA 1.0000 0.9783606 0.9298905
## PostProbs NA 0.0666 0.0652000 0.0619000
## R2 NA 0.7601 0.7593000 0.7585000
## dim NA 9.0000 8.0000000 7.0000000
## logmarg NA -3604.4206 -3604.4424632 -3604.4932746
## model 4 model 5
## Intercept 1.0000000 1.0000000
## runtime 1.0000000 1.0000000
## mpaa_rating_Ryes 1.0000000 1.0000000
## thtr_rel_year 1.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 1.0000000 1.0000000
## best_actor_winyes 0.0000000 1.0000000
## best_actress_winyes 1.0000000 0.0000000
## BF 0.8903369 0.7062613
## PostProbs 0.0593000 0.0471000
## R2 0.7592000 0.7583000
## dim 8.0000000 7.0000000
## logmarg -3604.5367415 -3604.7683561
confint(coef(movies_final.lm))
## 2.5% 97.5% beta
## Intercept 61.61295213 63.136609084 62.34769231
## runtime -0.09087433 0.000000000 -0.04627277
## mpaa_rating_Ryes -2.82778849 0.002599150 -1.05608442
## thtr_rel_year -0.11114646 0.007007131 -0.02459562
## imdb_rating 13.77385201 16.180712792 14.98141407
## critics_score 0.01920235 0.114034122 0.06559367
## best_pic_nomyes 0.00000000 8.398831718 3.08695161
## best_actor_winyes -3.72767988 0.226782251 -0.97505211
## best_actress_winyes -4.46217267 0.030184526 -1.23735197
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
Apart from critics_score and imdb_rating, most of the 95% credible intervals cross 0. However this is to be expected using the AIC prior, as the model may use non-significant variables in order to improve prediction compared to the BIC prior. As the main objective of this project is prediction of the audience_score for a new movie, this was a choice made by me. For true explanatory power, perhaps a simple model with critics_score and imdb_rating would suffice.
plot(movies_final.lm, which = 1, add.smooth = F)
image(movies_final.lm, subset = -1, rotate = F)
According to the diagnostics, the residual vs fitted plot of the model seems to follow a pattern signifying the presence of a non-linear relationship between the predictors. The assumption of equal variance also seems to not being fulfilled in this model due to the reisduals not being spread equally along the range of the predictors.
As such, any predictions or credible intervals estimated from this model may be wrong and as such, one must be careful using it as the conclusions may be plain wrong.
The movie used for prediction will be Thor: Ragnarok (2017) using information from Rotten Tomatoes and IMDB.
thor.df <- data.frame(runtime = 130, mpaa_rating_R = 'no',
thtr_rel_year = 2017, imdb_rating = 8.2, critics_score = 92, best_pic_nom = 'no',
best_actor_win = 'no',
best_actress_win = 'no')
thor.predict <- predict(movies_final.lm, newdata = thor.df, estimator = 'BMA')
thor.predict$Ybma
## [,1]
## [1,] 89.31444
The prediction made by the model was an audience score of 89.3%. As the true audience score according to rotten tomatoes is 88%, I’m very happy with the prediction capabilities of the model even with all the limitations described in the diagnostics.
From the results, the predictive model built fulfilled the expectation even though the diagnostics were not very good and it used non-significant predictors. Some limitations of this work consisted on the very small sample of movies (651) in the dataset as well as the fact that basically all of the new features created were useless for the model and also the small sample for prediction (n = 1).
As such, better feature engineering and the a bigger sample of movies to train the model with may prove to be benefical in building a better predictive model for the prediction of audience scores.