Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
library(readr)

Load data

The dataset is a movie reviews data set with 651 observations and 32 variables.

load("movies.Rdata")
str(movies)
## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

Part 1: Data

The dataset comes from two data sources. The IMDB and Rotten tomatoes. The observations that make up this sample comes and from 212 movie studios with movies from 1970 to 2014, and contains 11 different genres. The sample in my opinion is well diversified, has enough sample size and encompasses all genres of movies from the last 4 decades with lot of interesting features that are relevant to answer the question of predicting audience scores. A suitable model built from the data set should be able generalize well for new data.


Part 2: Data manipulation

Here we are creating new features from existing variables using mutate funcation in the dplr package. The features created are

  1. feature_file - a factor variable with ‘yes’ if the movie type is a ‘Feature Film’ and ‘no’ otherwise
  2. drama - a factor variable with ‘yes’ if the movie genre is ‘Drama’ and ‘no’ otherwise
  3. mpaa_rating_r - a factor variable with ‘yes’ if the movie’s mpaa_rating is ‘R’ and ‘no’ otherwise
  4. oscar_season - a factor variable with ‘yes’ if the movie is released in the months on Oct, Nov and Dec. It will be ‘no’ otherwise
  5. summer_season - a factor variable with ‘yes’ if the movie is released in the months on May, June and July. It will be ‘no’ otherwise
movies <- mutate(movies, feature_film = factor(ifelse(movies$title_type == "Feature Film", 'yes', 'no'), levels = c('yes', 'no')))
movies <- mutate(movies, drama = factor(ifelse(movies$genre == "Drama", 'yes', 'no'), levels = c('yes', 'no')))
movies <- mutate(movies, mpaa_rating_R = factor(ifelse(movies$mpaa_rating == "R", 'yes', 'no'), levels = c('yes', 'no')))
movies <- mutate(movies, oscar_season = factor(ifelse(movies$thtr_rel_month %in% c(10,11,12), 'yes', 'no'), levels = c('yes', 'no')))
movies <- mutate(movies, summer_season = factor(ifelse(movies$thtr_rel_month %in% c(5,6,7), 'yes', 'no'), levels = c('yes', 'no')))

Part 3: Exploratory data analysis

  1. Feature_film on audience_score - The mean and the median values of the audience_Score is significantly higher for movie types that are not ‘feature films’ compared to movie types that are ‘feature film’. The following summay and the plot attests to this fact.
select(movies, feature_film, audience_score) %>%
group_by(feature_film) %>%
summarise(audience_score = mean(audience_score))
## # A tibble: 2 x 2
##   feature_film audience_score
##         <fctr>          <dbl>
## 1          yes       60.46531
## 2           no       81.05000
qplot(feature_film, audience_score, data = movies, geom = 'boxplot', main = "Feature_film on Audience_score")

2. Drama on Audience_score - movies under genre of ‘drama’ has a slightly higher audience_score distribution that other genres.

select(movies, drama, audience_score) %>%
group_by(drama) %>%
summarise(audience_score = mean(audience_score))
## # A tibble: 2 x 2
##    drama audience_score
##   <fctr>          <dbl>
## 1    yes       65.34754
## 2     no       59.73121
qplot(drama, audience_score, data = movies, geom = 'boxplot', main = "Drama genre vs Audience_Score")

  1. mpaa_rating_r on audience_score - ‘R’ rated movies have very slight lower mean audience score than other movies. But this is very slight and may not be statistically significant.
select(movies, mpaa_rating_R, audience_score) %>%
group_by(mpaa_rating_R) %>%
summarise(audience_score = mean(audience_score))
## # A tibble: 2 x 2
##   mpaa_rating_R audience_score
##          <fctr>          <dbl>
## 1           yes       62.04255
## 2            no       62.68944
qplot(mpaa_rating_R, audience_score, data = movies, geom = 'boxplot', main = "mpaa_rating of R on Audience_score")

  1. oscar_season on audience_score - movies released in the oscar season has a slightly higher audience_score distribution. It is not clear right now if this difference is significant enough for this feature to be selected to predict the audience score in the selected model.
select(movies, oscar_season, audience_score) %>%
group_by(oscar_season) %>%
summarise(audience_score = mean(audience_score))
## # A tibble: 2 x 2
##   oscar_season audience_score
##         <fctr>          <dbl>
## 1          yes       63.68586
## 2           no       61.81304
qplot(oscar_season, audience_score, data = movies, geom = 'boxplot', main = "Oscar season movies on Audience_score" )

  1. summer_season on audience_score - movies released in summer have very slightly lower mean audience score than other movies. But this is very slight and may not be statistically significant.
select(movies, summer_season, audience_score) %>%
group_by(summer_season) %>%
summarise(audience_score = mean(audience_score))
## # A tibble: 2 x 2
##   summer_season audience_score
##          <fctr>          <dbl>
## 1           yes       62.00610
## 2            no       62.48255
qplot(summer_season, audience_score, data = movies, geom = 'boxplot', main = "Summer season movies on Audience_score")

Part 4: Modeling

As instructed, a filtered list of feature is selected from the original dataset with the following features and the na.omit() is applied ot remove the the one observation that has ‘na’ values

movies_ds <- movies[,c('feature_film', 'drama', 'runtime', 'mpaa_rating_R', 'thtr_rel_year', 'oscar_season', 'summer_season', 'imdb_rating', 'critics_score', 'best_pic_nom', 'best_pic_win', 'best_actor_win', 'best_actress_win', 'best_dir_win', 'top200_box', 'audience_score')]

movies_ds = na.omit(movies_ds)

Bayesina Model Averaging (BMA) is used to select a parsimonious model to predict the target variable. The BPM (Best Predictive model) and HPM (Highest probability model) both have the same 3 predictors namely runtime, imdb_rating, critics_score. These set of predictors have the least RMSE(Root Mean Square Error) and hence it is selected to make the prediction

bma_movies = bas.lm(audience_score ~ . , data=movies_ds, prior="BIC", modelprior=uniform(), initprobs="eplogp")

BPM_pred_movies <- predict(bma_movies, estimator = "BPM", se.fit = TRUE)
bma_movies$namesx[BPM_pred_movies$bestmodel + 1]
## [1] "Intercept"     "runtime"       "imdb_rating"   "critics_score"
cv.summary.bas(BPM_pred_movies$fit, movies_ds$audience_score)
## [1] 10.00968
MPM_pred_movies <- predict(bma_movies, estimator = "MPM")
bma_movies$namesx[MPM_pred_movies$bestmodel + 1]
## [1] "Intercept"     "imdb_rating"   "critics_score"
cv.summary.bas(MPM_pred_movies$fit, movies_ds$audience_score)
## [1] 10.05972
HPM_pred_movies <- predict(bma_movies, estimator = "HPM")
bma_movies$namesx[HPM_pred_movies$bestmodel + 1]
## [1] "Intercept"     "runtime"       "imdb_rating"   "critics_score"
cv.summary.bas(HPM_pred_movies$fit, movies_ds$audience_score)
## [1] 10.00968

Part 5: Prediction

To test the prediction a dataset of 8 movies released in 2006 is compiled with the necessary predictors. This teste dataset is loaded below

movies_test_ds <- read_csv('movies_test.csv')
## Parsed with column specification:
## cols(
##   title = col_character(),
##   runtime = col_integer(),
##   imdb_rating = col_double(),
##   critics_score = col_integer(),
##   audience_score = col_integer()
## )
str(movies_test_ds, give.attr = F)
## Classes 'tbl_df', 'tbl' and 'data.frame':    8 obs. of  5 variables:
##  $ title         : chr  "Split" "Suicide Squad" "LBJ" "Arrival" ...
##  $ runtime       : int  117 123 98 116 107 109 89 92
##  $ imdb_rating   : num  7.3 6.2 6 8 7.6 5.3 6.6 6.5
##  $ critics_score : int  75 26 52 94 96 30 90 74
##  $ audience_score: int  79 61 63 82 89 42 69 68

A parsimonious model is build using the 3 predictors selected above and it is used to predict the audience_score in the test set. It did well giving a out of sample RMSE of 5.23 which is better than the training RMSE.

bma_movies_parsiomon = bas.lm(audience_score ~ runtime + imdb_rating + critics_score , data=movies_ds, prior = 'BIC', modelprior = uniform(), initprobs = 'eplogp')

yhat_movies_BPM <- predict(bma_movies_parsiomon, movies_test_ds, estimator = "BPM")$fit

cv.summary.bas(yhat_movies_BPM, movies_test_ds$audience_score)
## [1] 5.232874

Part 6: Conclusion

Thus the power of using BMA to perform feature selection is demonstrated here. I validated the selected set of features seperately by using AIC (Akaike Information Criterior) which returned the same set of 3 predictors while trying to minimize for AIC. So I am fairly confident that this parsimonious model has all the predictive information of the original dataset with less noise.

I am still curious about 2 things that I observed and don’t understand why yet.

  1. Why didn’t the feature_film variable have a high enough PIP (posterior Inclusion Probability) to be included in the model. The plot of feature_film version audience_score showed a significant difference in audience_score between these two groups of movies.

  2. Why is my out of sample RMSE (5.23) lower than my training RMSE (10.01). Shouldn’t it be the other way around? Is it due to my selecting movies with high ratings in my test set that inadvertently increased the accuracy of the model? Something to look into at a later time.