library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
library(readr)The dataset is a movie reviews data set with 651 observations and 32 variables.
load("movies.Rdata")
str(movies)## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 32 variables:
## $ title : chr "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
The dataset comes from two data sources. The IMDB and Rotten tomatoes. The observations that make up this sample comes and from 212 movie studios with movies from 1970 to 2014, and contains 11 different genres. The sample in my opinion is well diversified, has enough sample size and encompasses all genres of movies from the last 4 decades with lot of interesting features that are relevant to answer the question of predicting audience scores. A suitable model built from the data set should be able generalize well for new data.
Here we are creating new features from existing variables using mutate funcation in the dplr package. The features created are
movies <- mutate(movies, feature_film = factor(ifelse(movies$title_type == "Feature Film", 'yes', 'no'), levels = c('yes', 'no')))
movies <- mutate(movies, drama = factor(ifelse(movies$genre == "Drama", 'yes', 'no'), levels = c('yes', 'no')))
movies <- mutate(movies, mpaa_rating_R = factor(ifelse(movies$mpaa_rating == "R", 'yes', 'no'), levels = c('yes', 'no')))
movies <- mutate(movies, oscar_season = factor(ifelse(movies$thtr_rel_month %in% c(10,11,12), 'yes', 'no'), levels = c('yes', 'no')))
movies <- mutate(movies, summer_season = factor(ifelse(movies$thtr_rel_month %in% c(5,6,7), 'yes', 'no'), levels = c('yes', 'no')))select(movies, feature_film, audience_score) %>%
group_by(feature_film) %>%
summarise(audience_score = mean(audience_score))## # A tibble: 2 x 2
## feature_film audience_score
## <fctr> <dbl>
## 1 yes 60.46531
## 2 no 81.05000
qplot(feature_film, audience_score, data = movies, geom = 'boxplot', main = "Feature_film on Audience_score") 2. Drama on Audience_score - movies under genre of ‘drama’ has a slightly higher audience_score distribution that other genres.
select(movies, drama, audience_score) %>%
group_by(drama) %>%
summarise(audience_score = mean(audience_score))## # A tibble: 2 x 2
## drama audience_score
## <fctr> <dbl>
## 1 yes 65.34754
## 2 no 59.73121
qplot(drama, audience_score, data = movies, geom = 'boxplot', main = "Drama genre vs Audience_Score")select(movies, mpaa_rating_R, audience_score) %>%
group_by(mpaa_rating_R) %>%
summarise(audience_score = mean(audience_score))## # A tibble: 2 x 2
## mpaa_rating_R audience_score
## <fctr> <dbl>
## 1 yes 62.04255
## 2 no 62.68944
qplot(mpaa_rating_R, audience_score, data = movies, geom = 'boxplot', main = "mpaa_rating of R on Audience_score")select(movies, oscar_season, audience_score) %>%
group_by(oscar_season) %>%
summarise(audience_score = mean(audience_score))## # A tibble: 2 x 2
## oscar_season audience_score
## <fctr> <dbl>
## 1 yes 63.68586
## 2 no 61.81304
qplot(oscar_season, audience_score, data = movies, geom = 'boxplot', main = "Oscar season movies on Audience_score" )select(movies, summer_season, audience_score) %>%
group_by(summer_season) %>%
summarise(audience_score = mean(audience_score))## # A tibble: 2 x 2
## summer_season audience_score
## <fctr> <dbl>
## 1 yes 62.00610
## 2 no 62.48255
qplot(summer_season, audience_score, data = movies, geom = 'boxplot', main = "Summer season movies on Audience_score")As instructed, a filtered list of feature is selected from the original dataset with the following features and the na.omit() is applied ot remove the the one observation that has ‘na’ values
movies_ds <- movies[,c('feature_film', 'drama', 'runtime', 'mpaa_rating_R', 'thtr_rel_year', 'oscar_season', 'summer_season', 'imdb_rating', 'critics_score', 'best_pic_nom', 'best_pic_win', 'best_actor_win', 'best_actress_win', 'best_dir_win', 'top200_box', 'audience_score')]
movies_ds = na.omit(movies_ds)Bayesina Model Averaging (BMA) is used to select a parsimonious model to predict the target variable. The BPM (Best Predictive model) and HPM (Highest probability model) both have the same 3 predictors namely runtime, imdb_rating, critics_score. These set of predictors have the least RMSE(Root Mean Square Error) and hence it is selected to make the prediction
bma_movies = bas.lm(audience_score ~ . , data=movies_ds, prior="BIC", modelprior=uniform(), initprobs="eplogp")
BPM_pred_movies <- predict(bma_movies, estimator = "BPM", se.fit = TRUE)
bma_movies$namesx[BPM_pred_movies$bestmodel + 1]## [1] "Intercept" "runtime" "imdb_rating" "critics_score"
cv.summary.bas(BPM_pred_movies$fit, movies_ds$audience_score)## [1] 10.00968
MPM_pred_movies <- predict(bma_movies, estimator = "MPM")
bma_movies$namesx[MPM_pred_movies$bestmodel + 1]## [1] "Intercept" "imdb_rating" "critics_score"
cv.summary.bas(MPM_pred_movies$fit, movies_ds$audience_score)## [1] 10.05972
HPM_pred_movies <- predict(bma_movies, estimator = "HPM")
bma_movies$namesx[HPM_pred_movies$bestmodel + 1]## [1] "Intercept" "runtime" "imdb_rating" "critics_score"
cv.summary.bas(HPM_pred_movies$fit, movies_ds$audience_score)## [1] 10.00968
To test the prediction a dataset of 8 movies released in 2006 is compiled with the necessary predictors. This teste dataset is loaded below
movies_test_ds <- read_csv('movies_test.csv')## Parsed with column specification:
## cols(
## title = col_character(),
## runtime = col_integer(),
## imdb_rating = col_double(),
## critics_score = col_integer(),
## audience_score = col_integer()
## )
str(movies_test_ds, give.attr = F)## Classes 'tbl_df', 'tbl' and 'data.frame': 8 obs. of 5 variables:
## $ title : chr "Split" "Suicide Squad" "LBJ" "Arrival" ...
## $ runtime : int 117 123 98 116 107 109 89 92
## $ imdb_rating : num 7.3 6.2 6 8 7.6 5.3 6.6 6.5
## $ critics_score : int 75 26 52 94 96 30 90 74
## $ audience_score: int 79 61 63 82 89 42 69 68
A parsimonious model is build using the 3 predictors selected above and it is used to predict the audience_score in the test set. It did well giving a out of sample RMSE of 5.23 which is better than the training RMSE.
bma_movies_parsiomon = bas.lm(audience_score ~ runtime + imdb_rating + critics_score , data=movies_ds, prior = 'BIC', modelprior = uniform(), initprobs = 'eplogp')
yhat_movies_BPM <- predict(bma_movies_parsiomon, movies_test_ds, estimator = "BPM")$fit
cv.summary.bas(yhat_movies_BPM, movies_test_ds$audience_score)## [1] 5.232874
Thus the power of using BMA to perform feature selection is demonstrated here. I validated the selected set of features seperately by using AIC (Akaike Information Criterior) which returned the same set of 3 predictors while trying to minimize for AIC. So I am fairly confident that this parsimonious model has all the predictive information of the original dataset with less noise.
I am still curious about 2 things that I observed and don’t understand why yet.
Why didn’t the feature_film variable have a high enough PIP (posterior Inclusion Probability) to be included in the model. The plot of feature_film version audience_score showed a significant difference in audience_score between these two groups of movies.
Why is my out of sample RMSE (5.23) lower than my training RMSE (10.01). Shouldn’t it be the other way around? Is it due to my selecting movies with high ratings in my test set that inadvertently increased the accuracy of the model? Something to look into at a later time.