This project is a bayesian analysis of movie data. We’ll build linear models to predict the audience score of a film.
The dataset is comprised of 651 randomly sampled movies produced and released before 2016. The data draws from APIs from imdb.com, rottentomatoes.com, and flixster.com. As this is random sampling, only correlations can be drawn. Because of the randomness of the selection and the size of the dataset, our results can be generalizable. Since this project is present through an English-speaking platform, and as the data is drawn from sources that are based in the English-speaking world and cater to English speakers, the data will be biased toward movies where English is the main language. This precludes many foreign films, such as Asian films or Indian films. * * *
We are first going to create some new variables to aid in our exploratory data analysis. Below is a summary of the new variables.
feature_film: “yes” if title_type is Feature Film, “no” otherwise.
drama: “yes” if genre is Drama, “no” otherwise runtime.
mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise
oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise.
summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise.
movies <- movies %>%
mutate(feature_film = ifelse(title_type == "Feature Film", "yes", "no"),
drama = ifelse(genre == "Drama", "yes", "no"),
mpaa_rating_R = ifelse(mpaa_rating == "R","yes","no"),
oscar_season = ifelse(thtr_rel_month == 11 | thtr_rel_month == 10 | thtr_rel_month == 12, "yes", "no"),
summer_season = ifelse(thtr_rel_month == 5 | thtr_rel_month == 6 | thtr_rel_month == 7 | thtr_rel_month == 8, "yes","no"))We’ll then create a new dataframe `1``movies2``` that will include a subset of the total variables.
movies2_features <- c("audience_score", "feature_film", "drama", "runtime", "mpaa_rating_R", "thtr_rel_year", "oscar_season", "summer_season", "imdb_rating", "imdb_num_votes", "critics_score", "best_pic_nom", "best_pic_win", "best_actor_win", "best_actress_win", "best_dir_win", "top200_box")
movies2 <- movies[movies2_features]We’ll start out at a higher, broader level by taking a look at a summary of the variables in movies2.
## audience_score feature_film drama runtime
## Min. :11.00 Length:651 Length:651 Min. : 39.0
## 1st Qu.:46.00 Class :character Class :character 1st Qu.: 92.0
## Median :65.00 Mode :character Mode :character Median :103.0
## Mean :62.36 Mean :105.8
## 3rd Qu.:80.00 3rd Qu.:115.8
## Max. :97.00 Max. :267.0
## NA's :1
## mpaa_rating_R thtr_rel_year oscar_season summer_season
## Length:651 Min. :1970 Length:651 Length:651
## Class :character 1st Qu.:1990 Class :character Class :character
## Mode :character Median :2000 Mode :character Mode :character
## Mean :1998
## 3rd Qu.:2007
## Max. :2014
##
## imdb_rating imdb_num_votes critics_score best_pic_nom best_pic_win
## Min. :1.900 Min. : 180 Min. : 1.00 no :629 no :644
## 1st Qu.:5.900 1st Qu.: 4546 1st Qu.: 33.00 yes: 22 yes: 7
## Median :6.600 Median : 15116 Median : 61.00
## Mean :6.493 Mean : 57533 Mean : 57.69
## 3rd Qu.:7.300 3rd Qu.: 58301 3rd Qu.: 83.00
## Max. :9.000 Max. :893008 Max. :100.00
##
## best_actor_win best_actress_win best_dir_win top200_box
## no :558 no :579 no :608 no :636
## yes: 93 yes: 72 yes: 43 yes: 15
##
##
##
##
##
This summary gives us a look at the spread of each variable.
Let’s also take a look at the levels of each variable.
## tibble [651 x 17] (S3: tbl_df/tbl/data.frame)
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ feature_film : chr [1:651] "yes" "yes" "yes" "yes" ...
## $ drama : chr [1:651] "yes" "yes" "no" "yes" ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating_R : chr [1:651] "yes" "no" "yes" "no" ...
## $ thtr_rel_year : num [1:651] 2013 2001 1996 1993 2004 ...
## $ oscar_season : chr [1:651] "no" "no" "no" "yes" ...
## $ summer_season : chr [1:651] "no" "no" "yes" "no" ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
Let’s use boxplots to visualize how the newly-formed variables interact with audience_score.
plot1 <- ggplot(movies2, aes(x=mpaa_rating_R,y=audience_score))+
geom_boxplot(colour="aquamarine4")
plot2 <- ggplot(movies2, aes(x=oscar_season, y=audience_score))+
geom_boxplot(colour="aquamarine4")
plot3 <- ggplot(movies2, aes(x=summer_season,y=audience_score))+
geom_boxplot(colour="aquamarine4")
plot4 <- ggplot(movies2, aes(x=feature_film, y=audience_score))+
geom_boxplot(colour="aquamarine4")
plot5 <- ggplot(movies2, aes(x=drama, y=audience_score))+
geom_boxplot(colour="aquamarine4")
grid.arrange(plot1,plot2,plot3,plot4,plot5, ncol=3)We’ll then map out correlation charts that will show the relationships between audience_score and all other variables in movies2.
Notice the high correlation between audience_score and critics_score. Let’s visualize this correlation with a scatter plot with a regression line.
## [1] 0.7042762
ggplot(data=movies2, aes(x = audience_score, y = critics_score)) +
geom_jitter() +
geom_smooth(method = "lm")## `geom_smooth()` using formula 'y ~ x'
Let’s do the same with imdb_rating, again due to the high correlation it has with audience_score.
## [1] 0.8648652
ggplot(data=movies2, aes(x = audience_score, y = imdb_rating)) +
geom_jitter() +
geom_smooth(method = "lm")## `geom_smooth()` using formula 'y ~ x'
We can see strong positive correlations with both sets of variables.
We’ll first create the full linear model, incorporating every variables in movies2.
We will use the stepAIC function from library MASS to build a model (backwards) until the AIC can not be lowered.
##
## Call:
## lm(formula = audience_score ~ ., data = na.omit(movies2))
##
## Coefficients:
## (Intercept) feature_filmyes dramayes
## 1.244e+02 -2.248e+00 1.292e+00
## runtime mpaa_rating_Ryes thtr_rel_year
## -5.614e-02 -1.444e+00 -7.657e-02
## oscar_seasonyes summer_seasonyes imdb_rating
## -5.333e-01 9.106e-01 1.472e+01
## imdb_num_votes critics_score best_pic_nomyes
## 7.234e-06 5.748e-02 5.321e+00
## best_pic_winyes best_actor_winyes best_actress_winyes
## -3.212e+00 -1.544e+00 -2.198e+00
## best_dir_winyes top200_boxyes
## -1.231e+00 8.478e-01
We will use the stepAIC function, tuned to optimize for AIC, to find the best model. The model will be built backwards.
## Start: AIC=3006.94
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win + top200_box
##
## Df Sum of Sq RSS AIC
## - top200_box 1 9 62999 3005.0
## - oscar_season 1 28 63018 3005.2
## - best_pic_win 1 48 63038 3005.4
## - best_dir_win 1 51 63040 3005.5
## - summer_season 1 92 63081 3005.9
## - best_actor_win 1 171 63160 3006.7
## - feature_film 1 177 63166 3006.8
## <none> 62990 3006.9
## - drama 1 216 63206 3007.2
## - imdb_num_votes 1 255 63244 3007.6
## - best_actress_win 1 283 63273 3007.9
## - mpaa_rating_R 1 314 63304 3008.2
## - thtr_rel_year 1 397 63386 3009.0
## - best_pic_nom 1 408 63398 3009.1
## - runtime 1 538 63527 3010.5
## - critics_score 1 669 63659 3011.8
## - imdb_rating 1 58556 121545 3432.2
##
## Step: AIC=3005.04
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - oscar_season 1 26 63025 3003.3
## - best_pic_win 1 49 63047 3003.5
## - best_dir_win 1 52 63051 3003.6
## - summer_season 1 94 63093 3004.0
## - best_actor_win 1 169 63168 3004.8
## - feature_film 1 176 63175 3004.8
## <none> 62999 3005.0
## - drama 1 214 63213 3005.2
## - best_actress_win 1 279 63278 3005.9
## - imdb_num_votes 1 302 63301 3006.1
## - mpaa_rating_R 1 330 63329 3006.4
## - best_pic_nom 1 404 63403 3007.2
## - thtr_rel_year 1 415 63414 3007.3
## - runtime 1 535 63534 3008.5
## - critics_score 1 681 63680 3010.0
## - imdb_rating 1 58606 121604 3430.5
##
## Step: AIC=3003.31
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_pic_win 1 46 63071 3001.8
## - best_dir_win 1 56 63081 3001.9
## - best_actor_win 1 174 63200 3003.1
## - summer_season 1 177 63202 3003.1
## - feature_film 1 182 63207 3003.2
## <none> 63025 3003.3
## - drama 1 222 63247 3003.6
## - best_actress_win 1 281 63307 3004.2
## - imdb_num_votes 1 302 63328 3004.4
## - mpaa_rating_R 1 329 63354 3004.7
## - best_pic_nom 1 387 63412 3005.3
## - thtr_rel_year 1 410 63436 3005.5
## - runtime 1 587 63613 3007.3
## - critics_score 1 679 63704 3008.3
## - imdb_rating 1 58603 121628 3428.6
##
## Step: AIC=3001.78
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win +
## best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_dir_win 1 94 63165 3000.7
## - best_actor_win 1 163 63234 3001.5
## - feature_film 1 171 63242 3001.5
## - summer_season 1 174 63245 3001.6
## <none> 63071 3001.8
## - drama 1 220 63291 3002.0
## - imdb_num_votes 1 271 63342 3002.6
## - best_actress_win 1 294 63365 3002.8
## - mpaa_rating_R 1 330 63401 3003.2
## - best_pic_nom 1 342 63414 3003.3
## - thtr_rel_year 1 397 63468 3003.9
## - runtime 1 586 63657 3005.8
## - critics_score 1 680 63751 3006.8
## - imdb_rating 1 58858 121929 3428.2
##
## Step: AIC=3000.75
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - summer_season 1 167 63332 3000.5
## - best_actor_win 1 171 63336 3000.5
## - feature_film 1 183 63348 3000.6
## <none> 63165 3000.7
## - drama 1 228 63394 3001.1
## - imdb_num_votes 1 247 63412 3001.3
## - best_actress_win 1 299 63464 3001.8
## - best_pic_nom 1 326 63491 3002.1
## - mpaa_rating_R 1 345 63510 3002.3
## - thtr_rel_year 1 368 63533 3002.5
## - critics_score 1 651 63816 3005.4
## - runtime 1 673 63839 3005.6
## - imdb_rating 1 58895 122061 3426.9
##
## Step: AIC=3000.46
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + imdb_rating + imdb_num_votes + critics_score +
## best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - feature_film 1 156 63488 3000.1
## <none> 63332 3000.5
## - best_actor_win 1 195 63527 3000.5
## - drama 1 204 63536 3000.6
## - imdb_num_votes 1 260 63592 3001.1
## - best_pic_nom 1 297 63629 3001.5
## - best_actress_win 1 297 63629 3001.5
## - mpaa_rating_R 1 356 63688 3002.1
## - thtr_rel_year 1 361 63693 3002.2
## - runtime 1 690 64022 3005.5
## - critics_score 1 732 64064 3005.9
## - imdb_rating 1 58763 122095 3425.1
##
## Step: AIC=3000.06
## audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year +
## imdb_rating + imdb_num_votes + critics_score + best_pic_nom +
## best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - drama 1 121 63609 2999.3
## - imdb_num_votes 1 173 63661 2999.8
## <none> 63488 3000.1
## - best_actor_win 1 219 63706 3000.3
## - thtr_rel_year 1 277 63765 3000.9
## - best_pic_nom 1 291 63778 3001.0
## - best_actress_win 1 306 63794 3001.2
## - mpaa_rating_R 1 453 63941 3002.7
## - runtime 1 715 64203 3005.3
## - critics_score 1 875 64363 3007.0
## - imdb_rating 1 63189 126677 3447.1
##
## Step: AIC=2999.3
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_actor_win +
## best_actress_win
##
## Df Sum of Sq RSS AIC
## - imdb_num_votes 1 148 63757 2998.8
## <none> 63609 2999.3
## - best_actor_win 1 209 63818 2999.4
## - thtr_rel_year 1 272 63881 3000.1
## - best_actress_win 1 274 63883 3000.1
## - best_pic_nom 1 307 63916 3000.4
## - mpaa_rating_R 1 391 64000 3001.3
## - runtime 1 631 64240 3003.7
## - critics_score 1 916 64525 3006.6
## - imdb_rating 1 63434 127043 3447.0
##
## Step: AIC=2998.81
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## critics_score + best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## <none> 63757 2998.8
## - thtr_rel_year 1 201 63958 2998.9
## - best_actor_win 1 219 63976 2999.0
## - best_actress_win 1 266 64023 2999.5
## - mpaa_rating_R 1 367 64124 3000.5
## - best_pic_nom 1 442 64199 3001.3
## - runtime 1 519 64276 3002.1
## - critics_score 1 879 64635 3005.7
## - imdb_rating 1 67356 131113 3465.4
The final model built using AIC consists of the following variables:
runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win
AIC.lm <- lm(audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win + best_actress_win, data=movies2)Taking a look at the coefficients of this model:
## (Intercept) runtime mpaa_rating_Ryes thtr_rel_year
## 70.10675281 -0.05115515 -1.50528039 -0.05122557
## imdb_rating critics_score best_pic_nomyes best_actor_winyes
## 15.00149242 0.06409989 4.88277038 -1.73481942
## best_actress_winyes
## -2.11568281
Taking a look at the standard deviation of the model:
## [1] 9.973201
Plotting the residuals of the model:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see that the residuals are normally distributed.
We will use the stepAIC function, tuned to optimize for BIC, to find the best model. The model will be built backwards.
## Start: AIC=3083.07
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win + top200_box
##
## Df Sum of Sq RSS AIC
## - top200_box 1 9 62999 3076.7
## - oscar_season 1 28 63018 3076.9
## - best_pic_win 1 48 63038 3077.1
## - best_dir_win 1 51 63040 3077.1
## - summer_season 1 92 63081 3077.5
## - best_actor_win 1 171 63160 3078.4
## - feature_film 1 177 63166 3078.4
## - drama 1 216 63206 3078.8
## - imdb_num_votes 1 255 63244 3079.2
## - best_actress_win 1 283 63273 3079.5
## - mpaa_rating_R 1 314 63304 3079.8
## - thtr_rel_year 1 397 63386 3080.7
## - best_pic_nom 1 408 63398 3080.8
## - runtime 1 538 63527 3082.1
## <none> 62990 3083.1
## - critics_score 1 669 63659 3083.5
## - imdb_rating 1 58556 121545 3503.9
##
## Step: AIC=3076.69
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + oscar_season + summer_season + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - oscar_season 1 26 63025 3070.5
## - best_pic_win 1 49 63047 3070.7
## - best_dir_win 1 52 63051 3070.8
## - summer_season 1 94 63093 3071.2
## - best_actor_win 1 169 63168 3072.0
## - feature_film 1 176 63175 3072.0
## - drama 1 214 63213 3072.4
## - best_actress_win 1 279 63278 3073.1
## - imdb_num_votes 1 302 63301 3073.3
## - mpaa_rating_R 1 330 63329 3073.6
## - best_pic_nom 1 404 63403 3074.4
## - thtr_rel_year 1 415 63414 3074.5
## - runtime 1 535 63534 3075.7
## <none> 62999 3076.7
## - critics_score 1 681 63680 3077.2
## - imdb_rating 1 58606 121604 3497.7
##
## Step: AIC=3070.49
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_pic_win + best_actor_win +
## best_actress_win + best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_pic_win 1 46 63071 3064.5
## - best_dir_win 1 56 63081 3064.6
## - best_actor_win 1 174 63200 3065.8
## - summer_season 1 177 63202 3065.8
## - feature_film 1 182 63207 3065.9
## - drama 1 222 63247 3066.3
## - best_actress_win 1 281 63307 3066.9
## - imdb_num_votes 1 302 63328 3067.1
## - mpaa_rating_R 1 329 63354 3067.4
## - best_pic_nom 1 387 63412 3068.0
## - thtr_rel_year 1 410 63436 3068.2
## - runtime 1 587 63613 3070.0
## <none> 63025 3070.5
## - critics_score 1 679 63704 3071.0
## - imdb_rating 1 58603 121628 3491.3
##
## Step: AIC=3064.48
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win +
## best_dir_win
##
## Df Sum of Sq RSS AIC
## - best_dir_win 1 94 63165 3059.0
## - best_actor_win 1 163 63234 3059.7
## - feature_film 1 171 63242 3059.8
## - summer_season 1 174 63245 3059.8
## - drama 1 220 63291 3060.3
## - imdb_num_votes 1 271 63342 3060.8
## - best_actress_win 1 294 63365 3061.0
## - mpaa_rating_R 1 330 63401 3061.4
## - best_pic_nom 1 342 63414 3061.5
## - thtr_rel_year 1 397 63468 3062.1
## - runtime 1 586 63657 3064.0
## <none> 63071 3064.5
## - critics_score 1 680 63751 3065.0
## - imdb_rating 1 58858 121929 3486.5
##
## Step: AIC=3058.97
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + summer_season + imdb_rating + imdb_num_votes +
## critics_score + best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - summer_season 1 167 63332 3054.2
## - best_actor_win 1 171 63336 3054.2
## - feature_film 1 183 63348 3054.4
## - drama 1 228 63394 3054.8
## - imdb_num_votes 1 247 63412 3055.0
## - best_actress_win 1 299 63464 3055.6
## - best_pic_nom 1 326 63491 3055.8
## - mpaa_rating_R 1 345 63510 3056.0
## - thtr_rel_year 1 368 63533 3056.3
## <none> 63165 3059.0
## - critics_score 1 651 63816 3059.2
## - runtime 1 673 63839 3059.4
## - imdb_rating 1 58895 122061 3480.7
##
## Step: AIC=3054.2
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R +
## thtr_rel_year + imdb_rating + imdb_num_votes + critics_score +
## best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - feature_film 1 156 63488 3049.3
## - best_actor_win 1 195 63527 3049.7
## - drama 1 204 63536 3049.8
## - imdb_num_votes 1 260 63592 3050.4
## - best_pic_nom 1 297 63629 3050.8
## - best_actress_win 1 297 63629 3050.8
## - mpaa_rating_R 1 356 63688 3051.4
## - thtr_rel_year 1 361 63693 3051.4
## <none> 63332 3054.2
## - runtime 1 690 64022 3054.8
## - critics_score 1 732 64064 3055.2
## - imdb_rating 1 58763 122095 3474.4
##
## Step: AIC=3049.32
## audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year +
## imdb_rating + imdb_num_votes + critics_score + best_pic_nom +
## best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - drama 1 121 63609 3044.1
## - imdb_num_votes 1 173 63661 3044.6
## - best_actor_win 1 219 63706 3045.1
## - thtr_rel_year 1 277 63765 3045.7
## - best_pic_nom 1 291 63778 3045.8
## - best_actress_win 1 306 63794 3046.0
## - mpaa_rating_R 1 453 63941 3047.5
## <none> 63488 3049.3
## - runtime 1 715 64203 3050.1
## - critics_score 1 875 64363 3051.7
## - imdb_rating 1 63189 126677 3491.9
##
## Step: AIC=3044.09
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## imdb_num_votes + critics_score + best_pic_nom + best_actor_win +
## best_actress_win
##
## Df Sum of Sq RSS AIC
## - imdb_num_votes 1 148 63757 3039.1
## - best_actor_win 1 209 63818 3039.7
## - thtr_rel_year 1 272 63881 3040.4
## - best_actress_win 1 274 63883 3040.4
## - best_pic_nom 1 307 63916 3040.7
## - mpaa_rating_R 1 391 64000 3041.6
## - runtime 1 631 64240 3044.0
## <none> 63609 3044.1
## - critics_score 1 916 64525 3046.9
## - imdb_rating 1 63434 127043 3487.3
##
## Step: AIC=3039.12
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating +
## critics_score + best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - thtr_rel_year 1 201 63958 3034.7
## - best_actor_win 1 219 63976 3034.9
## - best_actress_win 1 266 64023 3035.3
## - mpaa_rating_R 1 367 64124 3036.4
## - best_pic_nom 1 442 64199 3037.1
## - runtime 1 519 64276 3037.9
## <none> 63757 3039.1
## - critics_score 1 879 64635 3041.5
## - imdb_rating 1 67356 131113 3501.3
##
## Step: AIC=3034.68
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom + best_actor_win + best_actress_win
##
## Df Sum of Sq RSS AIC
## - best_actor_win 1 207 64165 3030.3
## - best_actress_win 1 261 64219 3030.9
## - mpaa_rating_R 1 373 64331 3032.0
## - best_pic_nom 1 447 64405 3032.7
## - runtime 1 468 64425 3032.9
## <none> 63958 3034.7
## - critics_score 1 968 64926 3038.0
## - imdb_rating 1 67172 131129 3494.9
##
## Step: AIC=3030.3
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom + best_actress_win
##
## Df Sum of Sq RSS AIC
## - best_actress_win 1 296 64461 3026.8
## - mpaa_rating_R 1 366 64531 3027.5
## - best_pic_nom 1 396 64561 3027.8
## <none> 64165 3030.3
## - runtime 1 643 64808 3030.3
## - critics_score 1 968 65133 3033.6
## - imdb_rating 1 67296 131461 3490.0
##
## Step: AIC=3026.82
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score +
## best_pic_nom
##
## Df Sum of Sq RSS AIC
## - best_pic_nom 1 303 64765 3023.4
## - mpaa_rating_R 1 354 64815 3023.9
## <none> 64461 3026.8
## - runtime 1 814 65275 3028.5
## - critics_score 1 957 65418 3029.9
## - imdb_rating 1 67424 131885 3485.7
##
## Step: AIC=3023.39
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score
##
## Df Sum of Sq RSS AIC
## - mpaa_rating_R 1 361 65126 3020.5
## - runtime 1 638 65403 3023.3
## <none> 64765 3023.4
## - critics_score 1 1027 65792 3027.1
## - imdb_rating 1 68173 132937 3484.3
##
## Step: AIC=3020.53
## audience_score ~ runtime + imdb_rating + critics_score
##
## Df Sum of Sq RSS AIC
## <none> 65126 3020.5
## - runtime 1 653 65779 3020.5
## - critics_score 1 1073 66199 3024.7
## - imdb_rating 1 67874 133000 3478.2
The final model will use the following variables:
audience_score ~ runtime + imdb_rating + critics_score
## (Intercept) runtime imdb_rating critics_score
## -33.28320569 -0.05361506 14.98076157 0.07035672
## [1] 10.04062
Taking a look at the residuals:
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see that the residuals are normally distributed.
as_full.bas <- bas.lm(audience_score ~ .,
prior ="BIC",
modelprior = uniform(),
data = na.omit(movies2))
as_full.bas##
## Call:
## bas.lm(formula = audience_score ~ ., data = na.omit(movies2),
## prior = "BIC", modelprior = uniform())
##
##
## Marginal Posterior Inclusion Probabilities:
## Intercept feature_filmyes dramayes
## 1.00000 0.06537 0.04320
## runtime mpaa_rating_Ryes thtr_rel_year
## 0.46971 0.19984 0.09069
## oscar_seasonyes summer_seasonyes imdb_rating
## 0.07506 0.08042 1.00000
## imdb_num_votes critics_score best_pic_nomyes
## 0.05774 0.88855 0.13119
## best_pic_winyes best_actor_winyes best_actress_winyes
## 0.03985 0.14435 0.14128
## best_dir_winyes top200_boxyes
## 0.06694 0.04762
According to this model, there is a 100% chance that imdb_rating will be included in the final model. Other noteworthy variables are runtime (~47%), critics_score (~89%). The variable with the nearest score to these is mpaa_rating_R:yes at ~20%.
## 2.5% 97.5% beta
## Intercept 6.159980e+01 6.314012e+01 6.234769e+01
## feature_filmyes -9.335871e-01 1.875713e-01 -1.046908e-01
## dramayes 0.000000e+00 0.000000e+00 1.604413e-02
## runtime -8.308220e-02 0.000000e+00 -2.567772e-02
## mpaa_rating_Ryes -2.108859e+00 0.000000e+00 -3.036174e-01
## thtr_rel_year -5.473572e-02 1.090637e-04 -4.532635e-03
## oscar_seasonyes -1.035255e+00 8.710594e-03 -8.034940e-02
## summer_seasonyes -9.139496e-03 1.055576e+00 8.704545e-02
## imdb_rating 1.370488e+01 1.659557e+01 1.498203e+01
## imdb_num_votes -8.960385e-08 1.536983e-06 2.080713e-07
## critics_score 0.000000e+00 1.058527e-01 6.296648e-02
## best_pic_nomyes -1.007777e-01 4.771271e+00 5.068035e-01
## best_pic_winyes 0.000000e+00 0.000000e+00 -8.502836e-03
## best_actor_winyes -2.581776e+00 0.000000e+00 -2.876695e-01
## best_actress_winyes -2.833973e+00 0.000000e+00 -3.088382e-01
## best_dir_winyes -1.145373e+00 0.000000e+00 -1.195011e-01
## top200_boxyes -3.053916e-02 7.534309e-02 8.648185e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.0000000 1.0000000
## feature_filmyes 0.06536947 0.0000 0.0000000 0.0000000
## dramayes 0.04319833 0.0000 0.0000000 0.0000000
## runtime 0.46971477 1.0000 0.0000000 0.0000000
## mpaa_rating_Ryes 0.19984016 0.0000 0.0000000 0.0000000
## thtr_rel_year 0.09068970 0.0000 0.0000000 0.0000000
## oscar_seasonyes 0.07505684 0.0000 0.0000000 0.0000000
## summer_seasonyes 0.08042023 0.0000 0.0000000 0.0000000
## imdb_rating 1.00000000 1.0000 1.0000000 1.0000000
## imdb_num_votes 0.05773502 0.0000 0.0000000 0.0000000
## critics_score 0.88855056 1.0000 1.0000000 1.0000000
## best_pic_nomyes 0.13119140 0.0000 0.0000000 0.0000000
## best_pic_winyes 0.03984766 0.0000 0.0000000 0.0000000
## best_actor_winyes 0.14434896 0.0000 0.0000000 1.0000000
## best_actress_winyes 0.14128087 0.0000 0.0000000 0.0000000
## best_dir_winyes 0.06693898 0.0000 0.0000000 0.0000000
## top200_boxyes 0.04762234 0.0000 0.0000000 0.0000000
## BF NA 1.0000 0.9968489 0.2543185
## PostProbs NA 0.1297 0.1293000 0.0330000
## R2 NA 0.7549 0.7525000 0.7539000
## dim NA 4.0000 3.0000000 4.0000000
## logmarg NA -3615.2791 -3615.2822108 -3616.6482224
## model 4 model 5
## Intercept 1.0000000 1.0000000
## feature_filmyes 0.0000000 0.0000000
## dramayes 0.0000000 0.0000000
## runtime 0.0000000 1.0000000
## mpaa_rating_Ryes 1.0000000 1.0000000
## thtr_rel_year 0.0000000 0.0000000
## oscar_seasonyes 0.0000000 0.0000000
## summer_seasonyes 0.0000000 0.0000000
## imdb_rating 1.0000000 1.0000000
## imdb_num_votes 0.0000000 0.0000000
## critics_score 1.0000000 1.0000000
## best_pic_nomyes 0.0000000 0.0000000
## best_pic_winyes 0.0000000 0.0000000
## best_actor_winyes 0.0000000 0.0000000
## best_actress_winyes 0.0000000 0.0000000
## best_dir_winyes 0.0000000 0.0000000
## top200_boxyes 0.0000000 0.0000000
## BF 0.2521327 0.2391994
## PostProbs 0.0327000 0.0310000
## R2 0.7539000 0.7563000
## dim 4.0000000 5.0000000
## logmarg -3616.6568544 -3616.7095127
The best model chosen contains the variables runtime, imdb_rating, and critics_score. Notice that this is the same model created by the backwards stepwise BIC method above.
Below, we can visualize the goodness of each of the models analyzed using the bas.lm function. The best model (rank 1) shows on the left, with the colored squares representing variables that would be selected for that particular model.
We see a normal distribution here.
Now let’s plot the residuals against the fitted values.
We see some left-skewness here, but the data is generally scattered around 0.
Now let’s plot the absolute value of the residuals against the fitted values.
We do not see a fan shape, meeting the necessary condition.
The movie I’ve chosen is Finding Dory. The information I will be using for the prediction comes from:
IMDB and Rotten Tomatoes.
I’ll create the data frames containing Finding Dory’s information.
finding_dory_df <- data.frame(imdb_rating = 7.5, runtime = 97, critics_score = 94, mpaa_rating_R="no", thtr_rel_year=2016, best_pic_nom="no",best_actor_win="no", best_actress_win="no")I will run predictions using both the BIC and AIC models, to contrast them. Note that the set of variables the BIC model uses is a subset of the variables the AIC model uses.
## fit lwr upr
## 1 80.48538 60.72202 100.2487
The BIC model predicts a score of 80.48538.
## fit lwr upr
## 1 80.41053 60.71769 100.1034
The AIC model predicts a score of 80.41053.
As the true score was 86, the BIC model was only marginally more accurate (93.587% accuracy vs 93.501% accuracy).
The model created using the stepAIC tuned toward BIC was the same model found to be ideal by bas.lm. In the end, the AIC and BIC models scored almost identically. I believe if the scope of this project were increased, there would be the possibility of normally distributed errors. A method to deal with these issues– which was not touched on in this project– was variable transformation.