Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
library(MASS)
library(GGally)
library(gridExtra)

Load data

load("movies.Rdata")

Part 0: Intro

This project is a bayesian analysis of movie data. We’ll build linear models to predict the audience score of a film.

Part 1: Data

The dataset is comprised of 651 randomly sampled movies produced and released before 2016. The data draws from APIs from imdb.com, rottentomatoes.com, and flixster.com. As this is random sampling, only correlations can be drawn. Because of the randomness of the selection and the size of the dataset, our results can be generalizable. Since this project is present through an English-speaking platform, and as the data is drawn from sources that are based in the English-speaking world and cater to English speakers, the data will be biased toward movies where English is the main language. This precludes many foreign films, such as Asian films or Indian films. * * *

Part 2: Data manipulation

We are first going to create some new variables to aid in our exploratory data analysis. Below is a summary of the new variables.

feature_film: “yes” if title_type is Feature Film, “no” otherwise.

drama: “yes” if genre is Drama, “no” otherwise runtime.

mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise

oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise.

summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise.

movies <- movies %>%
  mutate(feature_film = ifelse(title_type == "Feature Film", "yes", "no"),
         drama = ifelse(genre == "Drama", "yes", "no"),
         mpaa_rating_R = ifelse(mpaa_rating == "R","yes","no"),
         oscar_season = ifelse(thtr_rel_month == 11 | thtr_rel_month == 10 | thtr_rel_month == 12, "yes", "no"),
         summer_season = ifelse(thtr_rel_month == 5 | thtr_rel_month == 6 | thtr_rel_month == 7 | thtr_rel_month == 8, "yes","no"))

We’ll then create a new dataframe `1``movies2``` that will include a subset of the total variables.

movies2_features <- c("audience_score", "feature_film", "drama", "runtime", "mpaa_rating_R", "thtr_rel_year", "oscar_season", "summer_season", "imdb_rating", "imdb_num_votes", "critics_score", "best_pic_nom", "best_pic_win", "best_actor_win", "best_actress_win", "best_dir_win", "top200_box")
movies2 <- movies[movies2_features]

Part 3: Exploratory data analysis

We’ll start out at a higher, broader level by taking a look at a summary of the variables in movies2.

summary(movies2)
##  audience_score  feature_film          drama              runtime     
##  Min.   :11.00   Length:651         Length:651         Min.   : 39.0  
##  1st Qu.:46.00   Class :character   Class :character   1st Qu.: 92.0  
##  Median :65.00   Mode  :character   Mode  :character   Median :103.0  
##  Mean   :62.36                                         Mean   :105.8  
##  3rd Qu.:80.00                                         3rd Qu.:115.8  
##  Max.   :97.00                                         Max.   :267.0  
##                                                        NA's   :1      
##  mpaa_rating_R      thtr_rel_year  oscar_season       summer_season     
##  Length:651         Min.   :1970   Length:651         Length:651        
##  Class :character   1st Qu.:1990   Class :character   Class :character  
##  Mode  :character   Median :2000   Mode  :character   Mode  :character  
##                     Mean   :1998                                        
##                     3rd Qu.:2007                                        
##                     Max.   :2014                                        
##                                                                         
##   imdb_rating    imdb_num_votes   critics_score    best_pic_nom best_pic_win
##  Min.   :1.900   Min.   :   180   Min.   :  1.00   no :629      no :644     
##  1st Qu.:5.900   1st Qu.:  4546   1st Qu.: 33.00   yes: 22      yes:  7     
##  Median :6.600   Median : 15116   Median : 61.00                            
##  Mean   :6.493   Mean   : 57533   Mean   : 57.69                            
##  3rd Qu.:7.300   3rd Qu.: 58301   3rd Qu.: 83.00                            
##  Max.   :9.000   Max.   :893008   Max.   :100.00                            
##                                                                             
##  best_actor_win best_actress_win best_dir_win top200_box
##  no :558        no :579          no :608      no :636   
##  yes: 93        yes: 72          yes: 43      yes: 15   
##                                                         
##                                                         
##                                                         
##                                                         
## 

This summary gives us a look at the spread of each variable.

Let’s also take a look at the levels of each variable.

str(movies2)
## tibble [651 x 17] (S3: tbl_df/tbl/data.frame)
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ feature_film    : chr [1:651] "yes" "yes" "yes" "yes" ...
##  $ drama           : chr [1:651] "yes" "yes" "no" "yes" ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating_R   : chr [1:651] "yes" "no" "yes" "no" ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ oscar_season    : chr [1:651] "no" "no" "no" "yes" ...
##  $ summer_season   : chr [1:651] "no" "no" "yes" "no" ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Let’s use boxplots to visualize how the newly-formed variables interact with audience_score.

plot1 <- ggplot(movies2, aes(x=mpaa_rating_R,y=audience_score))+
            geom_boxplot(colour="aquamarine4")
  
plot2 <- ggplot(movies2, aes(x=oscar_season, y=audience_score))+
            geom_boxplot(colour="aquamarine4")
  
plot3 <- ggplot(movies2, aes(x=summer_season,y=audience_score))+
            geom_boxplot(colour="aquamarine4")
  
plot4 <- ggplot(movies2, aes(x=feature_film, y=audience_score))+
            geom_boxplot(colour="aquamarine4")
  
plot5 <- ggplot(movies2, aes(x=drama, y=audience_score))+
            geom_boxplot(colour="aquamarine4")
            
grid.arrange(plot1,plot2,plot3,plot4,plot5, ncol=3)

We’ll then map out correlation charts that will show the relationships between audience_score and all other variables in movies2.

suppressWarnings(suppressMessages(print(ggpairs(movies2, columns = 1:8))))

suppressWarnings(suppressMessages(print(ggpairs(movies2, columns = c(1,9:17)))))

Notice the high correlation between audience_score and critics_score. Let’s visualize this correlation with a scatter plot with a regression line.

cor(movies2$audience_score, movies2$critics_score)
## [1] 0.7042762
ggplot(data=movies2, aes(x = audience_score, y = critics_score)) +
  geom_jitter() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Let’s do the same with imdb_rating, again due to the high correlation it has with audience_score.

cor(movies2$audience_score, movies2$imdb_rating)
## [1] 0.8648652
ggplot(data=movies2, aes(x = audience_score, y = imdb_rating)) +
  geom_jitter() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

We can see strong positive correlations with both sets of variables.


Part 4: Modeling

We’ll first create the full linear model, incorporating every variables in movies2.

We will use the stepAIC function from library MASS to build a model (backwards) until the AIC can not be lowered.

as_full <- lm(audience_score ~ ., data= na.omit(movies2))

as_full
## 
## Call:
## lm(formula = audience_score ~ ., data = na.omit(movies2))
## 
## Coefficients:
##         (Intercept)      feature_filmyes             dramayes  
##           1.244e+02           -2.248e+00            1.292e+00  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##          -5.614e-02           -1.444e+00           -7.657e-02  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##          -5.333e-01            9.106e-01            1.472e+01  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##           7.234e-06            5.748e-02            5.321e+00  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##          -3.212e+00           -1.544e+00           -2.198e+00  
##     best_dir_winyes        top200_boxyes  
##          -1.231e+00            8.478e-01

Creating a model based on AIC

We will use the stepAIC function, tuned to optimize for AIC, to find the best model. The model will be built backwards.

stepAIC.model <- stepAIC(as_full, direction = "backward", trace = TRUE)
## Start:  AIC=3006.94
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box
## 
##                    Df Sum of Sq    RSS    AIC
## - top200_box        1         9  62999 3005.0
## - oscar_season      1        28  63018 3005.2
## - best_pic_win      1        48  63038 3005.4
## - best_dir_win      1        51  63040 3005.5
## - summer_season     1        92  63081 3005.9
## - best_actor_win    1       171  63160 3006.7
## - feature_film      1       177  63166 3006.8
## <none>                           62990 3006.9
## - drama             1       216  63206 3007.2
## - imdb_num_votes    1       255  63244 3007.6
## - best_actress_win  1       283  63273 3007.9
## - mpaa_rating_R     1       314  63304 3008.2
## - thtr_rel_year     1       397  63386 3009.0
## - best_pic_nom      1       408  63398 3009.1
## - runtime           1       538  63527 3010.5
## - critics_score     1       669  63659 3011.8
## - imdb_rating       1     58556 121545 3432.2
## 
## Step:  AIC=3005.04
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - oscar_season      1        26  63025 3003.3
## - best_pic_win      1        49  63047 3003.5
## - best_dir_win      1        52  63051 3003.6
## - summer_season     1        94  63093 3004.0
## - best_actor_win    1       169  63168 3004.8
## - feature_film      1       176  63175 3004.8
## <none>                           62999 3005.0
## - drama             1       214  63213 3005.2
## - best_actress_win  1       279  63278 3005.9
## - imdb_num_votes    1       302  63301 3006.1
## - mpaa_rating_R     1       330  63329 3006.4
## - best_pic_nom      1       404  63403 3007.2
## - thtr_rel_year     1       415  63414 3007.3
## - runtime           1       535  63534 3008.5
## - critics_score     1       681  63680 3010.0
## - imdb_rating       1     58606 121604 3430.5
## 
## Step:  AIC=3003.31
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_pic_win      1        46  63071 3001.8
## - best_dir_win      1        56  63081 3001.9
## - best_actor_win    1       174  63200 3003.1
## - summer_season     1       177  63202 3003.1
## - feature_film      1       182  63207 3003.2
## <none>                           63025 3003.3
## - drama             1       222  63247 3003.6
## - best_actress_win  1       281  63307 3004.2
## - imdb_num_votes    1       302  63328 3004.4
## - mpaa_rating_R     1       329  63354 3004.7
## - best_pic_nom      1       387  63412 3005.3
## - thtr_rel_year     1       410  63436 3005.5
## - runtime           1       587  63613 3007.3
## - critics_score     1       679  63704 3008.3
## - imdb_rating       1     58603 121628 3428.6
## 
## Step:  AIC=3001.78
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win + 
##     best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_dir_win      1        94  63165 3000.7
## - best_actor_win    1       163  63234 3001.5
## - feature_film      1       171  63242 3001.5
## - summer_season     1       174  63245 3001.6
## <none>                           63071 3001.8
## - drama             1       220  63291 3002.0
## - imdb_num_votes    1       271  63342 3002.6
## - best_actress_win  1       294  63365 3002.8
## - mpaa_rating_R     1       330  63401 3003.2
## - best_pic_nom      1       342  63414 3003.3
## - thtr_rel_year     1       397  63468 3003.9
## - runtime           1       586  63657 3005.8
## - critics_score     1       680  63751 3006.8
## - imdb_rating       1     58858 121929 3428.2
## 
## Step:  AIC=3000.75
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - summer_season     1       167  63332 3000.5
## - best_actor_win    1       171  63336 3000.5
## - feature_film      1       183  63348 3000.6
## <none>                           63165 3000.7
## - drama             1       228  63394 3001.1
## - imdb_num_votes    1       247  63412 3001.3
## - best_actress_win  1       299  63464 3001.8
## - best_pic_nom      1       326  63491 3002.1
## - mpaa_rating_R     1       345  63510 3002.3
## - thtr_rel_year     1       368  63533 3002.5
## - critics_score     1       651  63816 3005.4
## - runtime           1       673  63839 3005.6
## - imdb_rating       1     58895 122061 3426.9
## 
## Step:  AIC=3000.46
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + imdb_rating + imdb_num_votes + critics_score + 
##     best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - feature_film      1       156  63488 3000.1
## <none>                           63332 3000.5
## - best_actor_win    1       195  63527 3000.5
## - drama             1       204  63536 3000.6
## - imdb_num_votes    1       260  63592 3001.1
## - best_pic_nom      1       297  63629 3001.5
## - best_actress_win  1       297  63629 3001.5
## - mpaa_rating_R     1       356  63688 3002.1
## - thtr_rel_year     1       361  63693 3002.2
## - runtime           1       690  64022 3005.5
## - critics_score     1       732  64064 3005.9
## - imdb_rating       1     58763 122095 3425.1
## 
## Step:  AIC=3000.06
## audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year + 
##     imdb_rating + imdb_num_votes + critics_score + best_pic_nom + 
##     best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - drama             1       121  63609 2999.3
## - imdb_num_votes    1       173  63661 2999.8
## <none>                           63488 3000.1
## - best_actor_win    1       219  63706 3000.3
## - thtr_rel_year     1       277  63765 3000.9
## - best_pic_nom      1       291  63778 3001.0
## - best_actress_win  1       306  63794 3001.2
## - mpaa_rating_R     1       453  63941 3002.7
## - runtime           1       715  64203 3005.3
## - critics_score     1       875  64363 3007.0
## - imdb_rating       1     63189 126677 3447.1
## 
## Step:  AIC=2999.3
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_actor_win + 
##     best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - imdb_num_votes    1       148  63757 2998.8
## <none>                           63609 2999.3
## - best_actor_win    1       209  63818 2999.4
## - thtr_rel_year     1       272  63881 3000.1
## - best_actress_win  1       274  63883 3000.1
## - best_pic_nom      1       307  63916 3000.4
## - mpaa_rating_R     1       391  64000 3001.3
## - runtime           1       631  64240 3003.7
## - critics_score     1       916  64525 3006.6
## - imdb_rating       1     63434 127043 3447.0
## 
## Step:  AIC=2998.81
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                           63757 2998.8
## - thtr_rel_year     1       201  63958 2998.9
## - best_actor_win    1       219  63976 2999.0
## - best_actress_win  1       266  64023 2999.5
## - mpaa_rating_R     1       367  64124 3000.5
## - best_pic_nom      1       442  64199 3001.3
## - runtime           1       519  64276 3002.1
## - critics_score     1       879  64635 3005.7
## - imdb_rating       1     67356 131113 3465.4

The final model built using AIC consists of the following variables:

runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win

AIC.lm <- lm(audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win + best_actress_win, data=movies2)

Taking a look at the coefficients of this model:

AIC.lm$coefficients
##         (Intercept)             runtime    mpaa_rating_Ryes       thtr_rel_year 
##         70.10675281         -0.05115515         -1.50528039         -0.05122557 
##         imdb_rating       critics_score     best_pic_nomyes   best_actor_winyes 
##         15.00149242          0.06409989          4.88277038         -1.73481942 
## best_actress_winyes 
##         -2.11568281

Taking a look at the standard deviation of the model:

summary(AIC.lm)$sigma
## [1] 9.973201
#plot(movies2$audience_score ~ AIC.lm$residuals)

Plotting the residuals of the model:

ggplot(data=AIC.lm, aes(x=AIC.lm$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that the residuals are normally distributed.

Creating a model based on BIC

We will use the stepAIC function, tuned to optimize for BIC, to find the best model. The model will be built backwards.

stepBIC.model <- stepAIC(as_full, direction = "backward", k=log(nrow(movies2)), trace = TRUE)
## Start:  AIC=3083.07
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box
## 
##                    Df Sum of Sq    RSS    AIC
## - top200_box        1         9  62999 3076.7
## - oscar_season      1        28  63018 3076.9
## - best_pic_win      1        48  63038 3077.1
## - best_dir_win      1        51  63040 3077.1
## - summer_season     1        92  63081 3077.5
## - best_actor_win    1       171  63160 3078.4
## - feature_film      1       177  63166 3078.4
## - drama             1       216  63206 3078.8
## - imdb_num_votes    1       255  63244 3079.2
## - best_actress_win  1       283  63273 3079.5
## - mpaa_rating_R     1       314  63304 3079.8
## - thtr_rel_year     1       397  63386 3080.7
## - best_pic_nom      1       408  63398 3080.8
## - runtime           1       538  63527 3082.1
## <none>                           62990 3083.1
## - critics_score     1       669  63659 3083.5
## - imdb_rating       1     58556 121545 3503.9
## 
## Step:  AIC=3076.69
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - oscar_season      1        26  63025 3070.5
## - best_pic_win      1        49  63047 3070.7
## - best_dir_win      1        52  63051 3070.8
## - summer_season     1        94  63093 3071.2
## - best_actor_win    1       169  63168 3072.0
## - feature_film      1       176  63175 3072.0
## - drama             1       214  63213 3072.4
## - best_actress_win  1       279  63278 3073.1
## - imdb_num_votes    1       302  63301 3073.3
## - mpaa_rating_R     1       330  63329 3073.6
## - best_pic_nom      1       404  63403 3074.4
## - thtr_rel_year     1       415  63414 3074.5
## - runtime           1       535  63534 3075.7
## <none>                           62999 3076.7
## - critics_score     1       681  63680 3077.2
## - imdb_rating       1     58606 121604 3497.7
## 
## Step:  AIC=3070.49
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_pic_win      1        46  63071 3064.5
## - best_dir_win      1        56  63081 3064.6
## - best_actor_win    1       174  63200 3065.8
## - summer_season     1       177  63202 3065.8
## - feature_film      1       182  63207 3065.9
## - drama             1       222  63247 3066.3
## - best_actress_win  1       281  63307 3066.9
## - imdb_num_votes    1       302  63328 3067.1
## - mpaa_rating_R     1       329  63354 3067.4
## - best_pic_nom      1       387  63412 3068.0
## - thtr_rel_year     1       410  63436 3068.2
## - runtime           1       587  63613 3070.0
## <none>                           63025 3070.5
## - critics_score     1       679  63704 3071.0
## - imdb_rating       1     58603 121628 3491.3
## 
## Step:  AIC=3064.48
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win + 
##     best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_dir_win      1        94  63165 3059.0
## - best_actor_win    1       163  63234 3059.7
## - feature_film      1       171  63242 3059.8
## - summer_season     1       174  63245 3059.8
## - drama             1       220  63291 3060.3
## - imdb_num_votes    1       271  63342 3060.8
## - best_actress_win  1       294  63365 3061.0
## - mpaa_rating_R     1       330  63401 3061.4
## - best_pic_nom      1       342  63414 3061.5
## - thtr_rel_year     1       397  63468 3062.1
## - runtime           1       586  63657 3064.0
## <none>                           63071 3064.5
## - critics_score     1       680  63751 3065.0
## - imdb_rating       1     58858 121929 3486.5
## 
## Step:  AIC=3058.97
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - summer_season     1       167  63332 3054.2
## - best_actor_win    1       171  63336 3054.2
## - feature_film      1       183  63348 3054.4
## - drama             1       228  63394 3054.8
## - imdb_num_votes    1       247  63412 3055.0
## - best_actress_win  1       299  63464 3055.6
## - best_pic_nom      1       326  63491 3055.8
## - mpaa_rating_R     1       345  63510 3056.0
## - thtr_rel_year     1       368  63533 3056.3
## <none>                           63165 3059.0
## - critics_score     1       651  63816 3059.2
## - runtime           1       673  63839 3059.4
## - imdb_rating       1     58895 122061 3480.7
## 
## Step:  AIC=3054.2
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + imdb_rating + imdb_num_votes + critics_score + 
##     best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - feature_film      1       156  63488 3049.3
## - best_actor_win    1       195  63527 3049.7
## - drama             1       204  63536 3049.8
## - imdb_num_votes    1       260  63592 3050.4
## - best_pic_nom      1       297  63629 3050.8
## - best_actress_win  1       297  63629 3050.8
## - mpaa_rating_R     1       356  63688 3051.4
## - thtr_rel_year     1       361  63693 3051.4
## <none>                           63332 3054.2
## - runtime           1       690  64022 3054.8
## - critics_score     1       732  64064 3055.2
## - imdb_rating       1     58763 122095 3474.4
## 
## Step:  AIC=3049.32
## audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year + 
##     imdb_rating + imdb_num_votes + critics_score + best_pic_nom + 
##     best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - drama             1       121  63609 3044.1
## - imdb_num_votes    1       173  63661 3044.6
## - best_actor_win    1       219  63706 3045.1
## - thtr_rel_year     1       277  63765 3045.7
## - best_pic_nom      1       291  63778 3045.8
## - best_actress_win  1       306  63794 3046.0
## - mpaa_rating_R     1       453  63941 3047.5
## <none>                           63488 3049.3
## - runtime           1       715  64203 3050.1
## - critics_score     1       875  64363 3051.7
## - imdb_rating       1     63189 126677 3491.9
## 
## Step:  AIC=3044.09
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_actor_win + 
##     best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - imdb_num_votes    1       148  63757 3039.1
## - best_actor_win    1       209  63818 3039.7
## - thtr_rel_year     1       272  63881 3040.4
## - best_actress_win  1       274  63883 3040.4
## - best_pic_nom      1       307  63916 3040.7
## - mpaa_rating_R     1       391  64000 3041.6
## - runtime           1       631  64240 3044.0
## <none>                           63609 3044.1
## - critics_score     1       916  64525 3046.9
## - imdb_rating       1     63434 127043 3487.3
## 
## Step:  AIC=3039.12
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - thtr_rel_year     1       201  63958 3034.7
## - best_actor_win    1       219  63976 3034.9
## - best_actress_win  1       266  64023 3035.3
## - mpaa_rating_R     1       367  64124 3036.4
## - best_pic_nom      1       442  64199 3037.1
## - runtime           1       519  64276 3037.9
## <none>                           63757 3039.1
## - critics_score     1       879  64635 3041.5
## - imdb_rating       1     67356 131113 3501.3
## 
## Step:  AIC=3034.68
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_actor_win    1       207  64165 3030.3
## - best_actress_win  1       261  64219 3030.9
## - mpaa_rating_R     1       373  64331 3032.0
## - best_pic_nom      1       447  64405 3032.7
## - runtime           1       468  64425 3032.9
## <none>                           63958 3034.7
## - critics_score     1       968  64926 3038.0
## - imdb_rating       1     67172 131129 3494.9
## 
## Step:  AIC=3030.3
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_actress_win  1       296  64461 3026.8
## - mpaa_rating_R     1       366  64531 3027.5
## - best_pic_nom      1       396  64561 3027.8
## <none>                           64165 3030.3
## - runtime           1       643  64808 3030.3
## - critics_score     1       968  65133 3033.6
## - imdb_rating       1     67296 131461 3490.0
## 
## Step:  AIC=3026.82
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom
## 
##                 Df Sum of Sq    RSS    AIC
## - best_pic_nom   1       303  64765 3023.4
## - mpaa_rating_R  1       354  64815 3023.9
## <none>                        64461 3026.8
## - runtime        1       814  65275 3028.5
## - critics_score  1       957  65418 3029.9
## - imdb_rating    1     67424 131885 3485.7
## 
## Step:  AIC=3023.39
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score
## 
##                 Df Sum of Sq    RSS    AIC
## - mpaa_rating_R  1       361  65126 3020.5
## - runtime        1       638  65403 3023.3
## <none>                        64765 3023.4
## - critics_score  1      1027  65792 3027.1
## - imdb_rating    1     68173 132937 3484.3
## 
## Step:  AIC=3020.53
## audience_score ~ runtime + imdb_rating + critics_score
## 
##                 Df Sum of Sq    RSS    AIC
## <none>                        65126 3020.5
## - runtime        1       653  65779 3020.5
## - critics_score  1      1073  66199 3024.7
## - imdb_rating    1     67874 133000 3478.2

The final model will use the following variables:

audience_score ~ runtime + imdb_rating + critics_score

BIC.lm <- lm(audience_score ~ runtime + imdb_rating + critics_score, data=movies2)
BIC.lm$coefficients
##   (Intercept)       runtime   imdb_rating critics_score 
##  -33.28320569   -0.05361506   14.98076157    0.07035672
summary(BIC.lm)$sigma
## [1] 10.04062
#plot(na.omit(movies2$audience_score) ~ BIC.lm$residuals)

Taking a look at the residuals:

ggplot(data=BIC.lm, aes(x=BIC.lm$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that the residuals are normally distributed.

Creating a model using bayesian averaging

as_full.bas <- bas.lm(audience_score ~ .,
       prior ="BIC",
       modelprior = uniform(),
       data = na.omit(movies2))

as_full.bas
## 
## Call:
## bas.lm(formula = audience_score ~ ., data = na.omit(movies2), 
##     prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes  
##             1.00000              0.06537              0.04320  
##             runtime     mpaa_rating_Ryes        thtr_rel_year  
##             0.46971              0.19984              0.09069  
##     oscar_seasonyes     summer_seasonyes          imdb_rating  
##             0.07506              0.08042              1.00000  
##      imdb_num_votes        critics_score      best_pic_nomyes  
##             0.05774              0.88855              0.13119  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes  
##             0.03985              0.14435              0.14128  
##     best_dir_winyes        top200_boxyes  
##             0.06694              0.04762

According to this model, there is a 100% chance that imdb_rating will be included in the final model. Other noteworthy variables are runtime (~47%), critics_score (~89%). The variable with the nearest score to these is mpaa_rating_R:yes at ~20%.

confint(coef(as_full.bas))
##                              2.5%        97.5%          beta
## Intercept            6.159980e+01 6.314012e+01  6.234769e+01
## feature_filmyes     -9.335871e-01 1.875713e-01 -1.046908e-01
## dramayes             0.000000e+00 0.000000e+00  1.604413e-02
## runtime             -8.308220e-02 0.000000e+00 -2.567772e-02
## mpaa_rating_Ryes    -2.108859e+00 0.000000e+00 -3.036174e-01
## thtr_rel_year       -5.473572e-02 1.090637e-04 -4.532635e-03
## oscar_seasonyes     -1.035255e+00 8.710594e-03 -8.034940e-02
## summer_seasonyes    -9.139496e-03 1.055576e+00  8.704545e-02
## imdb_rating          1.370488e+01 1.659557e+01  1.498203e+01
## imdb_num_votes      -8.960385e-08 1.536983e-06  2.080713e-07
## critics_score        0.000000e+00 1.058527e-01  6.296648e-02
## best_pic_nomyes     -1.007777e-01 4.771271e+00  5.068035e-01
## best_pic_winyes      0.000000e+00 0.000000e+00 -8.502836e-03
## best_actor_winyes   -2.581776e+00 0.000000e+00 -2.876695e-01
## best_actress_winyes -2.833973e+00 0.000000e+00 -3.088382e-01
## best_dir_winyes     -1.145373e+00 0.000000e+00 -1.195011e-01
## top200_boxyes       -3.053916e-02 7.534309e-02  8.648185e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
summary(as_full.bas)
##                     P(B != 0 | Y)    model 1       model 2       model 3
## Intercept              1.00000000     1.0000     1.0000000     1.0000000
## feature_filmyes        0.06536947     0.0000     0.0000000     0.0000000
## dramayes               0.04319833     0.0000     0.0000000     0.0000000
## runtime                0.46971477     1.0000     0.0000000     0.0000000
## mpaa_rating_Ryes       0.19984016     0.0000     0.0000000     0.0000000
## thtr_rel_year          0.09068970     0.0000     0.0000000     0.0000000
## oscar_seasonyes        0.07505684     0.0000     0.0000000     0.0000000
## summer_seasonyes       0.08042023     0.0000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000
## imdb_num_votes         0.05773502     0.0000     0.0000000     0.0000000
## critics_score          0.88855056     1.0000     1.0000000     1.0000000
## best_pic_nomyes        0.13119140     0.0000     0.0000000     0.0000000
## best_pic_winyes        0.03984766     0.0000     0.0000000     0.0000000
## best_actor_winyes      0.14434896     0.0000     0.0000000     1.0000000
## best_actress_winyes    0.14128087     0.0000     0.0000000     0.0000000
## best_dir_winyes        0.06693898     0.0000     0.0000000     0.0000000
## top200_boxyes          0.04762234     0.0000     0.0000000     0.0000000
## BF                             NA     1.0000     0.9968489     0.2543185
## PostProbs                      NA     0.1297     0.1293000     0.0330000
## R2                             NA     0.7549     0.7525000     0.7539000
## dim                            NA     4.0000     3.0000000     4.0000000
## logmarg                        NA -3615.2791 -3615.2822108 -3616.6482224
##                           model 4       model 5
## Intercept               1.0000000     1.0000000
## feature_filmyes         0.0000000     0.0000000
## dramayes                0.0000000     0.0000000
## runtime                 0.0000000     1.0000000
## mpaa_rating_Ryes        1.0000000     1.0000000
## thtr_rel_year           0.0000000     0.0000000
## oscar_seasonyes         0.0000000     0.0000000
## summer_seasonyes        0.0000000     0.0000000
## imdb_rating             1.0000000     1.0000000
## imdb_num_votes          0.0000000     0.0000000
## critics_score           1.0000000     1.0000000
## best_pic_nomyes         0.0000000     0.0000000
## best_pic_winyes         0.0000000     0.0000000
## best_actor_winyes       0.0000000     0.0000000
## best_actress_winyes     0.0000000     0.0000000
## best_dir_winyes         0.0000000     0.0000000
## top200_boxyes           0.0000000     0.0000000
## BF                      0.2521327     0.2391994
## PostProbs               0.0327000     0.0310000
## R2                      0.7539000     0.7563000
## dim                     4.0000000     5.0000000
## logmarg             -3616.6568544 -3616.7095127

The best model chosen contains the variables runtime, imdb_rating, and critics_score. Notice that this is the same model created by the backwards stepwise BIC method above.

Below, we can visualize the goodness of each of the models analyzed using the bas.lm function. The best model (rank 1) shows on the left, with the colored squares representing variables that would be selected for that particular model.

image(as_full.bas, rotate = F)

qqnorm(BIC.lm$residuals, col="red")
qqline(BIC.lm$residuals)

We see a normal distribution here.

Now let’s plot the residuals against the fitted values.

plot(BIC.lm$residuals ~ BIC.lm$fitted, col="red")
abline(h=0, lty=2)

We see some left-skewness here, but the data is generally scattered around 0.

Now let’s plot the absolute value of the residuals against the fitted values.

plot(abs(BIC.lm$residuals) ~ BIC.lm$fitted, col="red")

We do not see a fan shape, meeting the necessary condition.


Part 5: Prediction

The movie I’ve chosen is Finding Dory. The information I will be using for the prediction comes from:

IMDB and Rotten Tomatoes.

I’ll create the data frames containing Finding Dory’s information.

finding_dory_df <- data.frame(imdb_rating = 7.5, runtime = 97, critics_score = 94, mpaa_rating_R="no", thtr_rel_year=2016, best_pic_nom="no",best_actor_win="no", best_actress_win="no")

I will run predictions using both the BIC and AIC models, to contrast them. Note that the set of variables the BIC model uses is a subset of the variables the AIC model uses.

predict(BIC.lm, newdata = finding_dory_df, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 80.48538 60.72202 100.2487

The BIC model predicts a score of 80.48538.

predict(AIC.lm, newdata = finding_dory_df, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 80.41053 60.71769 100.1034

The AIC model predicts a score of 80.41053.

As the true score was 86, the BIC model was only marginally more accurate (93.587% accuracy vs 93.501% accuracy).


Part 6: Conclusion

The model created using the stepAIC tuned toward BIC was the same model found to be ideal by bas.lm. In the end, the AIC and BIC models scored almost identically. I believe if the scope of this project were increased, there would be the possibility of normally distributed errors. A method to deal with these issues– which was not touched on in this project– was variable transformation.