Setup

Load packages

library(ggplot2)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.2
library(statsr)
library(BAS)
## Warning: package 'BAS' was built under R version 4.0.2
library(MASS)
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.2
library("gplots")
## Warning: package 'gplots' was built under R version 4.0.2
library(gridExtra)
library(grid)
library(plotmo)
## Warning: package 'plotmo' was built under R version 4.0.2
## Warning: package 'TeachingDemos' was built under R version 4.0.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.0.2
## Warning: package 'readr' was built under R version 4.0.2
## Warning: package 'forcats' was built under R version 4.0.2
library(broom)
options(width=100)

Load data

load("movies.Rdata")

Introduction

The goal of this project is to develop a Bayesian regression model to predict audience score of a film.

Part 1: Data

The data draws from APIs from imdb.com, rottentomatoes.com, and flixster.com. It is comprised of 651 randomly sampled movies produced and released before 2016.

Since random sampling is used, only correlations can be drawn, hence we can generalize our conclusions about audience_score to the population of movies rated by users of the two web-systems. One source of bias comes from the fact that the data is drawn from sources that cater to English speakers only, the data will be biased towards movies where English is the main language.

Another source of bias is that users who rated a movie on IMDb may also have rated the movies on Rotten Tomatoes. Therefore, if we create a model that uses a Rotten Tomatoes measure as response variable and an IMDb measure as explanatory variable, we may have bias, since part of the information are given by the same users in both systems. Another fact is that the ratings are expected to be correlated, given the bias described. Random assignment was not used, so our conclusions cannot be of causation.


Part 2: Data manipulation

In this section, we would do the following: 1. Remove Observations with NUll values 2. Create new variables to aid exploratory data analysis. 3. Create a subset of the movies data

Step 1: Remove observations with Null values

movies <- na.omit(movies)

Step 2: Create new variables

Finally, We are create some new variables to aid our data analysis. Below is a summary of the new variables.

feature_film: “yes” if title_type is Feature Film, “no” otherwise.

drama: “yes” if genre is Drama, “no” otherwise runtime.

mpaa_rating_R: “yes” if mpaa_rating is R, “no” otherwise

oscar_season: “yes” if movie is released in November, October, or December (based on thtr_rel_month), “no” otherwise.

summer_season: “yes” if movie is released in May, June, July, or August (based on thtr_rel_month), “no” otherwise.

# Create new variable based on `title_type`: New variable should be called 
# `feature_film` with levels yes (movies that are feature films) and no
movies <- movies %>%
            mutate(feature_film=ifelse(title_type=="Feature Film", "yes", "no"))
# Create new variable based on `genre`: New variable should be called `drama`
# with levels yes (movies that are dramas) and no
movies <- movies %>%
            mutate(drama=ifelse(genre=="Drama", "yes", "no"))
# Create new variable based on `mpaa_rating`: New variable should be called 
# `mpaa_rating_R` with levels yes (movies that are R rated) and no
movies <- movies %>%
            mutate(mpaa_rating_R=ifelse(mpaa_rating=="R", "yes", "no"))
# Create two new variables based on `thtr_rel_month`: 
# + New variable called `oscar_season` with levels yes (if movie is released in November, October, or December) and no 
# + New variable called `summer_season` with levels yes (if movie is released in May, June, July, or August) and no
movies <- movies %>%
            mutate(oscar_season=ifelse(thtr_rel_month >= 10 & thtr_rel_month <= 12 , "yes", "no"),
                   
                   summer_season=ifelse(thtr_rel_month >= 5 & thtr_rel_month <= 8, "yes", "no"))

Step 3: Create a subset of the movies data

We’ll then create a subset of the movies data

movies_features <- c("audience_score", "feature_film", "drama", "runtime", "mpaa_rating_R", "thtr_rel_year", "oscar_season", "summer_season", "imdb_rating", "imdb_num_votes", "critics_score", "best_pic_nom", "best_pic_win", "best_actor_win", "best_actress_win", "best_dir_win", "top200_box")
movies <- movies[movies_features]

Part 3: Exploratory data analysis

We’ll start out at a higher, broader level by taking a look at a summary of the variables in movies.

summary(movies)
##  audience_score  feature_film          drama              runtime      mpaa_rating_R     
##  Min.   :11.00   Length:619         Length:619         Min.   : 65.0   Length:619        
##  1st Qu.:46.00   Class :character   Class :character   1st Qu.: 93.0   Class :character  
##  Median :65.00   Mode  :character   Mode  :character   Median :103.0   Mode  :character  
##  Mean   :62.21                                         Mean   :106.5                     
##  3rd Qu.:80.00                                         3rd Qu.:116.0                     
##  Max.   :97.00                                         Max.   :267.0                     
##  thtr_rel_year  oscar_season       summer_season       imdb_rating    imdb_num_votes  
##  Min.   :1972   Length:619         Length:619         Min.   :1.900   Min.   :   183  
##  1st Qu.:1991   Class :character   Class :character   1st Qu.:5.900   1st Qu.:  5026  
##  Median :2000   Mode  :character   Mode  :character   Median :6.600   Median : 16480  
##  Mean   :1998                                         Mean   :6.486   Mean   : 60014  
##  3rd Qu.:2007                                         3rd Qu.:7.300   3rd Qu.: 62507  
##  Max.   :2014                                         Max.   :9.000   Max.   :893008  
##  critics_score    best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win top200_box
##  Min.   :  1.00   no :597      no :612      no :528        no :548          no :576      no :604   
##  1st Qu.: 33.00   yes: 22      yes:  7      yes: 91        yes: 71          yes: 43      yes: 15   
##  Median : 61.00                                                                                    
##  Mean   : 57.43                                                                                    
##  3rd Qu.: 82.50                                                                                    
##  Max.   :100.00

This summary gives us a look at the spread of each variable.

Let’s also take a look at the levels of each variable.

str(movies)
## tibble [619 x 17] (S3: tbl_df/tbl/data.frame)
##  $ audience_score  : num [1:619] 73 81 91 76 27 76 47 89 66 75 ...
##  $ feature_film    : chr [1:619] "yes" "yes" "yes" "yes" ...
##  $ drama           : chr [1:619] "yes" "yes" "no" "yes" ...
##  $ runtime         : num [1:619] 80 101 84 139 90 142 93 88 119 127 ...
##  $ mpaa_rating_R   : chr [1:619] "yes" "no" "yes" "no" ...
##  $ thtr_rel_year   : num [1:619] 2013 2001 1996 1993 2004 ...
##  $ oscar_season    : chr [1:619] "no" "no" "no" "yes" ...
##  $ summer_season   : chr [1:619] "no" "no" "yes" "no" ...
##  $ imdb_rating     : num [1:619] 5.5 7.3 7.6 7.2 5.1 7.2 5.5 7.5 6.6 6.8 ...
##  $ imdb_num_votes  : int [1:619] 899 12285 22381 35096 2386 5016 2272 880 12496 71979 ...
##  $ critics_score   : num [1:619] 45 96 91 80 33 57 17 90 83 89 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 2 1 1 2 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 2 ...
##  - attr(*, "na.action")= 'omit' Named int [1:32] 6 25 94 100 131 172 175 184 198 207 ...
##   ..- attr(*, "names")= chr [1:32] "6" "25" "94" "100" ...

We are going to conduct analyses on the relation of the new variables feature_film, drama, mpaa_rating_R, oscar_season and summer_season to the response variable audience_score.

The following boxplots show the visualization of distribution of audience_score from each group of yes and no of our new binary-categorical variables. The histogram shows the distribution of audience_score.

# plotting variables as horizontal -> Labels are read horizontally
a <- ggplot(aes(x=audience_score), data=movies) + geom_histogram(binwidth = 7, color = "steelblue", fill="white")
b <- ggplot(aes(x=feature_film, y=audience_score), data=movies) + geom_boxplot(color = "steelblue")
c <- ggplot(aes(x=drama, y=audience_score), data=movies) + geom_boxplot(color = "steelblue")
d <- ggplot(aes(x=mpaa_rating_R, y=audience_score), data=movies) + geom_boxplot(color = "steelblue")
e <- ggplot(aes(x=oscar_season, y=audience_score), data=movies) + geom_boxplot(color = "steelblue")
f <- ggplot(aes(x=summer_season, y=audience_score), data=movies) + geom_boxplot(color = "steelblue")
# Create a grid of plots
grid.arrange(
  a, b, c, d, e, f,
  nrow = 2,
  top = "Audience Score",
  bottom = textGrob(
    "",
    gp = gpar(fontface = 3, fontsize = 9),
    hjust = 1,
    x = 1
  )
)

From the plots above we see that the biggest difference is from audience_scores from feature_film. > Other variables have similar and small differences in the median.

summary(movies$audience_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.21   80.00   97.00

The median of audience_score is 65 and the mean is 62, which are not easily understood form the histogram.

We’ll then map out correlation charts that will show the relationships between audience_score and all other variables in movies.

suppressWarnings(suppressMessages(print(ggpairs(movies, columns = 1:8))))

suppressWarnings(suppressMessages(print(ggpairs(movies, columns = c(1,9:17)))))

Notice the high correlation between audience_score and critics_score. Let’s visualize this correlation with a scatter plot with a regression line.

cor(movies$audience_score, movies$critics_score)
## [1] 0.7015256
ggplot(data=movies, aes(x = audience_score, y = critics_score)) +
  geom_jitter() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

Let’s do the same with imdb_rating, again due to the high correlation it has with audience_score.

cor(movies$audience_score, movies$imdb_rating)
## [1] 0.8605425
ggplot(data=movies, aes(x = audience_score, y = imdb_rating)) +
  geom_jitter() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

We can see strong positive correlations with both sets of variables.


Part 4: Modeling

We’ll first create the full linear model with all the variables in movies.

We will use the stepAIC function from library MASS to build a model (backwards) until the AIC can not be lowered.

as_full <- lm(audience_score ~ ., data= na.omit(movies))

as_full
## 
## Call:
## lm(formula = audience_score ~ ., data = na.omit(movies))
## 
## Coefficients:
##         (Intercept)      feature_filmyes             dramayes              runtime  
##           1.279e+02           -2.478e+00            1.340e+00           -6.438e-02  
##    mpaa_rating_Ryes        thtr_rel_year      oscar_seasonyes     summer_seasonyes  
##          -1.417e+00           -7.778e-02           -3.442e-01            9.833e-01  
##         imdb_rating       imdb_num_votes        critics_score      best_pic_nomyes  
##           1.467e+01            7.547e-06            6.176e-02            5.258e+00  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes      best_dir_winyes  
##          -3.097e+00           -1.347e+00           -2.311e+00           -1.233e+00  
##       top200_boxyes  
##           7.978e-01

Creating a model based on AIC

We will use the stepAIC function, tuned to optimize for AIC, to find the best model. The model will be built backwards.

stepAIC.model <- stepAIC(as_full, direction = "backward", trace = TRUE)
## Start:  AIC=2878.42
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box
## 
##                    Df Sum of Sq    RSS    AIC
## - top200_box        1         8  61293 2876.5
## - oscar_season      1        11  61296 2876.5
## - best_pic_win      1        45  61330 2876.9
## - best_dir_win      1        50  61336 2876.9
## - summer_season     1       102  61388 2877.5
## - best_actor_win    1       126  61411 2877.7
## - feature_film      1       177  61463 2878.2
## <none>                           61285 2878.4
## - drama             1       226  61511 2878.7
## - imdb_num_votes    1       272  61557 2879.2
## - mpaa_rating_R     1       289  61574 2879.3
## - best_actress_win  1       308  61593 2879.5
## - thtr_rel_year     1       378  61663 2880.2
## - best_pic_nom      1       397  61682 2880.4
## - runtime           1       647  61932 2882.9
## - critics_score     1       737  62022 2883.8
## - imdb_rating       1     54331 115617 3269.3
## 
## Step:  AIC=2876.5
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - oscar_season      1        10  61303 2874.6
## - best_pic_win      1        45  61338 2875.0
## - best_dir_win      1        52  61345 2875.0
## - summer_season     1       105  61399 2875.6
## - best_actor_win    1       125  61419 2875.8
## - feature_film      1       176  61470 2876.3
## <none>                           61293 2876.5
## - drama             1       224  61517 2876.8
## - mpaa_rating_R     1       303  61597 2877.6
## - best_actress_win  1       305  61598 2877.6
## - imdb_num_votes    1       320  61613 2877.7
## - best_pic_nom      1       393  61686 2878.5
## - thtr_rel_year     1       396  61689 2878.5
## - runtime           1       645  61939 2881.0
## - critics_score     1       748  62042 2882.0
## - imdb_rating       1     54384 115677 3267.7
## 
## Step:  AIC=2874.6
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_pic_win      1        43  61347 2873.0
## - best_dir_win      1        54  61358 2873.2
## - best_actor_win    1       129  61432 2873.9
## - summer_season     1       167  61470 2874.3
## - feature_film      1       180  61483 2874.4
## <none>                           61303 2874.6
## - drama             1       231  61534 2874.9
## - mpaa_rating_R     1       303  61607 2875.7
## - best_actress_win  1       305  61609 2875.7
## - imdb_num_votes    1       321  61624 2875.8
## - best_pic_nom      1       384  61687 2876.5
## - thtr_rel_year     1       394  61697 2876.6
## - runtime           1       692  61995 2879.6
## - critics_score     1       748  62052 2880.1
## - imdb_rating       1     54374 115677 3265.7
## 
## Step:  AIC=2873.04
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win + 
##     best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_dir_win      1        91  61438 2872.0
## - best_actor_win    1       119  61466 2872.2
## - summer_season     1       164  61511 2872.7
## - feature_film      1       170  61517 2872.8
## <none>                           61347 2873.0
## - drama             1       229  61576 2873.3
## - imdb_num_votes    1       289  61636 2874.0
## - mpaa_rating_R     1       305  61651 2874.1
## - best_actress_win  1       319  61665 2874.3
## - best_pic_nom      1       342  61689 2874.5
## - thtr_rel_year     1       381  61728 2874.9
## - runtime           1       692  62038 2878.0
## - critics_score     1       750  62097 2878.6
## - imdb_rating       1     54632 115979 3265.3
## 
## Step:  AIC=2871.96
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_actor_win    1       125  61563 2871.2
## - summer_season     1       157  61595 2871.5
## - feature_film      1       185  61622 2871.8
## <none>                           61438 2872.0
## - drama             1       238  61676 2872.4
## - imdb_num_votes    1       266  61704 2872.6
## - mpaa_rating_R     1       318  61756 2873.2
## - best_actress_win  1       324  61762 2873.2
## - best_pic_nom      1       327  61764 2873.2
## - thtr_rel_year     1       352  61790 2873.5
## - critics_score     1       719  62157 2877.2
## - runtime           1       789  62227 2877.9
## - imdb_rating       1     54671 116109 3264.0
## 
## Step:  AIC=2871.22
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - summer_season     1       176  61739 2871.0
## <none>                           61563 2871.2
## - feature_film      1       212  61775 2871.3
## - drama             1       238  61801 2871.6
## - imdb_num_votes    1       282  61845 2872.0
## - best_pic_nom      1       294  61857 2872.2
## - mpaa_rating_R     1       310  61873 2872.3
## - thtr_rel_year     1       349  61912 2872.7
## - best_actress_win  1       354  61917 2872.8
## - critics_score     1       712  62275 2876.3
## - runtime           1       964  62527 2878.8
## - imdb_rating       1     54679 116242 3262.7
## 
## Step:  AIC=2870.99
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + imdb_rating + imdb_num_votes + critics_score + 
##     best_pic_nom + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - feature_film      1       188  61927 2870.9
## <none>                           61739 2871.0
## - drama             1       214  61953 2871.1
## - best_pic_nom      1       262  62002 2871.6
## - imdb_num_votes    1       300  62039 2872.0
## - mpaa_rating_R     1       319  62058 2872.2
## - thtr_rel_year     1       342  62082 2872.4
## - best_actress_win  1       353  62092 2872.5
## - critics_score     1       813  62553 2877.1
## - runtime           1      1004  62743 2879.0
## - imdb_rating       1     54557 116296 3261.0
## 
## Step:  AIC=2870.87
## audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year + 
##     imdb_rating + imdb_num_votes + critics_score + best_pic_nom + 
##     best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - drama             1       124  62052 2870.1
## - imdb_num_votes    1       196  62124 2870.8
## <none>                           61927 2870.9
## - best_pic_nom      1       248  62176 2871.4
## - thtr_rel_year     1       251  62178 2871.4
## - best_actress_win  1       364  62291 2872.5
## - mpaa_rating_R     1       411  62339 2873.0
## - critics_score     1       982  62909 2878.6
## - runtime           1      1005  62932 2878.8
## - imdb_rating       1     58584 120511 3281.0
## 
## Step:  AIC=2870.11
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - imdb_num_votes    1       167  62219 2869.8
## <none>                           62052 2870.1
## - thtr_rel_year     1       244  62296 2870.5
## - best_pic_nom      1       266  62317 2870.8
## - best_actress_win  1       330  62381 2871.4
## - mpaa_rating_R     1       351  62403 2871.6
## - runtime           1       910  62961 2877.1
## - critics_score     1      1021  63073 2878.2
## - imdb_rating       1     58890 120941 3281.2
## 
## Step:  AIC=2869.78
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     critics_score + best_pic_nom + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - thtr_rel_year     1       169  62388 2869.5
## <none>                           62219 2869.8
## - best_actress_win  1       320  62539 2871.0
## - mpaa_rating_R     1       330  62549 2871.1
## - best_pic_nom      1       395  62614 2871.7
## - runtime           1       777  62996 2875.5
## - critics_score     1       979  63198 2877.4
## - imdb_rating       1     63047 125266 3300.9
## 
## Step:  AIC=2869.46
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                           62388 2869.5
## - best_actress_win  1       313  62701 2870.6
## - mpaa_rating_R     1       333  62720 2870.8
## - best_pic_nom      1       400  62787 2871.4
## - runtime           1       713  63101 2874.5
## - critics_score     1      1066  63454 2877.9
## - imdb_rating       1     62879 125267 3299.0

The final model built using AIC consists of the following variables:

runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win

AIC.lm <- lm(audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + critics_score + best_pic_nom + best_actor_win + best_actress_win, data=movies)

The coefficients of this model:

AIC.lm$coefficients
##         (Intercept)             runtime    mpaa_rating_Ryes       thtr_rel_year         imdb_rating 
##         68.80456944         -0.05746878         -1.47061512         -0.05028853         14.97379533 
##       critics_score     best_pic_nomyes   best_actor_winyes best_actress_winyes 
##          0.06895589          4.89109667         -1.55307354         -2.20922544

Taking a look at the standard deviation of the model:

summary(AIC.lm)$sigma
## [1] 10.08554
plot(movies$audience_score ~ AIC.lm$residuals)

Plotting the residuals of the model:

ggplot(data=AIC.lm, aes(x=AIC.lm$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that the residuals are normally distributed.

Creating a model based on BIC

We will use the stepAIC function, tuned to optimize for BIC, to find the best model. The model will be built backwards.

stepBIC.model <- stepAIC(as_full, direction = "backward", k=log(nrow(movies)), trace = TRUE)
## Start:  AIC=2953.7
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win + top200_box
## 
##                    Df Sum of Sq    RSS    AIC
## - top200_box        1         8  61293 2947.4
## - oscar_season      1        11  61296 2947.4
## - best_pic_win      1        45  61330 2947.7
## - best_dir_win      1        50  61336 2947.8
## - summer_season     1       102  61388 2948.3
## - best_actor_win    1       126  61411 2948.5
## - feature_film      1       177  61463 2949.1
## - drama             1       226  61511 2949.6
## - imdb_num_votes    1       272  61557 2950.0
## - mpaa_rating_R     1       289  61574 2950.2
## - best_actress_win  1       308  61593 2950.4
## - thtr_rel_year     1       378  61663 2951.1
## - best_pic_nom      1       397  61682 2951.3
## <none>                           61285 2953.7
## - runtime           1       647  61932 2953.8
## - critics_score     1       737  62022 2954.7
## - imdb_rating       1     54331 115617 3340.2
## 
## Step:  AIC=2947.35
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + oscar_season + summer_season + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
##     best_actor_win + best_actress_win + best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - oscar_season      1        10  61303 2941.0
## - best_pic_win      1        45  61338 2941.4
## - best_dir_win      1        52  61345 2941.5
## - summer_season     1       105  61399 2942.0
## - best_actor_win    1       125  61419 2942.2
## - feature_film      1       176  61470 2942.7
## - drama             1       224  61517 2943.2
## - mpaa_rating_R     1       303  61597 2944.0
## - best_actress_win  1       305  61598 2944.0
## - imdb_num_votes    1       320  61613 2944.1
## - best_pic_nom      1       393  61686 2944.9
## - thtr_rel_year     1       396  61689 2944.9
## <none>                           61293 2947.4
## - runtime           1       645  61939 2947.4
## - critics_score     1       748  62042 2948.4
## - imdb_rating       1     54384 115677 3334.1
## 
## Step:  AIC=2941.03
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_pic_win + best_actor_win + 
##     best_actress_win + best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_pic_win      1        43  61347 2935.0
## - best_dir_win      1        54  61358 2935.1
## - best_actor_win    1       129  61432 2935.9
## - summer_season     1       167  61470 2936.3
## - feature_film      1       180  61483 2936.4
## - drama             1       231  61534 2936.9
## - mpaa_rating_R     1       303  61607 2937.7
## - best_actress_win  1       305  61609 2937.7
## - imdb_num_votes    1       321  61624 2937.8
## - best_pic_nom      1       384  61687 2938.5
## - thtr_rel_year     1       394  61697 2938.6
## <none>                           61303 2941.0
## - runtime           1       692  61995 2941.5
## - critics_score     1       748  62052 2942.1
## - imdb_rating       1     54374 115677 3327.6
## 
## Step:  AIC=2935.04
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win + 
##     best_dir_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_dir_win      1        91  61438 2929.5
## - best_actor_win    1       119  61466 2929.8
## - summer_season     1       164  61511 2930.3
## - feature_film      1       170  61517 2930.3
## - drama             1       229  61576 2930.9
## - imdb_num_votes    1       289  61636 2931.5
## - mpaa_rating_R     1       305  61651 2931.7
## - best_actress_win  1       319  61665 2931.8
## - best_pic_nom      1       342  61689 2932.1
## - thtr_rel_year     1       381  61728 2932.4
## <none>                           61347 2935.0
## - runtime           1       692  62038 2935.5
## - critics_score     1       750  62097 2936.1
## - imdb_rating       1     54632 115979 3322.8
## 
## Step:  AIC=2929.53
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actor_win + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_actor_win    1       125  61563 2924.4
## - summer_season     1       157  61595 2924.7
## - feature_film      1       185  61622 2925.0
## - drama             1       238  61676 2925.5
## - imdb_num_votes    1       266  61704 2925.8
## - mpaa_rating_R     1       318  61756 2926.3
## - best_actress_win  1       324  61762 2926.4
## - best_pic_nom      1       327  61764 2926.4
## - thtr_rel_year     1       352  61790 2926.6
## <none>                           61438 2929.5
## - critics_score     1       719  62157 2930.3
## - runtime           1       789  62227 2931.0
## - imdb_rating       1     54671 116109 3317.1
## 
## Step:  AIC=2924.36
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
##     critics_score + best_pic_nom + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - summer_season     1       176  61739 2919.7
## - feature_film      1       212  61775 2920.1
## - drama             1       238  61801 2920.3
## - imdb_num_votes    1       282  61845 2920.8
## - best_pic_nom      1       294  61857 2920.9
## - mpaa_rating_R     1       310  61873 2921.0
## - thtr_rel_year     1       349  61912 2921.4
## - best_actress_win  1       354  61917 2921.5
## <none>                           61563 2924.4
## - critics_score     1       712  62275 2925.1
## - runtime           1       964  62527 2927.5
## - imdb_rating       1     54679 116242 3311.4
## 
## Step:  AIC=2919.7
## audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
##     thtr_rel_year + imdb_rating + imdb_num_votes + critics_score + 
##     best_pic_nom + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - feature_film      1       188  61927 2915.2
## - drama             1       214  61953 2915.4
## - best_pic_nom      1       262  62002 2915.9
## - imdb_num_votes    1       300  62039 2916.3
## - mpaa_rating_R     1       319  62058 2916.5
## - thtr_rel_year     1       342  62082 2916.7
## - best_actress_win  1       353  62092 2916.8
## <none>                           61739 2919.7
## - critics_score     1       813  62553 2921.4
## - runtime           1      1004  62743 2923.3
## - imdb_rating       1     54557 116296 3305.2
## 
## Step:  AIC=2915.16
## audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year + 
##     imdb_rating + imdb_num_votes + critics_score + best_pic_nom + 
##     best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - drama             1       124  62052 2910.0
## - imdb_num_votes    1       196  62124 2910.7
## - best_pic_nom      1       248  62176 2911.2
## - thtr_rel_year     1       251  62178 2911.2
## - best_actress_win  1       364  62291 2912.4
## - mpaa_rating_R     1       411  62339 2912.8
## <none>                           61927 2915.2
## - critics_score     1       982  62909 2918.5
## - runtime           1      1005  62932 2918.7
## - imdb_rating       1     58584 120511 3320.8
## 
## Step:  AIC=2909.97
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     imdb_num_votes + critics_score + best_pic_nom + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - imdb_num_votes    1       167  62219 2905.2
## - thtr_rel_year     1       244  62296 2906.0
## - best_pic_nom      1       266  62317 2906.2
## - best_actress_win  1       330  62381 2906.8
## - mpaa_rating_R     1       351  62403 2907.0
## <none>                           62052 2910.0
## - runtime           1       910  62961 2912.5
## - critics_score     1      1021  63073 2913.6
## - imdb_rating       1     58890 120941 3316.6
## 
## Step:  AIC=2905.21
## audience_score ~ runtime + mpaa_rating_R + thtr_rel_year + imdb_rating + 
##     critics_score + best_pic_nom + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - thtr_rel_year     1       169  62388 2900.5
## - best_actress_win  1       320  62539 2902.0
## - mpaa_rating_R     1       330  62549 2902.1
## - best_pic_nom      1       395  62614 2902.7
## <none>                           62219 2905.2
## - runtime           1       777  62996 2906.5
## - critics_score     1       979  63198 2908.4
## - imdb_rating       1     63047 125266 3331.9
## 
## Step:  AIC=2900.46
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom + best_actress_win
## 
##                    Df Sum of Sq    RSS    AIC
## - best_actress_win  1       313  62701 2897.1
## - mpaa_rating_R     1       333  62720 2897.3
## - best_pic_nom      1       400  62787 2898.0
## <none>                           62388 2900.5
## - runtime           1       713  63101 2901.1
## - critics_score     1      1066  63454 2904.5
## - imdb_rating       1     62879 125267 3325.5
## 
## Step:  AIC=2897.13
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score + 
##     best_pic_nom
## 
##                 Df Sum of Sq    RSS    AIC
## - best_pic_nom   1       303  63004 2893.7
## - mpaa_rating_R  1       323  63024 2893.9
## <none>                        62701 2897.1
## - runtime        1       884  63585 2899.4
## - critics_score  1      1061  63762 2901.1
## - imdb_rating    1     62968 125669 3321.1
## 
## Step:  AIC=2893.69
## audience_score ~ runtime + mpaa_rating_R + imdb_rating + critics_score
## 
##                 Df Sum of Sq    RSS    AIC
## - mpaa_rating_R  1       332  63336 2890.5
## <none>                        63004 2893.7
## - runtime        1       704  63708 2894.1
## - critics_score  1      1135  64139 2898.3
## - imdb_rating    1     63737 126741 3319.9
## 
## Step:  AIC=2890.51
## audience_score ~ runtime + imdb_rating + critics_score
## 
##                 Df Sum of Sq    RSS    AIC
## <none>                        63336 2890.5
## - runtime        1       690  64025 2890.8
## - critics_score  1      1177  64513 2895.5
## - imdb_rating    1     63413 126749 3313.5

The final model will use the following variables:

audience_score ~ runtime + imdb_rating + critics_score

BIC.lm <- lm(audience_score ~ runtime + imdb_rating + critics_score, data=movies)
BIC.lm$coefficients
##   (Intercept)       runtime   imdb_rating critics_score 
##  -32.90824894   -0.05790520   14.95043323    0.07530741
summary(BIC.lm)$sigma
## [1] 10.14815
plot(na.omit(movies$audience_score) ~ BIC.lm$residuals)

Taking a look at the residuals:

ggplot(data=BIC.lm, aes(x=BIC.lm$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see that the residuals are normally distributed.

Creating a model using Bayesian averaging

as_full.bas <- bas.lm(audience_score ~ .,
       prior ="BIC",
       modelprior = uniform(),
       data = na.omit(movies))

as_full.bas
## 
## Call:
## bas.lm(formula = audience_score ~ ., data = na.omit(movies), 
##     prior = "BIC", modelprior = uniform())
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##           Intercept      feature_filmyes             dramayes              runtime  
##             1.00000              0.05876              0.04509              0.51400  
##    mpaa_rating_Ryes        thtr_rel_year      oscar_seasonyes     summer_seasonyes  
##             0.16498              0.08089              0.06526              0.07935  
##         imdb_rating       imdb_num_votes        critics_score      best_pic_nomyes  
##             1.00000              0.06242              0.92016              0.13201  
##     best_pic_winyes    best_actor_winyes  best_actress_winyes      best_dir_winyes  
##             0.04077              0.11565              0.14770              0.06701  
##       top200_boxyes  
##             0.04876

According to this model, there is a 100% chance that imdb_rating will be included in the final model. Other noteworthy variables are runtime (~47%), critics_score (~89%). The variable with the nearest score to these is mpaa_rating_R:yes at ~20%.

confint(coef(as_full.bas))
##                              2.5%        97.5%          beta
## Intercept            6.143478e+01 6.301352e+01  6.221002e+01
## feature_filmyes     -6.649503e-01 1.249856e-01 -9.162435e-02
## dramayes             0.000000e+00 0.000000e+00  1.956484e-02
## runtime             -8.989580e-02 0.000000e+00 -3.042190e-02
## mpaa_rating_Ryes    -1.964520e+00 0.000000e+00 -2.409288e-01
## thtr_rel_year       -4.634386e-02 0.000000e+00 -3.878883e-03
## oscar_seasonyes     -7.105776e-01 1.738174e-02 -6.293171e-02
## summer_seasonyes    -2.130373e-04 1.097149e+00  8.660919e-02
## imdb_rating          1.351437e+01 1.645399e+01  1.490680e+01
## imdb_num_votes      -6.167253e-08 2.423759e-06  2.466254e-07
## critics_score        0.000000e+00 1.111559e-01  6.951964e-02
## best_pic_nomyes      0.000000e+00 4.846975e+00  5.130600e-01
## best_pic_winyes      0.000000e+00 0.000000e+00 -6.383140e-03
## best_actor_winyes   -2.259661e+00 1.732811e-02 -2.123363e-01
## best_actress_winyes -3.099880e+00 0.000000e+00 -3.323225e-01
## best_dir_winyes     -1.553895e+00 4.523009e-02 -1.188402e-01
## top200_boxyes        0.000000e+00 0.000000e+00  8.972019e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"
summary(as_full.bas)
##                     P(B != 0 | Y)    model 1       model 2       model 3       model 4
## Intercept              1.00000000     1.0000     1.0000000     1.0000000     1.0000000
## feature_filmyes        0.05876309     0.0000     0.0000000     0.0000000     0.0000000
## dramayes               0.04508592     0.0000     0.0000000     0.0000000     0.0000000
## runtime                0.51399873     1.0000     0.0000000     0.0000000     1.0000000
## mpaa_rating_Ryes       0.16498090     0.0000     0.0000000     0.0000000     1.0000000
## thtr_rel_year          0.08088668     0.0000     0.0000000     0.0000000     0.0000000
## oscar_seasonyes        0.06525993     0.0000     0.0000000     0.0000000     0.0000000
## summer_seasonyes       0.07935408     0.0000     0.0000000     0.0000000     0.0000000
## imdb_rating            1.00000000     1.0000     1.0000000     1.0000000     1.0000000
## imdb_num_votes         0.06242230     0.0000     0.0000000     0.0000000     0.0000000
## critics_score          0.92015573     1.0000     1.0000000     1.0000000     1.0000000
## best_pic_nomyes        0.13200908     0.0000     0.0000000     0.0000000     0.0000000
## best_pic_winyes        0.04076727     0.0000     0.0000000     0.0000000     0.0000000
## best_actor_winyes      0.11565473     0.0000     0.0000000     0.0000000     0.0000000
## best_actress_winyes    0.14770126     0.0000     0.0000000     1.0000000     0.0000000
## best_dir_winyes        0.06701259     0.0000     0.0000000     0.0000000     0.0000000
## top200_boxyes          0.04875994     0.0000     0.0000000     0.0000000     0.0000000
## BF                             NA     1.0000     0.8715404     0.2048238     0.2039916
## PostProbs                      NA     0.1558     0.1358000     0.0319000     0.0318000
## R2                             NA     0.7483     0.7455000     0.7470000     0.7496000
## dim                            NA     4.0000     3.0000000     4.0000000     5.0000000
## logmarg                        NA -3434.7520 -3434.8894481 -3436.3375603 -3436.3416317
##                           model 5
## Intercept               1.0000000
## feature_filmyes         0.0000000
## dramayes                0.0000000
## runtime                 1.0000000
## mpaa_rating_Ryes        0.0000000
## thtr_rel_year           0.0000000
## oscar_seasonyes         0.0000000
## summer_seasonyes        0.0000000
## imdb_rating             1.0000000
## imdb_num_votes          0.0000000
## critics_score           1.0000000
## best_pic_nomyes         1.0000000
## best_pic_winyes         0.0000000
## best_actor_winyes       0.0000000
## best_actress_winyes     0.0000000
## best_dir_winyes         0.0000000
## top200_boxyes           0.0000000
## BF                      0.1851908
## PostProbs               0.0289000
## R2                      0.7495000
## dim                     5.0000000
## logmarg             -3436.4383237

The best model chosen contains the variables runtime, imdb_rating, and critics_score. Notice that this is the same model created by the backwards stepwise BIC method above.

Below, we can visualize the goodness of each of the models analyzed using the bas.lm function. The best model (rank 1) shows on the left, with the colored squares representing variables that would be selected for that particular model.

image(as_full.bas, rotate = F )

qqnorm(BIC.lm$residuals, col="steelblue")
qqline(BIC.lm$residuals)

We see a normal distribution here.

Now let’s plot the residuals against the fitted values.

plot(BIC.lm$residuals ~ BIC.lm$fitted, col="steelblue")
abline(h=0, lty=2)

We see some left-skewness here, but the data is generally scattered around 0.

Now let’s plot the absolute value of the residuals against the fitted values.

plot(abs(BIC.lm$residuals) ~ BIC.lm$fitted, col="steelblue")

We do not see a fan shape, meeting the necessary condition.


Part 5: Prediction

The movie I’ve chosen is Captain America: Civil War. The information for the prediction comes from:

IMDB and Rotten Tomatoes.

I’ll create the data frames containing information on Captain America: Civil War information.

captain_america <- data.frame(imdb_rating = 8.1, runtime = 147, critics_score = 91, mpaa_rating_R="no", thtr_rel_year=2016, best_pic_nom="no",best_actor_win="no", best_actress_win="no")

I will run predictions using both the BIC and AIC models, to contrast them. Note that the set of variables the BIC model uses is a subset of the variables the AIC model uses.

predict(BIC.lm, newdata = captain_america, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 86.53117 66.50016 106.5622

The BIC model predicts a score of 86.53.

predict(AIC.lm, newdata = captain_america, interval = "prediction", level = 0.95)
##       fit      lwr      upr
## 1 86.5377 66.51882 106.5566

The AIC model predicts a score of 86.54.

As the true score was 89, both the BIC and AIC models were very accurate, with the BIC model being marginally more precise than the AIC model.


Part 6: Conclusion

The model created using the stepAIC tuned toward BIC was the same model found to be ideal by bas.lm.