Bayesian modeling of movie data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(BAS)
library(broom)

Load data

load("movies.Rdata")

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016.Thus any analysis is generalizable to movies produced and released before 2016. Since there were no experimental groups or random assignment, any analysis on this data set cannot determine causality, only relationships and correlations.

Part 2: Data manipulation

#Create new variable based on `title_type`: New variable should be called `feature_film` with levels yes (movies that are feature films) and no
summary(movies$title_type)

##  Documentary Feature Film     TV Movie 
##           55          591            5

movies <- movies %>% mutate(feature_film = ifelse(as.character(title_type) == "Feature Film", "yes", "no"))
movies %>% group_by(feature_film) %>% summarise(count = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   feature_film count
##   <chr>        <int>
## 1 no              60
## 2 yes            591

#Create new variable based on `genre`: New variable should be called `drama` with levels yes (movies that are dramas) and no
summary(movies$genre)

##        Action & Adventure                 Animation Art House & International 
##                        65                         9                        14 
##                    Comedy               Documentary                     Drama 
##                        87                        52                       305 
##                    Horror Musical & Performing Arts        Mystery & Suspense 
##                        23                        12                        59 
##                     Other Science Fiction & Fantasy 
##                        16                         9

movies <- movies %>% mutate(drama = ifelse(as.character(genre) == "Drama", "yes", "no"))
movies %>% group_by(drama) %>% summarise(count = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   drama count
##   <chr> <int>
## 1 no      346
## 2 yes     305

#Create new variable based on `mpaa_rating`: New variable should be called `mpaa_rating_R` with levels yes (movies that are R rated) and no
summary(movies$mpaa_rating)

##       G   NC-17      PG   PG-13       R Unrated 
##      19       2     118     133     329      50

movies <- movies %>% mutate(mpaa_rating_R = ifelse(as.character(mpaa_rating) == "R", "yes", "no"))
movies %>% group_by(mpaa_rating_R) %>% summarise(count = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   mpaa_rating_R count
##   <chr>         <int>
## 1 no              322
## 2 yes             329

#Create two new variables based on `thtr_rel_month`: New variable called `oscar_season` with levels yes (if movie is released in November, October, or December) and no (2 pt) New variable called `summer_season` with levels yes (if movie is released in May, June, July, or August) and no 
summary(movies$thtr_rel_month)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00    7.00    6.74   10.00   12.00

oscar_month <- c(11, 10, 12)
movies <- movies %>% mutate(oscar_season = ifelse(thtr_rel_month %in% oscar_month, "yes", "no"))
movies %>% select(oscar_season, thtr_rel_month) %>% head(movies, n = 10)

## # A tibble: 10 x 2
##    oscar_season thtr_rel_month
##    <chr>                 <dbl>
##  1 no                        4
##  2 no                        3
##  3 no                        8
##  4 yes                      10
##  5 no                        9
##  6 no                        1
##  7 no                        1
##  8 yes                      11
##  9 no                        9
## 10 no                        3

summer_months <- c(5,6,7,8)
movies <- movies %>% mutate(summer_season = ifelse(thtr_rel_month %in% summer_months, "yes", "no"))
movies %>% select(summer_season, thtr_rel_month) %>% head(movies, n = 10)

## # A tibble: 10 x 2
##    summer_season thtr_rel_month
##    <chr>                  <dbl>
##  1 no                         4
##  2 no                         3
##  3 yes                        8
##  4 no                        10
##  5 no                         9
##  6 no                         1
##  7 no                         1
##  8 no                        11
##  9 no                         9
## 10 no                         3

Part 3: Exploratory data analysis

Conduct exploratory data analysis of the relationship between audience_score and the new variables constructed in the previous part

#Audience score vs feature film and title type
ggplot(data = movies, aes(x = title_type, y = audience_score, fill = feature_film)) + geom_boxplot()

ggplot(data = movies, aes(x = feature_film, y = audience_score, fill = feature_film)) + geom_boxplot()

The audience scores for feature films seem to typically be less than that of documentaries and TV movies. When combining TV movie scores with Documentaries, Feature Films still score less in audience scores.

#Audience score vs dramas and genres
ggplot(data = movies, aes(x = genre, y = audience_score, fill = drama)) + geom_boxplot()

ggplot(data = movies, aes(x = drama, y = audience_score, fill = drama)) + geom_boxplot()

Compared to all the other genres, Drama’s score somewhere in the middle range with three different genres having a higher median and the rest scoring generally below dramas. When combined using the drama variable, drama’s have a slightly higher median audience score than all the other genre’s combined.

#Audience score vs MPAA rating and specifically R-rated movies
ggplot(data = movies, aes(x = mpaa_rating, y = audience_score, fill = mpaa_rating_R)) + geom_boxplot()

ggplot(data = movies, aes(x = mpaa_rating_R, y = audience_score, fill = mpaa_rating_R)) + geom_boxplot()

When separated by rating, all the movies generally fall into a similar range except for Unrated movies which are a bit higher but also have some low scored outliers. The R rated movies have an distribution of audience scores that is very similar to that of PG rated movies. When combining all the movies besides R using the mpaa_rating_R variable, the distributions end up being very similar.

#Audience scores vs Oscar seasons
ggplot(data = movies, aes(x = as.character(thtr_rel_month), y = audience_score, fill = oscar_season)) + geom_boxplot() + xlab("Theater Release Month")

ggplot(data = movies, aes(x = oscar_season, y = audience_score, fill = oscar_season)) + geom_boxplot()

Movies released during the Oscar Season (months 10, 11, and 12) have median audience scores that are only slightly higher than the rest.

#Audience scores vs summer seasons
ggplot(data = movies, aes(x = as.character(thtr_rel_month), y = audience_score, fill = summer_season)) + geom_boxplot() + xlab("Theater Release Month")

ggplot(data = movies, aes(x = summer_season, y = audience_score, fill = summer_season)) + geom_boxplot()

Movies realeased in the summer season (months 5 6 7 and 8) have about the same median audience scores as other movies. * * *

Part 4: Modeling

Develop a Bayesian regression model to predict audience_score using the following explanatory variables: feature_film, drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, and top200_box. For the regression model we will start with some exploration of audience_score since it will be the response variable in the model.

ggplot(data = movies, aes(x = audience_score)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(movies$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.36   80.00   97.00

The median of the distribution is 65. This also shows us that 25% of these randomly sampled movies scored at least 80 points. The distribution is left skewed which means that in this data set, more movies have audience scores above the mean than below it.

m_audscore_full <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + thtr_rel_year + oscar_season + summer_season + imdb_rating + imdb_num_votes + critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win + top200_box, data = movies)
tidy(m_audscore_full)

## # A tibble: 17 x 5
##    term                    estimate   std.error statistic  p.value
##    <chr>                      <dbl>       <dbl>     <dbl>    <dbl>
##  1 (Intercept)         124.         77.5            1.61  1.09e- 1
##  2 feature_filmyes      -2.25        1.69          -1.33  1.83e- 1
##  3 dramayes              1.29        0.877          1.47  1.41e- 1
##  4 runtime              -0.0561      0.0242        -2.32  2.04e- 2
##  5 mpaa_rating_Ryes     -1.44        0.813         -1.78  7.60e- 2
##  6 thtr_rel_year        -0.0766      0.0383        -2.00  4.63e- 2
##  7 oscar_seasonyes      -0.533       0.997         -0.535 5.93e- 1
##  8 summer_seasonyes      0.911       0.949          0.959 3.38e- 1
##  9 imdb_rating          14.7         0.607         24.3   2.03e-92
## 10 imdb_num_votes        0.00000723  0.00000452     1.60  1.10e- 1
## 11 critics_score         0.0575      0.0222         2.59  9.73e- 3
## 12 best_pic_nomyes       5.32        2.63           2.02  4.33e- 2
## 13 best_pic_winyes      -3.21        4.61          -0.697 4.86e- 1
## 14 best_actor_winyes    -1.54        1.18          -1.31  1.91e- 1
## 15 best_actress_winyes  -2.20        1.30          -1.69  9.23e- 2
## 16 best_dir_winyes      -1.23        1.73          -0.713 4.76e- 1
## 17 top200_boxyes         0.848       2.78           0.305 7.61e- 1

As you can see from a quick summary of the full linear model, many coefficients of independent variables are not statistically significant. We will use the Bayesian Information Criterion (BIC), as our criterion for model selection. BIC is based on model fit, while simultaneously penalizing the number of parameters in proportion to the sample size.

BIC(m_audscore_full)

## [1] 4934.145

now we will remove variables and see which ones when removed cause the BIC to decrease.

#Step 1 of model selection
m_audscore_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year
                       + oscar_season + summer_season + imdb_rating + imdb_num_votes + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v2 <-lm(audience_score ~ feature_film + runtime + mpaa_rating_R + thtr_rel_year + 
                        oscar_season + summer_season + imdb_rating + imdb_num_votes + 
                        critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                        best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R + thtr_rel_year + 
                         oscar_season + summer_season + imdb_rating + imdb_num_votes + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime + thtr_rel_year + 
                         oscar_season + summer_season + imdb_rating + imdb_num_votes + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                         oscar_season + summer_season + imdb_rating + imdb_num_votes + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                         thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                         thtr_rel_year + oscar_season + imdb_rating + imdb_num_votes + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                         thtr_rel_year + oscar_season + summer_season + imdb_num_votes + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                         thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + best_pic_nom + best_pic_win + best_actor_win + 
                          best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_win + best_actor_win + 
                          best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_actor_win + 
                          best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v13 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v14 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v15 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + top200_box, data = movies)
m_audscore_wo_v16 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
BIC(m_audscore_wo_v1)

## [1] 4929.489

BIC(m_audscore_wo_v2)

## [1] 4929.897

BIC(m_audscore_wo_v3)

## [1] 4940.193

BIC(m_audscore_wo_v4)

## [1] 4930.904

BIC(m_audscore_wo_v5)

## [1] 4931.75

BIC(m_audscore_wo_v6)

## [1] 4927.962

BIC(m_audscore_wo_v7)

## [1] 4928.613

BIC(m_audscore_wo_v8)

## [1] 5354.924

BIC(m_audscore_wo_v9)

## [1] 4930.291

BIC(m_audscore_wo_v10)

## [1] 4934.538

BIC(m_audscore_wo_v11)

## [1] 4931.865

BIC(m_audscore_wo_v12)

## [1] 4928.167

BIC(m_audscore_wo_v13)

## [1] 4929.428

BIC(m_audscore_wo_v14)

## [1] 4930.581

BIC(m_audscore_wo_v15)

## [1] 4928.19

BIC(m_audscore_wo_v16)

## [1] 4927.764

the BIC of the model without the 16th variable, top200_box, is the lowest, so we’ll continue to step with that model.

#Eliminated `top200_box` in step 1. Step 2 of selection:
m_audscore1_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                           critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v13 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                           best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v14 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_dir_win, data = movies)
m_audscore1_wo_v15 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win, data = movies)
BIC(m_audscore1_wo_v1)

## [1] 4923.097

BIC(m_audscore1_wo_v2)

## [1] 4923.488

BIC(m_audscore1_wo_v3)

## [1] 4933.787

BIC(m_audscore1_wo_v4)

## [1] 4924.684

BIC(m_audscore1_wo_v5)

## [1] 4925.556

BIC(m_audscore1_wo_v6)

## [1] 4921.56

BIC(m_audscore1_wo_v7)

## [1] 4922.261

BIC(m_audscore1_wo_v8)

## [1] 5348.763

BIC(m_audscore1_wo_v9)

## [1] 4924.399

BIC(m_audscore1_wo_v10)

## [1] 4928.271

BIC(m_audscore1_wo_v11)

## [1] 4925.442

BIC(m_audscore1_wo_v12)

## [1] 4921.787

BIC(m_audscore1_wo_v13)

## [1] 4923.032

BIC(m_audscore1_wo_v14)

## [1] 4924.164

BIC(m_audscore1_wo_v15)

## [1] 4921.824

From a BIC of 4927.764 we can bring the BIC down to 4921.56 by removing oscar_season.

#Removed `oscar_season` in step 2. Step 3 of selection:
m_audscore2_wo_v1 <- lm(audience_score ~  drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime +  
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                           summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                           critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                         best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v13 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_dir_win, data = movies)
m_audscore2_wo_v14 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win, data = movies)
BIC(m_audscore2_wo_v1)

## [1] 4916.952

BIC(m_audscore2_wo_v2)

## [1] 4917.37

BIC(m_audscore2_wo_v3)

## [1] 4928.204

BIC(m_audscore2_wo_v4)

## [1] 4918.466

BIC(m_audscore2_wo_v5)

## [1] 4919.302

BIC(m_audscore2_wo_v6)

## [1] 4916.902

BIC(m_audscore2_wo_v7)

## [1] 5342.412

BIC(m_audscore2_wo_v8)

## [1] 4918.194

BIC(m_audscore2_wo_v9)

## [1] 4922.047

BIC(m_audscore2_wo_v10)

## [1] 4919.06

BIC(m_audscore2_wo_v11)

## [1] 4915.554

BIC(m_audscore2_wo_v12)

## [1] 4916.879

BIC(m_audscore2_wo_v13)

## [1] 4917.978

BIC(m_audscore2_wo_v14)

## [1] 4915.657

from a BIC of 4921, the BIC was brought down to 4915.554 by removing best_pic_win.

#Removed `best_pic_win` in step 3. Step 4 of selection:
m_audscore3_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                           summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season +  
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                           best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_dir_win, data = movies)
m_audscore3_wo_v13 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
BIC(m_audscore3_wo_v1)

## [1] 4910.835

BIC(m_audscore3_wo_v2)

## [1] 4911.341

BIC(m_audscore3_wo_v3)

## [1] 4922.185

BIC(m_audscore3_wo_v4)

## [1] 4912.47

BIC(m_audscore3_wo_v5)

## [1] 4913.16

BIC(m_audscore3_wo_v6)

## [1] 4910.867

BIC(m_audscore3_wo_v7)

## [1] 5337.54

BIC(m_audscore3_wo_v8)

## [1] 4911.865

BIC(m_audscore3_wo_v9)

## [1] 4916.05

BIC(m_audscore3_wo_v10)

## [1] 4912.597

BIC(m_audscore3_wo_v11)

## [1] 4910.753

BIC(m_audscore3_wo_v12)

## [1] 4912.104

BIC(m_audscore3_wo_v13)

## [1] 4910.045

The biggest decrease in BIC was found when removing best_dir_win which brought the BIC to 4910.045

#Removed `best_dir_win` in step 4. Step 5 of selection:
m_audscore4_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                           summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season +
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score +   
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                           best_actress_win, data = movies)
m_audscore4_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win, data = movies)
BIC(m_audscore4_wo_v1)

## [1] 4905.453

BIC(m_audscore4_wo_v2)

## [1] 4905.915

BIC(m_audscore4_wo_v3)

## [1] 4917.564

BIC(m_audscore4_wo_v4)

## [1] 4907.106

BIC(m_audscore4_wo_v5)

## [1] 4907.34

BIC(m_audscore4_wo_v6)

## [1] 4905.283

BIC(m_audscore4_wo_v7)

## [1] 5331.766

BIC(m_audscore4_wo_v8)

## [1] 4906.101

BIC(m_audscore4_wo_v9)

## [1] 4910.234

BIC(m_audscore4_wo_v10)

## [1] 4906.911

BIC(m_audscore4_wo_v11)

## [1] 4905.325

BIC(m_audscore4_wo_v12)

## [1] 4906.64

The BIC was made lowest (4905.283) when summer_season was removed from the model.

#Removed `summer_season` in step 5. Step 6 of selection:
m_audscore5_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v3 <- lm(audience_score ~ feature_film + drama +  mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                           imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes +  best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                           best_actress_win, data = movies)
m_audscore5_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win, data = movies)
BIC(m_audscore5_wo_v1)

## [1] 4900.403

BIC(m_audscore5_wo_v2)

## [1] 4900.901

BIC(m_audscore5_wo_v3)

## [1] 4912.993

BIC(m_audscore5_wo_v4)

## [1] 4902.447

BIC(m_audscore5_wo_v5)

## [1] 4902.497

BIC(m_audscore5_wo_v6)

## [1] 5325.472

BIC(m_audscore5_wo_v7)

## [1] 4901.47

BIC(m_audscore5_wo_v8)

## [1] 4906.272

BIC(m_audscore5_wo_v9)

## [1] 4901.849

BIC(m_audscore5_wo_v10)

## [1] 4900.807

BIC(m_audscore5_wo_v11)

## [1] 4901.85

The BIC was made lowest (4900.403) when feature_film was removed from the model.

#Removed `feature_film` in step 6. Step 7 of selection:
m_audscore6_wo_v1 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v2 <- lm(audience_score ~ drama + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v3 <- lm(audience_score ~ drama + runtime + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v4 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v5 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+  
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v6 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v7 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v8 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + 
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v9 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                           best_actress_win, data = movies)
m_audscore6_wo_v10 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win, data = movies)
BIC(m_audscore6_wo_v1)

## [1] 4895.167

BIC(m_audscore6_wo_v2)

## [1] 4908.164

BIC(m_audscore6_wo_v3)

## [1] 4898.551

BIC(m_audscore6_wo_v4)

## [1] 4896.755

BIC(m_audscore6_wo_v5)

## [1] 5342.94

BIC(m_audscore6_wo_v6)

## [1] 4895.693

BIC(m_audscore6_wo_v7)

## [1] 4902.823

BIC(m_audscore6_wo_v8)

## [1] 4896.895

BIC(m_audscore6_wo_v9)

## [1] 4896.159

BIC(m_audscore6_wo_v10)

## [1] 4897.052

The BIC is made lowest (4895.167) when drama is removed from the model.

#removed `drama` in step 7. Step 8 of selection:
m_audscore7_wo_v1 <- lm(audience_score ~ mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v2 <- lm(audience_score ~ runtime +
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v3 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v4 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v5 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v6 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v7 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v8 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actress_win, data = movies)
m_audscore7_wo_v9 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win, data = movies)
BIC(m_audscore7_wo_v1)

## [1] 4902.088

BIC(m_audscore7_wo_v2)

## [1] 4892.677

BIC(m_audscore7_wo_v3)

## [1] 4891.464

BIC(m_audscore7_wo_v4)

## [1] 5338.34

BIC(m_audscore7_wo_v5)

## [1] 4890.199

BIC(m_audscore7_wo_v6)

## [1] 4897.984

BIC(m_audscore7_wo_v7)

## [1] 4891.817

BIC(m_audscore7_wo_v8)

## [1] 4890.824

BIC(m_audscore7_wo_v9)

## [1] 4891.487

The BIC is made lowest (4890.199) when imdb_num_votes is removed from the model.

#removed `imdb_num_votes` in step 8. Step 9:
m_audscore8_wo_v1 <- lm(audience_score ~ mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore8_wo_v2 <- lm(audience_score ~ runtime + 
                          thtr_rel_year+ imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore8_wo_v3 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore8_wo_v4 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+  
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore8_wo_v5 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                           best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore8_wo_v6 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                           critics_score +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore8_wo_v7 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actress_win, data = movies)
m_audscore8_wo_v8 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win, data = movies)
BIC(m_audscore8_wo_v1)

## [1] 4896.011

BIC(m_audscore8_wo_v2)

## [1] 4887.453

BIC(m_audscore8_wo_v3)

## [1] 4885.766

BIC(m_audscore8_wo_v4)

## [1] 5352.361

BIC(m_audscore8_wo_v5)

## [1] 4892.618

BIC(m_audscore8_wo_v6)

## [1] 4888.214

BIC(m_audscore8_wo_v7)

## [1] 4885.954

BIC(m_audscore8_wo_v8)

## [1] 4886.425

The BIC is made lowest (4885.766) when thtr_rel_year is removed from the model.

#removed `thtr_rel_year` in step 9. Step 10:
m_audscore9_wo_v1 <- lm(audience_score ~ mpaa_rating_R + 
                          imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore9_wo_v2 <- lm(audience_score ~ runtime +  
                          imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore9_wo_v3 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore9_wo_v4 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          imdb_rating + 
                          best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore9_wo_v5 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          imdb_rating + 
                           critics_score +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore9_wo_v6 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actress_win, data = movies)
m_audscore9_wo_v7 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win, data = movies)
BIC(m_audscore9_wo_v1)

## [1] 4891.111

BIC(m_audscore9_wo_v2)

## [1] 4883.072

BIC(m_audscore9_wo_v3)

## [1] 5345.964

BIC(m_audscore9_wo_v4)

## [1] 4889.055

BIC(m_audscore9_wo_v5)

## [1] 4883.82

BIC(m_audscore9_wo_v6)

## [1] 4881.39

BIC(m_audscore9_wo_v7)

## [1] 4881.941

The BIC is made lowest (4881.39) when best_actor_win is removed from the model.

#removed `best_actor_win` in step 10. Step 11 of selection:
m_audscore10_wo_v1 <- lm(audience_score ~ mpaa_rating_R + 
                          imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actress_win, data = movies)
m_audscore10_wo_v2 <- lm(audience_score ~ runtime + 
                          imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actress_win, data = movies)
m_audscore10_wo_v3 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          critics_score + best_pic_nom +  
                          best_actress_win, data = movies)
m_audscore10_wo_v4 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          imdb_rating + 
                           best_pic_nom +  
                          best_actress_win, data = movies)
m_audscore10_wo_v5 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          imdb_rating + 
                           critics_score +  
                          best_actress_win, data = movies)
m_audscore10_wo_v6 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          imdb_rating + 
                           critics_score + best_pic_nom, data = movies)
BIC(m_audscore10_wo_v1)

## [1] 4888.433

BIC(m_audscore10_wo_v2)

## [1] 4878.608

BIC(m_audscore10_wo_v3)

## [1] 5341.127

BIC(m_audscore10_wo_v4)

## [1] 4884.644

BIC(m_audscore10_wo_v5)

## [1] 4878.911

BIC(m_audscore10_wo_v6)

## [1] 4877.909

The BIC is made lowest (4877.909) when best_actress_win is removed from the model.

#removed `best_actress_win` in step 11. Step 12 of selection:
m_audscore11_wo_v1 <- lm(audience_score ~ mpaa_rating_R + 
                          imdb_rating + 
                           critics_score + best_pic_nom, data = movies)
m_audscore11_wo_v2 <- lm(audience_score ~ runtime + 
                          imdb_rating + 
                           critics_score + best_pic_nom, data = movies)
m_audscore11_wo_v3 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          critics_score + best_pic_nom, data = movies)
m_audscore11_wo_v4 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          imdb_rating + 
                           best_pic_nom, data = movies)
m_audscore11_wo_v5 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          imdb_rating + 
                           critics_score, data = movies)
BIC(m_audscore11_wo_v1)

## [1] 4886.586

BIC(m_audscore11_wo_v2)

## [1] 4874.994

BIC(m_audscore11_wo_v3)

## [1] 5336.746

BIC(m_audscore11_wo_v4)

## [1] 4881.009

BIC(m_audscore11_wo_v5)

## [1] 4874.484

The BIC is made lowest (4874.484) when best_pic_nom is removed from the model.

#removed `best_pic_nom` in step 12. Step 13 of selection:
m_audscore12_wo_v1 <- lm(audience_score ~ mpaa_rating_R + 
                          imdb_rating + 
                           critics_score, data = movies)
m_audscore12_wo_v2 <- lm(audience_score ~ runtime +  
                          imdb_rating + 
                           critics_score, data = movies)
m_audscore12_wo_v3 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          critics_score, data = movies)
m_audscore12_wo_v4 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          imdb_rating, data = movies)
BIC(m_audscore12_wo_v1)

## [1] 4881.401

BIC(m_audscore12_wo_v2)

## [1] 4871.623

BIC(m_audscore12_wo_v3)

## [1] 5335.434

BIC(m_audscore12_wo_v4)

## [1] 4878.238

The BIC is made lowest (4871.623) when mpaa_rating_R is removed from the model.

#removed `mpaa_rating_R` in step 13. Step 14:
m_audscore13_wo_v1 <- lm(audience_score ~ imdb_rating + 
                           critics_score, data = movies)
m_audscore13_wo_v2 <- lm(audience_score ~ runtime +  
                          critics_score, data = movies)
m_audscore13_wo_v3 <- lm(audience_score ~ runtime +  
                          imdb_rating, data = movies)
BIC(m_audscore13_wo_v1)

## [1] 4878.542

BIC(m_audscore13_wo_v2)

## [1] 5329.265

BIC(m_audscore13_wo_v3)

## [1] 4875.773

The BIC doesn’t lower upon the removal of any of these variables so the final model will include runtime, imdb_rating, and critics_score.

#final model
m_audscore_final <- lm(audience_score ~ runtime + imdb_rating + critics_score, data = movies)
BIC(m_audscore_final)

## [1] 4871.623

summary(m_audscore_final)

## 
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + critics_score, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.998  -6.565   0.557   5.475  52.448 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -33.28321    3.21939 -10.338  < 2e-16 ***
## runtime        -0.05362    0.02107  -2.545  0.01117 *  
## imdb_rating    14.98076    0.57735  25.947  < 2e-16 ***
## critics_score   0.07036    0.02156   3.263  0.00116 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.04 on 646 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7549, Adjusted R-squared:  0.7538 
## F-statistic: 663.3 on 3 and 646 DF,  p-value: < 2.2e-16

The coefficients of these predictor variables indicate a few things: - with every increase in runtime by 1 minute, we can expect the audience score to decrease by .05 points - with a point increase on the imdb_rating, we can expect audience score to increase by 14 points on average - with an increase in the critics_score by 1 point, we can expect the audience score to increase by .07 points

Now we will run some model diagonistics on the variables we’ve deemed to be decent predictors.

m_audscore_final_aug <- augment(m_audscore_final)

#Linearity and constant variance
ggplot(data = m_audscore_final_aug, aes(x = .fitted, y = .resid)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted values", y = "Residuals")

#Normality
ggplot(data = m_audscore_final_aug, aes(x = .resid)) +
  geom_histogram(binwidth = 5) +
  xlab("Residuals")

The residuals plot seems to be in a fan shape indicating that the model may not be accounting for all the relationships between the variables. But they do seem mostly normally distributed. * * *

Part 5: Prediction

NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.

#Pick a movie from 2016 (a new movie that is not in the sample) and do a prediction for this movie using your the model you developed and the `predict` function in R.
movies %>% filter(title == 'Train to Busan')

## # A tibble: 0 x 37
## # … with 37 variables: title <chr>, title_type <fct>, genre <fct>,
## #   runtime <dbl>, mpaa_rating <fct>, studio <fct>, thtr_rel_year <dbl>,
## #   thtr_rel_month <dbl>, thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## #   dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## #   imdb_num_votes <int>, critics_rating <fct>, critics_score <dbl>,
## #   audience_rating <fct>, audience_score <dbl>, best_pic_nom <fct>,
## #   best_pic_win <fct>, best_actor_win <fct>, best_actress_win <fct>,
## #   best_dir_win <fct>, top200_box <fct>, director <chr>, actor1 <chr>,
## #   actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>, imdb_url <chr>,
## #   rt_url <chr>, feature_film <chr>, drama <chr>, mpaa_rating_R <chr>,
## #   oscar_season <chr>, summer_season <chr>

#the movie isnt in the data set already so we can do a prediction
#Data for this movie came from the IMDB's and Rotten Tomatoes website
busan <- data.frame(runtime = 118, imdb_rating = 7.5, critics_score = 94)
predict(m_audscore_final, busan)

##        1 
## 79.35946

predict(m_audscore_final, busan, interval = "prediction", level = .95)

##        fit      lwr      upr
## 1 79.35946 59.60028 99.11864

The actual audience score on Rotten Tomatoes is 88. * * *

Part 6: Conclusion

Using the variable given to us and the one’s generated from the data, we found that the best model for predicting a movie’s audience score on Rotten Tomatoes depends mainly on three variables: the movie’s runtime, the IMDB rating, and the critics score on Rotten Tomatoes. Of the variables explored in the EDA section, the ones that didnt show much difference visually were all understandably eliminated from the model. This model was selected by choosing the one with the lowest BIC. In the above prediction, the model isn’t perfect but in the ball park, the 95% confidence interval of the prediction is quite large, predicting that the score could be anywhere between 59 and 99, while the actual score is 88. One short coming of the predictor variables that we chose from was that many had multiple levels which could make prediction difficult.