Setup

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016.Thus any analysis is generalizable to movies produced and released before 2016. Since there were no experimental groups or random assignment, any analysis on this data set cannot determine causality, only relationships and correlations.


Part 2: Data manipulation

##  Documentary Feature Film     TV Movie 
##           55          591            5
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   feature_film count
##   <chr>        <int>
## 1 no              60
## 2 yes            591
##        Action & Adventure                 Animation Art House & International 
##                        65                         9                        14 
##                    Comedy               Documentary                     Drama 
##                        87                        52                       305 
##                    Horror Musical & Performing Arts        Mystery & Suspense 
##                        23                        12                        59 
##                     Other Science Fiction & Fantasy 
##                        16                         9
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   drama count
##   <chr> <int>
## 1 no      346
## 2 yes     305
##       G   NC-17      PG   PG-13       R Unrated 
##      19       2     118     133     329      50
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   mpaa_rating_R count
##   <chr>         <int>
## 1 no              322
## 2 yes             329
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00    7.00    6.74   10.00   12.00
## # A tibble: 10 x 2
##    oscar_season thtr_rel_month
##    <chr>                 <dbl>
##  1 no                        4
##  2 no                        3
##  3 no                        8
##  4 yes                      10
##  5 no                        9
##  6 no                        1
##  7 no                        1
##  8 yes                      11
##  9 no                        9
## 10 no                        3
## # A tibble: 10 x 2
##    summer_season thtr_rel_month
##    <chr>                  <dbl>
##  1 no                         4
##  2 no                         3
##  3 yes                        8
##  4 no                        10
##  5 no                         9
##  6 no                         1
##  7 no                         1
##  8 no                        11
##  9 no                         9
## 10 no                         3

Part 3: Exploratory data analysis

Conduct exploratory data analysis of the relationship between audience_score and the new variables constructed in the previous part

The audience scores for feature films seem to typically be less than that of documentaries and TV movies. When combining TV movie scores with Documentaries, Feature Films still score less in audience scores.

Compared to all the other genres, Drama’s score somewhere in the middle range with three different genres having a higher median and the rest scoring generally below dramas. When combined using the drama variable, drama’s have a slightly higher median audience score than all the other genre’s combined.

When separated by rating, all the movies generally fall into a similar range except for Unrated movies which are a bit higher but also have some low scored outliers. The R rated movies have an distribution of audience scores that is very similar to that of PG rated movies. When combining all the movies besides R using the mpaa_rating_R variable, the distributions end up being very similar.

Movies released during the Oscar Season (months 10, 11, and 12) have median audience scores that are only slightly higher than the rest.

Movies realeased in the summer season (months 5 6 7 and 8) have about the same median audience scores as other movies. * * *

Part 4: Modeling

Develop a Bayesian regression model to predict audience_score using the following explanatory variables: feature_film, drama, runtime, mpaa_rating_R, thtr_rel_year, oscar_season, summer_season, imdb_rating, imdb_num_votes, critics_score, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, and top200_box. For the regression model we will start with some exploration of audience_score since it will be the response variable in the model.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.36   80.00   97.00

The median of the distribution is 65. This also shows us that 25% of these randomly sampled movies scored at least 80 points. The distribution is left skewed which means that in this data set, more movies have audience scores above the mean than below it.

## # A tibble: 17 x 5
##    term                    estimate   std.error statistic  p.value
##    <chr>                      <dbl>       <dbl>     <dbl>    <dbl>
##  1 (Intercept)         124.         77.5            1.61  1.09e- 1
##  2 feature_filmyes      -2.25        1.69          -1.33  1.83e- 1
##  3 dramayes              1.29        0.877          1.47  1.41e- 1
##  4 runtime              -0.0561      0.0242        -2.32  2.04e- 2
##  5 mpaa_rating_Ryes     -1.44        0.813         -1.78  7.60e- 2
##  6 thtr_rel_year        -0.0766      0.0383        -2.00  4.63e- 2
##  7 oscar_seasonyes      -0.533       0.997         -0.535 5.93e- 1
##  8 summer_seasonyes      0.911       0.949          0.959 3.38e- 1
##  9 imdb_rating          14.7         0.607         24.3   2.03e-92
## 10 imdb_num_votes        0.00000723  0.00000452     1.60  1.10e- 1
## 11 critics_score         0.0575      0.0222         2.59  9.73e- 3
## 12 best_pic_nomyes       5.32        2.63           2.02  4.33e- 2
## 13 best_pic_winyes      -3.21        4.61          -0.697 4.86e- 1
## 14 best_actor_winyes    -1.54        1.18          -1.31  1.91e- 1
## 15 best_actress_winyes  -2.20        1.30          -1.69  9.23e- 2
## 16 best_dir_winyes      -1.23        1.73          -0.713 4.76e- 1
## 17 top200_boxyes         0.848       2.78           0.305 7.61e- 1

As you can see from a quick summary of the full linear model, many coefficients of independent variables are not statistically significant. We will use the Bayesian Information Criterion (BIC), as our criterion for model selection. BIC is based on model fit, while simultaneously penalizing the number of parameters in proportion to the sample size.

## [1] 4934.145

now we will remove variables and see which ones when removed cause the BIC to decrease.

#Step 1 of model selection
m_audscore_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + thtr_rel_year
                       + oscar_season + summer_season + imdb_rating + imdb_num_votes + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v2 <-lm(audience_score ~ feature_film + runtime + mpaa_rating_R + thtr_rel_year + 
                        oscar_season + summer_season + imdb_rating + imdb_num_votes + 
                        critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                        best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R + thtr_rel_year + 
                         oscar_season + summer_season + imdb_rating + imdb_num_votes + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime + thtr_rel_year + 
                         oscar_season + summer_season + imdb_rating + imdb_num_votes + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                         oscar_season + summer_season + imdb_rating + imdb_num_votes + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                         thtr_rel_year + summer_season + imdb_rating + imdb_num_votes + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                         thtr_rel_year + oscar_season + imdb_rating + imdb_num_votes + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                         thtr_rel_year + oscar_season + summer_season + imdb_num_votes + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                         thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                         critics_score + best_pic_nom + best_pic_win + best_actor_win + 
                         best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + best_pic_nom + best_pic_win + best_actor_win + 
                          best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_win + best_actor_win + 
                          best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_actor_win + 
                          best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v13 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actress_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v14 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_dir_win + top200_box, data = movies)
m_audscore_wo_v15 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + top200_box, data = movies)
m_audscore_wo_v16 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
BIC(m_audscore_wo_v1)
## [1] 4929.489
## [1] 4929.897
## [1] 4940.193
## [1] 4930.904
## [1] 4931.75
## [1] 4927.962
## [1] 4928.613
## [1] 5354.924
## [1] 4930.291
## [1] 4934.538
## [1] 4931.865
## [1] 4928.167
## [1] 4929.428
## [1] 4930.581
## [1] 4928.19
## [1] 4927.764

the BIC of the model without the 16th variable, top200_box, is the lowest, so we’ll continue to step with that model.

#Eliminated `top200_box` in step 1. Step 2 of selection:
m_audscore1_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                           critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v13 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                           best_actress_win + best_dir_win, data = movies)
m_audscore1_wo_v14 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_dir_win, data = movies)
m_audscore1_wo_v15 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + oscar_season + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win, data = movies)
BIC(m_audscore1_wo_v1)
## [1] 4923.097
## [1] 4923.488
## [1] 4933.787
## [1] 4924.684
## [1] 4925.556
## [1] 4921.56
## [1] 4922.261
## [1] 5348.763
## [1] 4924.399
## [1] 4928.271
## [1] 4925.442
## [1] 4921.787
## [1] 4923.032
## [1] 4924.164
## [1] 4921.824

From a BIC of 4927.764 we can bring the BIC down to 4921.56 by removing oscar_season.

#Removed `oscar_season` in step 2. Step 3 of selection:
m_audscore2_wo_v1 <- lm(audience_score ~  drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime +  
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                           summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                           critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_win + 
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                         best_actress_win + best_dir_win, data = movies)
m_audscore2_wo_v13 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_dir_win, data = movies)
m_audscore2_wo_v14 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom + best_pic_win + 
                          best_actor_win + best_actress_win, data = movies)
BIC(m_audscore2_wo_v1)
## [1] 4916.952
## [1] 4917.37
## [1] 4928.204
## [1] 4918.466
## [1] 4919.302
## [1] 4916.902
## [1] 5342.412
## [1] 4918.194
## [1] 4922.047
## [1] 4919.06
## [1] 4915.554
## [1] 4916.879
## [1] 4917.978
## [1] 4915.657

from a BIC of 4921, the BIC was brought down to 4915.554 by removing best_pic_win.

#Removed `best_pic_win` in step 3. Step 4 of selection:
m_audscore3_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                           summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season +  
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + best_pic_nom +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score +  
                          best_actor_win + best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                           best_actress_win + best_dir_win, data = movies)
m_audscore3_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_dir_win, data = movies)
m_audscore3_wo_v13 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
BIC(m_audscore3_wo_v1)
## [1] 4910.835
## [1] 4911.341
## [1] 4922.185
## [1] 4912.47
## [1] 4913.16
## [1] 4910.867
## [1] 5337.54
## [1] 4911.865
## [1] 4916.05
## [1] 4912.597
## [1] 4910.753
## [1] 4912.104
## [1] 4910.045

The biggest decrease in BIC was found when removing best_dir_win which brought the BIC to 4910.045

#Removed `best_dir_win` in step 4. Step 5 of selection:
m_audscore4_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v3 <- lm(audience_score ~ feature_film + drama + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                           summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season +
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score +   
                          best_actor_win + best_actress_win, data = movies)
m_audscore4_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                           best_actress_win, data = movies)
m_audscore4_wo_v12 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year + summer_season + imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win, data = movies)
BIC(m_audscore4_wo_v1)
## [1] 4905.453
## [1] 4905.915
## [1] 4917.564
## [1] 4907.106
## [1] 4907.34
## [1] 4905.283
## [1] 5331.766
## [1] 4906.101
## [1] 4910.234
## [1] 4906.911
## [1] 4905.325
## [1] 4906.64

The BIC was made lowest (4905.283) when summer_season was removed from the model.

#Removed `summer_season` in step 5. Step 6 of selection:
m_audscore5_wo_v1 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v2 <- lm(audience_score ~ feature_film + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v3 <- lm(audience_score ~ feature_film + drama +  mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v4 <- lm(audience_score ~ feature_film + drama + runtime + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v5 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                           imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v6 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v7 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v8 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes +  best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v9 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore5_wo_v10 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                           best_actress_win, data = movies)
m_audscore5_wo_v11 <- lm(audience_score ~ feature_film + drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win, data = movies)
BIC(m_audscore5_wo_v1)
## [1] 4900.403
## [1] 4900.901
## [1] 4912.993
## [1] 4902.447
## [1] 4902.497
## [1] 5325.472
## [1] 4901.47
## [1] 4906.272
## [1] 4901.849
## [1] 4900.807
## [1] 4901.85

The BIC was made lowest (4900.403) when feature_film was removed from the model.

#Removed `feature_film` in step 6. Step 7 of selection:
m_audscore6_wo_v1 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v2 <- lm(audience_score ~ drama + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v3 <- lm(audience_score ~ drama + runtime + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v4 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v5 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+  
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v6 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v7 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v8 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + 
                          best_actor_win + best_actress_win, data = movies)
m_audscore6_wo_v9 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                           best_actress_win, data = movies)
m_audscore6_wo_v10 <- lm(audience_score ~ drama + runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win, data = movies)
BIC(m_audscore6_wo_v1)
## [1] 4895.167
## [1] 4908.164
## [1] 4898.551
## [1] 4896.755
## [1] 5342.94
## [1] 4895.693
## [1] 4902.823
## [1] 4896.895
## [1] 4896.159
## [1] 4897.052

The BIC is made lowest (4895.167) when drama is removed from the model.

#removed `drama` in step 7. Step 8 of selection:
m_audscore7_wo_v1 <- lm(audience_score ~ mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v2 <- lm(audience_score ~ runtime +
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v3 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v4 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v5 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                           critics_score + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v6 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + best_pic_nom +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v7 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score +  
                          best_actor_win + best_actress_win, data = movies)
m_audscore7_wo_v8 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actress_win, data = movies)
m_audscore7_wo_v9 <- lm(audience_score ~ runtime + mpaa_rating_R + 
                          thtr_rel_year+ imdb_rating + 
                          imdb_num_votes + critics_score + best_pic_nom +  
                          best_actor_win, data = movies)
BIC(m_audscore7_wo_v1)
## [1] 4902.088
## [1] 4892.677
## [1] 4891.464
## [1] 5338.34
## [1] 4890.199
## [1] 4897.984
## [1] 4891.817
## [1] 4890.824
## [1] 4891.487

The BIC is made lowest (4890.199) when imdb_num_votes is removed from the model.

## [1] 4896.011
## [1] 4887.453
## [1] 4885.766
## [1] 5352.361
## [1] 4892.618
## [1] 4888.214
## [1] 4885.954
## [1] 4886.425

The BIC is made lowest (4885.766) when thtr_rel_year is removed from the model.

## [1] 4891.111
## [1] 4883.072
## [1] 5345.964
## [1] 4889.055
## [1] 4883.82
## [1] 4881.39
## [1] 4881.941

The BIC is made lowest (4881.39) when best_actor_win is removed from the model.

## [1] 4888.433
## [1] 4878.608
## [1] 5341.127
## [1] 4884.644
## [1] 4878.911
## [1] 4877.909

The BIC is made lowest (4877.909) when best_actress_win is removed from the model.

## [1] 4886.586
## [1] 4874.994
## [1] 5336.746
## [1] 4881.009
## [1] 4874.484

The BIC is made lowest (4874.484) when best_pic_nom is removed from the model.

## [1] 4881.401
## [1] 4871.623
## [1] 5335.434
## [1] 4878.238

The BIC is made lowest (4871.623) when mpaa_rating_R is removed from the model.

## [1] 4878.542
## [1] 5329.265
## [1] 4875.773

The BIC doesn’t lower upon the removal of any of these variables so the final model will include runtime, imdb_rating, and critics_score.

## [1] 4871.623
## 
## Call:
## lm(formula = audience_score ~ runtime + imdb_rating + critics_score, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.998  -6.565   0.557   5.475  52.448 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -33.28321    3.21939 -10.338  < 2e-16 ***
## runtime        -0.05362    0.02107  -2.545  0.01117 *  
## imdb_rating    14.98076    0.57735  25.947  < 2e-16 ***
## critics_score   0.07036    0.02156   3.263  0.00116 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.04 on 646 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7549, Adjusted R-squared:  0.7538 
## F-statistic: 663.3 on 3 and 646 DF,  p-value: < 2.2e-16

The coefficients of these predictor variables indicate a few things: - with every increase in runtime by 1 minute, we can expect the audience score to decrease by .05 points - with a point increase on the imdb_rating, we can expect audience score to increase by 14 points on average - with an increase in the critics_score by 1 point, we can expect the audience score to increase by .07 points

Now we will run some model diagonistics on the variables we’ve deemed to be decent predictors.

The residuals plot seems to be in a fan shape indicating that the model may not be accounting for all the relationships between the variables. But they do seem mostly normally distributed. * * *

Part 5: Prediction

NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.

## # A tibble: 0 x 37
## # … with 37 variables: title <chr>, title_type <fct>, genre <fct>,
## #   runtime <dbl>, mpaa_rating <fct>, studio <fct>, thtr_rel_year <dbl>,
## #   thtr_rel_month <dbl>, thtr_rel_day <dbl>, dvd_rel_year <dbl>,
## #   dvd_rel_month <dbl>, dvd_rel_day <dbl>, imdb_rating <dbl>,
## #   imdb_num_votes <int>, critics_rating <fct>, critics_score <dbl>,
## #   audience_rating <fct>, audience_score <dbl>, best_pic_nom <fct>,
## #   best_pic_win <fct>, best_actor_win <fct>, best_actress_win <fct>,
## #   best_dir_win <fct>, top200_box <fct>, director <chr>, actor1 <chr>,
## #   actor2 <chr>, actor3 <chr>, actor4 <chr>, actor5 <chr>, imdb_url <chr>,
## #   rt_url <chr>, feature_film <chr>, drama <chr>, mpaa_rating_R <chr>,
## #   oscar_season <chr>, summer_season <chr>
##        1 
## 79.35946
##        fit      lwr      upr
## 1 79.35946 59.60028 99.11864

The actual audience score on Rotten Tomatoes is 88. * * *

Part 6: Conclusion

Using the variable given to us and the one’s generated from the data, we found that the best model for predicting a movie’s audience score on Rotten Tomatoes depends mainly on three variables: the movie’s runtime, the IMDB rating, and the critics score on Rotten Tomatoes. Of the variables explored in the EDA section, the ones that didnt show much difference visually were all understandably eliminated from the model. This model was selected by choosing the one with the lowest BIC. In the above prediction, the model isn’t perfect but in the ball park, the 95% confidence interval of the prediction is quite large, predicting that the score could be anywhere between 59 and 99, while the actual score is 88. One short coming of the predictor variables that we chose from was that many had multiple levels which could make prediction difficult.