Setup

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.


Part 1: Data

The dataset is comprised of 651 randomly sampled movies produced & released in America prior to 2016. The data collected is observational, including general information about each movie, as well as ratings from various sources (IMDB & Rotten Tomatoes). Conclusions from the analysis should be generalizable to American movies made prior to 2016. Because we want to extrapolate our findings to movies made after 2016, we assume that there aren’t any major changes to the movie industry between 2015 and 2016 (for example, Rotten Tomatoes changing which critics’ scores are included in their scoring).

Because the data is observational, causality cannot be inferred from the data.


Part 2: Research question

What variables can we use to predict critic’s score of a movie? Anyone can see from looking at a site like Rotten Tomatoes that professional film critics rate movies differently from audiences. It seems likely that critics’ scores of a movie could be biased due to their connections within the film industry, whereas audiences might be more objective. It also seems likely that they consider different variables when rating a movie. Which variables are the most important in predicting a critic’s rating?


Part 3: Exploratory data analysis

First, I want to take a look at the distribution of critics’ scores:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   33.00   61.00   57.69   83.00  100.00

It is right skewed with the an average of 58%. 25% of scores fall at 83 or above. Because the response variable I want to predict is not normally distributed, it is likely that not all of assumptions for multiple linear regression will be met, which we will check later on.

Next, I want to see what explanatory variables look promising to include in the model:

##  [1] "title"            "title_type"       "genre"           
##  [4] "runtime"          "mpaa_rating"      "studio"          
##  [7] "thtr_rel_year"    "thtr_rel_month"   "thtr_rel_day"    
## [10] "dvd_rel_year"     "dvd_rel_month"    "dvd_rel_day"     
## [13] "imdb_rating"      "imdb_num_votes"   "critics_rating"  
## [16] "critics_score"    "audience_rating"  "audience_score"  
## [19] "best_pic_nom"     "best_pic_win"     "best_actor_win"  
## [22] "best_actress_win" "best_dir_win"     "top200_box"      
## [25] "director"         "actor1"           "actor2"          
## [28] "actor3"           "actor4"           "actor5"          
## [31] "imdb_url"         "rt_url"

I want to check whether the numerical variables have a linear relationship with critics_score because this is an assumption of multiple linear regression:

Based on these graphs, we can safely eliminate theater & dvd release year, and number of IMDB votes (imbalanced) from the list of potential explanatory variables in our model as they do not have a linear relationship with critic’s score.

I also want to check the number of levels for each available categorical variable:

## # A tibble: 1 x 32
##   title title_type genre runtime mpaa_rating studio thtr_rel_year
##   <int>      <int> <int>   <int>       <int>  <int>         <int>
## 1   647          3    11      90           6    212            44
## # … with 25 more variables: thtr_rel_month <int>, thtr_rel_day <int>,
## #   dvd_rel_year <int>, dvd_rel_month <int>, dvd_rel_day <int>,
## #   imdb_rating <int>, imdb_num_votes <int>, critics_rating <int>,
## #   critics_score <int>, audience_rating <int>, audience_score <int>,
## #   best_pic_nom <int>, best_pic_win <int>, best_actor_win <int>,
## #   best_actress_win <int>, best_dir_win <int>, top200_box <int>,
## #   director <int>, actor1 <int>, actor2 <int>, actor3 <int>,
## #   actor4 <int>, actor5 <int>, imdb_url <int>, rt_url <int>

Because there are too many levels for some of these categorical variables, there will be a small number of datapoints within each level, which would add noise to the model. Therefore I will remove title, genre, MPAA rating, studio, release months & days, actors, directors, and URLs.

I will then check the relationship between the remaining categorical variables and critic’s score:

Finally, I will create the dataset with only the variables of interest for the model:


Part 4: Modeling

## Warning in model.matrix.default(mt, mf, contrasts): the response appeared
## on the right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 3 in
## model.matrix: no columns are assigned
## 
## Call:
## lm(formula = critics_score ~ imdb_rating + critics_rating + critics_score + 
##     audience_rating + best_dir_win + best_actor_win + best_actress_win + 
##     runtime + title_type + audience_score + best_pic_nom + best_pic_win + 
##     top200_box, data = movies_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.280  -7.963  -0.315   7.845  29.152 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             26.971048   5.034370   5.357 1.18e-07 ***
## imdb_rating              8.402413   0.885966   9.484  < 2e-16 ***
## critics_ratingFresh     -5.620482   1.302385  -4.316 1.85e-05 ***
## critics_ratingRotten   -40.370282   1.477742 -27.319  < 2e-16 ***
## audience_ratingUpright  -0.666895   1.807468  -0.369 0.712276    
## best_dir_winyes          0.891771   1.916787   0.465 0.641917    
## best_actor_winyes        0.357911   1.311466   0.273 0.785012    
## best_actress_winyes      1.271671   1.451627   0.876 0.381344    
## runtime                  0.005861   0.025813   0.227 0.820467    
## title_typeFeature Film  -5.827762   1.746327  -3.337 0.000896 ***
## title_typeTV Movie      -0.087248   5.263967  -0.017 0.986781    
## audience_score           0.028789   0.063556   0.453 0.650724    
## best_pic_nomyes          0.777389   2.904594   0.268 0.789062    
## best_pic_winyes         -1.158386   5.068584  -0.229 0.819298    
## top200_boxyes            0.352934   2.972743   0.119 0.905532    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.1 on 635 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8507, Adjusted R-squared:  0.8474 
## F-statistic: 258.4 on 14 and 635 DF,  p-value: < 2.2e-16

Model Selection

Now that we have the full model, we want to drop any variables that might not contribute meaningfully to the prediction power of the model. I’m going to use the stepwise backward selection method based on R-squared because I can see from the p-values of the explanatory variables audience rating, audience score, best picture nominations/wins and top 200 box office status that the p-values are being impacted by collinearity.

First, I will try removing each variable from the model, and check if the adjusted R-squared is higher than the full model adjusted R-squared (0.8474):

## 
## Call:
## lm(formula = critics_score ~ imdb_rating + critics_rating + critics_score + 
##     best_dir_win + best_actress_win + runtime + title_type + 
##     audience_score + best_pic_nom + best_pic_win + top200_box, 
##     data = movies_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.437  -7.894  -0.169   7.810  29.186 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             27.048183   4.956731   5.457 6.95e-08 ***
## imdb_rating              8.480366   0.864501   9.810  < 2e-16 ***
## critics_ratingFresh     -5.570342   1.296153  -4.298 2.00e-05 ***
## critics_ratingRotten   -40.318270   1.470807 -27.412  < 2e-16 ***
## best_dir_winyes          0.946484   1.910468   0.495 0.620475    
## best_actress_winyes      1.287968   1.445598   0.891 0.373288    
## runtime                  0.006896   0.025281   0.273 0.785115    
## title_typeFeature Film  -5.808538   1.741955  -3.334 0.000904 ***
## title_typeTV Movie      -0.199338   5.250858  -0.038 0.969729    
## audience_score           0.011261   0.043962   0.256 0.797910    
## best_pic_nomyes          0.930786   2.875855   0.324 0.746305    
## best_pic_winyes         -1.286867   5.046964  -0.255 0.798822    
## top200_boxyes            0.356089   2.967627   0.120 0.904528    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.08 on 637 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8506, Adjusted R-squared:  0.8478 
## F-statistic: 302.2 on 12 and 637 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = critics_score ~ imdb_rating + critics_rating + critics_score + 
##     best_actress_win + runtime + title_type + audience_score + 
##     best_pic_nom + best_pic_win + top200_box, data = movies_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.532  -7.835  -0.217   7.780  29.174 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             26.750905   4.917366   5.440 7.60e-08 ***
## imdb_rating              8.497817   0.863272   9.844  < 2e-16 ***
## critics_ratingFresh     -5.551316   1.294817  -4.287 2.09e-05 ***
## critics_ratingRotten   -40.349837   1.468557 -27.476  < 2e-16 ***
## best_actress_winyes      1.288724   1.444742   0.892    0.373    
## runtime                  0.008966   0.024919   0.360    0.719    
## title_typeFeature Film  -5.733265   1.734290  -3.306    0.001 ** 
## title_typeTV Movie      -0.183504   5.247655  -0.035    0.972    
## audience_score           0.010670   0.043920   0.243    0.808    
## best_pic_nomyes          0.853500   2.869922   0.297    0.766    
## best_pic_winyes         -0.569172   4.831727  -0.118    0.906    
## top200_boxyes            0.335652   2.965585   0.113    0.910    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.08 on 638 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8506, Adjusted R-squared:  0.848 
## F-statistic: 330.1 on 11 and 638 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = critics_score ~ imdb_rating + critics_rating + critics_score + 
##     best_actress_win + title_type + audience_score + best_pic_nom + 
##     best_pic_win + top200_box, data = movies_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.388  -7.820  -0.256   7.734  29.114 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             27.09020    4.76430   5.686 1.98e-08 ***
## imdb_rating              8.55987    0.83804  10.214  < 2e-16 ***
## critics_ratingFresh     -5.56899    1.29204  -4.310 1.89e-05 ***
## critics_ratingRotten   -40.33435    1.46467 -27.538  < 2e-16 ***
## best_actress_winyes      1.35965    1.43043   0.951   0.3422    
## title_typeFeature Film  -5.49656    1.68991  -3.253   0.0012 ** 
## title_typeTV Movie       0.03904    5.23025   0.007   0.9940    
## audience_score           0.01008    0.04367   0.231   0.8176    
## best_pic_nomyes          0.97721    2.84116   0.344   0.7310    
## best_pic_winyes         -0.45884    4.81492  -0.095   0.9241    
## top200_boxyes            0.41831    2.95067   0.142   0.8873    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.06 on 640 degrees of freedom
## Multiple R-squared:  0.8506, Adjusted R-squared:  0.8482 
## F-statistic: 364.3 on 10 and 640 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = critics_score ~ imdb_rating + critics_rating + critics_score + 
##     title_type + audience_score + best_pic_nom + best_pic_win + 
##     top200_box, data = movies_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.550  -7.782  -0.157   7.697  28.976 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             26.855129   4.757521   5.645 2.49e-08 ***
## imdb_rating              8.633165   0.834422  10.346  < 2e-16 ***
## critics_ratingFresh     -5.620756   1.290797  -4.354 1.55e-05 ***
## critics_ratingRotten   -40.359557   1.464324 -27.562  < 2e-16 ***
## title_typeFeature Film  -5.364643   1.684074  -3.186  0.00152 ** 
## title_typeTV Movie       0.328277   5.220998   0.063  0.94988    
## audience_score           0.006841   0.043535   0.157  0.87519    
## best_pic_nomyes          1.337543   2.815543   0.475  0.63491    
## best_pic_winyes         -0.211216   4.807507  -0.044  0.96497    
## top200_boxyes            0.549274   2.947226   0.186  0.85221    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.06 on 641 degrees of freedom
## Multiple R-squared:  0.8504, Adjusted R-squared:  0.8483 
## F-statistic: 404.8 on 9 and 641 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = critics_score ~ imdb_rating + critics_rating + critics_score + 
##     title_type + best_pic_nom + best_pic_win + top200_box, data = movies_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.518  -7.741  -0.175   7.614  28.949 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             26.6700     4.6058   5.791 1.10e-08 ***
## imdb_rating              8.7333     0.5385  16.218  < 2e-16 ***
## critics_ratingFresh     -5.6396     1.2842  -4.391 1.32e-05 ***
## critics_ratingRotten   -40.3971     1.4436 -27.983  < 2e-16 ***
## title_typeFeature Film  -5.3812     1.6795  -3.204  0.00142 ** 
## title_typeTV Movie       0.3237     5.2169   0.062  0.95055    
## best_pic_nomyes          1.3597     2.8099   0.484  0.62862    
## best_pic_winyes         -0.2407     4.8002  -0.050  0.96003    
## top200_boxyes            0.5551     2.9448   0.189  0.85054    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.06 on 642 degrees of freedom
## Multiple R-squared:  0.8504, Adjusted R-squared:  0.8485 
## F-statistic: 456.1 on 8 and 642 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = critics_score ~ imdb_rating + critics_rating + critics_score + 
##     title_type + best_pic_win + top200_box, data = movies_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.574  -7.680  -0.183   7.637  28.956 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             26.4750     4.5854   5.774 1.21e-08 ***
## imdb_rating              8.7651     0.5341  16.410  < 2e-16 ***
## critics_ratingFresh     -5.7146     1.2741  -4.485 8.63e-06 ***
## critics_ratingRotten   -40.4706     1.4348 -28.207  < 2e-16 ***
## title_typeFeature Film  -5.2918     1.6683  -3.172  0.00159 ** 
## title_typeTV Movie       0.4009     5.2114   0.077  0.93871    
## best_pic_winyes          0.7766     4.3128   0.180  0.85715    
## top200_boxyes            0.5717     2.9428   0.194  0.84603    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.05 on 643 degrees of freedom
## Multiple R-squared:  0.8503, Adjusted R-squared:  0.8487 
## F-statistic: 521.8 on 7 and 643 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = critics_score ~ imdb_rating + critics_rating + critics_score + 
##     title_type + best_pic_win, data = movies_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.586  -7.707  -0.203   7.852  28.952 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             26.4677     4.5818   5.777 1.19e-08 ***
## imdb_rating              8.7688     0.5334  16.440  < 2e-16 ***
## critics_ratingFresh     -5.7473     1.2620  -4.554 6.29e-06 ***
## critics_ratingRotten   -40.5025     1.4242 -28.438  < 2e-16 ***
## title_typeFeature Film  -5.2684     1.6627  -3.168   0.0016 ** 
## title_typeTV Movie       0.4180     5.2068   0.080   0.9360    
## best_pic_winyes          0.8127     4.3056   0.189   0.8503    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.04 on 644 degrees of freedom
## Multiple R-squared:  0.8503, Adjusted R-squared:  0.8489 
## F-statistic: 609.7 on 6 and 644 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = critics_score ~ imdb_rating + critics_rating + critics_score + 
##     title_type, data = movies_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.599  -7.731  -0.205   7.843  28.954 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             26.4316     4.5744   5.778 1.18e-08 ***
## imdb_rating              8.7768     0.5313  16.518  < 2e-16 ***
## critics_ratingFresh     -5.7854     1.2448  -4.648 4.07e-06 ***
## critics_ratingRotten   -40.5349     1.4128 -28.691  < 2e-16 ***
## title_typeFeature Film  -5.2456     1.6571  -3.166  0.00162 ** 
## title_typeTV Movie       0.4420     5.2013   0.085  0.93230    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.03 on 645 degrees of freedom
## Multiple R-squared:  0.8503, Adjusted R-squared:  0.8491 
## F-statistic: 732.7 on 5 and 645 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = critics_score ~ critics_rating + critics_score + 
##     title_type, data = movies_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.3887  -9.3887  -0.4458   9.5542  27.6113 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              96.708      2.003  48.275  < 2e-16 ***
## critics_ratingFresh      -9.800      1.455  -6.735 3.63e-11 ***
## critics_ratingRotten    -53.857      1.383 -38.953  < 2e-16 ***
## title_typeFeature Film  -11.462      1.924  -5.958 4.19e-09 ***
## title_typeTV Movie       -9.085      6.162  -1.474    0.141    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.15 on 646 degrees of freedom
## Multiple R-squared:  0.787,  Adjusted R-squared:  0.7856 
## F-statistic: 596.6 on 4 and 646 DF,  p-value: < 2.2e-16

There are no other variables that we could remove to give the model a higher adjusted R-squared. Therefore the final model is new_model_9.

Now I want to see if this final model is valid, so I will check the assumptions for mult. linear regression are met.

Check Assumptions:

  1. Residuals are independent (and Y values) The movies were randomly selected for inclusion in the dataset, so should be independent.

  2. X and Y have a linear relationship.

  3. Nearly normal residuals with mean 0.

  4. Constant variability of residuals (homoscedasticity).

## Warning in model.matrix.default(object, data = structure(list(critics_score
## = c(45, : the response appeared on the right-hand side and was dropped
## Warning in model.matrix.default(object, data = structure(list(critics_score
## = c(45, : problem with term 3 in model.matrix: no columns are assigned

The diagnostic graphs above show that the conditions are met - residuals are normally distributed around 0, there is homoscedasticity, and we already checked linear relationships when we were selecting variables.


Part 5: Prediction

Overall, the model fits pretty well:

Now I want to see how the model does on a future (as of this dataset) movie from 2016 or beyond.

I will look at the movie ‘La La Land’. I will use the data from IMDB as a source.

##        1 
## 91.40004

The actual score is 91% so the model’s residual was only 0.4! Pretty good.

##        fit      lwr      upr
## 1 91.40004 69.63982 113.1603

The 95% confidence interval is quite large, so I think this particular movie is predicted well by the model, but another movie might not.


Part 6: Conclusion

We can use imdb rating, critics rating, and title type to reliably predict critic’s score – so long as the movie is similar to the type we used to train the model (American movies between 1970 to 2006).

In reality, this probably isn’t a very useful model, because if you don’t know critic’s score you probably also don’t know critic’s rating. But even with the removal of that (helpful) variable, the model still does fairly well.