Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)

Load data

Part 1: Data

As shown in the codebook, the data set is comprised of 651 randomly sampled movies produced and released before 2016. Hence, the inference method only implicates generalization rather than causality.

Part 2: Research question

What attributes do make a movie popular in Rotten Tomatoes?

Part 3: Exploratory data analysis

Predictors:
- “title_type”
- “genre”
- “runtime”
- “mpaa_rating”
- “thtr_rel_month”
- “critics_rating”
- “critics_score”
- “best_pic_nom”
- “best_pic_win”
- “best_actor_win”
- “best_actress_win”
- “best_dir_win”
- “top200_box”

# subset the needed columns
predictors <- movies[, c(2,3,4,5,8,15,16,19,20,21,22,23,24)]
str(predictors)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  13 variables:
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Check best_*_wins and top200_box

# Intuitively, more bests mean higher ratings, so I sum up the binary counts of "best_pic_nom", "best_pic_win", "best_actor_win", "best_actress_win", "best_dir_win", "top200_box"
predictors$bests <- rowSums(sapply(predictors[,8:13], function(x) ifelse(x == 'yes', 1, 0)))
# remove binary counts
predictors <- predictors[,-(8:13)]

Check the relationship between critics ratings and scores

ggpairs(predictors, columns = c(6,7))

# The critics score plotting are nearly bimodel whose peaks accord with the ratings of 'Rotten' and 'Certified Fresh'. The middle level of Fresh is quite close to Certified Fresh, which makes the distribution a bimodel rather than a trimodel. Here I only keep the numeric score and drop the categorical ratings.

# remove critic ratings
predictors$critics_rating <- NULL

Check the relationship between title type and genre

ggpairs(predictors, columns = c(1,2))

# The feature film dominates the title type, I tend to drop tile_type since it doesn't contribute much variance.

# remove title_type
predictors$title_type <- NULL

Check correlations between numeric variables

ggpairs(predictors, columns = c('runtime', 'thtr_rel_month', 'critics_score', 'bests'))

# Not much high corrections, so we are going to keep them all

Labels:
- “audience score”

# Labels: since this is a regression model, so the numeric audience score is used as the label rather than the categorical variabel audience rating
label <- movies[, c(18)]
str(label)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  1 variable:
##  $ audience_score: num  73 81 91 76 27 86 76 47 89 66 ...

# Pair-plooting the audience score and rating
ggpairs(movies, columns = c(17,18))

# The audience score plotting are bimodel whose peaks basically accord with the binary audience rating of 'Spilled' and 'Upright'. It makes sense.

Combine predictors and label

movies.rt <- cbind(predictors, label)
str(movies.rt)

## 'data.frame':    651 obs. of  7 variables:
##  $ genre         : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime       : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating   : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ thtr_rel_month: num  4 3 8 10 9 1 1 11 9 3 ...
##  $ critics_score : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ bests         : num  0 0 0 2 0 0 0 1 0 0 ...
##  $ audience_score: num  73 81 91 76 27 86 76 47 89 66 ...

sample the data for train and test

# sample the data for train and test
set.seed(100)
len <- nrow(movies)
idx <- sample(len, size=0.9 * len, replace=FALSE)
# test data
movies.test <- movies.rt[-idx, ]

# train data
movies.rt <- movies.rt[idx, ]

Part 4: Modeling

m.rt <- lm(audience_score ~ ., data = movies.rt)
summary(m.rt)

## 
## Call:
## lm(formula = audience_score ~ ., data = movies.rt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.321  -8.876   0.127   9.494  40.214 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     28.61036    5.28229   5.416 9.02e-08 ***
## genreAnimation                   5.07506    5.57720   0.910  0.36323    
## genreArt House & International   5.27970    4.68384   1.127  0.26013    
## genreComedy                     -2.29289    2.48090  -0.924  0.35577    
## genreDocumentary                10.01201    3.34578   2.992  0.00289 ** 
## genreDrama                       1.13472    2.18741   0.519  0.60414    
## genreHorror                     -9.16743    3.72782  -2.459  0.01422 *  
## genreMusical & Performing Arts  12.19676    6.07346   2.008  0.04510 *  
## genreMystery & Suspense         -5.16759    2.77687  -1.861  0.06327 .  
## genreOther                       1.09184    4.01148   0.272  0.78558    
## genreScience Fiction & Fantasy  -7.41032    5.01231  -1.478  0.13985    
## runtime                          0.08401    0.03524   2.384  0.01745 *  
## mpaa_ratingNC-17               -11.08995   10.61009  -1.045  0.29637    
## mpaa_ratingPG                   -0.71759    4.08751  -0.176  0.86071    
## mpaa_ratingPG-13                -0.90689    4.21393  -0.215  0.82968    
## mpaa_ratingR                    -0.24251    4.06035  -0.060  0.95239    
## mpaa_ratingUnrated              -1.95396    4.62310  -0.423  0.67271    
## thtr_rel_month                  -0.08981    0.16900  -0.531  0.59535    
## critics_score                    0.44494    0.02416  18.417  < 2e-16 ***
## bests                           -0.07399    0.89650  -0.083  0.93425    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.88 on 564 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5364, Adjusted R-squared:  0.5208 
## F-statistic: 34.35 on 19 and 564 DF,  p-value: < 2.2e-16

# Adjusted R-squared:  0.5208

Backward model selection

First drop: from the summary, thtr_rel_month, bests, and mpaa_rating are not significant. Let’s drop them alternatively to pick the model that increases most.

# remove month
m.rt.1.month <- lm(audience_score ~ genre + runtime + mpaa_rating + critics_score + bests, data = movies.rt)
summary(m.rt.1.month)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + mpaa_rating + 
##     critics_score + bests, data = movies.rt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.335  -8.847   0.326   9.347  40.068 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     28.43552    5.26868   5.397 9.98e-08 ***
## genreAnimation                   4.91440    5.56546   0.883  0.37760    
## genreArt House & International   5.32566    4.68006   1.138  0.25563    
## genreComedy                     -2.35475    2.47660  -0.951  0.34211    
## genreDocumentary                 9.98946    3.34338   2.988  0.00293 ** 
## genreDrama                       1.14701    2.18589   0.525  0.59997    
## genreHorror                     -9.18929    3.72523  -2.467  0.01393 *  
## genreMusical & Performing Arts  12.19229    6.06959   2.009  0.04504 *  
## genreMystery & Suspense         -5.09532    2.77178  -1.838  0.06655 .  
## genreOther                       1.17727    4.00572   0.294  0.76894    
## genreScience Fiction & Fantasy  -7.41155    5.00913  -1.480  0.13954    
## runtime                          0.08045    0.03457   2.327  0.02032 *  
## mpaa_ratingNC-17               -11.02524   10.60265  -1.040  0.29885    
## mpaa_ratingPG                   -0.77203    4.08363  -0.189  0.85012    
## mpaa_ratingPG-13                -0.91394    4.21124  -0.217  0.82827    
## mpaa_ratingR                    -0.30941    4.05582  -0.076  0.93922    
## mpaa_ratingUnrated              -1.98562    4.61978  -0.430  0.66750    
## critics_score                    0.44520    0.02414  18.443  < 2e-16 ***
## bests                           -0.10681    0.89380  -0.119  0.90492    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.88 on 565 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5362, Adjusted R-squared:  0.5214 
## F-statistic: 36.29 on 18 and 565 DF,  p-value: < 2.2e-16

# Adjusted R-squared:  0.5214

# remove mpaa_rating
m.rt.1.mpaa_rating <-lm(audience_score ~ genre + runtime + thtr_rel_month + critics_score + bests, data = movies.rt)
summary(m.rt.1.mpaa_rating)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_month + 
##     critics_score + bests, data = movies.rt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.922  -8.791   0.187   9.434  40.512 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    28.20867    3.95954   7.124 3.18e-12 ***
## genreAnimation                  5.38062    5.02762   1.070  0.28498    
## genreArt House & International  5.04474    4.58880   1.099  0.27208    
## genreComedy                    -2.34839    2.45745  -0.956  0.33967    
## genreDocumentary                9.31338    2.99232   3.112  0.00195 ** 
## genreDrama                      1.15367    2.13256   0.541  0.58873    
## genreHorror                    -9.07345    3.64739  -2.488  0.01314 *  
## genreMusical & Performing Arts 12.50041    6.02466   2.075  0.03845 *  
## genreMystery & Suspense        -4.96047    2.70590  -1.833  0.06729 .  
## genreOther                      1.06880    3.98880   0.268  0.78883    
## genreScience Fiction & Fantasy -7.28103    4.99091  -1.459  0.14516    
## runtime                         0.08226    0.03438   2.393  0.01706 *  
## thtr_rel_month                 -0.08024    0.16797  -0.478  0.63303    
## critics_score                   0.44373    0.02343  18.937  < 2e-16 ***
## bests                          -0.07471    0.88377  -0.085  0.93266    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.84 on 569 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5351, Adjusted R-squared:  0.5236 
## F-statistic: 46.77 on 14 and 569 DF,  p-value: < 2.2e-16

# Adjusted R-squared:  0.5236

# remove bests
m.rt.1.bests <- lm(audience_score ~ genre + runtime + mpaa_rating + thtr_rel_month + critics_score, data = movies.rt)
summary(m.rt.1.bests)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + mpaa_rating + 
##     thtr_rel_month + critics_score, data = movies.rt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.291  -8.887   0.164   9.531  40.235 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     28.71786    5.11467   5.615 3.09e-08 ***
## genreAnimation                   5.06645    5.57132   0.909  0.36354    
## genreArt House & International   5.27795    4.67967   1.128  0.25986    
## genreComedy                     -2.30212    2.47620  -0.930  0.35292    
## genreDocumentary                10.01858    3.34189   2.998  0.00284 ** 
## genreDrama                       1.12525    2.18247   0.516  0.60635    
## genreHorror                     -9.17231    3.72408  -2.463  0.01408 *  
## genreMusical & Performing Arts  12.22075    6.06116   2.016  0.04425 *  
## genreMystery & Suspense         -5.19123    2.75963  -1.881  0.06047 .  
## genreOther                       1.08189    4.00614   0.270  0.78722    
## genreScience Fiction & Fantasy  -7.40951    5.00790  -1.480  0.13955    
## runtime                          0.08296    0.03286   2.525  0.01184 *  
## mpaa_ratingNC-17               -11.09080   10.60075  -1.046  0.29590    
## mpaa_ratingPG                   -0.72392    4.08320  -0.177  0.85934    
## mpaa_ratingPG-13                -0.90648    4.21022  -0.215  0.82961    
## mpaa_ratingR                    -0.23352    4.05532  -0.058  0.95410    
## mpaa_ratingUnrated              -1.92912    4.60924  -0.419  0.67572    
## thtr_rel_month                  -0.09077    0.16845  -0.539  0.59021    
## critics_score                    0.44464    0.02386  18.634  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.87 on 565 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5364, Adjusted R-squared:  0.5217 
## F-statistic: 36.32 on 18 and 565 DF,  p-value: < 2.2e-16

# Adjusted R-squared:  0.5217

Pick the model of m.rt.1.mpaa_rating that remove mpaa_rating with highest increase of adjusted R2 from 0.5208 to 0.5236

Second drop: based on the previous drop, drop bests and thtr_rel_month

# remove bests
m.rt.2.bests <- lm(audience_score ~ genre + runtime + thtr_rel_month + critics_score, data = movies.rt)
summary(m.rt.2.bests)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_month + 
##     critics_score, data = movies.rt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.891  -8.848   0.211   9.422  40.542 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    28.31656    3.74498   7.561 1.61e-13 ***
## genreAnimation                  5.36880    5.02130   1.069  0.28543    
## genreArt House & International  5.05322    4.58371   1.102  0.27074    
## genreComedy                    -2.35696    2.45321  -0.961  0.33708    
## genreDocumentary                9.33348    2.98026   3.132  0.00183 ** 
## genreDrama                      1.14820    2.12972   0.539  0.59001    
## genreHorror                    -9.07015    3.64401  -2.489  0.01309 *  
## genreMusical & Performing Arts 12.52952    6.00958   2.085  0.03752 *  
## genreMystery & Suspense        -4.97941    2.69426  -1.848  0.06510 .  
## genreOther                      1.05958    3.98384   0.266  0.79036    
## genreScience Fiction & Fantasy -7.27932    4.98652  -1.460  0.14490    
## runtime                         0.08121    0.03204   2.535  0.01152 *  
## thtr_rel_month                 -0.08119    0.16744  -0.485  0.62795    
## critics_score                   0.44345    0.02317  19.140  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.83 on 570 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5351, Adjusted R-squared:  0.5245 
## F-statistic: 50.46 on 13 and 570 DF,  p-value: < 2.2e-16

# Adjusted R-squared: 0.5245

# remove thtr_rel_month
m.rt.2.month <- lm(audience_score ~ genre + runtime + critics_score + bests, data = movies.rt)
summary(m.rt.2.month)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + critics_score + 
##     bests, data = movies.rt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.014  -9.041   0.304   9.425  40.357 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    28.00688    3.93428   7.119 3.29e-12 ***
## genreAnimation                  5.26196    5.01808   1.049  0.29481    
## genreArt House & International  5.08049    4.58508   1.108  0.26831    
## genreComedy                    -2.40145    2.45327  -0.979  0.32806    
## genreDocumentary                9.30364    2.99022   3.111  0.00196 ** 
## genreDrama                      1.15975    2.13108   0.544  0.58651    
## genreHorror                    -9.10802    3.64420  -2.499  0.01272 *  
## genreMusical & Performing Arts 12.47887    6.02042   2.073  0.03864 *  
## genreMystery & Suspense        -4.90492    2.70157  -1.816  0.06996 .  
## genreOther                      1.13852    3.98343   0.286  0.77513    
## genreScience Fiction & Fantasy -7.28585    4.98752  -1.461  0.14462    
## runtime                         0.07918    0.03375   2.346  0.01931 *  
## critics_score                   0.44387    0.02341  18.958  < 2e-16 ***
## bests                          -0.10293    0.88120  -0.117  0.90706    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.83 on 570 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5349, Adjusted R-squared:  0.5243 
## F-statistic: 50.42 on 13 and 570 DF,  p-value: < 2.2e-16

# Adjusted R-squared: 0.5243

Droping thtr_rel_month have the same increase of adjusted R2 from 0.5236 to 0.5245

Third drop: drop both bests

m.rt.3 <- lm(audience_score ~ genre + runtime + critics_score, data = movies.rt)
summary(m.rt.3)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + critics_score, 
##     data = movies.rt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.175  -9.062   0.339   9.491  40.395 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    28.15287    3.72723   7.553 1.69e-13 ***
## genreAnimation                  5.24367    5.01130   1.046  0.29583    
## genreArt House & International  5.09279    4.57991   1.112  0.26661    
## genreComedy                    -2.41418    2.44873  -0.986  0.32460    
## genreDocumentary                9.33129    2.97826   3.133  0.00182 ** 
## genreDrama                      1.15227    2.12827   0.541  0.58843    
## genreHorror                    -9.10403    3.64090  -2.500  0.01268 *  
## genreMusical & Performing Arts 12.51879    6.00551   2.085  0.03755 *  
## genreMystery & Suspense        -4.93022    2.69055  -1.832  0.06741 .  
## genreOther                      1.12690    3.97875   0.283  0.77710    
## genreScience Fiction & Fantasy -7.28356    4.98317  -1.462  0.14439    
## runtime                         0.07768    0.03118   2.491  0.01301 *  
## critics_score                   0.44348    0.02315  19.155  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.82 on 571 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5349, Adjusted R-squared:  0.5251 
## F-statistic: 54.72 on 12 and 571 DF,  p-value: < 2.2e-16

# Adjusted R-squared: 0.5251

Droping both bests and thtr_rel_month have the same increase of adjusted R2 from 0.5245 to 0.5251

Forth drop: drop runtime, critics_score, and genre alternatively to check the R2 increase

m.rt.4.runtime <- lm(audience_score ~ genre + critics_score, data = movies.rt)
summary(m.rt.4.runtime)

## 
## Call:
## lm(formula = audience_score ~ genre + critics_score, data = movies.rt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.673  -8.807   0.027   9.448  41.835 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     35.83345    2.09962  17.067  < 2e-16 ***
## genreAnimation                   3.89933    5.00203   0.780  0.43598    
## genreArt House & International   4.67639    4.59508   1.018  0.30925    
## genreComedy                     -2.90570    2.45047  -1.186  0.23620    
## genreDocumentary                 8.19200    2.94955   2.777  0.00566 ** 
## genreDrama                       1.54055    2.13090   0.723  0.47000    
## genreHorror                    -10.06917    3.63465  -2.770  0.00578 ** 
## genreMusical & Performing Arts  13.20202    6.02301   2.192  0.02879 *  
## genreMystery & Suspense         -4.53977    2.69661  -1.684  0.09282 .  
## genreOther                       1.51605    3.99143   0.380  0.70421    
## genreScience Fiction & Fantasy  -7.55573    5.00180  -1.511  0.13144    
## critics_score                    0.45222    0.02298  19.676  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.88 on 573 degrees of freedom
## Multiple R-squared:  0.5297, Adjusted R-squared:  0.5207 
## F-statistic: 58.67 on 11 and 573 DF,  p-value: < 2.2e-16

# Adjusted R-squared:  0.5207

m.rt.4.genre <- lm(audience_score ~ runtime + critics_score, data = movies.rt)
summary(m.rt.4.genre)

## 
## Call:
## lm(formula = audience_score ~ runtime + critics_score, data = movies.rt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.897  -9.714   0.298  10.046  42.177 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   26.88780    3.26629   8.232 1.22e-15 ***
## runtime        0.06773    0.03021   2.242   0.0253 *  
## critics_score  0.49105    0.02125  23.105  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.24 on 581 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.4976, Adjusted R-squared:  0.4959 
## F-statistic: 287.7 on 2 and 581 DF,  p-value: < 2.2e-16

# Adjusted R-squared:  0.4959

m.rt.4.critics <- lm(audience_score ~ genre + runtime, data = movies.rt)
summary(m.rt.4.critics)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime, data = movies.rt)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50.75 -12.40   1.21  12.68  43.20 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    36.65187    4.73880   7.734 4.71e-14 ***
## genreAnimation                 11.24826    6.40442   1.756 0.079568 .  
## genreArt House & International  9.38533    5.85757   1.602 0.109651    
## genreComedy                    -2.35616    3.13561  -0.751 0.452709    
## genreDocumentary               30.46422    3.54235   8.600  < 2e-16 ***
## genreDrama                     10.76002    2.64850   4.063 5.53e-05 ***
## genreHorror                    -5.66784    4.65652  -1.217 0.224036    
## genreMusical & Performing Arts 26.50512    7.63303   3.472 0.000555 ***
## genreMystery & Suspense         1.65375    3.41703   0.484 0.628590    
## genreOther                     11.49510    5.04744   2.277 0.023130 *  
## genreScience Fiction & Fantasy -2.60474    6.37330  -0.409 0.682916    
## runtime                         0.16675    0.03948   4.224 2.80e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.7 on 572 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.236,  Adjusted R-squared:  0.2213 
## F-statistic: 16.06 on 11 and 572 DF,  p-value: < 2.2e-16

# Adjusted R-squared: 0.2213

Fourth drop doesn’t show any increase in R2, so our model selection stop here. m.rt.3 is our final model.

Diagnostics for MLR

Use residual plots to evaluate whether the conditions of least squares regression are reasonable.

ggplot(m.rt.3, aes(.fitted, .resid)) + 
    geom_jitter()

Linear association: The residuals plot doesn’t show a random scatter.
Constant variance of residuals: residuals vary less in the right than left.
Nearly normal residuals: Residuals are slightly right skewed
Independent observations: Classes sampled randomly

Conclusion: this model doesn’t quite meet the requirements for MLR.

Part 5: Prediction

movies.test <- movies.test[, c('genre','runtime','critics_score', 'audience_score')]

# predict with CI of 95%
preds.95 <- predict(m.rt.3, movies.test[, -c(4)], interval = "prediction", level = 0.95)

preds.95 <- cbind(preds.95, movies.test$audience_score)
colnames(preds.95)[4] <- "score"
preds.95 <- as.data.frame(preds.95)

# Evalutate 95% CI
preds.95$cover <- mutate(preds.95, cover = ifelse(score >= lwr & score <= upr, TRUE, FALSE))
# sum the predictions that CI covers real audience score
sum(preds.95$cover == TRUE) / nrow(preds.95)

## [1] 0.9545455

Part 6: Conclusion

Based on the fitted model m.rt.3, critics_score, runtime, and genere contribute most to the model. critics_score has the highest weight of 0.44348. In Genre, different genres have different effect on the score. For example, Documentary and Musical & Performing Arts have positive effect, while Science Fiction & Fantasy and Horror tend to have negative effect. Runtime has a little effect on the socre.

This model doesn’t provide a high R2(slightly higher 50%), which is also shown in the Diagnostics result. A polynomial model is worth trying for regession approach. Additionally, there are many categorical features in the dataset, it’s better to choose other methods like decision tree to imporve the prediction accuracy. But it’s beyond the scope of this class.