library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)As shown in the codebook, the data set is comprised of 651 randomly sampled movies produced and released before 2016. Hence, the inference method only implicates generalization rather than causality.
What attributes do make a movie popular in Rotten Tomatoes?
# subset the needed columns
predictors <- movies[, c(2,3,4,5,8,15,16,19,20,21,22,23,24)]
str(predictors)## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 13 variables:
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ thtr_rel_month : num 4 3 8 10 9 1 1 11 9 3 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# Intuitively, more bests mean higher ratings, so I sum up the binary counts of "best_pic_nom", "best_pic_win", "best_actor_win", "best_actress_win", "best_dir_win", "top200_box"
predictors$bests <- rowSums(sapply(predictors[,8:13], function(x) ifelse(x == 'yes', 1, 0)))
# remove binary counts
predictors <- predictors[,-(8:13)]ggpairs(predictors, columns = c(6,7))# The critics score plotting are nearly bimodel whose peaks accord with the ratings of 'Rotten' and 'Certified Fresh'. The middle level of Fresh is quite close to Certified Fresh, which makes the distribution a bimodel rather than a trimodel. Here I only keep the numeric score and drop the categorical ratings.
# remove critic ratings
predictors$critics_rating <- NULLggpairs(predictors, columns = c(1,2))# The feature film dominates the title type, I tend to drop tile_type since it doesn't contribute much variance.
# remove title_type
predictors$title_type <- NULLggpairs(predictors, columns = c('runtime', 'thtr_rel_month', 'critics_score', 'bests'))# Not much high corrections, so we are going to keep them allLabels:
# Labels: since this is a regression model, so the numeric audience score is used as the label rather than the categorical variabel audience rating
label <- movies[, c(18)]
str(label)## Classes 'tbl_df', 'tbl' and 'data.frame': 651 obs. of 1 variable:
## $ audience_score: num 73 81 91 76 27 86 76 47 89 66 ...
# Pair-plooting the audience score and rating
ggpairs(movies, columns = c(17,18))# The audience score plotting are bimodel whose peaks basically accord with the binary audience rating of 'Spilled' and 'Upright'. It makes sense.movies.rt <- cbind(predictors, label)
str(movies.rt)## 'data.frame': 651 obs. of 7 variables:
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ thtr_rel_month: num 4 3 8 10 9 1 1 11 9 3 ...
## $ critics_score : num 45 96 91 80 33 91 57 17 90 83 ...
## $ bests : num 0 0 0 2 0 0 0 1 0 0 ...
## $ audience_score: num 73 81 91 76 27 86 76 47 89 66 ...
# sample the data for train and test
set.seed(100)
len <- nrow(movies)
idx <- sample(len, size=0.9 * len, replace=FALSE)
# test data
movies.test <- movies.rt[-idx, ]
# train data
movies.rt <- movies.rt[idx, ]m.rt <- lm(audience_score ~ ., data = movies.rt)
summary(m.rt)##
## Call:
## lm(formula = audience_score ~ ., data = movies.rt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.321 -8.876 0.127 9.494 40.214
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.61036 5.28229 5.416 9.02e-08 ***
## genreAnimation 5.07506 5.57720 0.910 0.36323
## genreArt House & International 5.27970 4.68384 1.127 0.26013
## genreComedy -2.29289 2.48090 -0.924 0.35577
## genreDocumentary 10.01201 3.34578 2.992 0.00289 **
## genreDrama 1.13472 2.18741 0.519 0.60414
## genreHorror -9.16743 3.72782 -2.459 0.01422 *
## genreMusical & Performing Arts 12.19676 6.07346 2.008 0.04510 *
## genreMystery & Suspense -5.16759 2.77687 -1.861 0.06327 .
## genreOther 1.09184 4.01148 0.272 0.78558
## genreScience Fiction & Fantasy -7.41032 5.01231 -1.478 0.13985
## runtime 0.08401 0.03524 2.384 0.01745 *
## mpaa_ratingNC-17 -11.08995 10.61009 -1.045 0.29637
## mpaa_ratingPG -0.71759 4.08751 -0.176 0.86071
## mpaa_ratingPG-13 -0.90689 4.21393 -0.215 0.82968
## mpaa_ratingR -0.24251 4.06035 -0.060 0.95239
## mpaa_ratingUnrated -1.95396 4.62310 -0.423 0.67271
## thtr_rel_month -0.08981 0.16900 -0.531 0.59535
## critics_score 0.44494 0.02416 18.417 < 2e-16 ***
## bests -0.07399 0.89650 -0.083 0.93425
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.88 on 564 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5364, Adjusted R-squared: 0.5208
## F-statistic: 34.35 on 19 and 564 DF, p-value: < 2.2e-16
# Adjusted R-squared: 0.5208First drop: from the summary, thtr_rel_month, bests, and mpaa_rating are not significant. Let’s drop them alternatively to pick the model that increases most.
# remove month
m.rt.1.month <- lm(audience_score ~ genre + runtime + mpaa_rating + critics_score + bests, data = movies.rt)
summary(m.rt.1.month)##
## Call:
## lm(formula = audience_score ~ genre + runtime + mpaa_rating +
## critics_score + bests, data = movies.rt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.335 -8.847 0.326 9.347 40.068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.43552 5.26868 5.397 9.98e-08 ***
## genreAnimation 4.91440 5.56546 0.883 0.37760
## genreArt House & International 5.32566 4.68006 1.138 0.25563
## genreComedy -2.35475 2.47660 -0.951 0.34211
## genreDocumentary 9.98946 3.34338 2.988 0.00293 **
## genreDrama 1.14701 2.18589 0.525 0.59997
## genreHorror -9.18929 3.72523 -2.467 0.01393 *
## genreMusical & Performing Arts 12.19229 6.06959 2.009 0.04504 *
## genreMystery & Suspense -5.09532 2.77178 -1.838 0.06655 .
## genreOther 1.17727 4.00572 0.294 0.76894
## genreScience Fiction & Fantasy -7.41155 5.00913 -1.480 0.13954
## runtime 0.08045 0.03457 2.327 0.02032 *
## mpaa_ratingNC-17 -11.02524 10.60265 -1.040 0.29885
## mpaa_ratingPG -0.77203 4.08363 -0.189 0.85012
## mpaa_ratingPG-13 -0.91394 4.21124 -0.217 0.82827
## mpaa_ratingR -0.30941 4.05582 -0.076 0.93922
## mpaa_ratingUnrated -1.98562 4.61978 -0.430 0.66750
## critics_score 0.44520 0.02414 18.443 < 2e-16 ***
## bests -0.10681 0.89380 -0.119 0.90492
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.88 on 565 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5362, Adjusted R-squared: 0.5214
## F-statistic: 36.29 on 18 and 565 DF, p-value: < 2.2e-16
# Adjusted R-squared: 0.5214
# remove mpaa_rating
m.rt.1.mpaa_rating <-lm(audience_score ~ genre + runtime + thtr_rel_month + critics_score + bests, data = movies.rt)
summary(m.rt.1.mpaa_rating)##
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_month +
## critics_score + bests, data = movies.rt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.922 -8.791 0.187 9.434 40.512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.20867 3.95954 7.124 3.18e-12 ***
## genreAnimation 5.38062 5.02762 1.070 0.28498
## genreArt House & International 5.04474 4.58880 1.099 0.27208
## genreComedy -2.34839 2.45745 -0.956 0.33967
## genreDocumentary 9.31338 2.99232 3.112 0.00195 **
## genreDrama 1.15367 2.13256 0.541 0.58873
## genreHorror -9.07345 3.64739 -2.488 0.01314 *
## genreMusical & Performing Arts 12.50041 6.02466 2.075 0.03845 *
## genreMystery & Suspense -4.96047 2.70590 -1.833 0.06729 .
## genreOther 1.06880 3.98880 0.268 0.78883
## genreScience Fiction & Fantasy -7.28103 4.99091 -1.459 0.14516
## runtime 0.08226 0.03438 2.393 0.01706 *
## thtr_rel_month -0.08024 0.16797 -0.478 0.63303
## critics_score 0.44373 0.02343 18.937 < 2e-16 ***
## bests -0.07471 0.88377 -0.085 0.93266
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.84 on 569 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5351, Adjusted R-squared: 0.5236
## F-statistic: 46.77 on 14 and 569 DF, p-value: < 2.2e-16
# Adjusted R-squared: 0.5236
# remove bests
m.rt.1.bests <- lm(audience_score ~ genre + runtime + mpaa_rating + thtr_rel_month + critics_score, data = movies.rt)
summary(m.rt.1.bests)##
## Call:
## lm(formula = audience_score ~ genre + runtime + mpaa_rating +
## thtr_rel_month + critics_score, data = movies.rt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.291 -8.887 0.164 9.531 40.235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.71786 5.11467 5.615 3.09e-08 ***
## genreAnimation 5.06645 5.57132 0.909 0.36354
## genreArt House & International 5.27795 4.67967 1.128 0.25986
## genreComedy -2.30212 2.47620 -0.930 0.35292
## genreDocumentary 10.01858 3.34189 2.998 0.00284 **
## genreDrama 1.12525 2.18247 0.516 0.60635
## genreHorror -9.17231 3.72408 -2.463 0.01408 *
## genreMusical & Performing Arts 12.22075 6.06116 2.016 0.04425 *
## genreMystery & Suspense -5.19123 2.75963 -1.881 0.06047 .
## genreOther 1.08189 4.00614 0.270 0.78722
## genreScience Fiction & Fantasy -7.40951 5.00790 -1.480 0.13955
## runtime 0.08296 0.03286 2.525 0.01184 *
## mpaa_ratingNC-17 -11.09080 10.60075 -1.046 0.29590
## mpaa_ratingPG -0.72392 4.08320 -0.177 0.85934
## mpaa_ratingPG-13 -0.90648 4.21022 -0.215 0.82961
## mpaa_ratingR -0.23352 4.05532 -0.058 0.95410
## mpaa_ratingUnrated -1.92912 4.60924 -0.419 0.67572
## thtr_rel_month -0.09077 0.16845 -0.539 0.59021
## critics_score 0.44464 0.02386 18.634 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.87 on 565 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5364, Adjusted R-squared: 0.5217
## F-statistic: 36.32 on 18 and 565 DF, p-value: < 2.2e-16
# Adjusted R-squared: 0.5217Pick the model of m.rt.1.mpaa_rating that remove mpaa_rating with highest increase of adjusted R2 from 0.5208 to 0.5236
Second drop: based on the previous drop, drop bests and thtr_rel_month
# remove bests
m.rt.2.bests <- lm(audience_score ~ genre + runtime + thtr_rel_month + critics_score, data = movies.rt)
summary(m.rt.2.bests)##
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_month +
## critics_score, data = movies.rt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.891 -8.848 0.211 9.422 40.542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.31656 3.74498 7.561 1.61e-13 ***
## genreAnimation 5.36880 5.02130 1.069 0.28543
## genreArt House & International 5.05322 4.58371 1.102 0.27074
## genreComedy -2.35696 2.45321 -0.961 0.33708
## genreDocumentary 9.33348 2.98026 3.132 0.00183 **
## genreDrama 1.14820 2.12972 0.539 0.59001
## genreHorror -9.07015 3.64401 -2.489 0.01309 *
## genreMusical & Performing Arts 12.52952 6.00958 2.085 0.03752 *
## genreMystery & Suspense -4.97941 2.69426 -1.848 0.06510 .
## genreOther 1.05958 3.98384 0.266 0.79036
## genreScience Fiction & Fantasy -7.27932 4.98652 -1.460 0.14490
## runtime 0.08121 0.03204 2.535 0.01152 *
## thtr_rel_month -0.08119 0.16744 -0.485 0.62795
## critics_score 0.44345 0.02317 19.140 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.83 on 570 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5351, Adjusted R-squared: 0.5245
## F-statistic: 50.46 on 13 and 570 DF, p-value: < 2.2e-16
# Adjusted R-squared: 0.5245
# remove thtr_rel_month
m.rt.2.month <- lm(audience_score ~ genre + runtime + critics_score + bests, data = movies.rt)
summary(m.rt.2.month)##
## Call:
## lm(formula = audience_score ~ genre + runtime + critics_score +
## bests, data = movies.rt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.014 -9.041 0.304 9.425 40.357
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.00688 3.93428 7.119 3.29e-12 ***
## genreAnimation 5.26196 5.01808 1.049 0.29481
## genreArt House & International 5.08049 4.58508 1.108 0.26831
## genreComedy -2.40145 2.45327 -0.979 0.32806
## genreDocumentary 9.30364 2.99022 3.111 0.00196 **
## genreDrama 1.15975 2.13108 0.544 0.58651
## genreHorror -9.10802 3.64420 -2.499 0.01272 *
## genreMusical & Performing Arts 12.47887 6.02042 2.073 0.03864 *
## genreMystery & Suspense -4.90492 2.70157 -1.816 0.06996 .
## genreOther 1.13852 3.98343 0.286 0.77513
## genreScience Fiction & Fantasy -7.28585 4.98752 -1.461 0.14462
## runtime 0.07918 0.03375 2.346 0.01931 *
## critics_score 0.44387 0.02341 18.958 < 2e-16 ***
## bests -0.10293 0.88120 -0.117 0.90706
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.83 on 570 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5349, Adjusted R-squared: 0.5243
## F-statistic: 50.42 on 13 and 570 DF, p-value: < 2.2e-16
# Adjusted R-squared: 0.5243Droping thtr_rel_month have the same increase of adjusted R2 from 0.5236 to 0.5245
Third drop: drop both bests
m.rt.3 <- lm(audience_score ~ genre + runtime + critics_score, data = movies.rt)
summary(m.rt.3)##
## Call:
## lm(formula = audience_score ~ genre + runtime + critics_score,
## data = movies.rt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.175 -9.062 0.339 9.491 40.395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.15287 3.72723 7.553 1.69e-13 ***
## genreAnimation 5.24367 5.01130 1.046 0.29583
## genreArt House & International 5.09279 4.57991 1.112 0.26661
## genreComedy -2.41418 2.44873 -0.986 0.32460
## genreDocumentary 9.33129 2.97826 3.133 0.00182 **
## genreDrama 1.15227 2.12827 0.541 0.58843
## genreHorror -9.10403 3.64090 -2.500 0.01268 *
## genreMusical & Performing Arts 12.51879 6.00551 2.085 0.03755 *
## genreMystery & Suspense -4.93022 2.69055 -1.832 0.06741 .
## genreOther 1.12690 3.97875 0.283 0.77710
## genreScience Fiction & Fantasy -7.28356 4.98317 -1.462 0.14439
## runtime 0.07768 0.03118 2.491 0.01301 *
## critics_score 0.44348 0.02315 19.155 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.82 on 571 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.5349, Adjusted R-squared: 0.5251
## F-statistic: 54.72 on 12 and 571 DF, p-value: < 2.2e-16
# Adjusted R-squared: 0.5251Droping both bests and thtr_rel_month have the same increase of adjusted R2 from 0.5245 to 0.5251
Forth drop: drop runtime, critics_score, and genre alternatively to check the R2 increase
m.rt.4.runtime <- lm(audience_score ~ genre + critics_score, data = movies.rt)
summary(m.rt.4.runtime)##
## Call:
## lm(formula = audience_score ~ genre + critics_score, data = movies.rt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.673 -8.807 0.027 9.448 41.835
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.83345 2.09962 17.067 < 2e-16 ***
## genreAnimation 3.89933 5.00203 0.780 0.43598
## genreArt House & International 4.67639 4.59508 1.018 0.30925
## genreComedy -2.90570 2.45047 -1.186 0.23620
## genreDocumentary 8.19200 2.94955 2.777 0.00566 **
## genreDrama 1.54055 2.13090 0.723 0.47000
## genreHorror -10.06917 3.63465 -2.770 0.00578 **
## genreMusical & Performing Arts 13.20202 6.02301 2.192 0.02879 *
## genreMystery & Suspense -4.53977 2.69661 -1.684 0.09282 .
## genreOther 1.51605 3.99143 0.380 0.70421
## genreScience Fiction & Fantasy -7.55573 5.00180 -1.511 0.13144
## critics_score 0.45222 0.02298 19.676 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.88 on 573 degrees of freedom
## Multiple R-squared: 0.5297, Adjusted R-squared: 0.5207
## F-statistic: 58.67 on 11 and 573 DF, p-value: < 2.2e-16
# Adjusted R-squared: 0.5207
m.rt.4.genre <- lm(audience_score ~ runtime + critics_score, data = movies.rt)
summary(m.rt.4.genre)##
## Call:
## lm(formula = audience_score ~ runtime + critics_score, data = movies.rt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.897 -9.714 0.298 10.046 42.177
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.88780 3.26629 8.232 1.22e-15 ***
## runtime 0.06773 0.03021 2.242 0.0253 *
## critics_score 0.49105 0.02125 23.105 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.24 on 581 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.4976, Adjusted R-squared: 0.4959
## F-statistic: 287.7 on 2 and 581 DF, p-value: < 2.2e-16
# Adjusted R-squared: 0.4959
m.rt.4.critics <- lm(audience_score ~ genre + runtime, data = movies.rt)
summary(m.rt.4.critics)##
## Call:
## lm(formula = audience_score ~ genre + runtime, data = movies.rt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.75 -12.40 1.21 12.68 43.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.65187 4.73880 7.734 4.71e-14 ***
## genreAnimation 11.24826 6.40442 1.756 0.079568 .
## genreArt House & International 9.38533 5.85757 1.602 0.109651
## genreComedy -2.35616 3.13561 -0.751 0.452709
## genreDocumentary 30.46422 3.54235 8.600 < 2e-16 ***
## genreDrama 10.76002 2.64850 4.063 5.53e-05 ***
## genreHorror -5.66784 4.65652 -1.217 0.224036
## genreMusical & Performing Arts 26.50512 7.63303 3.472 0.000555 ***
## genreMystery & Suspense 1.65375 3.41703 0.484 0.628590
## genreOther 11.49510 5.04744 2.277 0.023130 *
## genreScience Fiction & Fantasy -2.60474 6.37330 -0.409 0.682916
## runtime 0.16675 0.03948 4.224 2.80e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.7 on 572 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.236, Adjusted R-squared: 0.2213
## F-statistic: 16.06 on 11 and 572 DF, p-value: < 2.2e-16
# Adjusted R-squared: 0.2213Fourth drop doesn’t show any increase in R2, so our model selection stop here. m.rt.3 is our final model.
Use residual plots to evaluate whether the conditions of least squares regression are reasonable.
ggplot(m.rt.3, aes(.fitted, .resid)) +
geom_jitter()Conclusion: this model doesn’t quite meet the requirements for MLR.
movies.test <- movies.test[, c('genre','runtime','critics_score', 'audience_score')]
# predict with CI of 95%
preds.95 <- predict(m.rt.3, movies.test[, -c(4)], interval = "prediction", level = 0.95)
preds.95 <- cbind(preds.95, movies.test$audience_score)
colnames(preds.95)[4] <- "score"
preds.95 <- as.data.frame(preds.95)
# Evalutate 95% CI
preds.95$cover <- mutate(preds.95, cover = ifelse(score >= lwr & score <= upr, TRUE, FALSE))
# sum the predictions that CI covers real audience score
sum(preds.95$cover == TRUE) / nrow(preds.95)## [1] 0.9545455
Based on the fitted model m.rt.3, critics_score, runtime, and genere contribute most to the model. critics_score has the highest weight of 0.44348. In Genre, different genres have different effect on the score. For example, Documentary and Musical & Performing Arts have positive effect, while Science Fiction & Fantasy and Horror tend to have negative effect. Runtime has a little effect on the socre.
This model doesn’t provide a high R2(slightly higher 50%), which is also shown in the Diagnostics result. A polynomial model is worth trying for regession approach. Additionally, there are many categorical features in the dataset, it’s better to choose other methods like decision tree to imporve the prediction accuracy. But it’s beyond the scope of this class.