Part 0: Setup

This analysis is designed to learn what attributes make a movie popular. The data comes from Rotten Tomatoes and IMDB as a random sample of movies.

0.1 Load packages

library(ggplot2); library(GGally)
library(tidyverse); library(kableExtra)
library(statsr); library(ggridges); library(ggpubr)

0.2 Load data

load("movies.Rdata")

head(movies) %>%
        kable() %>%
        kable_styling("striped") %>%
        scroll_box(width = "100%")
title title_type genre runtime mpaa_rating studio thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes critics_rating critics_score audience_rating audience_score best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win top200_box director actor1 actor2 actor3 actor4 actor5 imdb_url rt_url
Filly Brown Feature Film Drama 80 R Indomina Media Inc.  2013 4 19 2013 7 30 5.5 899 Rotten 45 Upright 73 no no no no no no Michael D. Olmos Gina Rodriguez Jenni Rivera Lou Diamond Phillips Emilio Rivera Joseph Julian Soria http://www.imdb.com/title/tt1869425/ //www.rottentomatoes.com/m/filly_brown_2012/
The Dish Feature Film Drama 101 PG-13 Warner Bros. Pictures 2001 3 14 2001 8 28 7.3 12285 Certified Fresh 96 Upright 81 no no no no no no Rob Sitch Sam Neill Kevin Harrington Patrick Warburton Tom Long Genevieve Mooy http://www.imdb.com/title/tt0205873/ //www.rottentomatoes.com/m/dish/
Waiting for Guffman Feature Film Comedy 84 R Sony Pictures Classics 1996 8 21 2001 8 21 7.6 22381 Certified Fresh 91 Upright 91 no no no no no no Christopher Guest Christopher Guest Catherine O’Hara Parker Posey Eugene Levy Bob Balaban http://www.imdb.com/title/tt0118111/ //www.rottentomatoes.com/m/waiting_for_guffman/
The Age of Innocence Feature Film Drama 139 PG Columbia Pictures 1993 10 1 2001 11 6 7.2 35096 Certified Fresh 80 Upright 76 no no yes no yes no Martin Scorsese Daniel Day-Lewis Michelle Pfeiffer Winona Ryder Richard E. Grant Alec McCowen http://www.imdb.com/title/tt0106226/ //www.rottentomatoes.com/m/age_of_innocence/
Malevolence Feature Film Horror 90 R Anchor Bay Entertainment 2004 9 10 2005 4 19 5.1 2386 Rotten 33 Spilled 27 no no no no no no Stevan Mena Samantha Dark R. Brandon Johnson Brandon Johnson Heather Magee Richard Glover http://www.imdb.com/title/tt0388230/ //www.rottentomatoes.com/m/10004684-malevolence/
Old Partner Documentary Documentary 78 Unrated Shcalo Media Group 2009 1 15 2010 4 20 7.8 333 Fresh 91 Upright 86 no no no no no no Chung-ryoul Lee Choi Won-kyun Lee Sam-soon Moo NA NA http://www.imdb.com/title/tt1334549/ //www.rottentomatoes.com/m/old-partner/

Part 1: Data

The dataset is comprised of 651 randomly sampled movies produced and released before 2016.

Some of these variables are only there for informational purposes and may be removed during the analysis process.

1.1 Missing and Duplicates Values

Before we move on to our analysis, let us confirm that we do not have any duplicate entries and/or missing values.

Let’s look for missing values first.

sapply(movies, function(x) sum(is.na(x)))
           title       title_type            genre          runtime 
               0                0                0                1 
     mpaa_rating           studio    thtr_rel_year   thtr_rel_month 
               0                8                0                0 
    thtr_rel_day     dvd_rel_year    dvd_rel_month      dvd_rel_day 
               0                8                8                8 
     imdb_rating   imdb_num_votes   critics_rating    critics_score 
               0                0                0                0 
 audience_rating   audience_score     best_pic_nom     best_pic_win 
               0                0                0                0 
  best_actor_win best_actress_win     best_dir_win       top200_box 
               0                0                0                0 
        director           actor1           actor2           actor3 
               2                2                7                9 
          actor4           actor5         imdb_url           rt_url 
              13               15                0                0 

There are a small amount of NA’s, so for this analysis. Let’s just remove these observations.

movies <- movies %>%
        drop_na()

dim(movies)
[1] 619  32

After removing the NA’s, we now have 619 observations.

  • Let’s now take a look at duplicate movie titles.
movies %>%
        group_by(title) %>%
        tally() %>%
        filter(n > 1)
# A tibble: 4 x 2
  title                  n
  <chr>              <int>
1 Hairspray              2
2 Man on Wire            2
3 Maniac                 2
4 Where the Heart Is     2

We will remove the second entry of each of these movies.

movies <- movies[!duplicated(movies$title),]

dim(movies)
## [1] 615  32

Now, let’s turn to the variables. A few of these variables are strictly informative. We can remove title, studio, dvd_rel_xxx, actor1:5, imdb_url and rt_url.

movies_clean <- movies %>%
        select(title_type, genre, mpaa_rating, runtime, 
               thtr_rel_year, thtr_rel_month,
               thtr_rel_day, imdb_num_votes, critics_rating, 
               critics_score, best_pic_nom, best_pic_win,
               best_actor_win, best_actress_win, best_dir_win,
               top200_box, audience_score)

dim(movies_clean)
## [1] 615  17

Now we have 615 observations and 16 variables from which we can do our analysis.


Part 2: Research question

Can we identify the various factors that produce a high Rotten Tomatoes rating?

According to Wikipedia, major hollywood studios have come to see Rotten Tomatoes as a threat to their marketing, as scores have influenced box office sales. As a result, understanding what drives these ratings is paramount to the financial success of the film.

We will utilize the dataset to develop a parsimonious model to predict a Rotten Tomatoes rating, as measued by audience_score. But, since there may be lack of independence in the imdb_rating variable and it is likely valuing the same response, we will not include imdb_rating in the modeling.


Part 3: Exploratory data analysis

3.1 Collinearity

One set of variables that may be of some concern are critics_rating and critics_score.

Let’s take a look at these variables

ggplot(movies_clean, aes(x = critics_rating, y = critics_score, 
                         fill = critics_rating)) +
        geom_boxplot() +
        geom_jitter(color = "black", size = 0.4, alpha = 0.9) +
        labs(x = "Critics Rating", y ="Critics Score",
             title = "Critics Score vs Critics Ratings")

Figure 3.1 Boxplots of Critics Score split by Critics Ratings

From this plot, it does not appear if Critics Rating adds any additional value to Critics Score, and adds collinearity to the model, so we will remove that variable.

movies_clean <- movies_clean %>%
        select(-critics_rating)

Next, we will begin to look at our variables for explanatory value of audience_score.

3.2 Numeric predictors

Let’s start with the numeric data.

movies_clean %>%
        select(runtime, thtr_rel_year, thtr_rel_month, thtr_rel_day, 
               imdb_num_votes, critics_score, audience_score) %>%
        ggpairs()

Figure 3.2 Pairs plot for numeric predictors

Looking down the diagonal of the distribution of the variables, most are skewed, but not enough for concern is our analysis. The one variable that will need to be transformed is imdb_num_votes. We will look at the log transformation.

par(mfrow = c(1, 2))

hist(movies_clean$imdb_num_votes, main = "", xlab = "Num of IMDB Votes", 
     col = "lightblue")
hist(log(movies_clean$imdb_num_votes), main = "", xlab = "Log(Num of IMDB Votes)", 
     col = "lightblue")

Figure 3.3 Histograms for transformation of IMDB Votes

That looks much better. We will transform that variable in our modeling.

movies_clean$imdb_num_votes <- log(movies_clean$imdb_num_votes)

3.3 Categorical Variables

Next we will at categorical variables

g1 <- ggplot(movies_clean, aes(x = audience_score, y = title_type, 
                               fill = title_type)) +
        geom_density_ridges() +
        theme_ridges() +
        theme(legend.position = "none") +
        xlab("Audience Score") + ylab("Title Type")

g2 <- ggplot(movies_clean, aes(x = audience_score, y = genre, fill = genre)) +
        geom_density_ridges() +
        theme_ridges() +
        theme(legend.position = "none") +
        xlab("Audience Score") + ylab("Genre")

g3 <- ggplot(movies_clean, aes(x = audience_score, y = mpaa_rating, 
                               fill = mpaa_rating)) +
        geom_density_ridges() +
        theme_ridges() +
        theme(legend.position = "none") +
        xlab("Audience Score") + ylab("MPAA Rating")

ggarrange(g1, g2, g3, nrow = 2)
## $`1`

## 
## $`2`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

Figure 3.4 Ridgeline for categorical variables

Some interesting information here to dive into with another project. But for now we will keep these variables in the model.

3.4 Bernoulli Variables

Lastly, we need to took at the Bernoulli variables

movies_bern <- movies_clean %>%
        select(best_pic_nom, best_pic_win, best_actor_win, best_actress_win,
               best_dir_win, top200_box)

summary(movies_bern)
 best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
 no :593      no :608      no :524        no :545          no :572     
 yes: 22      yes:  7      yes: 91        yes: 70          yes: 43     
 top200_box
 no :600   
 yes: 15   

These won’t likely add much value to the model, since the responses are very imbalanced, but again we will keep them in for now.

Before we begin to develop the model, let’s take a quick look at each predictor’s adjusted R-squared individually vs the response variable.

response <- "audience_score"

predictors <- names(movies_clean)
predictors <- predictors[predictors != response]

len <- length(predictors)

adj_r2_all <- rep(NA, len)

for (i in 1:len){
        
        lm_fit <- lm(as.formula(paste0(response,"~",predictors[i])), movies_clean)
        
        adj_r2_i <- summary(lm_fit)$adj.r.squared
        
        adj_r2_all[i] <- adj_r2_i
}

names(adj_r2_all) <- predictors

as.data.frame(adj_r2_all) %>%
        kable() %>%
        kable_styling(full_width = FALSE)
adj_r2_all
title_type 0.0818793
genre 0.1732798
mpaa_rating 0.0571973
runtime 0.0401893
thtr_rel_year 0.0038180
thtr_rel_month 0.0014380
thtr_rel_day -0.0011344
imdb_num_votes 0.0445858
critics_score 0.4918509
best_pic_nom 0.0474277
best_pic_win 0.0128100
best_actor_win -0.0012782
best_actress_win -0.0010759
best_dir_win 0.0084121
top200_box 0.0078158

From this initial analysis it appears that critics_score does the best job at explaining the response variable with an Adj R^2 of 49.18, followed by Genre at 17.32.

With this information let’s begin to build the model.


Part 4: Modeling

Because of the number of variables, we will use backward selection to find a parsimonious model. We will remove the variable with the highest p-value, then compare adjusted r-squared of the model with p predictors vs the model with p-1 predictors. We will continue to do that until the simpler model does not produce a higher adjusted r-squared.

lm_movies <- lm(audience_score ~ ., data = movies_clean)
summary(lm_movies)
## 
## Call:
## lm(formula = audience_score ~ ., data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.882  -7.986   0.000   8.447  46.180 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    516.357153 118.231152   4.367 1.49e-05 ***
## title_typeFeature Film          -3.721121   5.366800  -0.693 0.488359    
## title_typeTV Movie              -2.220960   8.445028  -0.263 0.792651    
## genreAnimation                   6.721536   5.421468   1.240 0.215546    
## genreArt House & International  16.645804   4.332610   3.842 0.000135 ***
## genreComedy                      3.181519   2.258322   1.409 0.159426    
## genreDocumentary                20.518261   5.539044   3.704 0.000232 ***
## genreDrama                       7.407934   2.026967   3.655 0.000281 ***
## genreHorror                     -5.021897   3.409310  -1.473 0.141290    
## genreMusical & Performing Arts  18.869712   4.563141   4.135 4.07e-05 ***
## genreMystery & Suspense         -1.389018   2.539724  -0.547 0.584644    
## genreOther                       6.205597   3.875802   1.601 0.109892    
## genreScience Fiction & Fantasy  -3.595153   4.939821  -0.728 0.467033    
## mpaa_ratingNC-17                -9.721827  13.647115  -0.712 0.476518    
## mpaa_ratingPG                   -3.241724   3.880884  -0.835 0.403887    
## mpaa_ratingPG-13                -4.932626   4.044329  -1.220 0.223093    
## mpaa_ratingR                    -2.245532   3.896211  -0.576 0.564609    
## mpaa_ratingUnrated               1.445843   4.692828   0.308 0.758119    
## runtime                          0.001233   0.033947   0.036 0.971039    
## thtr_rel_year                   -0.257406   0.059555  -4.322 1.82e-05 ***
## thtr_rel_month                  -0.181163   0.159310  -1.137 0.255932    
## thtr_rel_day                    -0.006362   0.061142  -0.104 0.917165    
## imdb_num_votes                   4.180679   0.443269   9.431  < 2e-16 ***
## critics_score                    0.368057   0.023505  15.659  < 2e-16 ***
## best_pic_nomyes                  7.684932   3.446506   2.230 0.026141 *  
## best_pic_winyes                 -5.169908   5.989255  -0.863 0.388383    
## best_actor_winyes               -1.872669   1.581581  -1.184 0.236875    
## best_actress_winyes             -2.602515   1.748720  -1.488 0.137226    
## best_dir_winyes                 -1.757968   2.276946  -0.772 0.440383    
## top200_boxyes                   -1.283028   3.625040  -0.354 0.723515    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.05 on 585 degrees of freedom
## Multiple R-squared:  0.6018, Adjusted R-squared:  0.5821 
## F-statistic: 30.49 on 29 and 585 DF,  p-value: < 2.2e-16

4.1 Backward Selection

Our initial model has an Adjusted R-Squared of 58.21. And the predictor with the highest p-score is runtime, so we will movie that from the model in our next iteration.

lm_movies2 <- lm(audience_score ~ . - runtime, data = movies_clean)
summary(lm_movies2)
## 
## Call:
## lm(formula = audience_score ~ . - runtime, data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.858  -7.992   0.000   8.439  46.204 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    517.256651 115.509686   4.478 9.06e-06 ***
## title_typeFeature Film          -3.727526   5.359329  -0.696 0.487005    
## title_typeTV Movie              -2.227238   8.436061  -0.264 0.791862    
## genreAnimation                   6.714616   5.413499   1.240 0.215344    
## genreArt House & International  16.648771   4.328147   3.847 0.000133 ***
## genreComedy                      3.173567   2.245767   1.413 0.158148    
## genreDocumentary                20.513790   5.532955   3.708 0.000229 ***
## genreDrama                       7.414552   2.017039   3.676 0.000259 ***
## genreHorror                     -5.032853   3.393044  -1.483 0.138536    
## genreMusical & Performing Arts  18.881993   4.546716   4.153 3.77e-05 ***
## genreMystery & Suspense         -1.384074   2.533911  -0.546 0.585122    
## genreOther                       6.212412   3.867957   1.606 0.108786    
## genreScience Fiction & Fantasy  -3.597255   4.935271  -0.729 0.466362    
## mpaa_ratingNC-17                -9.747366  13.617370  -0.716 0.474398    
## mpaa_ratingPG                   -3.234766   3.872849  -0.835 0.403922    
## mpaa_ratingPG-13                -4.914878   4.011276  -1.225 0.220968    
## mpaa_ratingR                    -2.234294   3.880595  -0.576 0.564998    
## mpaa_ratingUnrated               1.469128   4.644864   0.316 0.751894    
## thtr_rel_year                   -0.257816   0.058428  -4.413 1.22e-05 ***
## thtr_rel_month                  -0.180093   0.156427  -1.151 0.250084    
## thtr_rel_day                    -0.006406   0.061077  -0.105 0.916503    
## imdb_num_votes                   4.183798   0.434496   9.629  < 2e-16 ***
## critics_score                    0.368076   0.023479  15.677  < 2e-16 ***
## best_pic_nomyes                  7.694093   3.434335   2.240 0.025443 *  
## best_pic_winyes                 -5.159310   5.977042  -0.863 0.388387    
## best_actor_winyes               -1.863518   1.560054  -1.195 0.232757    
## best_actress_winyes             -2.596356   1.738995  -1.493 0.135970    
## best_dir_winyes                 -1.747313   2.256043  -0.775 0.438946    
## top200_boxyes                   -1.276198   3.617073  -0.353 0.724346    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.04 on 586 degrees of freedom
## Multiple R-squared:  0.6018, Adjusted R-squared:  0.5828 
## F-statistic: 31.63 on 28 and 586 DF,  p-value: < 2.2e-16

Our adjusted R-Squared increased to 58.28. So now we will remove thtr_rel_day.

lm_movies3 <- lm(audience_score ~ . -runtime -thtr_rel_day, data = movies_clean)
summary(lm_movies3)
## 
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day, data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.887  -7.998  -0.030   8.422  46.249 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    518.22679  115.04172   4.505 8.02e-06 ***
## title_typeFeature Film          -3.72193    5.35455  -0.695 0.487270    
## title_typeTV Movie              -2.24708    8.42683  -0.267 0.789826    
## genreAnimation                   6.71964    5.40873   1.242 0.214596    
## genreArt House & International  16.61461    4.31224   3.853 0.000130 ***
## genreComedy                      3.17911    2.24325   1.417 0.156958    
## genreDocumentary                20.50894    5.52810   3.710 0.000227 ***
## genreDrama                       7.41003    2.01488   3.678 0.000257 ***
## genreHorror                     -5.05159    3.38548  -1.492 0.136201    
## genreMusical & Performing Arts  18.87324    4.54212   4.155 3.73e-05 ***
## genreMystery & Suspense         -1.39174    2.53072  -0.550 0.582571    
## genreOther                       6.22489    3.86287   1.611 0.107616    
## genreScience Fiction & Fantasy  -3.56719    4.92279  -0.725 0.468969    
## mpaa_ratingNC-17                -9.80757   13.59380  -0.721 0.470905    
## mpaa_ratingPG                   -3.23871    3.86940  -0.837 0.402930    
## mpaa_ratingPG-13                -4.92505    4.00673  -1.229 0.219491    
## mpaa_ratingR                    -2.23008    3.87712  -0.575 0.565383    
## mpaa_ratingUnrated               1.48580    4.63823   0.320 0.748827    
## thtr_rel_year                   -0.25833    0.05817  -4.441 1.07e-05 ***
## thtr_rel_month                  -0.18186    0.15539  -1.170 0.242335    
## imdb_num_votes                   4.18173    0.43368   9.642  < 2e-16 ***
## critics_score                    0.36800    0.02345  15.694  < 2e-16 ***
## best_pic_nomyes                  7.70645    3.42942   2.247 0.025000 *  
## best_pic_winyes                 -5.18267    5.96786  -0.868 0.385513    
## best_actor_winyes               -1.86577    1.55859  -1.197 0.231755    
## best_actress_winyes             -2.60077    1.73702  -1.497 0.134862    
## best_dir_winyes                 -1.74107    2.25336  -0.773 0.440036    
## top200_boxyes                   -1.27214    3.61382  -0.352 0.724950    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.03 on 587 degrees of freedom
## Multiple R-squared:  0.6018, Adjusted R-squared:  0.5835 
## F-statistic: 32.86 on 27 and 587 DF,  p-value: < 2.2e-16

Our adjusted R-Squared increased again to 58.35. So now we will remove top200_box.

lm_movies4 <- lm(audience_score ~ . -runtime -thtr_rel_day -top200_box, 
                 data = movies_clean)
summary(lm_movies4)
## 
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day - top200_box, 
##     data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.781  -7.942  -0.021   8.410  46.270 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    515.19390  114.63315   4.494 8.41e-06 ***
## title_typeFeature Film          -3.70260    5.35027  -0.692 0.489185    
## title_typeTV Movie              -2.24919    8.42055  -0.267 0.789480    
## genreAnimation                   6.89758    5.38104   1.282 0.200408    
## genreArt House & International  16.65523    4.30748   3.867 0.000123 ***
## genreComedy                      3.23497    2.23597   1.447 0.148489    
## genreDocumentary                20.55875    5.52217   3.723 0.000216 ***
## genreDrama                       7.45831    2.00871   3.713 0.000224 ***
## genreHorror                     -5.00468    3.38033  -1.481 0.139268    
## genreMusical & Performing Arts  18.93587    4.53525   4.175 3.43e-05 ***
## genreMystery & Suspense         -1.33745    2.52414  -0.530 0.596407    
## genreOther                       6.24109    3.85972   1.617 0.106419    
## genreScience Fiction & Fantasy  -3.62820    4.91607  -0.738 0.460791    
## mpaa_ratingNC-17                -9.59950   13.57082  -0.707 0.479621    
## mpaa_ratingPG                   -3.13005    3.85419  -0.812 0.417054    
## mpaa_ratingPG-13                -4.78934    3.98517  -1.202 0.229928    
## mpaa_ratingR                    -2.06900    3.84715  -0.538 0.590918    
## mpaa_ratingUnrated               1.61136    4.62105   0.349 0.727439    
## thtr_rel_year                   -0.25677    0.05796  -4.430 1.12e-05 ***
## thtr_rel_month                  -0.18518    0.15499  -1.195 0.232641    
## imdb_num_votes                   4.15412    0.42621   9.747  < 2e-16 ***
## critics_score                    0.36756    0.02340  15.710  < 2e-16 ***
## best_pic_nomyes                  7.72745    3.42635   2.255 0.024481 *  
## best_pic_winyes                 -5.25914    5.95946  -0.882 0.377874    
## best_actor_winyes               -1.87258    1.55731  -1.202 0.229676    
## best_actress_winyes             -2.63601    1.73284  -1.521 0.128745    
## best_dir_winyes                 -1.71247    2.25021  -0.761 0.446946    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.02 on 588 degrees of freedom
## Multiple R-squared:  0.6017, Adjusted R-squared:  0.5841 
## F-statistic: 34.17 on 26 and 588 DF,  p-value: < 2.2e-16

Our adjusted R-Squared increased again to 58.41. So now we will remove title_type

lm_movies5 <- lm(audience_score ~ . -runtime -thtr_rel_day -top200_box -title_type, 
                 data = movies_clean)
summary(lm_movies5)
## 
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day - top200_box - 
##     title_type, data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.760  -7.897   0.033   8.459  46.289 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    505.09575  113.60617   4.446 1.05e-05 ***
## genreAnimation                   6.82466    5.37330   1.270 0.204548    
## genreArt House & International  16.40948    4.28641   3.828 0.000143 ***
## genreComedy                      3.22658    2.23304   1.445 0.149009    
## genreDocumentary                23.49745    3.39065   6.930 1.11e-11 ***
## genreDrama                       7.36216    2.00024   3.681 0.000254 ***
## genreHorror                     -5.05539    3.37503  -1.498 0.134700    
## genreMusical & Performing Arts  19.91917    4.28851   4.645 4.20e-06 ***
## genreMystery & Suspense         -1.39473    2.51965  -0.554 0.580102    
## genreOther                       6.12753    3.85151   1.591 0.112158    
## genreScience Fiction & Fantasy  -3.67516    4.90942  -0.749 0.454401    
## mpaa_ratingNC-17                -9.63504   13.55363  -0.711 0.477438    
## mpaa_ratingPG                   -3.10732    3.84921  -0.807 0.419841    
## mpaa_ratingPG-13                -4.76483    3.97988  -1.197 0.231698    
## mpaa_ratingR                    -2.07359    3.84218  -0.540 0.589614    
## mpaa_ratingUnrated               1.89132    4.58307   0.413 0.679995    
## thtr_rel_year                   -0.25335    0.05769  -4.392 1.33e-05 ***
## thtr_rel_month                  -0.18502    0.15428  -1.199 0.230916    
## imdb_num_votes                   4.09983    0.41891   9.787  < 2e-16 ***
## critics_score                    0.37086    0.02290  16.197  < 2e-16 ***
## best_pic_nomyes                  7.72007    3.42200   2.256 0.024435 *  
## best_pic_winyes                 -5.19534    5.95119  -0.873 0.383023    
## best_actor_winyes               -1.88306    1.55478  -1.211 0.226324    
## best_actress_winyes             -2.61592    1.72921  -1.513 0.130870    
## best_dir_winyes                 -1.74554    2.24690  -0.777 0.437549    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 590 degrees of freedom
## Multiple R-squared:  0.6014, Adjusted R-squared:  0.5851 
## F-statistic: 37.08 on 24 and 590 DF,  p-value: < 2.2e-16

Our adjusted R-Squared increased again to 58.51. So now we will remove best_dir_win.

lm_movies6 <- lm(audience_score ~ . -runtime -thtr_rel_day -top200_box -title_type
                 -best_dir_win, data = movies_clean)
summary(lm_movies6)
## 
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day - top200_box - 
##     title_type - best_dir_win, data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.330  -7.782   0.167   8.460  46.302 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    495.61246  112.91053   4.389 1.35e-05 ***
## genreAnimation                   6.81604    5.37149   1.269 0.204965    
## genreArt House & International  16.47054    4.28425   3.844 0.000134 ***
## genreComedy                      3.24651    2.23215   1.454 0.146356    
## genreDocumentary                23.53253    3.38921   6.943 1.01e-11 ***
## genreDrama                       7.36114    1.99957   3.681 0.000253 ***
## genreHorror                     -5.04590    3.37388  -1.496 0.135298    
## genreMusical & Performing Arts  19.88402    4.28683   4.638 4.33e-06 ***
## genreMystery & Suspense         -1.42361    2.51853  -0.565 0.572115    
## genreOther                       6.16107    3.84997   1.600 0.110069    
## genreScience Fiction & Fantasy  -3.77220    4.90619  -0.769 0.442280    
## mpaa_ratingNC-17                -9.55803   13.54872  -0.705 0.480804    
## mpaa_ratingPG                   -3.24834    3.84364  -0.845 0.398385    
## mpaa_ratingPG-13                -4.92508    3.97320  -1.240 0.215625    
## mpaa_ratingR                    -2.23620    3.83519  -0.583 0.560067    
## mpaa_ratingUnrated               1.75819    4.57833   0.384 0.701098    
## thtr_rel_year                   -0.24834    0.05731  -4.333 1.73e-05 ***
## thtr_rel_month                  -0.19071    0.15405  -1.238 0.216231    
## imdb_num_votes                   4.06381    0.41619   9.764  < 2e-16 ***
## critics_score                    0.36936    0.02281  16.194  < 2e-16 ***
## best_pic_nomyes                  7.89884    3.41311   2.314 0.020995 *  
## best_pic_winyes                 -6.51594    5.70134  -1.143 0.253552    
## best_actor_winyes               -1.95312    1.55164  -1.259 0.208619    
## best_actress_winyes             -2.63169    1.72851  -1.523 0.128413    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 591 degrees of freedom
## Multiple R-squared:  0.601,  Adjusted R-squared:  0.5854 
## F-statistic:  38.7 on 23 and 591 DF,  p-value: < 2.2e-16

Our adjusted R-Squared increased again to 58.54. So now we will remove best_pic_win.

lm_movies7 <- lm(audience_score ~ . -runtime -thtr_rel_day -top200_box -title_type
                 -best_dir_win - best_pic_win, data = movies_clean)
summary(lm_movies7)
## 
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day - top200_box - 
##     title_type - best_dir_win - best_pic_win, data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.302  -7.834   0.113   8.469  46.213 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    487.60398  112.72202   4.326 1.78e-05 ***
## genreAnimation                   6.77447    5.37275   1.261 0.207844    
## genreArt House & International  16.41711    4.28510   3.831 0.000141 ***
## genreComedy                      3.14624    2.23100   1.410 0.158995    
## genreDocumentary                23.41513    3.38853   6.910 1.26e-11 ***
## genreDrama                       7.31808    1.99973   3.660 0.000275 ***
## genreHorror                     -5.06471    3.37471  -1.501 0.133945    
## genreMusical & Performing Arts  19.81030    4.28746   4.621 4.70e-06 ***
## genreMystery & Suspense         -1.49398    2.51843  -0.593 0.553260    
## genreOther                       6.42008    3.84429   1.670 0.095442 .  
## genreScience Fiction & Fantasy  -3.75378    4.90743  -0.765 0.444626    
## mpaa_ratingNC-17                -9.53541   13.55221  -0.704 0.481955    
## mpaa_ratingPG                   -3.32055    3.84411  -0.864 0.388046    
## mpaa_ratingPG-13                -4.90387    3.97418  -1.234 0.217718    
## mpaa_ratingR                    -2.28252    3.83597  -0.595 0.552051    
## mpaa_ratingUnrated               1.63296    4.57820   0.357 0.721457    
## thtr_rel_year                   -0.24405    0.05720  -4.266 2.31e-05 ***
## thtr_rel_month                  -0.18359    0.15397  -1.192 0.233590    
## imdb_num_votes                   4.00816    0.41344   9.695  < 2e-16 ***
## critics_score                    0.36930    0.02281  16.188  < 2e-16 ***
## best_pic_nomyes                  6.21227    3.07838   2.018 0.044039 *  
## best_actor_winyes               -1.84546    1.54918  -1.191 0.234032    
## best_actress_winyes             -2.75595    1.72553  -1.597 0.110764    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 592 degrees of freedom
## Multiple R-squared:  0.6001, Adjusted R-squared:  0.5852 
## F-statistic: 40.38 on 22 and 592 DF,  p-value: < 2.2e-16

Our adjusted R-Squared has decreased to 58.52. So our lm_movies6 is our final model.

movies_final <- movies_clean %>%
        select(-runtime, -thtr_rel_day, -top200_box, -title_type,
                 -best_dir_win, -best_pic_win)

lm_final <- lm(audience_score ~ ., movies_final)

lm_final
## 
## Call:
## lm(formula = audience_score ~ ., data = movies_final)
## 
## Coefficients:
##                    (Intercept)                  genreAnimation  
##                       487.6040                          6.7745  
## genreArt House & International                     genreComedy  
##                        16.4171                          3.1462  
##               genreDocumentary                      genreDrama  
##                        23.4151                          7.3181  
##                    genreHorror  genreMusical & Performing Arts  
##                        -5.0647                         19.8103  
##        genreMystery & Suspense                      genreOther  
##                        -1.4940                          6.4201  
## genreScience Fiction & Fantasy                mpaa_ratingNC-17  
##                        -3.7538                         -9.5354  
##                  mpaa_ratingPG                mpaa_ratingPG-13  
##                        -3.3206                         -4.9039  
##                   mpaa_ratingR              mpaa_ratingUnrated  
##                        -2.2825                          1.6330  
##                  thtr_rel_year                  thtr_rel_month  
##                        -0.2440                         -0.1836  
##                 imdb_num_votes                   critics_score  
##                         4.0082                          0.3693  
##                best_pic_nomyes               best_actor_winyes  
##                         6.2123                         -1.8455  
##            best_actress_winyes  
##                        -2.7559

4.2 Model Testing

par(mfrow = c(2,2))

plot(lm_final$residuals, xlab = "", ylab = "Model Residuals")
abline(h = 0, lty = 2, col = "red", lwd = 2)

hist(lm_final$residuals, ylab = "", xlab = "Model Residuals", main = "")
abline(v = 0, lty = 2, lwd = 2, col = "red")

plot(lm_final$residuals, movies_final$audience_score, xlab = "Residuals", 
     ylab = "Audience Score")

plot(lm_final$fitted.value, lm_final$residuals, xlab = "Fitted Values",
     ylab = "Residuals")
abline(h = 0, lty = 2, col = "red", lwd = 2)

Figure 4.1 Model Residuals plotted for model fit.

Some observations from these charts.

  • The residuals do not appear to exhibit auto-correlation.

  • The residuals appear to be normally distributed.

  • The residuals appear to have great dispersion around the normality line, but still appear normal.

  • The residuals do seem to exhibit heteroskadasticity, as the variance of the residuals tends to decrease as the fitted values increases.

We will note this along with our results.


Part 5: Prediction

We will now use of newly constructed model to test a random movie.

set.seed(4242)

test.index <- sample(nrow(movies_final), 1)
test<- movies_final[test.index, ]

predict(lm_final, newdata = test, interval = "prediction", level = 0.95)
##        fit      lwr      upr
## 1 62.63984 36.84401 88.43567
test$audience_score
## [1] 64

Our prediction falls within the 95% confidence interval, albeit a wide interval.


Part 6: Conclusion

We have made some interesting finding to explain audience scoring on Rotten Tomatoes. Our best model used the following predictors to explain nearly over 60% of the variability. Genre, MPAA Rating, Theatre release year and month, number of IMDB votes, critics score, best picture nominee, best actor winner and best actress winner.