Modeling and prediction for movies

Part 0: Setup
- 0.1 Load packages
- 0.2 Load data
Part 1: Data
- 1.1 Missing and Duplicates Values
Part 2: Research question
Part 3: Exploratory data analysis
Part 4: Modeling
- 4.1 Backward Selection
- 4.2 Model Testing
Part 5: Prediction
Part 6: Conclusion

Part 0: Setup

This analysis is designed to learn what attributes make a movie popular. The data comes from Rotten Tomatoes and IMDB as a random sample of movies.

0.1 Load packages

library(ggplot2); library(GGally)
library(tidyverse); library(kableExtra)
library(statsr); library(ggridges); library(ggpubr)

0.2 Load data

load("movies.Rdata")

head(movies) %>%
        kable() %>%
        kable_styling("striped") %>%
        scroll_box(width = "100%")

title	title_type	genre	runtime	mpaa_rating	studio	thtr_rel_year	thtr_rel_month	thtr_rel_day	dvd_rel_year	dvd_rel_month	dvd_rel_day	imdb_rating	imdb_num_votes	critics_rating	critics_score	audience_rating	audience_score	best_pic_nom	best_pic_win	best_actor_win	best_actress_win	best_dir_win	top200_box	director	actor1	actor2	actor3	actor4	actor5	imdb_url	rt_url
Filly Brown	Feature Film	Drama	80	R	Indomina Media Inc.	2013	4	19	2013	7	30	5.5	899	Rotten	45	Upright	73	no	no	no	no	no	no	Michael D. Olmos	Gina Rodriguez	Jenni Rivera	Lou Diamond Phillips	Emilio Rivera	Joseph Julian Soria	http://www.imdb.com/title/tt1869425/	//www.rottentomatoes.com/m/filly_brown_2012/
The Dish	Feature Film	Drama	101	PG-13	Warner Bros. Pictures	2001	3	14	2001	8	28	7.3	12285	Certified Fresh	96	Upright	81	no	no	no	no	no	no	Rob Sitch	Sam Neill	Kevin Harrington	Patrick Warburton	Tom Long	Genevieve Mooy	http://www.imdb.com/title/tt0205873/	//www.rottentomatoes.com/m/dish/
Waiting for Guffman	Feature Film	Comedy	84	R	Sony Pictures Classics	1996	8	21	2001	8	21	7.6	22381	Certified Fresh	91	Upright	91	no	no	no	no	no	no	Christopher Guest	Christopher Guest	Catherine O’Hara	Parker Posey	Eugene Levy	Bob Balaban	http://www.imdb.com/title/tt0118111/	//www.rottentomatoes.com/m/waiting_for_guffman/
The Age of Innocence	Feature Film	Drama	139	PG	Columbia Pictures	1993	10	1	2001	11	6	7.2	35096	Certified Fresh	80	Upright	76	no	no	yes	no	yes	no	Martin Scorsese	Daniel Day-Lewis	Michelle Pfeiffer	Winona Ryder	Richard E. Grant	Alec McCowen	http://www.imdb.com/title/tt0106226/	//www.rottentomatoes.com/m/age_of_innocence/
Malevolence	Feature Film	Horror	90	R	Anchor Bay Entertainment	2004	9	10	2005	4	19	5.1	2386	Rotten	33	Spilled	27	no	no	no	no	no	no	Stevan Mena	Samantha Dark	R. Brandon Johnson	Brandon Johnson	Heather Magee	Richard Glover	http://www.imdb.com/title/tt0388230/	//www.rottentomatoes.com/m/10004684-malevolence/
Old Partner	Documentary	Documentary	78	Unrated	Shcalo Media Group	2009	1	15	2010	4	20	7.8	333	Fresh	91	Upright	86	no	no	no	no	no	no	Chung-ryoul Lee	Choi Won-kyun	Lee Sam-soon	Moo	NA	NA	http://www.imdb.com/title/tt1334549/	//www.rottentomatoes.com/m/old-partner/

Part 1: Data

The dataset is comprised of 651 randomly sampled movies produced and released before 2016.

Some of these variables are only there for informational purposes and may be removed during the analysis process.

Generalizability - Although, the exact sampling method was not described, random smapling was used to arrive at our dataset. Assuming a true random sample, our results may be generalized to the population of movies.
Causality - This is an observational study, so causality cannot be determined by the analysis.
An additional observation is that imdb_rating, audience_score and audience_rating may not be independent variables. The same individual may have voted on each site, for this reason we will only use 1 of these variables.

1.1 Missing and Duplicates Values

Before we move on to our analysis, let us confirm that we do not have any duplicate entries and/or missing values.

Let’s look for missing values first.

sapply(movies, function(x) sum(is.na(x)))

           title       title_type            genre          runtime 
               0                0                0                1 
     mpaa_rating           studio    thtr_rel_year   thtr_rel_month 
               0                8                0                0 
    thtr_rel_day     dvd_rel_year    dvd_rel_month      dvd_rel_day 
               0                8                8                8 
     imdb_rating   imdb_num_votes   critics_rating    critics_score 
               0                0                0                0 
 audience_rating   audience_score     best_pic_nom     best_pic_win 
               0                0                0                0 
  best_actor_win best_actress_win     best_dir_win       top200_box 
               0                0                0                0 
        director           actor1           actor2           actor3 
               2                2                7                9 
          actor4           actor5         imdb_url           rt_url 
              13               15                0                0

There are a small amount of NA’s, so for this analysis. Let’s just remove these observations.

movies <- movies %>%
        drop_na()

dim(movies)

[1] 619  32

After removing the NA’s, we now have 619 observations.

Let’s now take a look at duplicate movie titles.

movies %>%
        group_by(title) %>%
        tally() %>%
        filter(n > 1)

# A tibble: 4 x 2
  title                  n
  <chr>              <int>
1 Hairspray              2
2 Man on Wire            2
3 Maniac                 2
4 Where the Heart Is     2

We will remove the second entry of each of these movies.

movies <- movies[!duplicated(movies$title),]

dim(movies)

## [1] 615  32

Now, let’s turn to the variables. A few of these variables are strictly informative. We can remove title, studio, dvd_rel_xxx, actor1:5, imdb_url and rt_url.

movies_clean <- movies %>%
        select(title_type, genre, mpaa_rating, runtime, 
               thtr_rel_year, thtr_rel_month,
               thtr_rel_day, imdb_num_votes, critics_rating, 
               critics_score, best_pic_nom, best_pic_win,
               best_actor_win, best_actress_win, best_dir_win,
               top200_box, audience_score)

dim(movies_clean)

## [1] 615  17

Now we have 615 observations and 16 variables from which we can do our analysis.

Part 2: Research question

Can we identify the various factors that produce a high Rotten Tomatoes rating?

According to Wikipedia, major hollywood studios have come to see Rotten Tomatoes as a threat to their marketing, as scores have influenced box office sales. As a result, understanding what drives these ratings is paramount to the financial success of the film.

We will utilize the dataset to develop a parsimonious model to predict a Rotten Tomatoes rating, as measued by audience_score. But, since there may be lack of independence in the imdb_rating variable and it is likely valuing the same response, we will not include imdb_rating in the modeling.

Part 3: Exploratory data analysis

3.1 Collinearity

One set of variables that may be of some concern are critics_rating and critics_score.

Let’s take a look at these variables

ggplot(movies_clean, aes(x = critics_rating, y = critics_score, 
                         fill = critics_rating)) +
        geom_boxplot() +
        geom_jitter(color = "black", size = 0.4, alpha = 0.9) +
        labs(x = "Critics Rating", y ="Critics Score",
             title = "Critics Score vs Critics Ratings")

Figure 3.1 Boxplots of Critics Score split by Critics Ratings

From this plot, it does not appear if Critics Rating adds any additional value to Critics Score, and adds collinearity to the model, so we will remove that variable.

movies_clean <- movies_clean %>%
        select(-critics_rating)

Next, we will begin to look at our variables for explanatory value of audience_score.

3.2 Numeric predictors

Let’s start with the numeric data.

movies_clean %>%
        select(runtime, thtr_rel_year, thtr_rel_month, thtr_rel_day, 
               imdb_num_votes, critics_score, audience_score) %>%
        ggpairs()

Figure 3.2 Pairs plot for numeric predictors

Looking down the diagonal of the distribution of the variables, most are skewed, but not enough for concern is our analysis. The one variable that will need to be transformed is imdb_num_votes. We will look at the log transformation.

par(mfrow = c(1, 2))

hist(movies_clean$imdb_num_votes, main = "", xlab = "Num of IMDB Votes", 
     col = "lightblue")
hist(log(movies_clean$imdb_num_votes), main = "", xlab = "Log(Num of IMDB Votes)", 
     col = "lightblue")

Figure 3.3 Histograms for transformation of IMDB Votes

That looks much better. We will transform that variable in our modeling.

movies_clean$imdb_num_votes <- log(movies_clean$imdb_num_votes)

3.3 Categorical Variables

Next we will at categorical variables

g1 <- ggplot(movies_clean, aes(x = audience_score, y = title_type, 
                               fill = title_type)) +
        geom_density_ridges() +
        theme_ridges() +
        theme(legend.position = "none") +
        xlab("Audience Score") + ylab("Title Type")

g2 <- ggplot(movies_clean, aes(x = audience_score, y = genre, fill = genre)) +
        geom_density_ridges() +
        theme_ridges() +
        theme(legend.position = "none") +
        xlab("Audience Score") + ylab("Genre")

g3 <- ggplot(movies_clean, aes(x = audience_score, y = mpaa_rating, 
                               fill = mpaa_rating)) +
        geom_density_ridges() +
        theme_ridges() +
        theme(legend.position = "none") +
        xlab("Audience Score") + ylab("MPAA Rating")

ggarrange(g1, g2, g3, nrow = 2)

## $`1`

## 
## $`2`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

Figure 3.4 Ridgeline for categorical variables

Some interesting information here to dive into with another project. But for now we will keep these variables in the model.

3.4 Bernoulli Variables

Lastly, we need to took at the Bernoulli variables

movies_bern <- movies_clean %>%
        select(best_pic_nom, best_pic_win, best_actor_win, best_actress_win,
               best_dir_win, top200_box)

summary(movies_bern)

 best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
 no :593      no :608      no :524        no :545          no :572     
 yes: 22      yes:  7      yes: 91        yes: 70          yes: 43     
 top200_box
 no :600   
 yes: 15

These won’t likely add much value to the model, since the responses are very imbalanced, but again we will keep them in for now.

Before we begin to develop the model, let’s take a quick look at each predictor’s adjusted R-squared individually vs the response variable.

response <- "audience_score"

predictors <- names(movies_clean)
predictors <- predictors[predictors != response]

len <- length(predictors)

adj_r2_all <- rep(NA, len)

for (i in 1:len){
        
        lm_fit <- lm(as.formula(paste0(response,"~",predictors[i])), movies_clean)
        
        adj_r2_i <- summary(lm_fit)$adj.r.squared
        
        adj_r2_all[i] <- adj_r2_i
}

names(adj_r2_all) <- predictors

as.data.frame(adj_r2_all) %>%
        kable() %>%
        kable_styling(full_width = FALSE)

	adj_r2_all
title_type	0.0818793
genre	0.1732798
mpaa_rating	0.0571973
runtime	0.0401893
thtr_rel_year	0.0038180
thtr_rel_month	0.0014380
thtr_rel_day	-0.0011344
imdb_num_votes	0.0445858
critics_score	0.4918509
best_pic_nom	0.0474277
best_pic_win	0.0128100
best_actor_win	-0.0012782
best_actress_win	-0.0010759
best_dir_win	0.0084121
top200_box	0.0078158

From this initial analysis it appears that critics_score does the best job at explaining the response variable with an Adj R^2 of 49.18, followed by Genre at 17.32.

With this information let’s begin to build the model.

Part 4: Modeling

Because of the number of variables, we will use backward selection to find a parsimonious model. We will remove the variable with the highest p-value, then compare adjusted r-squared of the model with p predictors vs the model with p-1 predictors. We will continue to do that until the simpler model does not produce a higher adjusted r-squared.

lm_movies <- lm(audience_score ~ ., data = movies_clean)
summary(lm_movies)

## 
## Call:
## lm(formula = audience_score ~ ., data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.882  -7.986   0.000   8.447  46.180 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    516.357153 118.231152   4.367 1.49e-05 ***
## title_typeFeature Film          -3.721121   5.366800  -0.693 0.488359    
## title_typeTV Movie              -2.220960   8.445028  -0.263 0.792651    
## genreAnimation                   6.721536   5.421468   1.240 0.215546    
## genreArt House & International  16.645804   4.332610   3.842 0.000135 ***
## genreComedy                      3.181519   2.258322   1.409 0.159426    
## genreDocumentary                20.518261   5.539044   3.704 0.000232 ***
## genreDrama                       7.407934   2.026967   3.655 0.000281 ***
## genreHorror                     -5.021897   3.409310  -1.473 0.141290    
## genreMusical & Performing Arts  18.869712   4.563141   4.135 4.07e-05 ***
## genreMystery & Suspense         -1.389018   2.539724  -0.547 0.584644    
## genreOther                       6.205597   3.875802   1.601 0.109892    
## genreScience Fiction & Fantasy  -3.595153   4.939821  -0.728 0.467033    
## mpaa_ratingNC-17                -9.721827  13.647115  -0.712 0.476518    
## mpaa_ratingPG                   -3.241724   3.880884  -0.835 0.403887    
## mpaa_ratingPG-13                -4.932626   4.044329  -1.220 0.223093    
## mpaa_ratingR                    -2.245532   3.896211  -0.576 0.564609    
## mpaa_ratingUnrated               1.445843   4.692828   0.308 0.758119    
## runtime                          0.001233   0.033947   0.036 0.971039    
## thtr_rel_year                   -0.257406   0.059555  -4.322 1.82e-05 ***
## thtr_rel_month                  -0.181163   0.159310  -1.137 0.255932    
## thtr_rel_day                    -0.006362   0.061142  -0.104 0.917165    
## imdb_num_votes                   4.180679   0.443269   9.431  < 2e-16 ***
## critics_score                    0.368057   0.023505  15.659  < 2e-16 ***
## best_pic_nomyes                  7.684932   3.446506   2.230 0.026141 *  
## best_pic_winyes                 -5.169908   5.989255  -0.863 0.388383    
## best_actor_winyes               -1.872669   1.581581  -1.184 0.236875    
## best_actress_winyes             -2.602515   1.748720  -1.488 0.137226    
## best_dir_winyes                 -1.757968   2.276946  -0.772 0.440383    
## top200_boxyes                   -1.283028   3.625040  -0.354 0.723515    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.05 on 585 degrees of freedom
## Multiple R-squared:  0.6018, Adjusted R-squared:  0.5821 
## F-statistic: 30.49 on 29 and 585 DF,  p-value: < 2.2e-16

4.1 Backward Selection

Our initial model has an Adjusted R-Squared of 58.21. And the predictor with the highest p-score is runtime, so we will movie that from the model in our next iteration.

lm_movies2 <- lm(audience_score ~ . - runtime, data = movies_clean)
summary(lm_movies2)

## 
## Call:
## lm(formula = audience_score ~ . - runtime, data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.858  -7.992   0.000   8.439  46.204 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    517.256651 115.509686   4.478 9.06e-06 ***
## title_typeFeature Film          -3.727526   5.359329  -0.696 0.487005    
## title_typeTV Movie              -2.227238   8.436061  -0.264 0.791862    
## genreAnimation                   6.714616   5.413499   1.240 0.215344    
## genreArt House & International  16.648771   4.328147   3.847 0.000133 ***
## genreComedy                      3.173567   2.245767   1.413 0.158148    
## genreDocumentary                20.513790   5.532955   3.708 0.000229 ***
## genreDrama                       7.414552   2.017039   3.676 0.000259 ***
## genreHorror                     -5.032853   3.393044  -1.483 0.138536    
## genreMusical & Performing Arts  18.881993   4.546716   4.153 3.77e-05 ***
## genreMystery & Suspense         -1.384074   2.533911  -0.546 0.585122    
## genreOther                       6.212412   3.867957   1.606 0.108786    
## genreScience Fiction & Fantasy  -3.597255   4.935271  -0.729 0.466362    
## mpaa_ratingNC-17                -9.747366  13.617370  -0.716 0.474398    
## mpaa_ratingPG                   -3.234766   3.872849  -0.835 0.403922    
## mpaa_ratingPG-13                -4.914878   4.011276  -1.225 0.220968    
## mpaa_ratingR                    -2.234294   3.880595  -0.576 0.564998    
## mpaa_ratingUnrated               1.469128   4.644864   0.316 0.751894    
## thtr_rel_year                   -0.257816   0.058428  -4.413 1.22e-05 ***
## thtr_rel_month                  -0.180093   0.156427  -1.151 0.250084    
## thtr_rel_day                    -0.006406   0.061077  -0.105 0.916503    
## imdb_num_votes                   4.183798   0.434496   9.629  < 2e-16 ***
## critics_score                    0.368076   0.023479  15.677  < 2e-16 ***
## best_pic_nomyes                  7.694093   3.434335   2.240 0.025443 *  
## best_pic_winyes                 -5.159310   5.977042  -0.863 0.388387    
## best_actor_winyes               -1.863518   1.560054  -1.195 0.232757    
## best_actress_winyes             -2.596356   1.738995  -1.493 0.135970    
## best_dir_winyes                 -1.747313   2.256043  -0.775 0.438946    
## top200_boxyes                   -1.276198   3.617073  -0.353 0.724346    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.04 on 586 degrees of freedom
## Multiple R-squared:  0.6018, Adjusted R-squared:  0.5828 
## F-statistic: 31.63 on 28 and 586 DF,  p-value: < 2.2e-16

Our adjusted R-Squared increased to 58.28. So now we will remove thtr_rel_day.

lm_movies3 <- lm(audience_score ~ . -runtime -thtr_rel_day, data = movies_clean)
summary(lm_movies3)

## 
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day, data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.887  -7.998  -0.030   8.422  46.249 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    518.22679  115.04172   4.505 8.02e-06 ***
## title_typeFeature Film          -3.72193    5.35455  -0.695 0.487270    
## title_typeTV Movie              -2.24708    8.42683  -0.267 0.789826    
## genreAnimation                   6.71964    5.40873   1.242 0.214596    
## genreArt House & International  16.61461    4.31224   3.853 0.000130 ***
## genreComedy                      3.17911    2.24325   1.417 0.156958    
## genreDocumentary                20.50894    5.52810   3.710 0.000227 ***
## genreDrama                       7.41003    2.01488   3.678 0.000257 ***
## genreHorror                     -5.05159    3.38548  -1.492 0.136201    
## genreMusical & Performing Arts  18.87324    4.54212   4.155 3.73e-05 ***
## genreMystery & Suspense         -1.39174    2.53072  -0.550 0.582571    
## genreOther                       6.22489    3.86287   1.611 0.107616    
## genreScience Fiction & Fantasy  -3.56719    4.92279  -0.725 0.468969    
## mpaa_ratingNC-17                -9.80757   13.59380  -0.721 0.470905    
## mpaa_ratingPG                   -3.23871    3.86940  -0.837 0.402930    
## mpaa_ratingPG-13                -4.92505    4.00673  -1.229 0.219491    
## mpaa_ratingR                    -2.23008    3.87712  -0.575 0.565383    
## mpaa_ratingUnrated               1.48580    4.63823   0.320 0.748827    
## thtr_rel_year                   -0.25833    0.05817  -4.441 1.07e-05 ***
## thtr_rel_month                  -0.18186    0.15539  -1.170 0.242335    
## imdb_num_votes                   4.18173    0.43368   9.642  < 2e-16 ***
## critics_score                    0.36800    0.02345  15.694  < 2e-16 ***
## best_pic_nomyes                  7.70645    3.42942   2.247 0.025000 *  
## best_pic_winyes                 -5.18267    5.96786  -0.868 0.385513    
## best_actor_winyes               -1.86577    1.55859  -1.197 0.231755    
## best_actress_winyes             -2.60077    1.73702  -1.497 0.134862    
## best_dir_winyes                 -1.74107    2.25336  -0.773 0.440036    
## top200_boxyes                   -1.27214    3.61382  -0.352 0.724950    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.03 on 587 degrees of freedom
## Multiple R-squared:  0.6018, Adjusted R-squared:  0.5835 
## F-statistic: 32.86 on 27 and 587 DF,  p-value: < 2.2e-16

Our adjusted R-Squared increased again to 58.35. So now we will remove top200_box.

lm_movies4 <- lm(audience_score ~ . -runtime -thtr_rel_day -top200_box, 
                 data = movies_clean)
summary(lm_movies4)

## 
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day - top200_box, 
##     data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.781  -7.942  -0.021   8.410  46.270 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    515.19390  114.63315   4.494 8.41e-06 ***
## title_typeFeature Film          -3.70260    5.35027  -0.692 0.489185    
## title_typeTV Movie              -2.24919    8.42055  -0.267 0.789480    
## genreAnimation                   6.89758    5.38104   1.282 0.200408    
## genreArt House & International  16.65523    4.30748   3.867 0.000123 ***
## genreComedy                      3.23497    2.23597   1.447 0.148489    
## genreDocumentary                20.55875    5.52217   3.723 0.000216 ***
## genreDrama                       7.45831    2.00871   3.713 0.000224 ***
## genreHorror                     -5.00468    3.38033  -1.481 0.139268    
## genreMusical & Performing Arts  18.93587    4.53525   4.175 3.43e-05 ***
## genreMystery & Suspense         -1.33745    2.52414  -0.530 0.596407    
## genreOther                       6.24109    3.85972   1.617 0.106419    
## genreScience Fiction & Fantasy  -3.62820    4.91607  -0.738 0.460791    
## mpaa_ratingNC-17                -9.59950   13.57082  -0.707 0.479621    
## mpaa_ratingPG                   -3.13005    3.85419  -0.812 0.417054    
## mpaa_ratingPG-13                -4.78934    3.98517  -1.202 0.229928    
## mpaa_ratingR                    -2.06900    3.84715  -0.538 0.590918    
## mpaa_ratingUnrated               1.61136    4.62105   0.349 0.727439    
## thtr_rel_year                   -0.25677    0.05796  -4.430 1.12e-05 ***
## thtr_rel_month                  -0.18518    0.15499  -1.195 0.232641    
## imdb_num_votes                   4.15412    0.42621   9.747  < 2e-16 ***
## critics_score                    0.36756    0.02340  15.710  < 2e-16 ***
## best_pic_nomyes                  7.72745    3.42635   2.255 0.024481 *  
## best_pic_winyes                 -5.25914    5.95946  -0.882 0.377874    
## best_actor_winyes               -1.87258    1.55731  -1.202 0.229676    
## best_actress_winyes             -2.63601    1.73284  -1.521 0.128745    
## best_dir_winyes                 -1.71247    2.25021  -0.761 0.446946    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.02 on 588 degrees of freedom
## Multiple R-squared:  0.6017, Adjusted R-squared:  0.5841 
## F-statistic: 34.17 on 26 and 588 DF,  p-value: < 2.2e-16

Our adjusted R-Squared increased again to 58.41. So now we will remove title_type

lm_movies5 <- lm(audience_score ~ . -runtime -thtr_rel_day -top200_box -title_type, 
                 data = movies_clean)
summary(lm_movies5)

## 
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day - top200_box - 
##     title_type, data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.760  -7.897   0.033   8.459  46.289 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    505.09575  113.60617   4.446 1.05e-05 ***
## genreAnimation                   6.82466    5.37330   1.270 0.204548    
## genreArt House & International  16.40948    4.28641   3.828 0.000143 ***
## genreComedy                      3.22658    2.23304   1.445 0.149009    
## genreDocumentary                23.49745    3.39065   6.930 1.11e-11 ***
## genreDrama                       7.36216    2.00024   3.681 0.000254 ***
## genreHorror                     -5.05539    3.37503  -1.498 0.134700    
## genreMusical & Performing Arts  19.91917    4.28851   4.645 4.20e-06 ***
## genreMystery & Suspense         -1.39473    2.51965  -0.554 0.580102    
## genreOther                       6.12753    3.85151   1.591 0.112158    
## genreScience Fiction & Fantasy  -3.67516    4.90942  -0.749 0.454401    
## mpaa_ratingNC-17                -9.63504   13.55363  -0.711 0.477438    
## mpaa_ratingPG                   -3.10732    3.84921  -0.807 0.419841    
## mpaa_ratingPG-13                -4.76483    3.97988  -1.197 0.231698    
## mpaa_ratingR                    -2.07359    3.84218  -0.540 0.589614    
## mpaa_ratingUnrated               1.89132    4.58307   0.413 0.679995    
## thtr_rel_year                   -0.25335    0.05769  -4.392 1.33e-05 ***
## thtr_rel_month                  -0.18502    0.15428  -1.199 0.230916    
## imdb_num_votes                   4.09983    0.41891   9.787  < 2e-16 ***
## critics_score                    0.37086    0.02290  16.197  < 2e-16 ***
## best_pic_nomyes                  7.72007    3.42200   2.256 0.024435 *  
## best_pic_winyes                 -5.19534    5.95119  -0.873 0.383023    
## best_actor_winyes               -1.88306    1.55478  -1.211 0.226324    
## best_actress_winyes             -2.61592    1.72921  -1.513 0.130870    
## best_dir_winyes                 -1.74554    2.24690  -0.777 0.437549    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 590 degrees of freedom
## Multiple R-squared:  0.6014, Adjusted R-squared:  0.5851 
## F-statistic: 37.08 on 24 and 590 DF,  p-value: < 2.2e-16

Our adjusted R-Squared increased again to 58.51. So now we will remove best_dir_win.

lm_movies6 <- lm(audience_score ~ . -runtime -thtr_rel_day -top200_box -title_type
                 -best_dir_win, data = movies_clean)
summary(lm_movies6)

## 
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day - top200_box - 
##     title_type - best_dir_win, data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.330  -7.782   0.167   8.460  46.302 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    495.61246  112.91053   4.389 1.35e-05 ***
## genreAnimation                   6.81604    5.37149   1.269 0.204965    
## genreArt House & International  16.47054    4.28425   3.844 0.000134 ***
## genreComedy                      3.24651    2.23215   1.454 0.146356    
## genreDocumentary                23.53253    3.38921   6.943 1.01e-11 ***
## genreDrama                       7.36114    1.99957   3.681 0.000253 ***
## genreHorror                     -5.04590    3.37388  -1.496 0.135298    
## genreMusical & Performing Arts  19.88402    4.28683   4.638 4.33e-06 ***
## genreMystery & Suspense         -1.42361    2.51853  -0.565 0.572115    
## genreOther                       6.16107    3.84997   1.600 0.110069    
## genreScience Fiction & Fantasy  -3.77220    4.90619  -0.769 0.442280    
## mpaa_ratingNC-17                -9.55803   13.54872  -0.705 0.480804    
## mpaa_ratingPG                   -3.24834    3.84364  -0.845 0.398385    
## mpaa_ratingPG-13                -4.92508    3.97320  -1.240 0.215625    
## mpaa_ratingR                    -2.23620    3.83519  -0.583 0.560067    
## mpaa_ratingUnrated               1.75819    4.57833   0.384 0.701098    
## thtr_rel_year                   -0.24834    0.05731  -4.333 1.73e-05 ***
## thtr_rel_month                  -0.19071    0.15405  -1.238 0.216231    
## imdb_num_votes                   4.06381    0.41619   9.764  < 2e-16 ***
## critics_score                    0.36936    0.02281  16.194  < 2e-16 ***
## best_pic_nomyes                  7.89884    3.41311   2.314 0.020995 *  
## best_pic_winyes                 -6.51594    5.70134  -1.143 0.253552    
## best_actor_winyes               -1.95312    1.55164  -1.259 0.208619    
## best_actress_winyes             -2.63169    1.72851  -1.523 0.128413    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 591 degrees of freedom
## Multiple R-squared:  0.601,  Adjusted R-squared:  0.5854 
## F-statistic:  38.7 on 23 and 591 DF,  p-value: < 2.2e-16

Our adjusted R-Squared increased again to 58.54. So now we will remove best_pic_win.

lm_movies7 <- lm(audience_score ~ . -runtime -thtr_rel_day -top200_box -title_type
                 -best_dir_win - best_pic_win, data = movies_clean)
summary(lm_movies7)

## 
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day - top200_box - 
##     title_type - best_dir_win - best_pic_win, data = movies_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.302  -7.834   0.113   8.469  46.213 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    487.60398  112.72202   4.326 1.78e-05 ***
## genreAnimation                   6.77447    5.37275   1.261 0.207844    
## genreArt House & International  16.41711    4.28510   3.831 0.000141 ***
## genreComedy                      3.14624    2.23100   1.410 0.158995    
## genreDocumentary                23.41513    3.38853   6.910 1.26e-11 ***
## genreDrama                       7.31808    1.99973   3.660 0.000275 ***
## genreHorror                     -5.06471    3.37471  -1.501 0.133945    
## genreMusical & Performing Arts  19.81030    4.28746   4.621 4.70e-06 ***
## genreMystery & Suspense         -1.49398    2.51843  -0.593 0.553260    
## genreOther                       6.42008    3.84429   1.670 0.095442 .  
## genreScience Fiction & Fantasy  -3.75378    4.90743  -0.765 0.444626    
## mpaa_ratingNC-17                -9.53541   13.55221  -0.704 0.481955    
## mpaa_ratingPG                   -3.32055    3.84411  -0.864 0.388046    
## mpaa_ratingPG-13                -4.90387    3.97418  -1.234 0.217718    
## mpaa_ratingR                    -2.28252    3.83597  -0.595 0.552051    
## mpaa_ratingUnrated               1.63296    4.57820   0.357 0.721457    
## thtr_rel_year                   -0.24405    0.05720  -4.266 2.31e-05 ***
## thtr_rel_month                  -0.18359    0.15397  -1.192 0.233590    
## imdb_num_votes                   4.00816    0.41344   9.695  < 2e-16 ***
## critics_score                    0.36930    0.02281  16.188  < 2e-16 ***
## best_pic_nomyes                  6.21227    3.07838   2.018 0.044039 *  
## best_actor_winyes               -1.84546    1.54918  -1.191 0.234032    
## best_actress_winyes             -2.75595    1.72553  -1.597 0.110764    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 592 degrees of freedom
## Multiple R-squared:  0.6001, Adjusted R-squared:  0.5852 
## F-statistic: 40.38 on 22 and 592 DF,  p-value: < 2.2e-16

Our adjusted R-Squared has decreased to 58.52. So our lm_movies6 is our final model.

movies_final <- movies_clean %>%
        select(-runtime, -thtr_rel_day, -top200_box, -title_type,
                 -best_dir_win, -best_pic_win)

lm_final <- lm(audience_score ~ ., movies_final)

lm_final

## 
## Call:
## lm(formula = audience_score ~ ., data = movies_final)
## 
## Coefficients:
##                    (Intercept)                  genreAnimation  
##                       487.6040                          6.7745  
## genreArt House & International                     genreComedy  
##                        16.4171                          3.1462  
##               genreDocumentary                      genreDrama  
##                        23.4151                          7.3181  
##                    genreHorror  genreMusical & Performing Arts  
##                        -5.0647                         19.8103  
##        genreMystery & Suspense                      genreOther  
##                        -1.4940                          6.4201  
## genreScience Fiction & Fantasy                mpaa_ratingNC-17  
##                        -3.7538                         -9.5354  
##                  mpaa_ratingPG                mpaa_ratingPG-13  
##                        -3.3206                         -4.9039  
##                   mpaa_ratingR              mpaa_ratingUnrated  
##                        -2.2825                          1.6330  
##                  thtr_rel_year                  thtr_rel_month  
##                        -0.2440                         -0.1836  
##                 imdb_num_votes                   critics_score  
##                         4.0082                          0.3693  
##                best_pic_nomyes               best_actor_winyes  
##                         6.2123                         -1.8455  
##            best_actress_winyes  
##                        -2.7559

4.2 Model Testing

par(mfrow = c(2,2))

plot(lm_final$residuals, xlab = "", ylab = "Model Residuals")
abline(h = 0, lty = 2, col = "red", lwd = 2)

hist(lm_final$residuals, ylab = "", xlab = "Model Residuals", main = "")
abline(v = 0, lty = 2, lwd = 2, col = "red")

plot(lm_final$residuals, movies_final$audience_score, xlab = "Residuals", 
     ylab = "Audience Score")

plot(lm_final$fitted.value, lm_final$residuals, xlab = "Fitted Values",
     ylab = "Residuals")
abline(h = 0, lty = 2, col = "red", lwd = 2)

Figure 4.1 Model Residuals plotted for model fit.

Some observations from these charts.

The residuals do not appear to exhibit auto-correlation.
The residuals appear to be normally distributed.
The residuals appear to have great dispersion around the normality line, but still appear normal.
The residuals do seem to exhibit heteroskadasticity, as the variance of the residuals tends to decrease as the fitted values increases.

We will note this along with our results.

Part 5: Prediction

We will now use of newly constructed model to test a random movie.

set.seed(4242)

test.index <- sample(nrow(movies_final), 1)
test<- movies_final[test.index, ]

predict(lm_final, newdata = test, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 62.63984 36.84401 88.43567

test$audience_score

## [1] 64

Our prediction falls within the 95% confidence interval, albeit a wide interval.

Part 6: Conclusion

We have made some interesting finding to explain audience scoring on Rotten Tomatoes. Our best model used the following predictors to explain nearly over 60% of the variability. Genre, MPAA Rating, Theatre release year and month, number of IMDB votes, critics score, best picture nominee, best actor winner and best actress winner.