This analysis is designed to learn what attributes make a movie popular. The data comes from Rotten Tomatoes and IMDB as a random sample of movies.
load("movies.Rdata")
head(movies) %>%
kable() %>%
kable_styling("striped") %>%
scroll_box(width = "100%")| title | title_type | genre | runtime | mpaa_rating | studio | thtr_rel_year | thtr_rel_month | thtr_rel_day | dvd_rel_year | dvd_rel_month | dvd_rel_day | imdb_rating | imdb_num_votes | critics_rating | critics_score | audience_rating | audience_score | best_pic_nom | best_pic_win | best_actor_win | best_actress_win | best_dir_win | top200_box | director | actor1 | actor2 | actor3 | actor4 | actor5 | imdb_url | rt_url |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Filly Brown | Feature Film | Drama | 80 | R | Indomina Media Inc. | 2013 | 4 | 19 | 2013 | 7 | 30 | 5.5 | 899 | Rotten | 45 | Upright | 73 | no | no | no | no | no | no | Michael D. Olmos | Gina Rodriguez | Jenni Rivera | Lou Diamond Phillips | Emilio Rivera | Joseph Julian Soria | http://www.imdb.com/title/tt1869425/ | //www.rottentomatoes.com/m/filly_brown_2012/ |
| The Dish | Feature Film | Drama | 101 | PG-13 | Warner Bros. Pictures | 2001 | 3 | 14 | 2001 | 8 | 28 | 7.3 | 12285 | Certified Fresh | 96 | Upright | 81 | no | no | no | no | no | no | Rob Sitch | Sam Neill | Kevin Harrington | Patrick Warburton | Tom Long | Genevieve Mooy | http://www.imdb.com/title/tt0205873/ | //www.rottentomatoes.com/m/dish/ |
| Waiting for Guffman | Feature Film | Comedy | 84 | R | Sony Pictures Classics | 1996 | 8 | 21 | 2001 | 8 | 21 | 7.6 | 22381 | Certified Fresh | 91 | Upright | 91 | no | no | no | no | no | no | Christopher Guest | Christopher Guest | Catherine O’Hara | Parker Posey | Eugene Levy | Bob Balaban | http://www.imdb.com/title/tt0118111/ | //www.rottentomatoes.com/m/waiting_for_guffman/ |
| The Age of Innocence | Feature Film | Drama | 139 | PG | Columbia Pictures | 1993 | 10 | 1 | 2001 | 11 | 6 | 7.2 | 35096 | Certified Fresh | 80 | Upright | 76 | no | no | yes | no | yes | no | Martin Scorsese | Daniel Day-Lewis | Michelle Pfeiffer | Winona Ryder | Richard E. Grant | Alec McCowen | http://www.imdb.com/title/tt0106226/ | //www.rottentomatoes.com/m/age_of_innocence/ |
| Malevolence | Feature Film | Horror | 90 | R | Anchor Bay Entertainment | 2004 | 9 | 10 | 2005 | 4 | 19 | 5.1 | 2386 | Rotten | 33 | Spilled | 27 | no | no | no | no | no | no | Stevan Mena | Samantha Dark | R. Brandon Johnson | Brandon Johnson | Heather Magee | Richard Glover | http://www.imdb.com/title/tt0388230/ | //www.rottentomatoes.com/m/10004684-malevolence/ |
| Old Partner | Documentary | Documentary | 78 | Unrated | Shcalo Media Group | 2009 | 1 | 15 | 2010 | 4 | 20 | 7.8 | 333 | Fresh | 91 | Upright | 86 | no | no | no | no | no | no | Chung-ryoul Lee | Choi Won-kyun | Lee Sam-soon | Moo | NA | NA | http://www.imdb.com/title/tt1334549/ | //www.rottentomatoes.com/m/old-partner/ |
The dataset is comprised of 651 randomly sampled movies produced and released before 2016.
Some of these variables are only there for informational purposes and may be removed during the analysis process.
Generalizability - Although, the exact sampling method was not described, random smapling was used to arrive at our dataset. Assuming a true random sample, our results may be generalized to the population of movies.
Causality - This is an observational study, so causality cannot be determined by the analysis.
An additional observation is that imdb_rating, audience_score and audience_rating may not be independent variables. The same individual may have voted on each site, for this reason we will only use 1 of these variables.
Before we move on to our analysis, let us confirm that we do not have any duplicate entries and/or missing values.
Let’s look for missing values first.
title title_type genre runtime
0 0 0 1
mpaa_rating studio thtr_rel_year thtr_rel_month
0 8 0 0
thtr_rel_day dvd_rel_year dvd_rel_month dvd_rel_day
0 8 8 8
imdb_rating imdb_num_votes critics_rating critics_score
0 0 0 0
audience_rating audience_score best_pic_nom best_pic_win
0 0 0 0
best_actor_win best_actress_win best_dir_win top200_box
0 0 0 0
director actor1 actor2 actor3
2 2 7 9
actor4 actor5 imdb_url rt_url
13 15 0 0
There are a small amount of NA’s, so for this analysis. Let’s just remove these observations.
[1] 619 32
After removing the NA’s, we now have 619 observations.
# A tibble: 4 x 2
title n
<chr> <int>
1 Hairspray 2
2 Man on Wire 2
3 Maniac 2
4 Where the Heart Is 2
We will remove the second entry of each of these movies.
## [1] 615 32
Now, let’s turn to the variables. A few of these variables are strictly informative. We can remove title, studio, dvd_rel_xxx, actor1:5, imdb_url and rt_url.
movies_clean <- movies %>%
select(title_type, genre, mpaa_rating, runtime,
thtr_rel_year, thtr_rel_month,
thtr_rel_day, imdb_num_votes, critics_rating,
critics_score, best_pic_nom, best_pic_win,
best_actor_win, best_actress_win, best_dir_win,
top200_box, audience_score)
dim(movies_clean)## [1] 615 17
Now we have 615 observations and 16 variables from which we can do our analysis.
Can we identify the various factors that produce a high Rotten Tomatoes rating?
According to Wikipedia, major hollywood studios have come to see Rotten Tomatoes as a threat to their marketing, as scores have influenced box office sales. As a result, understanding what drives these ratings is paramount to the financial success of the film.
We will utilize the dataset to develop a parsimonious model to predict a Rotten Tomatoes rating, as measued by audience_score. But, since there may be lack of independence in the imdb_rating variable and it is likely valuing the same response, we will not include imdb_rating in the modeling.
One set of variables that may be of some concern are critics_rating and critics_score.
Let’s take a look at these variables
ggplot(movies_clean, aes(x = critics_rating, y = critics_score,
fill = critics_rating)) +
geom_boxplot() +
geom_jitter(color = "black", size = 0.4, alpha = 0.9) +
labs(x = "Critics Rating", y ="Critics Score",
title = "Critics Score vs Critics Ratings")Figure 3.1 Boxplots of Critics Score split by Critics Ratings
From this plot, it does not appear if Critics Rating adds any additional value to Critics Score, and adds collinearity to the model, so we will remove that variable.
Next, we will begin to look at our variables for explanatory value of audience_score.
Let’s start with the numeric data.
movies_clean %>%
select(runtime, thtr_rel_year, thtr_rel_month, thtr_rel_day,
imdb_num_votes, critics_score, audience_score) %>%
ggpairs()Figure 3.2 Pairs plot for numeric predictors
Looking down the diagonal of the distribution of the variables, most are skewed, but not enough for concern is our analysis. The one variable that will need to be transformed is imdb_num_votes. We will look at the log transformation.
par(mfrow = c(1, 2))
hist(movies_clean$imdb_num_votes, main = "", xlab = "Num of IMDB Votes",
col = "lightblue")
hist(log(movies_clean$imdb_num_votes), main = "", xlab = "Log(Num of IMDB Votes)",
col = "lightblue")Figure 3.3 Histograms for transformation of IMDB Votes
That looks much better. We will transform that variable in our modeling.
Next we will at categorical variables
g1 <- ggplot(movies_clean, aes(x = audience_score, y = title_type,
fill = title_type)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none") +
xlab("Audience Score") + ylab("Title Type")
g2 <- ggplot(movies_clean, aes(x = audience_score, y = genre, fill = genre)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none") +
xlab("Audience Score") + ylab("Genre")
g3 <- ggplot(movies_clean, aes(x = audience_score, y = mpaa_rating,
fill = mpaa_rating)) +
geom_density_ridges() +
theme_ridges() +
theme(legend.position = "none") +
xlab("Audience Score") + ylab("MPAA Rating")
ggarrange(g1, g2, g3, nrow = 2)## $`1`
##
## $`2`
##
## attr(,"class")
## [1] "list" "ggarrange"
Figure 3.4 Ridgeline for categorical variables
Some interesting information here to dive into with another project. But for now we will keep these variables in the model.
Lastly, we need to took at the Bernoulli variables
movies_bern <- movies_clean %>%
select(best_pic_nom, best_pic_win, best_actor_win, best_actress_win,
best_dir_win, top200_box)
summary(movies_bern) best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
no :593 no :608 no :524 no :545 no :572
yes: 22 yes: 7 yes: 91 yes: 70 yes: 43
top200_box
no :600
yes: 15
These won’t likely add much value to the model, since the responses are very imbalanced, but again we will keep them in for now.
Before we begin to develop the model, let’s take a quick look at each predictor’s adjusted R-squared individually vs the response variable.
response <- "audience_score"
predictors <- names(movies_clean)
predictors <- predictors[predictors != response]
len <- length(predictors)
adj_r2_all <- rep(NA, len)
for (i in 1:len){
lm_fit <- lm(as.formula(paste0(response,"~",predictors[i])), movies_clean)
adj_r2_i <- summary(lm_fit)$adj.r.squared
adj_r2_all[i] <- adj_r2_i
}
names(adj_r2_all) <- predictors
as.data.frame(adj_r2_all) %>%
kable() %>%
kable_styling(full_width = FALSE)| adj_r2_all | |
|---|---|
| title_type | 0.0818793 |
| genre | 0.1732798 |
| mpaa_rating | 0.0571973 |
| runtime | 0.0401893 |
| thtr_rel_year | 0.0038180 |
| thtr_rel_month | 0.0014380 |
| thtr_rel_day | -0.0011344 |
| imdb_num_votes | 0.0445858 |
| critics_score | 0.4918509 |
| best_pic_nom | 0.0474277 |
| best_pic_win | 0.0128100 |
| best_actor_win | -0.0012782 |
| best_actress_win | -0.0010759 |
| best_dir_win | 0.0084121 |
| top200_box | 0.0078158 |
From this initial analysis it appears that critics_score does the best job at explaining the response variable with an Adj R^2 of 49.18, followed by Genre at 17.32.
With this information let’s begin to build the model.
Because of the number of variables, we will use backward selection to find a parsimonious model. We will remove the variable with the highest p-value, then compare adjusted r-squared of the model with p predictors vs the model with p-1 predictors. We will continue to do that until the simpler model does not produce a higher adjusted r-squared.
##
## Call:
## lm(formula = audience_score ~ ., data = movies_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.882 -7.986 0.000 8.447 46.180
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 516.357153 118.231152 4.367 1.49e-05 ***
## title_typeFeature Film -3.721121 5.366800 -0.693 0.488359
## title_typeTV Movie -2.220960 8.445028 -0.263 0.792651
## genreAnimation 6.721536 5.421468 1.240 0.215546
## genreArt House & International 16.645804 4.332610 3.842 0.000135 ***
## genreComedy 3.181519 2.258322 1.409 0.159426
## genreDocumentary 20.518261 5.539044 3.704 0.000232 ***
## genreDrama 7.407934 2.026967 3.655 0.000281 ***
## genreHorror -5.021897 3.409310 -1.473 0.141290
## genreMusical & Performing Arts 18.869712 4.563141 4.135 4.07e-05 ***
## genreMystery & Suspense -1.389018 2.539724 -0.547 0.584644
## genreOther 6.205597 3.875802 1.601 0.109892
## genreScience Fiction & Fantasy -3.595153 4.939821 -0.728 0.467033
## mpaa_ratingNC-17 -9.721827 13.647115 -0.712 0.476518
## mpaa_ratingPG -3.241724 3.880884 -0.835 0.403887
## mpaa_ratingPG-13 -4.932626 4.044329 -1.220 0.223093
## mpaa_ratingR -2.245532 3.896211 -0.576 0.564609
## mpaa_ratingUnrated 1.445843 4.692828 0.308 0.758119
## runtime 0.001233 0.033947 0.036 0.971039
## thtr_rel_year -0.257406 0.059555 -4.322 1.82e-05 ***
## thtr_rel_month -0.181163 0.159310 -1.137 0.255932
## thtr_rel_day -0.006362 0.061142 -0.104 0.917165
## imdb_num_votes 4.180679 0.443269 9.431 < 2e-16 ***
## critics_score 0.368057 0.023505 15.659 < 2e-16 ***
## best_pic_nomyes 7.684932 3.446506 2.230 0.026141 *
## best_pic_winyes -5.169908 5.989255 -0.863 0.388383
## best_actor_winyes -1.872669 1.581581 -1.184 0.236875
## best_actress_winyes -2.602515 1.748720 -1.488 0.137226
## best_dir_winyes -1.757968 2.276946 -0.772 0.440383
## top200_boxyes -1.283028 3.625040 -0.354 0.723515
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.05 on 585 degrees of freedom
## Multiple R-squared: 0.6018, Adjusted R-squared: 0.5821
## F-statistic: 30.49 on 29 and 585 DF, p-value: < 2.2e-16
Our initial model has an Adjusted R-Squared of 58.21. And the predictor with the highest p-score is runtime, so we will movie that from the model in our next iteration.
##
## Call:
## lm(formula = audience_score ~ . - runtime, data = movies_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.858 -7.992 0.000 8.439 46.204
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 517.256651 115.509686 4.478 9.06e-06 ***
## title_typeFeature Film -3.727526 5.359329 -0.696 0.487005
## title_typeTV Movie -2.227238 8.436061 -0.264 0.791862
## genreAnimation 6.714616 5.413499 1.240 0.215344
## genreArt House & International 16.648771 4.328147 3.847 0.000133 ***
## genreComedy 3.173567 2.245767 1.413 0.158148
## genreDocumentary 20.513790 5.532955 3.708 0.000229 ***
## genreDrama 7.414552 2.017039 3.676 0.000259 ***
## genreHorror -5.032853 3.393044 -1.483 0.138536
## genreMusical & Performing Arts 18.881993 4.546716 4.153 3.77e-05 ***
## genreMystery & Suspense -1.384074 2.533911 -0.546 0.585122
## genreOther 6.212412 3.867957 1.606 0.108786
## genreScience Fiction & Fantasy -3.597255 4.935271 -0.729 0.466362
## mpaa_ratingNC-17 -9.747366 13.617370 -0.716 0.474398
## mpaa_ratingPG -3.234766 3.872849 -0.835 0.403922
## mpaa_ratingPG-13 -4.914878 4.011276 -1.225 0.220968
## mpaa_ratingR -2.234294 3.880595 -0.576 0.564998
## mpaa_ratingUnrated 1.469128 4.644864 0.316 0.751894
## thtr_rel_year -0.257816 0.058428 -4.413 1.22e-05 ***
## thtr_rel_month -0.180093 0.156427 -1.151 0.250084
## thtr_rel_day -0.006406 0.061077 -0.105 0.916503
## imdb_num_votes 4.183798 0.434496 9.629 < 2e-16 ***
## critics_score 0.368076 0.023479 15.677 < 2e-16 ***
## best_pic_nomyes 7.694093 3.434335 2.240 0.025443 *
## best_pic_winyes -5.159310 5.977042 -0.863 0.388387
## best_actor_winyes -1.863518 1.560054 -1.195 0.232757
## best_actress_winyes -2.596356 1.738995 -1.493 0.135970
## best_dir_winyes -1.747313 2.256043 -0.775 0.438946
## top200_boxyes -1.276198 3.617073 -0.353 0.724346
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.04 on 586 degrees of freedom
## Multiple R-squared: 0.6018, Adjusted R-squared: 0.5828
## F-statistic: 31.63 on 28 and 586 DF, p-value: < 2.2e-16
Our adjusted R-Squared increased to 58.28. So now we will remove thtr_rel_day.
lm_movies3 <- lm(audience_score ~ . -runtime -thtr_rel_day, data = movies_clean)
summary(lm_movies3)##
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day, data = movies_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.887 -7.998 -0.030 8.422 46.249
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 518.22679 115.04172 4.505 8.02e-06 ***
## title_typeFeature Film -3.72193 5.35455 -0.695 0.487270
## title_typeTV Movie -2.24708 8.42683 -0.267 0.789826
## genreAnimation 6.71964 5.40873 1.242 0.214596
## genreArt House & International 16.61461 4.31224 3.853 0.000130 ***
## genreComedy 3.17911 2.24325 1.417 0.156958
## genreDocumentary 20.50894 5.52810 3.710 0.000227 ***
## genreDrama 7.41003 2.01488 3.678 0.000257 ***
## genreHorror -5.05159 3.38548 -1.492 0.136201
## genreMusical & Performing Arts 18.87324 4.54212 4.155 3.73e-05 ***
## genreMystery & Suspense -1.39174 2.53072 -0.550 0.582571
## genreOther 6.22489 3.86287 1.611 0.107616
## genreScience Fiction & Fantasy -3.56719 4.92279 -0.725 0.468969
## mpaa_ratingNC-17 -9.80757 13.59380 -0.721 0.470905
## mpaa_ratingPG -3.23871 3.86940 -0.837 0.402930
## mpaa_ratingPG-13 -4.92505 4.00673 -1.229 0.219491
## mpaa_ratingR -2.23008 3.87712 -0.575 0.565383
## mpaa_ratingUnrated 1.48580 4.63823 0.320 0.748827
## thtr_rel_year -0.25833 0.05817 -4.441 1.07e-05 ***
## thtr_rel_month -0.18186 0.15539 -1.170 0.242335
## imdb_num_votes 4.18173 0.43368 9.642 < 2e-16 ***
## critics_score 0.36800 0.02345 15.694 < 2e-16 ***
## best_pic_nomyes 7.70645 3.42942 2.247 0.025000 *
## best_pic_winyes -5.18267 5.96786 -0.868 0.385513
## best_actor_winyes -1.86577 1.55859 -1.197 0.231755
## best_actress_winyes -2.60077 1.73702 -1.497 0.134862
## best_dir_winyes -1.74107 2.25336 -0.773 0.440036
## top200_boxyes -1.27214 3.61382 -0.352 0.724950
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.03 on 587 degrees of freedom
## Multiple R-squared: 0.6018, Adjusted R-squared: 0.5835
## F-statistic: 32.86 on 27 and 587 DF, p-value: < 2.2e-16
Our adjusted R-Squared increased again to 58.35. So now we will remove top200_box.
lm_movies4 <- lm(audience_score ~ . -runtime -thtr_rel_day -top200_box,
data = movies_clean)
summary(lm_movies4)##
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day - top200_box,
## data = movies_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.781 -7.942 -0.021 8.410 46.270
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 515.19390 114.63315 4.494 8.41e-06 ***
## title_typeFeature Film -3.70260 5.35027 -0.692 0.489185
## title_typeTV Movie -2.24919 8.42055 -0.267 0.789480
## genreAnimation 6.89758 5.38104 1.282 0.200408
## genreArt House & International 16.65523 4.30748 3.867 0.000123 ***
## genreComedy 3.23497 2.23597 1.447 0.148489
## genreDocumentary 20.55875 5.52217 3.723 0.000216 ***
## genreDrama 7.45831 2.00871 3.713 0.000224 ***
## genreHorror -5.00468 3.38033 -1.481 0.139268
## genreMusical & Performing Arts 18.93587 4.53525 4.175 3.43e-05 ***
## genreMystery & Suspense -1.33745 2.52414 -0.530 0.596407
## genreOther 6.24109 3.85972 1.617 0.106419
## genreScience Fiction & Fantasy -3.62820 4.91607 -0.738 0.460791
## mpaa_ratingNC-17 -9.59950 13.57082 -0.707 0.479621
## mpaa_ratingPG -3.13005 3.85419 -0.812 0.417054
## mpaa_ratingPG-13 -4.78934 3.98517 -1.202 0.229928
## mpaa_ratingR -2.06900 3.84715 -0.538 0.590918
## mpaa_ratingUnrated 1.61136 4.62105 0.349 0.727439
## thtr_rel_year -0.25677 0.05796 -4.430 1.12e-05 ***
## thtr_rel_month -0.18518 0.15499 -1.195 0.232641
## imdb_num_votes 4.15412 0.42621 9.747 < 2e-16 ***
## critics_score 0.36756 0.02340 15.710 < 2e-16 ***
## best_pic_nomyes 7.72745 3.42635 2.255 0.024481 *
## best_pic_winyes -5.25914 5.95946 -0.882 0.377874
## best_actor_winyes -1.87258 1.55731 -1.202 0.229676
## best_actress_winyes -2.63601 1.73284 -1.521 0.128745
## best_dir_winyes -1.71247 2.25021 -0.761 0.446946
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.02 on 588 degrees of freedom
## Multiple R-squared: 0.6017, Adjusted R-squared: 0.5841
## F-statistic: 34.17 on 26 and 588 DF, p-value: < 2.2e-16
Our adjusted R-Squared increased again to 58.41. So now we will remove title_type
lm_movies5 <- lm(audience_score ~ . -runtime -thtr_rel_day -top200_box -title_type,
data = movies_clean)
summary(lm_movies5)##
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day - top200_box -
## title_type, data = movies_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.760 -7.897 0.033 8.459 46.289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 505.09575 113.60617 4.446 1.05e-05 ***
## genreAnimation 6.82466 5.37330 1.270 0.204548
## genreArt House & International 16.40948 4.28641 3.828 0.000143 ***
## genreComedy 3.22658 2.23304 1.445 0.149009
## genreDocumentary 23.49745 3.39065 6.930 1.11e-11 ***
## genreDrama 7.36216 2.00024 3.681 0.000254 ***
## genreHorror -5.05539 3.37503 -1.498 0.134700
## genreMusical & Performing Arts 19.91917 4.28851 4.645 4.20e-06 ***
## genreMystery & Suspense -1.39473 2.51965 -0.554 0.580102
## genreOther 6.12753 3.85151 1.591 0.112158
## genreScience Fiction & Fantasy -3.67516 4.90942 -0.749 0.454401
## mpaa_ratingNC-17 -9.63504 13.55363 -0.711 0.477438
## mpaa_ratingPG -3.10732 3.84921 -0.807 0.419841
## mpaa_ratingPG-13 -4.76483 3.97988 -1.197 0.231698
## mpaa_ratingR -2.07359 3.84218 -0.540 0.589614
## mpaa_ratingUnrated 1.89132 4.58307 0.413 0.679995
## thtr_rel_year -0.25335 0.05769 -4.392 1.33e-05 ***
## thtr_rel_month -0.18502 0.15428 -1.199 0.230916
## imdb_num_votes 4.09983 0.41891 9.787 < 2e-16 ***
## critics_score 0.37086 0.02290 16.197 < 2e-16 ***
## best_pic_nomyes 7.72007 3.42200 2.256 0.024435 *
## best_pic_winyes -5.19534 5.95119 -0.873 0.383023
## best_actor_winyes -1.88306 1.55478 -1.211 0.226324
## best_actress_winyes -2.61592 1.72921 -1.513 0.130870
## best_dir_winyes -1.74554 2.24690 -0.777 0.437549
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13 on 590 degrees of freedom
## Multiple R-squared: 0.6014, Adjusted R-squared: 0.5851
## F-statistic: 37.08 on 24 and 590 DF, p-value: < 2.2e-16
Our adjusted R-Squared increased again to 58.51. So now we will remove best_dir_win.
lm_movies6 <- lm(audience_score ~ . -runtime -thtr_rel_day -top200_box -title_type
-best_dir_win, data = movies_clean)
summary(lm_movies6)##
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day - top200_box -
## title_type - best_dir_win, data = movies_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.330 -7.782 0.167 8.460 46.302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 495.61246 112.91053 4.389 1.35e-05 ***
## genreAnimation 6.81604 5.37149 1.269 0.204965
## genreArt House & International 16.47054 4.28425 3.844 0.000134 ***
## genreComedy 3.24651 2.23215 1.454 0.146356
## genreDocumentary 23.53253 3.38921 6.943 1.01e-11 ***
## genreDrama 7.36114 1.99957 3.681 0.000253 ***
## genreHorror -5.04590 3.37388 -1.496 0.135298
## genreMusical & Performing Arts 19.88402 4.28683 4.638 4.33e-06 ***
## genreMystery & Suspense -1.42361 2.51853 -0.565 0.572115
## genreOther 6.16107 3.84997 1.600 0.110069
## genreScience Fiction & Fantasy -3.77220 4.90619 -0.769 0.442280
## mpaa_ratingNC-17 -9.55803 13.54872 -0.705 0.480804
## mpaa_ratingPG -3.24834 3.84364 -0.845 0.398385
## mpaa_ratingPG-13 -4.92508 3.97320 -1.240 0.215625
## mpaa_ratingR -2.23620 3.83519 -0.583 0.560067
## mpaa_ratingUnrated 1.75819 4.57833 0.384 0.701098
## thtr_rel_year -0.24834 0.05731 -4.333 1.73e-05 ***
## thtr_rel_month -0.19071 0.15405 -1.238 0.216231
## imdb_num_votes 4.06381 0.41619 9.764 < 2e-16 ***
## critics_score 0.36936 0.02281 16.194 < 2e-16 ***
## best_pic_nomyes 7.89884 3.41311 2.314 0.020995 *
## best_pic_winyes -6.51594 5.70134 -1.143 0.253552
## best_actor_winyes -1.95312 1.55164 -1.259 0.208619
## best_actress_winyes -2.63169 1.72851 -1.523 0.128413
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13 on 591 degrees of freedom
## Multiple R-squared: 0.601, Adjusted R-squared: 0.5854
## F-statistic: 38.7 on 23 and 591 DF, p-value: < 2.2e-16
Our adjusted R-Squared increased again to 58.54. So now we will remove best_pic_win.
lm_movies7 <- lm(audience_score ~ . -runtime -thtr_rel_day -top200_box -title_type
-best_dir_win - best_pic_win, data = movies_clean)
summary(lm_movies7)##
## Call:
## lm(formula = audience_score ~ . - runtime - thtr_rel_day - top200_box -
## title_type - best_dir_win - best_pic_win, data = movies_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.302 -7.834 0.113 8.469 46.213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 487.60398 112.72202 4.326 1.78e-05 ***
## genreAnimation 6.77447 5.37275 1.261 0.207844
## genreArt House & International 16.41711 4.28510 3.831 0.000141 ***
## genreComedy 3.14624 2.23100 1.410 0.158995
## genreDocumentary 23.41513 3.38853 6.910 1.26e-11 ***
## genreDrama 7.31808 1.99973 3.660 0.000275 ***
## genreHorror -5.06471 3.37471 -1.501 0.133945
## genreMusical & Performing Arts 19.81030 4.28746 4.621 4.70e-06 ***
## genreMystery & Suspense -1.49398 2.51843 -0.593 0.553260
## genreOther 6.42008 3.84429 1.670 0.095442 .
## genreScience Fiction & Fantasy -3.75378 4.90743 -0.765 0.444626
## mpaa_ratingNC-17 -9.53541 13.55221 -0.704 0.481955
## mpaa_ratingPG -3.32055 3.84411 -0.864 0.388046
## mpaa_ratingPG-13 -4.90387 3.97418 -1.234 0.217718
## mpaa_ratingR -2.28252 3.83597 -0.595 0.552051
## mpaa_ratingUnrated 1.63296 4.57820 0.357 0.721457
## thtr_rel_year -0.24405 0.05720 -4.266 2.31e-05 ***
## thtr_rel_month -0.18359 0.15397 -1.192 0.233590
## imdb_num_votes 4.00816 0.41344 9.695 < 2e-16 ***
## critics_score 0.36930 0.02281 16.188 < 2e-16 ***
## best_pic_nomyes 6.21227 3.07838 2.018 0.044039 *
## best_actor_winyes -1.84546 1.54918 -1.191 0.234032
## best_actress_winyes -2.75595 1.72553 -1.597 0.110764
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13 on 592 degrees of freedom
## Multiple R-squared: 0.6001, Adjusted R-squared: 0.5852
## F-statistic: 40.38 on 22 and 592 DF, p-value: < 2.2e-16
Our adjusted R-Squared has decreased to 58.52. So our lm_movies6 is our final model.
movies_final <- movies_clean %>%
select(-runtime, -thtr_rel_day, -top200_box, -title_type,
-best_dir_win, -best_pic_win)
lm_final <- lm(audience_score ~ ., movies_final)
lm_final##
## Call:
## lm(formula = audience_score ~ ., data = movies_final)
##
## Coefficients:
## (Intercept) genreAnimation
## 487.6040 6.7745
## genreArt House & International genreComedy
## 16.4171 3.1462
## genreDocumentary genreDrama
## 23.4151 7.3181
## genreHorror genreMusical & Performing Arts
## -5.0647 19.8103
## genreMystery & Suspense genreOther
## -1.4940 6.4201
## genreScience Fiction & Fantasy mpaa_ratingNC-17
## -3.7538 -9.5354
## mpaa_ratingPG mpaa_ratingPG-13
## -3.3206 -4.9039
## mpaa_ratingR mpaa_ratingUnrated
## -2.2825 1.6330
## thtr_rel_year thtr_rel_month
## -0.2440 -0.1836
## imdb_num_votes critics_score
## 4.0082 0.3693
## best_pic_nomyes best_actor_winyes
## 6.2123 -1.8455
## best_actress_winyes
## -2.7559
par(mfrow = c(2,2))
plot(lm_final$residuals, xlab = "", ylab = "Model Residuals")
abline(h = 0, lty = 2, col = "red", lwd = 2)
hist(lm_final$residuals, ylab = "", xlab = "Model Residuals", main = "")
abline(v = 0, lty = 2, lwd = 2, col = "red")
plot(lm_final$residuals, movies_final$audience_score, xlab = "Residuals",
ylab = "Audience Score")
plot(lm_final$fitted.value, lm_final$residuals, xlab = "Fitted Values",
ylab = "Residuals")
abline(h = 0, lty = 2, col = "red", lwd = 2)Figure 4.1 Model Residuals plotted for model fit.
Some observations from these charts.
The residuals do not appear to exhibit auto-correlation.
The residuals appear to be normally distributed.
The residuals appear to have great dispersion around the normality line, but still appear normal.
The residuals do seem to exhibit heteroskadasticity, as the variance of the residuals tends to decrease as the fitted values increases.
We will note this along with our results.
We will now use of newly constructed model to test a random movie.
set.seed(4242)
test.index <- sample(nrow(movies_final), 1)
test<- movies_final[test.index, ]
predict(lm_final, newdata = test, interval = "prediction", level = 0.95)## fit lwr upr
## 1 62.63984 36.84401 88.43567
## [1] 64
Our prediction falls within the 95% confidence interval, albeit a wide interval.
We have made some interesting finding to explain audience scoring on Rotten Tomatoes. Our best model used the following predictors to explain nearly over 60% of the variability. Genre, MPAA Rating, Theatre release year and month, number of IMDB votes, critics score, best picture nominee, best actor winner and best actress winner.