Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.
This data set comprises of 651 randomly sampled movies produced and released before 2016, with 32 variables recorded for each of these movies.
As the movies in this set are randomly sampled, and there is significant representation from each genre type, studio and MPAA rating, one could generalize the results to the entire population of movies before 2016.
However, as random assignment was not used in this study, no causation can be established.
In this study, we want to understand what are the attributes that make a movie popular. The proxy that we will use for movie popularity is the Audience Score, as that is often crowdsourced and more representative of the general population rather than critics. As such, we are concerned with attributes that allow us to predict the audience score for the movie. Some key attributes to be tested are genre, runtime, thtr_rel_month, director and
We first want to understand the types of input the dataframe has.
| title | title_type | genre | runtime | mpaa_rating | studio | thtr_rel_year | thtr_rel_month | thtr_rel_day | dvd_rel_year | dvd_rel_month | dvd_rel_day | imdb_rating | imdb_num_votes | critics_rating | critics_score | audience_rating | audience_score | best_pic_nom | best_pic_win | best_actor_win | best_actress_win | best_dir_win | top200_box | director | actor1 | actor2 | actor3 | actor4 | actor5 | imdb_url | rt_url |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Filly Brown | Feature Film | Drama | 80 | R | Indomina Media Inc. | 2013 | 4 | 19 | 2013 | 7 | 30 | 5.5 | 899 | Rotten | 45 | Upright | 73 | no | no | no | no | no | no | Michael D. Olmos | Gina Rodriguez | Jenni Rivera | Lou Diamond Phillips | Emilio Rivera | Joseph Julian Soria | http://www.imdb.com/title/tt1869425/ | //www.rottentomatoes.com/m/filly_brown_2012/ |
| The Dish | Feature Film | Drama | 101 | PG-13 | Warner Bros. Pictures | 2001 | 3 | 14 | 2001 | 8 | 28 | 7.3 | 12285 | Certified Fresh | 96 | Upright | 81 | no | no | no | no | no | no | Rob Sitch | Sam Neill | Kevin Harrington | Patrick Warburton | Tom Long | Genevieve Mooy | http://www.imdb.com/title/tt0205873/ | //www.rottentomatoes.com/m/dish/ |
| Waiting for Guffman | Feature Film | Comedy | 84 | R | Sony Pictures Classics | 1996 | 8 | 21 | 2001 | 8 | 21 | 7.6 | 22381 | Certified Fresh | 91 | Upright | 91 | no | no | no | no | no | no | Christopher Guest | Christopher Guest | Catherine O’Hara | Parker Posey | Eugene Levy | Bob Balaban | http://www.imdb.com/title/tt0118111/ | //www.rottentomatoes.com/m/waiting_for_guffman/ |
| The Age of Innocence | Feature Film | Drama | 139 | PG | Columbia Pictures | 1993 | 10 | 1 | 2001 | 11 | 6 | 7.2 | 35096 | Certified Fresh | 80 | Upright | 76 | no | no | yes | no | yes | no | Martin Scorsese | Daniel Day-Lewis | Michelle Pfeiffer | Winona Ryder | Richard E. Grant | Alec McCowen | http://www.imdb.com/title/tt0106226/ | //www.rottentomatoes.com/m/age_of_innocence/ |
| Malevolence | Feature Film | Horror | 90 | R | Anchor Bay Entertainment | 2004 | 9 | 10 | 2005 | 4 | 19 | 5.1 | 2386 | Rotten | 33 | Spilled | 27 | no | no | no | no | no | no | Stevan Mena | Samantha Dark | R. Brandon Johnson | Brandon Johnson | Heather Magee | Richard Glover | http://www.imdb.com/title/tt0388230/ | //www.rottentomatoes.com/m/10004684-malevolence/ |
| Old Partner | Documentary | Documentary | 78 | Unrated | Shcalo Media Group | 2009 | 1 | 15 | 2010 | 4 | 20 | 7.8 | 333 | Fresh | 91 | Upright | 86 | no | no | no | no | no | no | Chung-ryoul Lee | Choi Won-kyun | Lee Sam-soon | Moo | NA | NA | http://www.imdb.com/title/tt1334549/ | //www.rottentomatoes.com/m/old-partner/ |
## title title_type genre
## Length:651 Documentary : 55 Drama :305
## Class :character Feature Film:591 Comedy : 87
## Mode :character TV Movie : 5 Action & Adventure: 65
## Mystery & Suspense: 59
## Documentary : 52
## Horror : 23
## (Other) : 60
## runtime mpaa_rating studio
## Min. : 39.0 G : 19 Paramount Pictures : 37
## 1st Qu.: 92.0 NC-17 : 2 Warner Bros. Pictures : 30
## Median :103.0 PG :118 Sony Pictures Home Entertainment: 27
## Mean :105.8 PG-13 :133 Universal Pictures : 23
## 3rd Qu.:115.8 R :329 Warner Home Video : 19
## Max. :267.0 Unrated: 50 (Other) :507
## NA's :1 NA's : 8
## thtr_rel_year thtr_rel_month thtr_rel_day dvd_rel_year
## Min. :1970 Min. : 1.00 Min. : 1.00 Min. :1991
## 1st Qu.:1990 1st Qu.: 4.00 1st Qu.: 7.00 1st Qu.:2001
## Median :2000 Median : 7.00 Median :15.00 Median :2004
## Mean :1998 Mean : 6.74 Mean :14.42 Mean :2004
## 3rd Qu.:2007 3rd Qu.:10.00 3rd Qu.:21.00 3rd Qu.:2008
## Max. :2014 Max. :12.00 Max. :31.00 Max. :2015
## NA's :8
## dvd_rel_month dvd_rel_day imdb_rating imdb_num_votes
## Min. : 1.000 Min. : 1.00 Min. :1.900 Min. : 180
## 1st Qu.: 3.000 1st Qu.: 7.00 1st Qu.:5.900 1st Qu.: 4546
## Median : 6.000 Median :15.00 Median :6.600 Median : 15116
## Mean : 6.333 Mean :15.01 Mean :6.493 Mean : 57533
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:7.300 3rd Qu.: 58301
## Max. :12.000 Max. :31.00 Max. :9.000 Max. :893008
## NA's :8 NA's :8
## critics_rating critics_score audience_rating audience_score
## Certified Fresh:135 Min. : 1.00 Spilled:275 Min. :11.00
## Fresh :209 1st Qu.: 33.00 Upright:376 1st Qu.:46.00
## Rotten :307 Median : 61.00 Median :65.00
## Mean : 57.69 Mean :62.36
## 3rd Qu.: 83.00 3rd Qu.:80.00
## Max. :100.00 Max. :97.00
##
## best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
## no :629 no :644 no :558 no :579 no :608
## yes: 22 yes: 7 yes: 93 yes: 72 yes: 43
##
##
##
##
##
## top200_box director actor1 actor2
## no :636 Length:651 Length:651 Length:651
## yes: 15 Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## actor3 actor4 actor5
## Length:651 Length:651 Length:651
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## imdb_url rt_url
## Length:651 Length:651
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
Understanding the Underlying Distribution of Audience Scores
The audience score in this case follows this underlying distribution:
We can see that the distribution of audience scores tend to be right skewed, with a mean at around 60 and a high concentration in the 70-90 range.
Let us try and visualize some of the other variables in relation to audience score:
Genre to Audience Score
There are a number of genres in which the movies are categorized, and the aim here is to determine if this variable is an influencing factor for the model.
genreaudience <- ggplot(movies, aes(x=genre,y=audience_score)) + geom_boxplot()
genreaudience + ggtitle("Genre vs Audience Scores") +
geom_hline(yintercept =mean(movies$audience_score, na.rm = TRUE), col = "red",lwd = 1)+coord_flip() The summary statistics of this are as follows:
## movies$genre: Action & Adventure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 37.00 52.00 53.78 65.00 94.00
## --------------------------------------------------------
## movies$genre: Animation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 59.00 65.00 62.44 70.00 88.00
## --------------------------------------------------------
## movies$genre: Art House & International
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 51.25 65.50 64.00 80.25 86.00
## --------------------------------------------------------
## movies$genre: Comedy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 37.00 50.00 52.51 67.50 93.00
## --------------------------------------------------------
## movies$genre: Documentary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 57.00 76.25 86.00 82.75 89.00 96.00
## --------------------------------------------------------
## movies$genre: Drama
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 52.00 70.00 65.35 80.00 95.00
## --------------------------------------------------------
## movies$genre: Horror
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.00 36.00 43.00 45.83 53.50 84.00
## --------------------------------------------------------
## movies$genre: Musical & Performing Arts
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.00 75.75 80.50 80.17 89.50 95.00
## --------------------------------------------------------
## movies$genre: Mystery & Suspense
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 40.50 54.00 55.95 70.50 97.00
## --------------------------------------------------------
## movies$genre: Other
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 53.00 73.50 66.69 82.50 91.00
## --------------------------------------------------------
## movies$genre: Science Fiction & Fantasy
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17.00 26.00 47.00 50.89 79.00 85.00
We can see from the above that documentaries, as well as musical and perfoming arts tend to get higher than average audience scores (>60), while other genres like horror and comedy tend to fare below average (<60). This indicates that it’s likely that genre is a good predictor for audience scores.
Runtime to Audience Score
There has been anecdotal quotes about how longer runtimes might influence audience scores, hence this is to verify if the relationship exists.
runtimeaudience <- plot(movies$runtime,movies$audience_score, main="Runtime vs Audience Score", xlab = "Runtime in Minutes", ylab="Audience Score")
abline(lm(movies$audience_score~movies$runtime),col = "red",lwd = 1) As you can see here, there is some linear relationship between audience score and runtime, but it is not very strong. As runtime could still potentially have some predictive power, it is added into the construction of the linear model.
Month of Movie release to Audience Score
There might be a relationship between the month of the movie release in relation to the Audience Score, as one would suspect that seasonality might influence perspectives on movies.
monthaudience <- ggplot(movies, aes(x=factor(thtr_rel_month),y=audience_score)) + geom_boxplot()
monthaudience + ggtitle("Theatre Release Month vs Audience Scores") +
geom_hline(yintercept =mean(movies$audience_score, na.rm = TRUE), col = "red",lwd = 1)+coord_flip() Although we can see that the mean audience score is the highest in December as opposed to the other months (perhaps during the festive season), it is not statistically significant enough to prove that this is indeed an influencing factor, hence we will not include this in our model.
Critics Rating to Audience Score
It is not feasible to use the critics score as a proxy for the audience score, as that would be strongly correlated. Instead, we can test if critics rating could be used to predict what scores audiences might give, which might be the reason why Rotten Tomatoes was set up in the first place.
criticsaudience <- boxplot(movies$audience_score~movies$critics_rating,
data=movies, main='Critics Rating vs Audience Scores',
xlab='Critics Rating', ylab='Audience Score') It is clear that from this, critics rating has some influence on what the final score from audiences would be, and should be included in the model. The summary statistics are as follows:
## movies$critics_rating: Certified Fresh
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 35.00 71.00 81.00 79.37 87.50 97.00
## --------------------------------------------------------
## movies$critics_rating: Fresh
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29.00 58.00 74.00 69.97 83.00 94.00
## --------------------------------------------------------
## movies$critics_rating: Rotten
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.0 36.0 48.0 49.7 64.0 95.0
Best Picture Nominee to Audience Score
Whether this movie is a best picture nominee might be a good indicator of whether Audiences would like it. The below EDA tries to verify this:
runtimeaudience <- plot(movies$runtime,movies$audience_score,
main="Runtime vs Audience Score",
xlab = "Runtime in Minutes", ylab="Audience Score")
abline(lm(movies$audience_score~movies$runtime),col = "red",lwd = 1) It is clear that this will be a key factor in the linear model, due to the sharp difference between non-nominated films and nominated films. However, this will only apply to a subset of 22 films.
Number of Votes on IMDB to Audience Score
Good movies are likely to generate larger volumes of reviews online, and this EDA seeks to test if the number of votes on IMDB is a good proxy for the final audience score:
plot(movies$imdb_num_votes,movies$audience_score, main="IMDB Votes vs Audience Score",
xlab = "Number of IMDB votes", ylab="Audience Score")
abline(lm(movies$audience_score~movies$imdb_num_votes),col = "red",lwd = 1) There is still an overall linear relationship possible for this, although weak. However, we will still add this as a potential predictor for the final audience score. * * *
Critics rating, genre, best picture nomination, runtime and number of votes on IMDB were considered for the full model. Hence, a linear model was fit to determine if this would be a good predictor. The results are as follows:
et_model <- lm(movies$audience_score ~ movies$critics_rating + movies$genre+movies$best_pic_nom+
movies$runtime+movies$imdb_num_votes)
summary(et_model)##
## Call:
## lm(formula = movies$audience_score ~ movies$critics_rating +
## movies$genre + movies$best_pic_nom + movies$runtime + movies$imdb_num_votes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.866 -9.794 0.187 10.261 42.617
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 6.303e+01 4.187e+00 15.054
## movies$critics_ratingFresh -4.527e+00 1.795e+00 -2.522
## movies$critics_ratingRotten -2.100e+01 1.776e+00 -11.824
## movies$genreAnimation 6.121e+00 5.291e+00 1.157
## movies$genreArt House & International 8.248e+00 4.387e+00 1.880
## movies$genreComedy -1.218e-01 2.436e+00 -0.050
## movies$genreDocumentary 1.974e+01 2.958e+00 6.674
## movies$genreDrama 5.471e+00 2.090e+00 2.618
## movies$genreHorror -6.304e+00 3.608e+00 -1.747
## movies$genreMusical & Performing Arts 1.874e+01 4.723e+00 3.968
## movies$genreMystery & Suspense -2.399e+00 2.686e+00 -0.893
## movies$genreOther 2.874e+00 4.170e+00 0.689
## movies$genreScience Fiction & Fantasy -8.275e+00 5.263e+00 -1.572
## movies$best_pic_nomyes 6.002e+00 3.499e+00 1.715
## movies$runtime 4.144e-02 3.412e-02 1.215
## movies$imdb_num_votes 3.258e-05 6.322e-06 5.153
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## movies$critics_ratingFresh 0.01193 *
## movies$critics_ratingRotten < 2e-16 ***
## movies$genreAnimation 0.24776
## movies$genreArt House & International 0.06058 .
## movies$genreComedy 0.96015
## movies$genreDocumentary 5.44e-11 ***
## movies$genreDrama 0.00907 **
## movies$genreHorror 0.08105 .
## movies$genreMusical & Performing Arts 8.08e-05 ***
## movies$genreMystery & Suspense 0.37207
## movies$genreOther 0.49100
## movies$genreScience Fiction & Fantasy 0.11639
## movies$best_pic_nomyes 0.08682 .
## movies$runtime 0.22490
## movies$imdb_num_votes 3.42e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.76 on 634 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.4804, Adjusted R-squared: 0.4681
## F-statistic: 39.08 on 15 and 634 DF, p-value: < 2.2e-16
This model has a reasonable predictive power, but the initial analysis seems to indicate that runtime and best picture nomination can be removed without impacting the model too much. Let us try it here:
et_model2 <- lm(audience_score ~ critics_rating + genre+imdb_num_votes,data=movies)
summary(et_model2)##
## Call:
## lm(formula = audience_score ~ critics_rating + genre + imdb_num_votes,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.077 -9.929 0.531 10.061 42.003
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.730e+01 2.528e+00 26.623 < 2e-16 ***
## critics_ratingFresh -4.851e+00 1.784e+00 -2.720 0.00672 **
## critics_ratingRotten -2.142e+01 1.761e+00 -12.162 < 2e-16 ***
## genreAnimation 5.458e+00 5.277e+00 1.034 0.30140
## genreArt House & International 8.468e+00 4.395e+00 1.927 0.05448 .
## genreComedy -1.671e-01 2.433e+00 -0.069 0.94528
## genreDocumentary 1.947e+01 2.948e+00 6.605 8.41e-11 ***
## genreDrama 6.107e+00 2.073e+00 2.946 0.00333 **
## genreHorror -6.527e+00 3.601e+00 -1.813 0.07036 .
## genreMusical & Performing Arts 1.930e+01 4.712e+00 4.096 4.75e-05 ***
## genreMystery & Suspense -1.960e+00 2.682e+00 -0.731 0.46522
## genreOther 3.645e+00 4.164e+00 0.875 0.38173
## genreScience Fiction & Fantasy -8.480e+00 5.273e+00 -1.608 0.10827
## imdb_num_votes 3.739e-05 5.921e-06 6.316 5.05e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.79 on 637 degrees of freedom
## Multiple R-squared: 0.4759, Adjusted R-squared: 0.4652
## F-statistic: 44.5 on 13 and 637 DF, p-value: < 2.2e-16
The new model has 3 variables, but only has a slightly smaller adjusted R-squared value of 0.465 compared to the 5 variable model. Hence, we know that the most important variables are genre, critics ratings and number of IMDB votes.
par(mfrow=c(2,2))
hist(et_model2$residuals, main="Residuals")
qqnorm(et_model2$residuals)
qqline(et_model2$residuals)
plot(et_model2$residuals~et_model2$fitted,yintercept=0)## Warning in plot.window(...): "yintercept" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "yintercept" is not a graphical
## parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "yintercept"
## is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "yintercept"
## is not a graphical parameter
## Warning in box(...): "yintercept" is not a graphical parameter
## Warning in title(...): "yintercept" is not a graphical parameter
We can clearly see that the residuals are normally distributed around zero, which is a good sign. As there is a linear relationship, the normals are nearly normal, but the residuals seem to be slighly interdependent, as there is less scattering at larger fitted values. However, as the interdependence is small and the residuals are still centred around zero, this can be ignored.
As I just watched Thor-Ragnarok on Netflix and thought that it was hillarious, I wanted to see if my model could potentially predict the final audience score for this would be. To me, as there is space exploration and mythology involved, this falls under the genre of “Science Fiction and Fantasy”. The following will be the code for the prediction for a 95% confidence level.
Thormovie<-data.frame(critics_rating="Fresh",genre="Science Fiction & Fantasy",imdb_num_votes=528202)
predict(et_model2,newdata=Thormovie,interval="prediction", level=0.95)## fit lwr upr
## 1 73.7176 42.58831 104.8469
The model predicted that the score will be 73.7, with a 95% condidence level that it will be between 42.6 and 104.8 (need new upper bound). The actual audience score was 87%, which was not too bad a prediction.
EDA was conducted on the movies dataset, and a linear model was fitted using 3 variables to predict audience scores. The model performed satisfactorily in giving reasonable predictions, and can be extended to movies that are outside of the dataset (Thor was released in 2017, which is outside of the sample). Hope you have found this useful!
Eric Tay