Setup

Load packages

library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(lubridate)
library(tibble)
library(DT)
library(statsr)
library(SignifReg)
library(tidyverse)
library(car)

Load data

load("movies.Rdata")

1. Data

The data set comprises of 651 randomly sampled movies produced and released before 2016. Although random sampling was used, the population of movies considered is only Hollywood. Further, the rating of the movie is drawn from popular sites like IMDB and Rotten Tomatoes, whose users might not be representative of the world’s population in general.

Codebook

  1. title: Title of movie
  2. title_type: Type of movie (Documentary, Feature Film, TV Movie)
  3. genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
  4. runtime: Runtime of movie (in minutes)
  5. mpaa_rating: MPAA rating of the movie (G, PG, PG-13, R, Unrated)
  6. studio: Studio that produced the movie
  7. thtr_rel_year: Year the movie is released in theaters
  8. thtr_rel_month: Month the movie is released in theaters
  9. thtr_rel_day: Day of the month the movie is released in theaters
  10. dvd_rel_year: Year the movie is released on DVD
  11. dvd_rel_month: Month the movie is released on DVD
  12. dvd_rel_day: Day of the month the movie is released on DVD
  13. imdb_rating: Rating on IMDB
  14. imdb_num_votes: Number of votes on IMDB
  15. critics_rating: Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)
  16. critics_score: Critics score on Rotten Tomatoes
  17. audience_rating: Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright)
  18. audience_score: Audience score on Rotten Tomatoes
  19. best_pic_nom: Whether or not the movie was nominated for a best picture Oscar (no, yes)
  20. best_pic_win: Whether or not the movie won a best picture Oscar (no, yes)
  21. best_actor_win: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) - note that this is not necessarily whether the actor won an Oscar for their role in the given movie
  22. best_actress win: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) - not that this is not necessarily whether the actresses won an Oscar for their role in the given movie
  23. best_dir_win: Whether or not the director of the movie ever won an Oscar (no, yes) - not that this is not necessarily whether the director won an Oscar for the given movie
  24. top200_box: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)
  25. director: Director of the movie
  26. actor1: First main actor/actress in the abridged cast of the movie
  27. actor2: Second main actor/actress in the abridged cast of the movie
  28. actor3: Third main actor/actress in the abridged cast of the movie
  29. actor4: Fourth main actor/actress in the abridged cast of the movie
  30. actor5: Fifth main actor/actress in the abridged cast of the movie
  31. imdb_url: Link to IMDB page for the movie
  32. rt_url: Link to Rotten Tomatoes page for the movie

Conclusions

  1. As our sample might not be a good representation of the world population, we have to careful in making generalized inferences. Further, any inference drawn would be applicable to only Hollywood movies.

  2. Considering this is an observational study (as opposed to an experimental study with random assignment), we will refrain from establishing casual relationships between variables.


2. Research question

Perhaps the greatest art form of the modern age, movies embody all that’s great in the world of media. All rolled into one, we get beautiful visuals, gorgeous music, thought-provoking stories, parables on morality, commentaries on society, insights into history, and so much more. Not to mention, they’re just really friggin’ fun to watch; so why wouldn’t you want to learn more about them?

In this analysis we want to answer the question what makes a movie popular?


3. Exploratory data analysis

Our Data

We have a data of 651 randomly sampled movies with 32 variables

datatable(movies)

Removing 1 duplicate row

movies <- movies[!duplicated(movies),]
nrow(movies)
## [1] 650

Reponse Variable

Audience Scores vary from 11-97 and 50% of these scores range between 46-80.

summary(movies$audience_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.32   80.00   97.00

Audience Scores are left-skewed as well.

ggplot(movies,aes(x=audience_score)) + 
  geom_histogram()

Explanatory Variables - Numeric

  • runtime - Highly right skewed. Slight positive correlation of 0.182 with audience_score.
  • imdb_rating - Slightly left skewed. High positive correlation of 0.865 with audience_score. It could be included in the model.
  • imdb_num_votes - Highly right skewed. Low positive correlation of 0.29 with audience_score.
  • critics_score - Multi-Modal Distribution. High positive correlation of 0.703 with audience_score. It could be included in the model.
movies %>%
select(audience_score,runtime,imdb_rating,imdb_num_votes,critics_score) %>%
ggpairs()

Explanatory Variables - Categorical

title_type: Documentaries have a higher audience score as compared to Feature Films and TV Movies.

df<-
tapply(movies$audience_score,movies$title_type,mean) %>%
  as.data.frame(names="mean_audience_score") %>%
  rename("mean_audience_score" = ".") %>%
  rownames_to_column("title_type")

ggplot(df,aes(x=title_type,y=mean_audience_score)) +
  geom_bar(stat="identity") +
  labs(x=NULL, y = "Audience Score")

genre: Documentaries and Musicals are rated highly whereas Horror movies are generally not that well accepted.

df<-
sort(tapply(movies$audience_score,movies$genre,mean),decreasing = TRUE) %>%
  as.data.frame(names="mean_audience_score") %>%
  rename("mean_audience_score" = ".") %>%
  rownames_to_column("genre")

ggplot(df,aes(x=reorder(genre, mean_audience_score),y=mean_audience_score)) +
  geom_bar(stat="identity") +
  labs(x=NULL, y = "Audience Score") +
  coord_flip()

mpaa_rating: Movies for general audiences are rated slightly higher than movies meant for more matured audience.

  • Unrated - Rating not available
  • G - General Audiences. All ages admitted.
  • NC-17 - No Children. No one 17 and under admitted.
  • R - Restricted. Under 17 requires accompanying parent or adult guardian.
  • PG - Parental Guidance Suggested. Some material may not be suitable for children.
  • PG-13 - Parents Strongly Cautioned. Some material may be inappropriate for children under 13.
df <- 
sort(tapply(movies$audience_score,movies$mpaa_rating,mean),decreasing = TRUE) %>%
  as.data.frame(names="mean_audience_score") %>%
  rename("mean_audience_score" = ".") %>%
  rownames_to_column("mpaa_rating")

ggplot(df,aes(x=reorder(mpaa_rating,-mean_audience_score),y=mean_audience_score)) +
  geom_bar(stat="identity") + 
  labs(x=NULL, y = "Audience_Score")

If the movie has won or been nominated for the best film at Oscars or featured in top 200 Box Office list on BoxOfficeMojo then it’s more popular among the audience.

At the same time having Oscar awarded actors, actresses or directors in the movie does not seem to impact the audience rating.

df <-
movies %>%
  select(best_pic_nom,best_pic_win,best_actor_win,best_actress_win,
         best_dir_win,top200_box,audience_score) %>%
  gather(Variable,Value, -audience_score) %>%
  group_by(Variable,Value) %>%
  summarize(mean_audience_score = mean(audience_score))

ggplot(df,aes(x=Variable,y=mean_audience_score,fill=Value)) +
  geom_bar(position="dodge",stat="identity") +
  labs(x=NULL, y = "Audience_Score")

4. Modeling

Building the Model

Using the Exploratory Data Analysis, these are the predictor variables which seemed to have some association with the audience score:

  • runtime
  • imdb_rating
  • imdb_num_votes
  • critics_score
  • genre+mpaa_rating
  • best_pic_nom
  • best_pic_win
  • top200_box

Now we will run a forward-selection algorithm to determine the best possible predictor variables based on Adjusted R-Squared

scope <- audience_score~
            runtime+
            imdb_rating+
            imdb_num_votes+
            critics_score+
            genre+mpaa_rating+
            best_pic_nom+
            best_pic_win+
            top200_box

model <- SignifReg(scope=scope,
                   data=movies,
                   alpha=0.05,
                   direction="forward",
                   criterion="r-adj",
                   correction="FDR")

summary(model)
## 
## Call:
## lm(formula = reg, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.708  -6.449   0.625   5.485  50.111 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -37.15256    3.11144 -11.941  < 2e-16 ***
## imdb_rating                     14.76629    0.57332  25.756  < 2e-16 ***
## genreAnimation                   9.11682    3.50121   2.604 0.009432 ** 
## genreArt House & International   0.03001    2.90680   0.010 0.991767    
## genreComedy                      2.09165    1.61538   1.295 0.195849    
## genreDocumentary                 1.20116    1.97792   0.607 0.543879    
## genreDrama                      -0.20340    1.38128  -0.147 0.882977    
## genreHorror                     -5.02802    2.38927  -2.104 0.035733 *  
## genreMusical & Performing Arts   4.39752    3.14076   1.400 0.161956    
## genreMystery & Suspense         -6.25294    1.78054  -3.512 0.000476 ***
## genreOther                       1.58199    2.76483   0.572 0.567400    
## genreScience Fiction & Fantasy  -0.29096    3.50593  -0.083 0.933884    
## critics_score                    0.06688    0.02144   3.120 0.001893 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.825 on 637 degrees of freedom
## Multiple R-squared:  0.7682, Adjusted R-squared:  0.7638 
## F-statistic: 175.9 on 12 and 637 DF,  p-value: < 2.2e-16

The final model has an Adjusted R-Squared of 0.7638 and the chosen predictors are:

  • imdb_rating
  • genre
  • critics_score

Model Diagnostics

We will check all the assumptions of a linear regression model:

  1. Errors are normally distributed with mean=0

Using a Histogram, errors seem to be normally distributed and centred at 0.

df <- data.frame(res=model$residuals)

df %>% 
  ggplot(aes(x=res)) +
  geom_histogram() 

Using a Q-Q Plot, errors seem to be normally distributed as well.

qqnorm(model$residuals)
qqline(model$residuals, col='red')

  1. Uncorrelated Errors

There seems to be no pattern for the errors over time (index). Thus we can safely assume that the errors (hence the observations) are uncorrelated.

plot(model$residuals)

  1. Constance Variance

This assumption is not satisfied. We can clearly see that the errors are more varied for lesser predicted values as compared to higher predicted values. We might need to transform our response variable or predictor variables or both to satisfy this condition.

df <- data.frame(residuals = model$residuals, predicted_values = model$fitted.values)
ggplot(df, aes(x=predicted_values,y=residuals)) +
  geom_point() +
  geom_hline(yintercept = 0)

Using box-cox transformation, the best transformation of response variable is very close to the response variable itself (lambda ~ 1). Thus we will not transform the response variable.

bc <- MASS::boxcox(audience_score ~
               imdb_rating+
               critics_score+
               runtime +
               best_pic_nom +
               genre
             ,data=movies)

Applying Inverse Transformation to the predictor variable imdb_rating

new_model <- lm(audience_score ~
                I(1/imdb_rating)+
                genre+
                critics_score,
                data=movies)

We can see that after the transformation the errors - although not perfectly - are randomly scattered around 0. Thus we can asssume that the condition of constant variance is met.

df <- data.frame(residuals = new_model$residuals, predicted_values = new_model$fitted.values)
ggplot(df, aes(x=predicted_values,y=residuals)) +
  geom_point() +
  geom_hline(yintercept = 0)

  1. Predictor Variables are independent of each other

As the Variation Inflation Factor < 5 for each predictor variable, we can assume that there is no multi-collinearity.

vif <- vif(new_model)
vif[,1]
## I(1/imdb_rating)            genre    critics_score 
##         1.816703         1.297577         1.892534
  1. No influential outliers

Almost all standardized errors are below the absolute value of 4. Thus we do not have any extreme outliers that may be influencing the regression line.

rstan <- rstandard(new_model)  # studentized
plot(rstan)

Model Interpretation

Interpreting the Model:

  • 64% variance in audience score is explained by 1/imdb_rating, genre and critics score.
  • All held constant, with 1 unit increase in critics score, the average audience score increases by 0.26 units.
  • A more concrete way of elaborating the above point would be: All held constant, we are 95 % confident that with 1 unit increase in critics score, the average audience score increases by 0.21 - 0.30 units.
  • All held constant, with 1 unit increase in 1/imdb_rating, the average audience score decreases by 237 units.
  • A more concrete way of elaborating the above point would be: All held constant, we are 95 % confident that with 1 unit increase in 1/imdb_rating, the average audience score decreases by 204 to 269 units.
  • For genre, the reference level is Action. The interpretation of all other genres must be made in relation to this reference level. Ex: All held constant, Musical & Performing Arts are rated, on an average, 7.77 points higher than the Action movies.
  • A more concrete way of elaborating the above point would be: All held constant, we are 95% confident that Musical & Performing Arts are rated, on an average, 0.11-15.43 points higher than the Action movies.
  • The t-tests correspond to the following hypothesis test:
    • H0: Beta = 0
    • HA: Beta !=0
    • For all p-values < 0.05, we reject H0
    • We can see that for each of the 3 predictor variables (and at least one level of genre), the p-value < 0.05. Thus the estimates (Beta’s) are significant.
  • The f-test correspond to the following hypotheis test:
    • H0: All Beta’s = 0
    • HA: At least one Beta != 0
    • As p-value < 0.05, we reject H0. Thus our model as a whole is significant.
summary(new_model)
## 
## Call:
## lm(formula = audience_score ~ I(1/imdb_rating) + genre + critics_score, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.043  -8.557   0.586   8.567  58.276 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      84.98952    3.91377  21.716  < 2e-16 ***
## I(1/imdb_rating)               -237.28740   16.65417 -14.248  < 2e-16 ***
## genreAnimation                    8.07374    4.35753   1.853  0.06437 .  
## genreArt House & International    2.02195    3.61527   0.559  0.57617    
## genreComedy                       0.35921    2.00657   0.179  0.85798    
## genreDocumentary                  6.13712    2.44137   2.514  0.01219 *  
## genreDrama                        0.31551    1.71942   0.183  0.85447    
## genreHorror                      -8.43713    2.96655  -2.844  0.00460 ** 
## genreMusical & Performing Arts    7.77701    3.90137   1.993  0.04664 *  
## genreMystery & Suspense          -6.25401    2.21826  -2.819  0.00496 ** 
## genreOther                        1.64148    3.44009   0.477  0.63341    
## genreScience Fiction & Fantasy   -1.68851    4.36556  -0.387  0.69905    
## critics_score                     0.26343    0.02326  11.324  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.22 on 637 degrees of freedom
## Multiple R-squared:  0.6411, Adjusted R-squared:  0.6343 
## F-statistic: 94.82 on 12 and 637 DF,  p-value: < 2.2e-16
confint(new_model)
##                                       2.5 %       97.5 %
## (Intercept)                      77.3040726   92.6749693
## I(1/imdb_rating)               -269.9911088 -204.5836992
## genreAnimation                   -0.4831081   16.6305949
## genreArt House & International   -5.0773497    9.1212399
## genreComedy                      -3.5810772    4.2994924
## genreDocumentary                  1.3430199   10.9312164
## genreDrama                       -3.0609153    3.6919290
## genreHorror                     -14.2625310   -2.6117372
## genreMusical & Performing Arts    0.1159193   15.4381055
## genreMystery & Suspense         -10.6099955   -1.8980273
## genreOther                       -5.1137973    8.3967632
## genreScience Fiction & Fantasy  -10.2611487    6.8841194
## critics_score                     0.2177493    0.3091118

5. Prediction

Predicting the audience score for the movie La La Land:

The current audience score on Rotten Tomatoes is 81 and our model predicts this score to be 80.

Links:

new_movie <- data.frame(title = "La La Land",
                        genre = "Comedy",
                        imdb_rating = 8.1,
                        critics_score = 91)
predict(new_model,new_movie) %>%
  round(2)
##     1 
## 80.03

Prediction Interval:

We are 95% confident that, all else being equal, the predicted audience score for the movie ‘La La Land’ will be between 80 and 83 on average.

predict(new_model, new_movie, interval="confidence")
##        fit      lwr      upr
## 1 80.02617 76.79458 83.25775

6. Conclusion

  • Genre, IMDB Rating and Critics Score are the most useful predictors. These 3 combined can explain 64% of the variance in the audience score.
  • As the movies considered are only Hollywood, we can not make generalized statements about all movies.
  • Also, as the ratings are only drawn from IMDB and Rotten Tomatoes, we cannot make generalized inferences about the population in general.