What Makes a Movies Popular?

Setup

Load packages

library(dplyr)
library(ggplot2)
library(GGally)
library(tidyr)
library(lubridate)
library(tibble)
library(DT)
library(statsr)
library(SignifReg)
library(tidyverse)
library(car)

Load data

load("movies.Rdata")

1. Data

The data set comprises of 651 randomly sampled movies produced and released before 2016. Although random sampling was used, the population of movies considered is only Hollywood. Further, the rating of the movie is drawn from popular sites like IMDB and Rotten Tomatoes, whose users might not be representative of the world’s population in general.

Codebook

title: Title of movie
title_type: Type of movie (Documentary, Feature Film, TV Movie)
genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
runtime: Runtime of movie (in minutes)
mpaa_rating: MPAA rating of the movie (G, PG, PG-13, R, Unrated)
studio: Studio that produced the movie
thtr_rel_year: Year the movie is released in theaters
thtr_rel_month: Month the movie is released in theaters
thtr_rel_day: Day of the month the movie is released in theaters
dvd_rel_year: Year the movie is released on DVD
dvd_rel_month: Month the movie is released on DVD
dvd_rel_day: Day of the month the movie is released on DVD
imdb_rating: Rating on IMDB
imdb_num_votes: Number of votes on IMDB
critics_rating: Categorical variable for critics rating on Rotten Tomatoes (Certified Fresh, Fresh, Rotten)
critics_score: Critics score on Rotten Tomatoes
audience_rating: Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright)
audience_score: Audience score on Rotten Tomatoes
best_pic_nom: Whether or not the movie was nominated for a best picture Oscar (no, yes)
best_pic_win: Whether or not the movie won a best picture Oscar (no, yes)
best_actor_win: Whether or not one of the main actors in the movie ever won an Oscar (no, yes) - note that this is not necessarily whether the actor won an Oscar for their role in the given movie
best_actress win: Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) - not that this is not necessarily whether the actresses won an Oscar for their role in the given movie
best_dir_win: Whether or not the director of the movie ever won an Oscar (no, yes) - not that this is not necessarily whether the director won an Oscar for the given movie
top200_box: Whether or not the movie is in the Top 200 Box Office list on BoxOfficeMojo (no, yes)
director: Director of the movie
actor1: First main actor/actress in the abridged cast of the movie
actor2: Second main actor/actress in the abridged cast of the movie
actor3: Third main actor/actress in the abridged cast of the movie
actor4: Fourth main actor/actress in the abridged cast of the movie
actor5: Fifth main actor/actress in the abridged cast of the movie
imdb_url: Link to IMDB page for the movie
rt_url: Link to Rotten Tomatoes page for the movie

Conclusions

As our sample might not be a good representation of the world population, we have to careful in making generalized inferences. Further, any inference drawn would be applicable to only Hollywood movies.
Considering this is an observational study (as opposed to an experimental study with random assignment), we will refrain from establishing casual relationships between variables.

2. Research question

Perhaps the greatest art form of the modern age, movies embody all that’s great in the world of media. All rolled into one, we get beautiful visuals, gorgeous music, thought-provoking stories, parables on morality, commentaries on society, insights into history, and so much more. Not to mention, they’re just really friggin’ fun to watch; so why wouldn’t you want to learn more about them?

In this analysis we want to answer the question what makes a movie popular?

3. Exploratory data analysis

Our Data

We have a data of 651 randomly sampled movies with 32 variables

datatable(movies)

Removing 1 duplicate row

movies <- movies[!duplicated(movies),]
nrow(movies)

## [1] 650

Reponse Variable

Audience Scores vary from 11-97 and 50% of these scores range between 46-80.

summary(movies$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.32   80.00   97.00

Audience Scores are left-skewed as well.

ggplot(movies,aes(x=audience_score)) + 
  geom_histogram()

Explanatory Variables - Numeric

runtime - Highly right skewed. Slight positive correlation of 0.182 with audience_score.
imdb_rating - Slightly left skewed. High positive correlation of 0.865 with audience_score. It could be included in the model.
imdb_num_votes - Highly right skewed. Low positive correlation of 0.29 with audience_score.
critics_score - Multi-Modal Distribution. High positive correlation of 0.703 with audience_score. It could be included in the model.

movies %>%
select(audience_score,runtime,imdb_rating,imdb_num_votes,critics_score) %>%
ggpairs()

Explanatory Variables - Categorical

title_type: Documentaries have a higher audience score as compared to Feature Films and TV Movies.

df<-
tapply(movies$audience_score,movies$title_type,mean) %>%
  as.data.frame(names="mean_audience_score") %>%
  rename("mean_audience_score" = ".") %>%
  rownames_to_column("title_type")

ggplot(df,aes(x=title_type,y=mean_audience_score)) +
  geom_bar(stat="identity") +
  labs(x=NULL, y = "Audience Score")

genre: Documentaries and Musicals are rated highly whereas Horror movies are generally not that well accepted.

df<-
sort(tapply(movies$audience_score,movies$genre,mean),decreasing = TRUE) %>%
  as.data.frame(names="mean_audience_score") %>%
  rename("mean_audience_score" = ".") %>%
  rownames_to_column("genre")

ggplot(df,aes(x=reorder(genre, mean_audience_score),y=mean_audience_score)) +
  geom_bar(stat="identity") +
  labs(x=NULL, y = "Audience Score") +
  coord_flip()

mpaa_rating: Movies for general audiences are rated slightly higher than movies meant for more matured audience.

Unrated - Rating not available
G - General Audiences. All ages admitted.
NC-17 - No Children. No one 17 and under admitted.
R - Restricted. Under 17 requires accompanying parent or adult guardian.
PG - Parental Guidance Suggested. Some material may not be suitable for children.
PG-13 - Parents Strongly Cautioned. Some material may be inappropriate for children under 13.

df <- 
sort(tapply(movies$audience_score,movies$mpaa_rating,mean),decreasing = TRUE) %>%
  as.data.frame(names="mean_audience_score") %>%
  rename("mean_audience_score" = ".") %>%
  rownames_to_column("mpaa_rating")

ggplot(df,aes(x=reorder(mpaa_rating,-mean_audience_score),y=mean_audience_score)) +
  geom_bar(stat="identity") + 
  labs(x=NULL, y = "Audience_Score")

If the movie has won or been nominated for the best film at Oscars or featured in top 200 Box Office list on BoxOfficeMojo then it’s more popular among the audience.

At the same time having Oscar awarded actors, actresses or directors in the movie does not seem to impact the audience rating.

df <-
movies %>%
  select(best_pic_nom,best_pic_win,best_actor_win,best_actress_win,
         best_dir_win,top200_box,audience_score) %>%
  gather(Variable,Value, -audience_score) %>%
  group_by(Variable,Value) %>%
  summarize(mean_audience_score = mean(audience_score))

ggplot(df,aes(x=Variable,y=mean_audience_score,fill=Value)) +
  geom_bar(position="dodge",stat="identity") +
  labs(x=NULL, y = "Audience_Score")

4. Modeling

Building the Model

Using the Exploratory Data Analysis, these are the predictor variables which seemed to have some association with the audience score:

runtime
imdb_rating
imdb_num_votes
critics_score
genre+mpaa_rating
best_pic_nom
best_pic_win
top200_box

Now we will run a forward-selection algorithm to determine the best possible predictor variables based on Adjusted R-Squared

scope <- audience_score~
            runtime+
            imdb_rating+
            imdb_num_votes+
            critics_score+
            genre+mpaa_rating+
            best_pic_nom+
            best_pic_win+
            top200_box

model <- SignifReg(scope=scope,
                   data=movies,
                   alpha=0.05,
                   direction="forward",
                   criterion="r-adj",
                   correction="FDR")

summary(model)

## 
## Call:
## lm(formula = reg, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.708  -6.449   0.625   5.485  50.111 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -37.15256    3.11144 -11.941  < 2e-16 ***
## imdb_rating                     14.76629    0.57332  25.756  < 2e-16 ***
## genreAnimation                   9.11682    3.50121   2.604 0.009432 ** 
## genreArt House & International   0.03001    2.90680   0.010 0.991767    
## genreComedy                      2.09165    1.61538   1.295 0.195849    
## genreDocumentary                 1.20116    1.97792   0.607 0.543879    
## genreDrama                      -0.20340    1.38128  -0.147 0.882977    
## genreHorror                     -5.02802    2.38927  -2.104 0.035733 *  
## genreMusical & Performing Arts   4.39752    3.14076   1.400 0.161956    
## genreMystery & Suspense         -6.25294    1.78054  -3.512 0.000476 ***
## genreOther                       1.58199    2.76483   0.572 0.567400    
## genreScience Fiction & Fantasy  -0.29096    3.50593  -0.083 0.933884    
## critics_score                    0.06688    0.02144   3.120 0.001893 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.825 on 637 degrees of freedom
## Multiple R-squared:  0.7682, Adjusted R-squared:  0.7638 
## F-statistic: 175.9 on 12 and 637 DF,  p-value: < 2.2e-16

The final model has an Adjusted R-Squared of 0.7638 and the chosen predictors are:

imdb_rating
genre
critics_score

Model Diagnostics

We will check all the assumptions of a linear regression model:

Errors are normally distributed with mean=0

Using a Histogram, errors seem to be normally distributed and centred at 0.

df <- data.frame(res=model$residuals)

df %>% 
  ggplot(aes(x=res)) +
  geom_histogram()

Using a Q-Q Plot, errors seem to be normally distributed as well.

qqnorm(model$residuals)
qqline(model$residuals, col='red')

Uncorrelated Errors

There seems to be no pattern for the errors over time (index). Thus we can safely assume that the errors (hence the observations) are uncorrelated.

plot(model$residuals)

Constance Variance

This assumption is not satisfied. We can clearly see that the errors are more varied for lesser predicted values as compared to higher predicted values. We might need to transform our response variable or predictor variables or both to satisfy this condition.

df <- data.frame(residuals = model$residuals, predicted_values = model$fitted.values)
ggplot(df, aes(x=predicted_values,y=residuals)) +
  geom_point() +
  geom_hline(yintercept = 0)

Using box-cox transformation, the best transformation of response variable is very close to the response variable itself (lambda ~ 1). Thus we will not transform the response variable.

bc <- MASS::boxcox(audience_score ~
               imdb_rating+
               critics_score+
               runtime +
               best_pic_nom +
               genre
             ,data=movies)

Applying Inverse Transformation to the predictor variable imdb_rating

new_model <- lm(audience_score ~
                I(1/imdb_rating)+
                genre+
                critics_score,
                data=movies)

We can see that after the transformation the errors - although not perfectly - are randomly scattered around 0. Thus we can asssume that the condition of constant variance is met.

df <- data.frame(residuals = new_model$residuals, predicted_values = new_model$fitted.values)
ggplot(df, aes(x=predicted_values,y=residuals)) +
  geom_point() +
  geom_hline(yintercept = 0)

Predictor Variables are independent of each other

As the Variation Inflation Factor < 5 for each predictor variable, we can assume that there is no multi-collinearity.

vif <- vif(new_model)
vif[,1]

## I(1/imdb_rating)            genre    critics_score 
##         1.816703         1.297577         1.892534

No influential outliers

Almost all standardized errors are below the absolute value of 4. Thus we do not have any extreme outliers that may be influencing the regression line.

rstan <- rstandard(new_model)  # studentized
plot(rstan)

Model Interpretation

Interpreting the Model:

64% variance in audience score is explained by 1/imdb_rating, genre and critics score.
All held constant, with 1 unit increase in critics score, the average audience score increases by 0.26 units.
A more concrete way of elaborating the above point would be: All held constant, we are 95 % confident that with 1 unit increase in critics score, the average audience score increases by 0.21 - 0.30 units.
All held constant, with 1 unit increase in 1/imdb_rating, the average audience score decreases by 237 units.
A more concrete way of elaborating the above point would be: All held constant, we are 95 % confident that with 1 unit increase in 1/imdb_rating, the average audience score decreases by 204 to 269 units.
For genre, the reference level is Action. The interpretation of all other genres must be made in relation to this reference level. Ex: All held constant, Musical & Performing Arts are rated, on an average, 7.77 points higher than the Action movies.
A more concrete way of elaborating the above point would be: All held constant, we are 95% confident that Musical & Performing Arts are rated, on an average, 0.11-15.43 points higher than the Action movies.
The t-tests correspond to the following hypothesis test:
- H0: Beta = 0
- HA: Beta !=0
- For all p-values < 0.05, we reject H0
- We can see that for each of the 3 predictor variables (and at least one level of genre), the p-value < 0.05. Thus the estimates (Beta’s) are significant.
The f-test correspond to the following hypotheis test:
- H0: All Beta’s = 0
- HA: At least one Beta != 0
- As p-value < 0.05, we reject H0. Thus our model as a whole is significant.

summary(new_model)

## 
## Call:
## lm(formula = audience_score ~ I(1/imdb_rating) + genre + critics_score, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.043  -8.557   0.586   8.567  58.276 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      84.98952    3.91377  21.716  < 2e-16 ***
## I(1/imdb_rating)               -237.28740   16.65417 -14.248  < 2e-16 ***
## genreAnimation                    8.07374    4.35753   1.853  0.06437 .  
## genreArt House & International    2.02195    3.61527   0.559  0.57617    
## genreComedy                       0.35921    2.00657   0.179  0.85798    
## genreDocumentary                  6.13712    2.44137   2.514  0.01219 *  
## genreDrama                        0.31551    1.71942   0.183  0.85447    
## genreHorror                      -8.43713    2.96655  -2.844  0.00460 ** 
## genreMusical & Performing Arts    7.77701    3.90137   1.993  0.04664 *  
## genreMystery & Suspense          -6.25401    2.21826  -2.819  0.00496 ** 
## genreOther                        1.64148    3.44009   0.477  0.63341    
## genreScience Fiction & Fantasy   -1.68851    4.36556  -0.387  0.69905    
## critics_score                     0.26343    0.02326  11.324  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.22 on 637 degrees of freedom
## Multiple R-squared:  0.6411, Adjusted R-squared:  0.6343 
## F-statistic: 94.82 on 12 and 637 DF,  p-value: < 2.2e-16

confint(new_model)

##                                       2.5 %       97.5 %
## (Intercept)                      77.3040726   92.6749693
## I(1/imdb_rating)               -269.9911088 -204.5836992
## genreAnimation                   -0.4831081   16.6305949
## genreArt House & International   -5.0773497    9.1212399
## genreComedy                      -3.5810772    4.2994924
## genreDocumentary                  1.3430199   10.9312164
## genreDrama                       -3.0609153    3.6919290
## genreHorror                     -14.2625310   -2.6117372
## genreMusical & Performing Arts    0.1159193   15.4381055
## genreMystery & Suspense         -10.6099955   -1.8980273
## genreOther                       -5.1137973    8.3967632
## genreScience Fiction & Fantasy  -10.2611487    6.8841194
## critics_score                     0.2177493    0.3091118

5. Prediction

Predicting the audience score for the movie La La Land:

The current audience score on Rotten Tomatoes is 81 and our model predicts this score to be 80.

Links:

new_movie <- data.frame(title = "La La Land",
                        genre = "Comedy",
                        imdb_rating = 8.1,
                        critics_score = 91)
predict(new_model,new_movie) %>%
  round(2)

##     1 
## 80.03

Prediction Interval:

We are 95% confident that, all else being equal, the predicted audience score for the movie ‘La La Land’ will be between 80 and 83 on average.

predict(new_model, new_movie, interval="confidence")

##        fit      lwr      upr
## 1 80.02617 76.79458 83.25775

6. Conclusion

Genre, IMDB Rating and Critics Score are the most useful predictors. These 3 combined can explain 64% of the variance in the audience score.
As the movies considered are only Hollywood, we can not make generalized statements about all movies.
Also, as the ratings are only drawn from IMDB and Rotten Tomatoes, we cannot make generalized inferences about the population in general.