Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)
library(qqplotr)
library(patchwork)

Load data

load("movies.RData")

Part 1: Data

Data for this analysis consists of 651 movies randomly sampled from a population of movies produced and released before the year 2016. Entries or observations in this dataset were collected via a simple random sampling approach (so each member of the population has equal chance of being included in the sample).
Under this circumstance, observations or movies were not randomly assigned to groups so to ascertain the required information, hence, the analyst/research team did not effect the information recorded for each entry (movie) but rather, it is just an observational study which involves just observing and recording.
In this regard, causal inference cannot be made between the explanatory variable(s) and response variable, but an association inference instead between the two, and since a random sampling approach was used to collect the sample data, the association inference made can be generalized to the population of movies in the United States.

Part 2: Research question

According the Statista, the global box office generated 42.3 billion dollars in revenue in 2019. Box office revenue for some years now has been on a continuous ascension with each succeeding year recording revenues higher than that of the preceding year. In 2020 however, the global movie industry suffered a major slump in revenue due the global covid-19 pandemic.
The enormous revenue generated by the motion picture industry is directly dependent on the success of movies and video content being put out by studios.
As success of the movies is key, what then informs the success of a movie?. According to a paper published by Sood and Balamurugan in 2017 titled “Factors affecting the success of a movie - A Case study of Twin Movies”, classical factors such as producer, production house, director, cast, run time of the movie, the genre, the script, time of release and the marketing, and social factors such as the IMDb ratings, the viewer and critic reviews, the ongoing social, cultural, political and economic trends amongst many others affect the success of a movie.
In view of this, we seek to ascertain what factors (classical and social) as described by Sood and Balamurugan are associated with the popularity of a movie (popularity as a measure of audience score).

Part 3: Exploratory data analysis

Summary Statistics

# Summary
summary(movies[, -c(1:3, 8:9, 11:12, 25:32)])

##     runtime       mpaa_rating                               studio   
##  Min.   : 39.0   G      : 19   Paramount Pictures              : 37  
##  1st Qu.: 92.0   NC-17  :  2   Warner Bros. Pictures           : 30  
##  Median :103.0   PG     :118   Sony Pictures Home Entertainment: 27  
##  Mean   :105.8   PG-13  :133   Universal Pictures              : 23  
##  3rd Qu.:115.8   R      :329   Warner Home Video               : 19  
##  Max.   :267.0   Unrated: 50   (Other)                         :507  
##  NA's   :1                     NA's                            :  8  
##  thtr_rel_year   dvd_rel_year   imdb_rating    imdb_num_votes  
##  Min.   :1970   Min.   :1991   Min.   :1.900   Min.   :   180  
##  1st Qu.:1990   1st Qu.:2001   1st Qu.:5.900   1st Qu.:  4546  
##  Median :2000   Median :2004   Median :6.600   Median : 15116  
##  Mean   :1998   Mean   :2004   Mean   :6.493   Mean   : 57533  
##  3rd Qu.:2007   3rd Qu.:2008   3rd Qu.:7.300   3rd Qu.: 58301  
##  Max.   :2014   Max.   :2015   Max.   :9.000   Max.   :893008  
##                 NA's   :8                                      
##          critics_rating critics_score    audience_rating audience_score 
##  Certified Fresh:135    Min.   :  1.00   Spilled:275     Min.   :11.00  
##  Fresh          :209    1st Qu.: 33.00   Upright:376     1st Qu.:46.00  
##  Rotten         :307    Median : 61.00                   Median :65.00  
##                         Mean   : 57.69                   Mean   :62.36  
##                         3rd Qu.: 83.00                   3rd Qu.:80.00  
##                         Max.   :100.00                   Max.   :97.00  
##                                                                         
##  best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
##  no :629      no :644      no :558        no :579          no :608     
##  yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##  top200_box
##  no :636   
##  yes: 15   
##            
##            
##            
##            
##

The above output shows the summary of statistics of informative variables in the movie dataset. By informative, we mean variables which could be of essence to the analysis underway. For numerical variables, it displays the statistics: min, max, Quartiles and mean whereas for categorical and traditional character/string variables, it displays the frequency/count and class, respectively.

Histogram of the response variable (audience Score)

A popular measure of the popularity of a movie is the number of user votes according to a paper published by Moghaddam et al (2019).However, this variable doesn’t seem to have neither a moderate nor strong linear relationship with other potential predictors in the sample.
Another variable which can very well inform the popularity of a movie is the audience score which somewhat describes how the audience received the movie.

ggplot(
  data = movies,
  aes(
    x = audience_score
  )
) +
  geom_histogram(bins = 30) +
  geom_vline(
    xintercept = mean(unlist(movies[, "audience_score"]), na.rm = T),
    size = 1,
    col = "black",
    linetype = "dashed"
  ) +
  annotate(
    "text",
    x = 53,
    y = 40,
    label = paste0(
      "mean = ",
      mean(unlist(movies[, "audience_score"]), na.rm = T) |>
        round(2)
    ),
    size = 4
  ) +
  labs(
    y = "Count",
    x = "score",
    title = "Distribution of audience Score on Rotten Tomatoes"
  ) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5)
  )

The above histogram shows the distribution of the audience score on rotten tomatoes. it shows a slightly left skewed distribution centered around a mean of 62.3625192 with standard deviation of 20.2226242.

Colinearity

The concept of colinearity describe the condition of a linear relationship existing between explanatory variables in a regression analysis. Explanatory variables should ideally not have any form of linear association with each other. All other variables are potential predictors except audience_score which is the response variable. We check for this condition using a pairwise correlation plot.

ggpairs(
  movies[
    ,
    c(
      "audience_score",
      "imdb_num_votes",
      "runtime",
      "thtr_rel_year",
      "imdb_rating",
      "critics_score"
    )
  ],
  title = "Pair-Wise Correlation Plots"
)

For a variable to be included in the model generation, it must have a linear relationship with the response variable; preferably, a strong linear relationship. However, this shouldn’t be the case for the relationship between the explanatory/predictor variables.
From the plot, only the variables critics_score and imdb_rating exhibit a strong linear relationship (positive) with the response variable audience_score. The remainder however, do not, and even if they do, They are with quite an insignificant correlation correlation coefficients. For the variables imdb_rating, it is equally a measure of success or popularity of a movies just as the audience score and therefore will not be included in the model building process. The critics_score on the hand may seems to be a measure of the success as well but these aproves critics on rotten tomatoes are not nearly as much as half the audience of these movies. Another factor is also that, there isn’t always a positive linear relationship between the audience score and the critics score on rotten tomatoes.
In this regard, the only numerical variable to be considered in genrating a multivariate linear regression model is the critic_score.

Part 4: Modeling

Modelling the linear relationship between variables can either be univariate or multi-variate, which is a function of the number of explanatory/predictor variable(s). A single linear regression (least squares line) also known as uni-variate takes the form;
\(\hat{y}=b_{0}+b_{1}x\)
whereas multiple linear regression model also known as multi-variate takes the form;
\(\hat{y}=b_{0} + b_{1}x_{1} + b_{2}x_{2}.....b_{n}x_{n}\)

where:
- \(\hat{y}\) is the response or dependent variable
- \(b\) is the intercept (the value of \(\hat{y}\) when \(x\) is 0)
- \(x\) is the explanatory or predictor variable.

For this analysis, we use the multi-variate linear regression model.

The pair-wise correlation plot above shows critics_score and imdb_rating to have a strong linear relationship with the response variable (audience_score). However, imdb_rating is equally a measure of the popularity of a movie as the audience_score and therefore will be not included in generating the model. Thus, the only numerical variable to be included in the modeling is the critics_score along with the other categorical variables.
Below is the list of explanatory/predictor variables for the model building.

Variable	Description	class
critics_score	critics score on rotten Tomatoes	numerical
critics_rating	critics rating on rotten tomatoes	categorical
best_pic_nom	whether the movie was nominated for best picture Oscar	categorical
best_pic_win	whether the movie won best best picture Oscar	categorical
best_actor_win	one of main actor/actors has ever won an Oscar?	categorical
best_actress_win	one of main actress/actresses has ever won an Oscar?	categorical
best_dir_win	whether the director has ever won an Oscar?	categorical

Full Model

To model the association between movie popularity (as a measure of audience_score) and and associated factors as described by [Sood and Balamurugan] in 2017, we begin with all identified explanatory variables until a parsimonious model is achieved, if possible.
The base model via computation, takes the form;

model_full <- lm(
  audience_score ~ critics_score + critics_rating + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win,
  data = movies
)

summary(model_full)

## 
## Call:
## lm(formula = audience_score ~ critics_score + critics_rating + 
##     best_pic_nom + best_pic_win + best_actor_win + best_actress_win + 
##     best_dir_win, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.140  -9.386   0.388   9.965  43.194 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          32.93755    3.81580   8.632   <2e-16 ***
## critics_score         0.52976    0.04168  12.710   <2e-16 ***
## critics_ratingFresh  -3.70659    1.66372  -2.228   0.0262 *  
## critics_ratingRotten  0.45182    2.74445   0.165   0.8693    
## best_pic_nomyes       9.00980    3.66911   2.456   0.0143 *  
## best_pic_winyes      -2.45449    6.49880  -0.378   0.7058    
## best_actor_winyes    -1.20693    1.63936  -0.736   0.4619    
## best_actress_winyes  -2.23041    1.83913  -1.213   0.2257    
## best_dir_winyes      -0.26992    2.41057  -0.112   0.9109    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.25 on 642 degrees of freedom
## Multiple R-squared:  0.5093, Adjusted R-squared:  0.5032 
## F-statistic: 83.29 on 8 and 642 DF,  p-value: < 2.2e-16

Above is the model output (multi-variate regression) of the general model using all explanatory variables included in the model building process. We seek to achieve the model with the smallest number of explanatories and with the highest predictive power (parsimonious model). To do so, we test different combinations of the explanatory variables to choose the combination with the highest predictive power (Adjusted R Squared) using the backward elimination method.

Why Adjusted R Squared ?

A Commonly used method for regression model selection is the P-value. However, as the chosen significance level cut-off (usually 0.05) changes, so does the model and output and hence, not often a preferred metric for model selection as compared to the Adjusted R Squared which is not influenced by such uncertainties.

Model selection

Step	Variables	Adjusted \(R^2\)
Full	critics_score + critics_rating + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win	0.5032
1	critics_rating + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win	0.3791
2	critics_score + best_pic_nom + best_pic_win + best_actor_win + best_actress_win + best_dir_win	0.4988
3	critics_score + critics_rating + best_pic_win + best_actor_win + best_actress_win + best_dir_win	0.4993
4	critics_score + critics_rating + best_pic_nom + best_actor_win + best_actress_win + best_dir_win	0.5039
5	critics_score + critics_rating + best_pic_nom + best_pic_win + best_actress_win + best_dir_win	0.5035
6	critics_score + critics_rating + best_pic_nom + best_pic_win + best_actor_win + best_dir_win	0.5028
7	critics_score + critics_rating + best_pic_nom + best_pic_win + best_actor_win + best_actress_win	0.504

\(.\) \(.\) \(.\) \(.\) \(.\)
\(.\) \(.\) \(.\) \(.\) \(.\)

Variables	Adjusted \(R^2\)
critics_score + critics_rating + best_pic_nom + best_actress_win	0.5049
critics_score + critics_rating + best_pic_nom + best_actor_win + best_actress_win	0.5046

After multiple variable combinations to achieve the parsimonious model, the combinations of imdb_rating,critics_ratings, audience_rating, best_pic_nom and best_actress_win achieved the highest adjusted \(R^2\) of 0.8827 and this is with the least number of predictors. Hence, the model output is

## 
## Call:
## lm(formula = audience_score ~ critics_score + critics_rating + 
##     best_pic_nom + best_actress_win, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.977  -9.378   0.404  10.069  43.322 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          32.83715    3.80681   8.626   <2e-16 ***
## critics_score         0.52892    0.04158  12.721   <2e-16 ***
## critics_ratingFresh  -3.71855    1.65180  -2.251   0.0247 *  
## critics_ratingRotten  0.43605    2.73558   0.159   0.8734    
## best_pic_nomyes       7.98570    3.26357   2.447   0.0147 *  
## best_actress_winyes  -2.43110    1.82081  -1.335   0.1823    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.23 on 645 degrees of freedom
## Multiple R-squared:  0.5087, Adjusted R-squared:  0.5049 
## F-statistic: 133.6 on 5 and 645 DF,  p-value: < 2.2e-16

Therefore, the relationship between the popularity of a movie and the factors affecting it takes the form;

lm(
  audience_score ~ critics_score  + critics_rating  + best_pic_nom  + best_actress_win,
  data = movies
) -> mod_popularity

For the multivariate linear regression to hold, we need to check if it meets certain conditions/criteria.

Linear Relationship between \(x\) (numerical) and \(y\).

Thus, each numerical explanatory should have a linear relationship with the response variable. we check this using a residual plot of the residuals vs the numeric variable:imdb_rating

ggplot(data = movies, aes(x = critics_score, y = mod_popularity$residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, lty = "dashed") +
  labs(title = "Residual Plot", x = "critics_score", y = "model residuals")

The above shows a random scatter around 0 and hence, we can say the linearity condition between the response audience_score and the numerical explanatory critics_score is met.

Nearly Normal Residuals centered around 0

What is left over after the model fit (residuals) is expected to be nearly normally distributed around 0. We can check this condition using a histogram and a quantile plot

# Histogram
ggplot(data = movies, aes(x = mod_popularity$residuals)) +
  geom_histogram(bins = 10) +
  labs(x = "Residuals", title = "Distribution of Residuals") -> histogram

# QQplot
ggplot(mapping = aes(sample = mod_popularity$residuals)) +
  stat_qq_point(col = "darkblue") +
  stat_qq_line() +
  labs(x = "Theoritical Quantiles", y = "Sample Quantiles", title = "Normal Q-Q Plot") -> qqplot

# Grid Plot
histogram | qqplot

The histogram of residuals shows a nearly normal distribution of residuals although just a little skewed to the right as attested-to by the quantile plot on the right. The quantile plot on the right shows just a little deviation off the diagonal line from the start around the origin and towards the top-right corner, a slight variability, nonetheless, the majority of the residuals fall exactly on the line.
With this, we can say the condition of nearly normal residuals is fairly satisfied.

Constant Variability of Residuals

Residuals are expected to equally variable for low and high values of the predicted response variable. We check this condition by plotting the fitted values (y-axis) against the residuals on the x-axis.

ggplot(data = movies, aes(x = mod_popularity$fitted.values, y = mod_popularity$residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, lty = "dashed") +
  labs(x = "model fit", y = "residuals", title = "residuals vs fitted values") -> rs1


# Absolute values of residuals
ggplot(data = movies, aes(x = mod_popularity$fitted.values, y = abs(mod_popularity$residuals))) +
  geom_point() +
  labs(x = "model fit", y = "residuals", title = "Absolute values of residuals") -> rs2

# Composite plot
rs1 | rs2

The plot on the right above shows somewhat of a fun shaped distribution which is evidence of an inconsistency in variability in the data points. On the second plot, although the scatter shows no definite shape, it’s randomness is not absolute and therefore conclude the condition of constant variability of residuals is not met.

Independence of residuals

To check for independence of observations, we plot a distribution of the residuals against the order of data collection to see if there is a pattern eg. a time series pattern.

ggplot(data = movies, aes(x = 1:nrow(movies), y = residuals(mod_popularity))) +
  geom_point() +
  labs(x = "Index", y = "residuals", title = "Independent Residuals")

The above scatter of residuals against the order of data collection reveals a random scatter and thus, it is without any pattern or whatsoever. Hence, we can conclude that, this condition is met for the model to hold.

Interpretation of Model Coefficients

Model Output

coef(mod_popularity) |>
  as.data.frame()

##                      coef(mod_popularity)
## (Intercept)                    32.8371534
## critics_score                   0.5289223
## critics_ratingFresh            -3.7185498
## critics_ratingRotten            0.4360501
## best_pic_nomyes                 7.9856964
## best_actress_winyes            -2.4311047

Intercept: The intercept of the model is basically the value of the response variable (\(y\)), which in this case is audience_score when predictors (\(x\)) have no value or no level. This appears meaningless most often than not but its just an adjustment to the height of the regression line. Thus, when the predictors are 0 or without any level, audience_score is 32.8371534.
Slope of critics_rating: This interprets as, for all else held constant, for an increase in the critics score of a movie, we would expect the model to predict its audience score to be up on average by 0.5289223
slope of critics_ratingFresh: All else held constant, we would expect the model to predict the audience score of movies rated by critics as Fresh to be 3.7185498 lower on average than those rated by critics as Certified Fresh
slope of critics_ratingRotten: All else held constant, we would expect the model to predict the audience score of movies rated by critics as Rotten to be 0.4360501 lower on average than those rated by critics as Certified Fresh.
slope of best_pic_nomyes:The estimate of this variable translates as, all else held constant, we would expect the model to predict the audience score of movies nominated for best picture to be 7.9856964 higher on average than those not nominated for best picture.
slope of best actress_win:All else held constant, We would expect the model to predict the audience score of movies with one of the main actress ever nominated for an Oscar to be lower on average by 2.4311047 than those without a main actress ever nominated for an Oscar

Part 5: Prediction

In this chapter, we attempt to predict the audience_score at the 95% predictive level of the 2016 movie; Arrival, staring Amy Adams, Jeremy Renner and Forest Whitaker.
Data for the regression model for predicting the audience score for the movie Arrival was sourced from the following hyperlinks:
- critics score of Arrival
- critics rating of Arrival
- Wikipedia - best picture
- wikipedia-best actress

The following are the metrics and statistics from rotten tomatoes, imdb and wikipedia that are associated with this movie.

data.frame(
  critics_score = 94,
  critics_rating = as.factor("Certified Fresh"),
  best_pic_nom = as.factor("yes"),
  best_actress_win = as.factor("yes")
) -> Arrival

Arrival

##   critics_score  critics_rating best_pic_nom best_actress_win
## 1            94 Certified Fresh          yes              yes

To predict the audience score for the movie Arrival, we call the predict function and use a predictive interval of 0.95;

predict(
  mod_popularity,
  Arrival,
  interval = "prediction",
  level = 0.95
)

##        fit      lwr      upr
## 1 88.11044 59.47009 116.7508

The model outputs a range of values which quantifies the uncertainty around the prediction. This means at the 95% level, this range of values will capture the true audience score of the movies Arrival given \(x\), where \(x\) is;

##   critics_score  critics_rating best_pic_nom best_actress_win
## 1            94 Certified Fresh          yes              yes

The 95% prediction interval of the audience score of a movie rated on imdb as 7.9, rated by critics on rotten tomatoes as Certified Fresh, rated by audience on rotten tomatoes as Upright, nominated for an Academy Award for best picture and with a main actress having won an Academy Award for best actress is between 71.92246 and 99.79956.
This means, we are 95% certain that, the audience score for the movie Arrival is between 59.47009 and 116.7508 with a center of 88.11044. And infact, the audience score on rotten tomatoes for the movie Arrival, is 82 (which indeed falls within the although quite large prediction interval).

Part 6: Conclusion

Expert opinions and a common sentiment shared by most movie lovers over the world, is that, the variables best_dir_win and best_actor_win amongst many equally important variables are very significant factors affecting movie popularity, However, purely from a statistically perspective (linear association), adding them to our developed linear model reduces its predictive power.

The F-statistic of 133.6 suggests at least one of the explanatory variables in the model is a significant predictor of the response variable, which is the audience score.

The audience score as a measure of the popularity of a movie in my opinion may not be always a reliable measure of the popularity of a movie because, not all persons who give these score (audience) on rotten tomatoes actually watch these movies. The convenience of just signing up and creating an account on rotten tomatoes means anybody whether have or have not watched the movie can give a score. The critics score on the other hand is given by persons who are approved to do so and I believe have actually watched the movie, however, the critics constitute a very small proportion of the audience which might be an under-representation of people who have actually seen the movie in question. Audience score was chosen as the measure because after all, the popularity of a movie is a measure of how the audience receive and react to it

Given that the audience score is used as a measure of popularity, I recommend a system be put in place on rotten tomatoes and imdb to only allow “audience” who have actually watched a movie in question to give their score and reviews.