Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)
library(DAAG)
library(devtools)

Load data

load("movies.Rdata")

Part 1: Data

The “movies” dataset for this project included 651 randomly sampled movies released in the United States between 1970 and 2016. The data likely represents a simple random sample of movies without the use of stratification. For example, the amount of movies for each theater release year varies and there are different proportions of genres that what we would expect from the population. From this, we can conclude that there are no controls in the simple random sample that would lead us to believe that it closely resembles the actual distribution of movies across the given time period.

The simple random sample does not appear to involve the assignment of movies to the factors under consideration for the data set. Equally, it can be said that the data is not representative of an experiment or observational study and because of this we can conclude that there are limitations on the conclusions that we are able to inference.

When viewing the data that is presented to us for this project, it is important to keep in mind that there are likely biases and lurking variables that could complicate identifying correlations between the variables. With that in mind, it can determined that through the course of this project, we should rule out causality that is determined, and only use the data for means of association.

Regardless of this fact, the data can still be useful for us in drawing a hypothesis and can be informative about the trend towards modern day movies

Part 2: Research question

Q: Are the ratings of a movie, as well as the genre, tied to the runtime of the movie?

Therefore, the response variable is the “run time”, and the explanatory variables are those linked to either the genre and type of movie it is or the ratings of the movie.

This data could reveal key characteristics that allow us to draw conclusions on if the overall film length is closely tied to the movie genre and how this can affect the ratings of the movie.

Part 3: Exploratory data analysis

Below is a list of variables that will be focused on throughout the course of this project, as well as what category of variable they fall into:

Numerical:

runtime (response variable)
imdb_rating
imdb_num_votes
critics_score
audience_score
thtr_rel_year
thtr_rel_month
thtr_rel_day

Categorical:

title_type
genre
mpaa_rating
critics rating
audience_rating
best_pic_nom
best_pic_win

To begin, we will start by looking at the response variable (run time) statistics:

ggplot(data = movies, aes(x = runtime)) + 
  geom_histogram(binwidth = 10, color='darkgray', fill  = 'blue') + 
  labs(title = "Distribution of Response Variable: Movie Length" )

summary(movies$runtime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    39.0    92.0   103.0   105.8   115.8   267.0       1

From this data, we can see that there are traces of a normal distribution with 1 possible outlire to the far right. The mean time is 105min and the median time is 103min with a max time of 267min and min time of 39min.

Next, we will check how the other numerical variables are related, looking specifically at the correlation values using a pairwise plot:

num_cols <- c(13,14,16,18)

ggpairs(movies, 
        columns = num_cols, 
        upper = list(continuous = wrap("cor", size = 10)), 
        diag = list(continuous = wrap("barDiag", colour = "blue")), 
        lower = list(continuous = wrap("smooth", colour = 'blue')))

It can be seen here that _rating, audience_score, and critics_score are all strongly correlated. By including more than one of these variable, we could possibly be introducing lurking variables, causing issues with our results. Additionally, since the correlation values for these variables are all high (> 0.7), including two, or all three, of these variables would not significantly improve the predicting power of the model compared compared to using just one. In this case, we will use only the imdb_rating variable which has the highest correlation values.

Next, since the variable imdb_num_votes was not strongly correlated with the other variables, we will look at the distribution of this variable:

ggplot(data = movies, aes (x = imdb_num_votes)) + 
  geom_histogram(binwidth = 30000, color='darkgray', fill  = 'blue') + 
  labs(title = "Distribution of Variable: imdb_num_votes" )

summary(movies$imdb_num_votes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     180    4546   15116   57533   58300  893008

As can be seen from the above histogram, the data for imdb_num_votes is strongly right skewed. Since non-normal distributions (like the strongly skewed data above) can be problematic when is comes to making inferences about the population, it is sensible to transform this data. We will do this using a log transformation:

ggplot(data = movies, aes (x = log10(imdb_num_votes))) + 
  geom_histogram(binwidth = 0.25, color='darkgray', fill  = 'blue') + 
  labs(title = "Distribution of Log Transformed Variable: log10 of IMDB Votes" )

After the log transformation, the imdb_rating variable can now be seen to be nearly normal, and therefore suitable to use in the model.

Continuing, we will be viewing the critic scores to see where our average is:

ggplot(data = movies, aes (x = critics_score)) + 
  geom_histogram(binwidth = 5, color='darkgray', fill  = 'blue') + 
  labs(title = "Distribution of Variable: Critics Score" )

summary(movies$critics_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   33.00   61.00   57.69   83.00  100.00

The last distribution to look at is for the other numeric variable, imdb_rating. This can be seen below:

ggplot(data = movies, aes (x = imdb_rating)) + 
  geom_histogram(binwidth = 0.3, color='darkgray', fill  = 'blue') + 
  labs(title = "Distribution of Variable: imdb_rating" )

summary(movies$imdb_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.900   5.900   6.600   6.493   7.300   9.000

This distribution can be seen to be nearly normal with a weak to moderate left skew. As we have a large sample size (n > 30), this skew should not affect any inferences made regarding the population.

For a linear regression model to be valid, the response and explanatory variables need to have a linear dependence. This can be checked using scatter plots as seen below:

ggplot(data = movies, aes (x = imdb_rating, y = runtime)) + 
  geom_jitter(color = 'darkblue') + 
  geom_smooth(method = 'lm', formula = y~x, color = 'red') + 
  labs(title = "Scatter Plot of runtime Vs imdb_rating" )

ggplot(data = movies, aes (x = log10(imdb_num_votes), y = runtime)) + 
  geom_point(color = 'darkblue') + 
  geom_smooth(method = 'lm', formula = y~x, color = 'red') + 
  labs(title = "Scatter Plot of runtime Vs log10(imdb_num_votes)" )

In this case, both of these explanatory variables appear to have a linear relationship with the response variable runtime. It is noticeable, however, that the imdb_rating variable does not have a constant variation of data which may pose as a problem for the model and will be discussed further in the modelling section.

Part 4: Modeling

For the model, a backward selection, p-value criteria model selection process will be used. This is because it is a more time efficient method than the \(R^{2}\) criteria and forward selection process when there are numerous variables in the model. As the model results will be used as more of a guideline rather than a strict value - a significance level of alpha = 0.2 will be used.

The full model and summary statistics are given below:

full_model = lm(runtime ~ imdb_rating + 
                  log10(imdb_num_votes) + 
                  title_type + 
                  genre + 
                  mpaa_rating +  
                  best_pic_nom + 
                  best_pic_win + 
                  top200_box, data = movies) 

summary(full_model)

## 
## Call:
## lm(formula = runtime ~ imdb_rating + log10(imdb_num_votes) + 
##     title_type + genre + mpaa_rating + best_pic_nom + best_pic_win + 
##     top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.308 -10.517  -1.905   8.563 168.396 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     54.7226     9.5546   5.727 1.58e-08 ***
## imdb_rating                      2.9745     0.8183   3.635 0.000301 ***
## log10(imdb_num_votes)            3.8133     1.2290   3.103 0.002004 ** 
## title_typeFeature Film           0.5568     6.4050   0.087 0.930751    
## title_typeTV Movie              -4.3478     9.9744  -0.436 0.663065    
## genreAnimation                  -6.6099     6.6495  -0.994 0.320581    
## genreArt House & International  -1.7858     5.2476  -0.340 0.733736    
## genreComedy                     -5.9823     2.8423  -2.105 0.035709 *  
## genreDocumentary                -9.0596     6.8184  -1.329 0.184433    
## genreDrama                       4.7180     2.5318   1.864 0.062858 .  
## genreHorror                     -9.1438     4.2303  -2.162 0.031033 *  
## genreMusical & Performing Arts   8.4133     5.8602   1.436 0.151597    
## genreMystery & Suspense          4.5845     3.1713   1.446 0.148777    
## genreOther                       4.7621     4.8557   0.981 0.327109    
## genreScience Fiction & Fantasy  -0.5089     6.0450  -0.084 0.932936    
## mpaa_ratingNC-17                 6.0170    12.8570   0.468 0.639951    
## mpaa_ratingPG                   11.4946     4.6769   2.458 0.014251 *  
## mpaa_ratingPG-13                17.5688     4.8200   3.645 0.000290 ***
## mpaa_ratingR                    12.6217     4.6569   2.710 0.006906 ** 
## mpaa_ratingUnrated              18.5272     5.3168   3.485 0.000527 ***
## best_pic_nomyes                 12.8459     4.3335   2.964 0.003149 ** 
## best_pic_winyes                 15.0925     7.3899   2.042 0.041538 *  
## top200_boxyes                    9.6923     4.6475   2.085 0.037430 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.96 on 627 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2654, Adjusted R-squared:  0.2396 
## F-statistic:  10.3 on 22 and 627 DF,  p-value: < 2.2e-16

Since both levels of the (Categorical) variable title_type have high p-values, title_type will be removed from the model and model 2 will be computed:

model_2 = lm(runtime ~ imdb_rating + 
               log10(imdb_num_votes)  + 
               genre + mpaa_rating +  
               best_pic_nom + 
               best_pic_win + 
               top200_box, data = movies)

summary((model_2))

## 
## Call:
## lm(formula = runtime ~ imdb_rating + log10(imdb_num_votes) + 
##     genre + mpaa_rating + best_pic_nom + best_pic_win + top200_box, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.173 -10.491  -2.048   8.365 168.514 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     54.9507     7.2606   7.568 1.35e-13 ***
## imdb_rating                      2.9846     0.8067   3.700 0.000235 ***
## log10(imdb_num_votes)            3.8646     1.2080   3.199 0.001448 ** 
## genreAnimation                  -6.5926     6.6410  -0.993 0.321234    
## genreArt House & International  -1.6446     5.2269  -0.315 0.753141    
## genreComedy                     -5.9777     2.8365  -2.107 0.035477 *  
## genreDocumentary                -9.3389     4.2212  -2.212 0.027298 *  
## genreDrama                       4.6810     2.5224   1.856 0.063954 .  
## genreHorror                     -9.0712     4.2229  -2.148 0.032086 *  
## genreMusical & Performing Arts   8.3101     5.5776   1.490 0.136754    
## genreMystery & Suspense          4.6016     3.1662   1.453 0.146620    
## genreOther                       4.4771     4.8242   0.928 0.353733    
## genreScience Fiction & Fantasy  -0.4891     6.0373  -0.081 0.935463    
## mpaa_ratingNC-17                 6.1258    12.8390   0.477 0.633441    
## mpaa_ratingPG                   11.5561     4.6699   2.475 0.013601 *  
## mpaa_ratingPG-13                17.6191     4.8131   3.661 0.000273 ***
## mpaa_ratingR                    12.6281     4.6509   2.715 0.006806 ** 
## mpaa_ratingUnrated              18.2300     5.2823   3.451 0.000596 ***
## best_pic_nomyes                 12.8742     4.3278   2.975 0.003044 ** 
## best_pic_winyes                 15.0344     7.3799   2.037 0.042046 *  
## top200_boxyes                    9.6716     4.6411   2.084 0.037573 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.93 on 629 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2649, Adjusted R-squared:  0.2416 
## F-statistic: 11.34 on 20 and 629 DF,  p-value: < 2.2e-16

After this iteration, all but one of the variables appear significant predictors of the movie running time (the categorical variables all have one or more levels which are highly significant). The only one that does not is the best_pic_win variable. This variable, however, is very close to the critical p-value (in this case, p-value = 0.2) and, at the authors discretion, deemed important and will remain in the model. Each coefficient in this model represents an increase or decrease in minutes for a predicted runtime; if the variable is numerical, this increase/decrease is for per unit of that variable. If the variable is categorical, then all but one of the levels of that variable will be set to 0 and the increase/decrease from remaining level will be added to the model.

For the model to provide valid results, there are certain conditions that need to be met. These are:

Linear relationships between (numerical) x and y
Nearly normal residuals with mean 0
Constant variability of residuals
Independent residuals

The first condition has already been verified by the previous scatter plots with the numerical variables. The condition for nearly normal residuals can be checked using histogram and/or Q-Q plot of the residuals:

ggplot(data = model_2, aes(x = model_2$residuals)) + 
  geom_histogram(color = 'darkgray', fill = 'darkblue', binwidth = 5)  + 
  labs(x = "residuals", title = "Distribution of Residuals from LRM" )

qqnorm(model_2$residuals, col = 'blue') 
qqline(model_2$residuals, col = 'red')

summary(model_2$residuals)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -57.173 -10.491  -2.048   0.000   8.365 168.514

From the summary statistics the mean of the residuals can be seen to equal zero. However, the histogram shows that the residuals are somewhat right skewed, due to a small proportion of outliers. The Q-Q plot also indicates an almost normal distribution, with some deviation from normality occurring on the right hand side, indicating again that the data is right skewed. As both plots show a nearly normal distribution, with the skew that is occurs appearing to be for more extreme values, we can be satisfied condition 2 has been met.

The variability of residuals can be checked by plotting the residuals against the predicted values from the model as seen below:

plot(model_2$residuals ~ model_2$fitted, 
     main = 'Plot of Residuals vs Model Prediction from MLR Model', 
     xlab = 'prediction', 
     ylab = 'residuals', 
     col = 'blue')

It is quite clear from this plot that the variability of residuals is not constant. The implications of this are that the values for the standard error are not entirely accurate and could therefore affect the accuracy of confidence intervals and the outcome of hypothesis tests from the model. If the model was altered to take into account the non-constancy that is present, then this would resolve this issue, however, this is outwith the scope of this project.

To check if we have independent residuals (condition 4), we can plot the residuals as they appear in the data which would reveal any time series dependence:

plot(model_2$residuals, 
     main = 'Plot of Residuals for MLR Model', 
     xlab = 'index', 
     ylab = 'residuals', 
     col = 'blue')

As can be seen from this plot, the residuals are distributed randomly, and are, therefore, independent. This is to be expected since the movie data was sampled randomly.

Part 5: Prediction

The model can now be used to predict the running time of a movie not present in the sample data. The chosen test movie is Arrival (2016). The information for this movie was found IMDB, Rotten Tomatoes, and Box Office Mojo.

As mentioned previously, a 80% confidence interval will be used for the model prediction:

arrival = data.frame(genre = 'Science Fiction & Fantasy' , 
                     mpaa_rating = 'PG-13', 
                     imdb_rating = 7.9, 
                     imdb_num_votes = 454979, 
                     best_pic_nom = 'yes', 
                     best_pic_win = 'no', 
                     top200_box = 'yes', 
                     stringsAsFactors=FALSE)

predict(model_2, 
        arrival, 
        interval = "prediction", 
        level = 0.8)

##        fit      lwr      upr
## 1 140.0708 115.9298 164.2119

Therefore, from the model, we are 80% confident that the runtime of the movie Arrival is between 116 and 164 minutes.

Part 6: Conclusion

In this instance, it is possible to check if the prediction of the model is correct or not. With the movie Arrival having a running time of 116 minutes, the prediction in this case was correct. Overall we can conclude that the explanatory variables explored during this project are significant predictors of runtime, and can be used to predict runtime to a certain degree of accuracy. There are a couple of points to note however. Firstly, the confidence level for the given interval may be lower than stated in reality due to the variability in the residuals. Secondly, the Adjusted \(R^{2}\) value for the model was given to be 0.24. This means that approx. 76% of the variability of the data is not explained by the model, the result of which is a model with poor predicting capabilities.

Future research, and improvements to the model, could stem from learning how to transform the data appropriately so that the non-constancy of the data is negated in the model.