Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)
library(DAAG)
library(devtools)

Load data

load("movies.Rdata")

Part 1: Data

The “movies” dataset for this project included 651 randomly sampled movies released in the United States between 1970 and 2016. The data likely represents a simple random sample of movies without the use of stratification. For example, the amount of movies for each theater release year varies and there are different proportions of genres that what we would expect from the population. From this, we can conclude that there are no controls in the simple random sample that would lead us to believe that it closely resembles the actual distribution of movies across the given time period.

The simple random sample does not appear to involve the assignment of movies to the factors under consideration for the data set. Equally, it can be said that the data is not representative of an experiment or observational study and because of this we can conclude that there are limitations on the inferences that we can draw from this data set.

When viewing the data that is presented to us for this project, it is important to keep in mind that there are likely biases and lurking variables that could complicate identifying correlations between the variables. Because of this, we should check to see how all of the variables are related to each other to ensure that our conclusions are accurate. With that in mind, it can determined that through the course of this project, we should rule out causality, and only use the data for means of association.

Regardless of this fact, the data can still be useful for us in drawing a hypothesis and can be informative about the trend towards modern day movies and how the movie industry has been impacted over the years based on soe of the trends that we can see with the data.

Part 2: Research question

Q: Are the ratings of a movie, as well as the genre, tied to the runtime of the movie?

Therefore, the response variable is the “run time”, and the explanatory variables are those linked to either the genre and type of movie it is or the ratings of the movie.

This data could reveal key characteristics that allow us to draw conclusions on if the overall film length is closely tied to the movie genre and how this can affect the ratings of the movie.

Part 3: Exploratory data analysis

Below is a list of variables that will be focused on throughout the course of this project, as well as what category of variable they fall into:

Ratio:

runtime (response variable)
imdb_rating
imdb_num_votes
critics_score
audience_score

Interval: * thtr_rel_year * thtr_rel_month * thtr_rel_day

Categorical: * genre * best pic nom * best pic win * top 200 box

We will not be focusing on ordinal data at all with this data set.

To begin, we will start by looking at the response variable (run time) statistics to have a gauge to go off of for the rest of our data:

ggplot(data = movies, aes(x = runtime)) + 
  geom_histogram(binwidth = 15, color='darkgray', fill  = 'blue') + 
  labs(title = "Response Variable: Movie Length" )

summary(movies$runtime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    39.0    92.0   103.0   105.8   115.8   267.0       1

From this data, we can see that there are traces of a normal distribution with 1 possible outlire to the far right. The mean time is 105min and the median time is 103min with a max time of 267min and min time of 39min.

Next, we will check how the other numerical variables are related to one another using a pair plot where the columns that are being used directly correspond to the columns that of the data set that we are trying to test:

Next, since the variable imdb_num_votes was not strongly correlated with the other variables, we will look at the distribution of this variable:

ggplot(data = movies, aes (x = imdb_num_votes)) + 
  geom_histogram(binwidth = 30000, color='darkgray', fill  = 'blue') + 
  labs(title = "Distribution of Variable: imdb_num_votes" )

summary(movies$imdb_num_votes)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     180    4546   15116   57533   58300  893008

As can be seen from the above histogram, the data for imdb_num_votes is strongly right skewed. Since non-normal distributions (like the strongly skewed data above) can be problematic when is comes to making inferences about the population, it is sensible to transform this data. We will do this using a log transformation:

ggplot(data = movies, aes (x = log10(imdb_num_votes))) + 
  geom_histogram(binwidth = 0.25, color='darkgray', fill  = 'blue') + 
  labs(title = "Distribution of Log Transformed Variable: log10 of IMDB Votes" )

Continuing, we will be viewing the critic scores to see where the average is for all movies before testing it against movies based on run time:

ggplot(data = movies, aes (x = critics_score)) + 
  geom_histogram(binwidth = 5, color='darkgray', fill  = 'blue') + 
  labs(title = "Distribution of Variable: Critics Score" )

summary(movies$critics_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   33.00   61.00   57.69   83.00  100.00

The last distribution to look at is for the other numeric variable, imdb_rating to again see where ratings fall for all movies before testing them against the runtime:

ggplot(data = movies, aes (x = imdb_rating)) + 
  geom_histogram(binwidth = 0.3, color='darkgray', fill  = 'blue') + 
  labs(title = "Distribution of Variable: imdb_rating" )

summary(movies$imdb_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.900   5.900   6.600   6.493   7.300   9.000

This distribution can be seen to be nearly normal with a weak left skew. Since we do have a larger sample size (n > 30), this skew should not affect any inferences made regarding the population given our knowledge of statistics.

To ensure that our linear regression model is valid, we need to test variables and ensure that they have a linear relationship. This is done with the scatter plot below that is testing the linear relationship between the imdb_rating to the runtime of the movie:

ggplot(data = movies, aes (x = imdb_rating, y = runtime)) + 
  geom_jitter(color = 'blue') + 
  geom_smooth(method = 'lm', formula = y~x, color = 'red') + 
  labs(title = "Scatter Plot of runtime Vs imdb_rating" )

ggplot(data = movies, aes (x = critics_score, y = runtime)) + 
  geom_jitter(color = 'blue') + 
  geom_smooth(method = 'lm', formula = y~x, color = 'red') + 
  labs(title = "Scatter Plot of runtime Vs critics score" )

From this model, we can see that there is a linear relationship between the two variables. It is important to point out that there are what appear to be outliers in the scatterplot that could influence the data. We can also see that the relationship between the runtime and the critics scores is less significant. While the relationship does apear to be linear, the slope of the graph shows us that that it is not as strong as the imdb ratings.

Part 4: Modeling

For the model I will be using a backward selection, p-value criteria model process. This is because it is a more time efficient method than the square of R criteria and forward selection process since we will be working with a number of variables in the model.

The full model and summary statistics are given below:

full_model = lm(runtime ~ imdb_rating + 
                  log10(imdb_num_votes) + 
                  title_type + 
                  genre + 
                  mpaa_rating +  
                  best_pic_nom + 
                  best_pic_win + 
                  top200_box, data = movies) 

summary(full_model)

## 
## Call:
## lm(formula = runtime ~ imdb_rating + log10(imdb_num_votes) + 
##     title_type + genre + mpaa_rating + best_pic_nom + best_pic_win + 
##     top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.308 -10.517  -1.905   8.563 168.396 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     54.7226     9.5546   5.727 1.58e-08 ***
## imdb_rating                      2.9745     0.8183   3.635 0.000301 ***
## log10(imdb_num_votes)            3.8133     1.2290   3.103 0.002004 ** 
## title_typeFeature Film           0.5568     6.4050   0.087 0.930751    
## title_typeTV Movie              -4.3478     9.9744  -0.436 0.663065    
## genreAnimation                  -6.6099     6.6495  -0.994 0.320581    
## genreArt House & International  -1.7858     5.2476  -0.340 0.733736    
## genreComedy                     -5.9823     2.8423  -2.105 0.035709 *  
## genreDocumentary                -9.0596     6.8184  -1.329 0.184433    
## genreDrama                       4.7180     2.5318   1.864 0.062858 .  
## genreHorror                     -9.1438     4.2303  -2.162 0.031033 *  
## genreMusical & Performing Arts   8.4133     5.8602   1.436 0.151597    
## genreMystery & Suspense          4.5845     3.1713   1.446 0.148777    
## genreOther                       4.7621     4.8557   0.981 0.327109    
## genreScience Fiction & Fantasy  -0.5089     6.0450  -0.084 0.932936    
## mpaa_ratingNC-17                 6.0170    12.8570   0.468 0.639951    
## mpaa_ratingPG                   11.4946     4.6769   2.458 0.014251 *  
## mpaa_ratingPG-13                17.5688     4.8200   3.645 0.000290 ***
## mpaa_ratingR                    12.6217     4.6569   2.710 0.006906 ** 
## mpaa_ratingUnrated              18.5272     5.3168   3.485 0.000527 ***
## best_pic_nomyes                 12.8459     4.3335   2.964 0.003149 ** 
## best_pic_winyes                 15.0925     7.3899   2.042 0.041538 *  
## top200_boxyes                    9.6923     4.6475   2.085 0.037430 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.96 on 627 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2654, Adjusted R-squared:  0.2396 
## F-statistic:  10.3 on 22 and 627 DF,  p-value: < 2.2e-16

model_x = lm(runtime ~ imdb_rating + 
               log10(imdb_num_votes)  + 
               genre + mpaa_rating +  
               best_pic_nom + 
               best_pic_win + 
               top200_box, data = movies)

summary((model_x))

## 
## Call:
## lm(formula = runtime ~ imdb_rating + log10(imdb_num_votes) + 
##     genre + mpaa_rating + best_pic_nom + best_pic_win + top200_box, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.173 -10.491  -2.048   8.365 168.514 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     54.9507     7.2606   7.568 1.35e-13 ***
## imdb_rating                      2.9846     0.8067   3.700 0.000235 ***
## log10(imdb_num_votes)            3.8646     1.2080   3.199 0.001448 ** 
## genreAnimation                  -6.5926     6.6410  -0.993 0.321234    
## genreArt House & International  -1.6446     5.2269  -0.315 0.753141    
## genreComedy                     -5.9777     2.8365  -2.107 0.035477 *  
## genreDocumentary                -9.3389     4.2212  -2.212 0.027298 *  
## genreDrama                       4.6810     2.5224   1.856 0.063954 .  
## genreHorror                     -9.0712     4.2229  -2.148 0.032086 *  
## genreMusical & Performing Arts   8.3101     5.5776   1.490 0.136754    
## genreMystery & Suspense          4.6016     3.1662   1.453 0.146620    
## genreOther                       4.4771     4.8242   0.928 0.353733    
## genreScience Fiction & Fantasy  -0.4891     6.0373  -0.081 0.935463    
## mpaa_ratingNC-17                 6.1258    12.8390   0.477 0.633441    
## mpaa_ratingPG                   11.5561     4.6699   2.475 0.013601 *  
## mpaa_ratingPG-13                17.6191     4.8131   3.661 0.000273 ***
## mpaa_ratingR                    12.6281     4.6509   2.715 0.006806 ** 
## mpaa_ratingUnrated              18.2300     5.2823   3.451 0.000596 ***
## best_pic_nomyes                 12.8742     4.3278   2.975 0.003044 ** 
## best_pic_winyes                 15.0344     7.3799   2.037 0.042046 *  
## top200_boxyes                    9.6716     4.6411   2.084 0.037573 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.93 on 629 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2649, Adjusted R-squared:  0.2416 
## F-statistic: 11.34 on 20 and 629 DF,  p-value: < 2.2e-16

Following this step, we can see that there are a number of categorical variables that have a strong correlation, all but 1. Because of this, it is important for us to note all of them.

There are certain conditions that need to be met. These are:

Linear relationships between (numerical) x and y
Nearly normal residuals with mean 0
Constant variability of residuals
Independent residuals

The first condition has already been verified by the previous scatter plots with the numerical variables. The condition for nearly normal residuals can be checked using histogram and/or Q-Q plot of the residuals:

ggplot(data = model_x, aes(x = model_x$residuals)) + 
  geom_histogram(color = 'darkgray', fill = 'darkblue', binwidth = 5)  + 
  labs(x = "residuals", title = "Distribution of Residuals" )

qqnorm(model_x$residuals, col = 'blue') 
qqline(model_x$residuals, col = 'red')

summary(model_x$residuals)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -57.173 -10.491  -2.048   0.000   8.365 168.514

The histogram shows that the residuals are somewhat right skewed, due to a small proportion of outliers. The Q-Q plot also indicates that the values form a relatively straight line, telling us that they are related in some way. With the information that we have gained from these plots, we can say that the second condition has been met.

Next, we are going to check the variability of the resiuduals:

plot(model_x$residuals ~ model_x$fitted, 
     main = 'Plot of Residuals vs Model Prediction', 
     xlab = 'prediction', 
     ylab = 'residuals', 
     col = 'blue')

This plot shows us that the variability of the residuals is not completely constant, which could lead to an error in our standard deviation.

To check if we have independent residuals (condition 4), we can plot the residuals as they appear in the data which would reveal any time series dependence:

plot(model_x$residuals, 
     main = 'Plot of Residuals', 
     xlab = 'index', 
     ylab = 'residuals', 
     col = 'blue')

After this check, we can see that the residuals are evenly distributed and therefor independent, satisfying our 4th condition.

Part 5: Prediction

The model can now be used to predict the running time of a movie not present in the sample data. The chosen test movie is The Interview (2014).

As mentioned previously, a 80% confidence interval will be used for the model prediction:

interview = data.frame(genre = 'Comedy' , 
                     mpaa_rating = 'R', 
                     imdb_rating = 6.6, 
                     imdb_num_votes = 235529, 
                     best_pic_nom = 'no', 
                     best_pic_win = 'no', 
                     top200_box = 'no', 
                     stringsAsFactors=FALSE)

predict(model_x, 
        interview, 
        interval = "prediction", 
        level = 0.8)

##        fit      lwr      upr
## 1 102.0603 80.12019 124.0003

Therefore, from the model, we are 80% confident that the runtime of The Interview is between 80 and 124 minutes. Viewing the information that was given in the data set, we can see that the true runtime is 112 minutes.

Part 6: Conclusion

In conclusion, we can answer our research question with the following statement:

While we can not say their is a causal relationship, we have gained more knowledge into how the rating is effected by the movie runtime and genre. With this information, we can deduce that over the years, film producers have been made aware of this and for that reason, we see more standard movie runtimes ranging from 1 hour and 27 minutes to 1 hour and 43 minutes.