Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

setwd("~/coursera/probability and statistics/projectLM")
load("movies.Rdata")
movies<-movies[complete.cases(movies[,c(3,4,16,18,20,21,22,23)]),c(3,4,16,18,20,21,22,23)]

Part 1: Data

The dataset contains information about movies in Rotten Tomatoes and IMDB. There are 651 randomly sampled movies produced and released before 2016. There are 32 available variables. With this dataset and for the purpose of this project it is only possible to do an observational study and no causal analysis is done.The study can generalize to movies produced and released before 2016.

We considered that some of the variables are irrelevant to the purpose of identifying the popularity of a movie: the Link to IMDB page for the movie and the Link to Rotten Tomatoes page for the movie.

Data preprocessing

The dataset, with the remaining variables, has only one missing value, we decided to omit the corresponding observation from the analysis.

Part 2: Research question

For this study we want to determine how does the type of movie, runtime, awards obtained (actors, director and picture) and critics score relates to the popularity of a movie.

Answering this question con provide some insights to determine what characteristics are more influential to establish the popularity of a movie.

Part 3: Exploratory data analysis

The movies in the dataset are divided into 10 categories and and extra category named “other”. For the exploratory analysis we joined some of the smallest categories into the “other” category, in Figure 2.a) it is straightforward to determine that the majority of the movies,305 (Figure 1), are categorized as drama movies. Also the most common scenario is that where the majority of movies didn’t win an award as depicted in Figure 2.b); this scenario was also present in the actors, actresses and director awards variables (Figure 2 c,d and e).

summary(movies)

##                 genre        runtime      critics_score    audience_score 
##  Drama             :305   Min.   : 39.0   Min.   :  1.00   Min.   :11.00  
##  Comedy            : 87   1st Qu.: 92.0   1st Qu.: 33.00   1st Qu.:46.00  
##  Action & Adventure: 65   Median :103.0   Median : 61.00   Median :65.00  
##  Mystery & Suspense: 59   Mean   :105.8   Mean   : 57.65   Mean   :62.35  
##  Documentary       : 51   3rd Qu.:115.8   3rd Qu.: 83.00   3rd Qu.:80.00  
##  Horror            : 23   Max.   :267.0   Max.   :100.00   Max.   :97.00  
##  (Other)           : 60                                                   
##  best_pic_win best_actor_win best_actress_win best_dir_win
##  no :643      no :557        no :578          no :607     
##  yes:  7      yes: 93        yes: 72          yes: 43     
##                                                           
##                                                           
##                                                           
##                                                           
##

Figure 1. Summary statistics of genre,runtime,critics score, audience score, best picture award, best actor award, best actress award and best director award.

layout(matrix(c(1,1,2,3,4,5), 3, 2, byrow = TRUE))
barplot(summary(movies$genre,maxsum=6),main="a) Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other")
barplot(summary(movies$best_pic_win),main="b) Whether or not the movie won a best picture Oscar (no, yes)")
barplot(summary(movies$best_actor_win),main="c) Whether or not one of the main actors in the movie ever won an Oscar (no, yes)")
barplot(summary(movies$best_actress_win),main="d) Whether or not one of the main actresses in the movie ever won an Oscar (no, yes) - \n not that this is not necessarily whether the actresses won an Oscar for their role in the given movie")
barplot(summary(movies$best_dir_win),main="e) Whether or not the director of the movie ever won an Oscar (no, yes) - \n not that this is not necessarily whether the director won an Oscar for the given movie")

Figure 2. Barplots for Genre of movies, best picture award, main actors award, main actresses award, director award.

In the case of the numerical variables we used histograms and barplots (Figure 3). The average runtime of a movie is around 105 and its distribution is slightly right skewed (Figure 3 a and b), with some outliers movies with runtime around 250 minutes. The critics scores are nearly uniformly distributed, but slightly left skewed(Figure 3 c and d). The score given by the audience has a similar behavior has the critics scores, with a nearly uniform left skewed distribution.

layout(matrix(c(1,2,3,4,5,6), 3, 2, byrow = TRUE))
boxplot(movies$runtime,xlab="Runtime",main="a) Boxplot of movies runtime")
hist(movies$runtime, xlab="Runtime",main="b) Histogram of movies runtime")
boxplot(movies$critics_score,xlab="Critics score",main="c) Boxplot of critics score")
hist(movies$critics_score, xlab="Critics score",main="d) Histogram of critics score")
boxplot(movies$audience_score,xlab="Audience score",main="e) Boxplot of audience score")
hist(movies$audience_score, xlab="Audience score",main="f) Histogram of audience score")

Figure 3. Barplots and histograms of movies runtime, critics score and audience score.

Part 4: Modeling

In this project we will use linear regression and start by fitting a model with 8 variables(described in the previous section). Backward elimination will help us to define if better results can be obtained by using a smaller set of attributes. The advantage of backward elimination is that it allows to start with all the variables, deleting one variable at a time until there are no improvements in the model.

First, let’s fit an initial model with the 8 variables. The adjusted R-squared is 0.5223 and the least significant variable is whether or not the movie director won an oscar; then we will exclude this variable for the next model.

model<-lm(audience_score~.,data=movies)
summary(model)

## 
## Call:
## lm(formula = audience_score ~ ., data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.006  -9.286   0.315   9.441  41.169 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    27.12009    3.76375   7.206 1.65e-12 ***
## genreAnimation                  6.18849    5.01631   1.234 0.217782    
## genreArt House & International  5.73409    4.13645   1.386 0.166163    
## genreComedy                    -0.43313    2.31466  -0.187 0.851622    
## genreDocumentary                9.47933    2.82449   3.356 0.000838 ***
## genreDrama                      1.92732    1.98343   0.972 0.331567    
## genreHorror                    -8.29618    3.41550  -2.429 0.015419 *  
## genreMusical & Performing Arts  9.76812    4.46788   2.186 0.029159 *  
## genreMystery & Suspense        -4.02438    2.55797  -1.573 0.116156    
## genreOther                      2.09402    3.94544   0.531 0.595782    
## genreScience Fiction & Fantasy -6.61934    4.98306  -1.328 0.184535    
## runtime                         0.07925    0.03241   2.445 0.014745 *  
## critics_score                   0.44860    0.02202  20.371  < 2e-16 ***
## best_pic_winyes                 6.15503    5.74526   1.071 0.284433    
## best_actor_winyes              -1.13203    1.64849  -0.687 0.492520    
## best_actress_winyes            -1.62368    1.83069  -0.887 0.375457    
## best_dir_winyes                -0.41490    2.40725  -0.172 0.863215    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.99 on 633 degrees of freedom
## Multiple R-squared:  0.534,  Adjusted R-squared:  0.5223 
## F-statistic: 45.34 on 16 and 633 DF,  p-value: < 2.2e-16

Instead of doing the previous step n times until having a good model, we will make use of the step function.

small_model<-step(model, direction = "both", trace=FALSE ) 
summary(small_model)

## 
## Call:
## lm(formula = audience_score ~ genre + runtime + critics_score, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.151  -9.586   0.525   9.568  41.424 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    27.37145    3.59290   7.618 9.35e-14 ***
## genreAnimation                  5.94345    4.99993   1.189 0.234998    
## genreArt House & International  5.74324    4.12223   1.393 0.164035    
## genreComedy                    -0.53569    2.29935  -0.233 0.815856    
## genreDocumentary                9.45211    2.81096   3.363 0.000818 ***
## genreDrama                      1.68250    1.96516   0.856 0.392229    
## genreHorror                    -8.22215    3.40868  -2.412 0.016142 *  
## genreMusical & Performing Arts  9.77902    4.45744   2.194 0.028606 *  
## genreMystery & Suspense        -4.38861    2.53277  -1.733 0.083628 .  
## genreOther                      1.79864    3.93336   0.457 0.647627    
## genreScience Fiction & Fantasy -6.54296    4.97258  -1.316 0.188711    
## runtime                         0.07482    0.03040   2.462 0.014097 *  
## critics_score                   0.45007    0.02175  20.697  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.97 on 637 degrees of freedom
## Multiple R-squared:  0.5323, Adjusted R-squared:  0.5235 
## F-statistic: 60.42 on 12 and 637 DF,  p-value: < 2.2e-16

The smaller model has only 3 variables but it has a slightly larger adjusted R-squared of 0.5235. Then, using only a fraction of the variables this model captures almost the same variability (53%) than the full model. The genre and the critics score variables are the most significant variables.

plot(small_model$residuals~movies$runtime,xlab="Runtime",ylab="Residuals",main="a) Residuals vs. Runtime")

plot(small_model$residuals~movies$critics_score,xlab="Critics score",ylab="Residuals",main="b) Residuals vs critics score")

Figure 4. Residuals plots for Runtime and critics score. The residuals seem scattered around 0 for both Runtime and Critics score (Figure 4 a and b).

hist(small_model$residuals,ylab="Residuals",main="a) Histogram of smalll_model residuals")

qqnorm(small_model$residuals,main="b) Normal probability of residuals")
qqline(small_model$residuals)

Figure 5. Histogram and Q-Q plot of residuals.

On observation of the histogram of residuals in Figure 5 a there is a little skewness in the residuals, however in Figure 5 b there are deviations only in the tail of the graph.

plot(small_model$residuals~small_model$fitted,main="Residuals vs. fitted")
abline(0,0)

Figure 6. Residuals plot

The residuals seem to be generally homoscedastic. However, there is some degree of heteroscedasticity in the right end of Figure 6; then the model will be less accurate when predicting higher values.

Part 5: Prediction

We wanted to predict the audience score for a new movie that has not been used to fit the model. For the movie “Teenage Mutant Ninja Turtles: Out of the Shadows(2016)” we extracted the runtime, critics score and audience score from IMDB and Rotten tomatoes,the genre was set to “Science Fiction & Fantasy”. The real and the predicted audience score were 55 and 45.9 respectively, a confidence interval between 16.9 and 74.8.. With a 95% confidence interval if repeated samples were taken 95% of the samples would contain the population mean that can be between 16.9 and 74.8.

new_movie<-data.frame(genre="Science Fiction & Fantasy",runtime=112,critics_score=37)
predict(small_model,new_movie)

##        1 
## 45.86104

predict(small_model, new_movie, interval="predict")

##        fit      lwr      upr
## 1 45.86104 16.93388 74.78821

Part 6: Conclusion

Determining the popularity of a movie is not simple task. The intrinsic characteristics of a movie seem to have some degree of correlation with the popularity of a movie. However, external attributes, like critics score , also seem to be correlated with its popularity. Then, the movie genre, runtime and critics score can be used to predict the popularity of a movie. One important shortcoming of the current approach is the fact that we only used a subset of the features for the initial model, a better model could be trained using the whole set of features or a combination of some features. Another possible path for future research is to use external features independent of IMDN and Rotten tomatoes.