Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("movies.Rdata")

Part 1: Data

Movies data set includes information from Rotten Tomatoes and IMDB for a random sample of movies.

651 movies
32 variables

Sample collection methodology: Rotten Tomatoes and IMDB are review-aggregation websites that include information related to films and television programs. The rating for a particular film or television program is the result of critics and audiences reviews.

Data collection method implications: Since the sample was optained from a randomly sample selected movies in Rotten Tomatoes and IMDB, we can reneralized the results to the entire film population from 1970 to 2016 included either in Rotten Tomatoes and IMDB. In addition, it is important to point out that audience reviews comes from volunteers subjects; as volunteers are more likely to participate in the reviews, it can be a source of bias.

Scope of inference: Each film is equally likely to be selected, therefore we are dealing with a large-scale obervational study, the sample is representative of the population. As the groups are not escencially the same (due there is no random assignment), causal conclutions cannot be made.

In short, we have an observational study: not-causal-generalizable.

Part 2: Research question

Research question: We are interested in learning what attributes make a movie popular. For this time we only consider IMDB ratings. We will take the following variables from movies dataset:

imdb_rating: Rating on IMDB.
imdb_num_votes: Number of votes on IMDB.
genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other).
runtime: Runtime of movie (in minutes).
thtr_rel_year: Year the movie is released in theaters.
best_pic_win: Whether or not the movie won a best picture Oscar (no, yes).

Part 3: Exploratory data analysis

Most of the variables are not numerical, therefore only a scatterplot would not be ideal for vizualizing any patterns in the data. Firstly, we would be interested in find out whether a particular film genre is more likely to have higher or lower scores than others. If so genre would be representative in order to estimate IMDB rating.

# Select the variables and remove NA values
best <- movies %>% select(imdb_rating,imdb_num_votes,genre,runtime,thtr_rel_year,best_pic_win) %>% na.omit() 

# Create facet mapped plot to film genre
gg1 <- ggplot(data=best,aes(x=imdb_rating,fill=genre)) + geom_density(alpha=0.4) +
         ggtitle("IMDB Rating Probability Densities") + labs(x="IMDB Rating",y="Kernel Estimate")

gg1 + facet_wrap(~genre)

Comments

At a first glance, there is no such apparent pattern between genre and IMDB Rating. Only Documentaries and Musical and Performing Arts seems to have higher IMDB Rating in average (maybe because of the type of audience that consume those kind of films), but the other categories have roughly the same behavior.
This result should be shown as high p-values in the later linear regression model. In other words, apparently the genre does not affect the rating.
Funny thing: science fiction and fantasy is sort of uniform.

Secondly, runtime of movie (in minutes) could have to do with a higher or lower scores, and also could be more likely to win an Oscar. As both IMDB and runtime are numerical variables, let use a simple scatterplot and add a smoth linear regression. We could use facet_wrap to create a plot of each decade separetely.

Note: In order to facilitate the data manipulation We will consider in 2000's category all the movies from this year to the present.

# Create decade factor and add it to the data set
best <- best %>% mutate(decade=cut(x=best$thtr_rel_year,breaks = c(1970,1980,1990,2000,2020),
    include.lowest = TRUE, right = TRUE,labels=c("70's","80's","90's","2000's"))) 

# Create faced mapped scatterplot to decade
gg2 <- ggplot(data=best,aes(x=runtime,y=imdb_rating)) + 
         geom_point(aes(col=best_pic_win)) + labs(x="Run Time (minutes)",y="IMDB Rating",col="Oscar Winner") +
         geom_smooth(method = "lm") 

gg2 + facet_wrap(~decade,nrow=1)

## `geom_smooth()` using formula 'y ~ x'

Comments

In all the plots the trend is positive: further analysis are required to corroborate the strength of linear association.
Indeed, winning and Oscar is an unique event: there are only a couple of points for each decade. Winning an Oscar is not a guarantee to have a high IMDB rating though.
The IMDB ratings are quite similar trhoughout the dacades of 70’s 80’s and 90’s: from 2000’s low IMDB ratings are more common.

Now let’s check the relationship between the number of votes and the IMDB Rating. We will use a simple scatterplot.

# Create the plot
ggplot(data=best,aes(x=imdb_num_votes,y=imdb_rating)) + geom_point(aes(col=best_pic_win)) +
         labs(x="IMDB Number of Votes",y="IMDB Rating",col="Oscar Winner") + geom_smooth(method="lm")

## `geom_smooth()` using formula 'y ~ x'

Comments

The relationsip between the two variables is positive: further analysis are required to corroborate the strength of linear association.
According as the number of votes increase, the variability in the rating seems to decrease.
High number a votes are less common as well as Oscar Winning Films.
As mention above, audience reviews comes from volunteers subjects: as volunteers are more likely to participate in the reviews, it can be a source of bias.

Part 4: Modeling

Now we can start modeling and confirm what we already figured out in the EDA. For this we will use the backward selection approach, by starting with a full model and then systematically drop terms according to significance of the changes in fit quality.

# Create the linear model
best.fit <- lm(imdb_rating~imdb_num_votes+runtime+genre+best_pic_win+decade,data=best)

summary(best.fit)

## 
## Call:
## lm(formula = imdb_rating ~ imdb_num_votes + runtime + genre + 
##     best_pic_win + decade, data = best)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0173 -0.4088  0.0597  0.5371  2.3980 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.372e+00  2.499e-01  21.496  < 2e-16 ***
## imdb_num_votes                  3.624e-06  3.472e-07  10.438  < 2e-16 ***
## runtime                         5.540e-03  1.983e-03   2.794 0.005368 ** 
## genreAnimation                  1.385e-01  3.060e-01   0.453 0.650855    
## genreArt House & International  9.195e-01  2.528e-01   3.637 0.000299 ***
## genreComedy                    -1.292e-02  1.416e-01  -0.091 0.927339    
## genreDocumentary                2.050e+00  1.637e-01  12.524  < 2e-16 ***
## genreDrama                      7.552e-01  1.184e-01   6.378 3.47e-10 ***
## genreHorror                     1.973e-02  2.086e-01   0.095 0.924672    
## genreMusical & Performing Arts  1.469e+00  2.701e-01   5.440 7.62e-08 ***
## genreMystery & Suspense         5.133e-01  1.547e-01   3.319 0.000957 ***
## genreOther                      4.008e-01  2.403e-01   1.668 0.095860 .  
## genreScience Fiction & Fantasy -2.710e-01  3.048e-01  -0.889 0.374371    
## best_pic_winyes                -9.971e-02  3.477e-01  -0.287 0.774373    
## decade80's                     -4.872e-02  1.386e-01  -0.352 0.725307    
## decade90's                     -3.901e-01  1.289e-01  -3.026 0.002576 ** 
## decade2000's                   -3.718e-01  1.225e-01  -3.034 0.002510 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8537 on 633 degrees of freedom
## Multiple R-squared:  0.3961, Adjusted R-squared:  0.3808 
## F-statistic: 25.95 on 16 and 633 DF,  p-value: < 2.2e-16

With this approach we want to remove all the terms that do not result in a statistically significant detriment of the goodness of fit. In this case we will use drop1 function and then we will choose the term to remove from the model that has the leas significant effect of reducinf the goodness of fit. In other words, the term with the largest p-value for its F-test.

# Apply drop1 funtion withpartial F-test to the model
drop1(best.fit,test="F")

## Single term deletions
## 
## Model:
## imdb_rating ~ imdb_num_votes + runtime + genre + best_pic_win + 
##     decade
##                Df Sum of Sq    RSS      AIC  F value    Pr(>F)    
## <none>                      461.29 -188.919                       
## imdb_num_votes  1    79.396 540.68  -87.690 108.9515 < 2.2e-16 ***
## runtime         1     5.688 466.97 -182.953   7.8049 0.0053680 ** 
## genre          10   188.205 649.49   13.492  25.8264 < 2.2e-16 ***
## best_pic_win    1     0.060 461.35 -190.834   0.0822 0.7743726    
## decade          3    13.912 475.20 -175.605   6.3638 0.0002973 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Udate the model by removing the least significant term
best.fit1 <- update(best.fit,.~.-best_pic_win)

We can see that removing the variable best_pic_win, regarding whether or not is an Oscar Winning Film, has the single least significant effect on reducing the goodness of fit, so we will star with update function by removing this term from best.fit. The new model will be named best.fit1. Then we will apply drop1 function one more time and repeat the process.

# Apply drop1 funtion withpartial F-test to the model
drop1(best.fit1,test="F")

## Single term deletions
## 
## Model:
## imdb_rating ~ imdb_num_votes + runtime + genre + decade
##                Df Sum of Sq    RSS      AIC  F value    Pr(>F)    
## <none>                      461.35 -190.834                       
## imdb_num_votes  1    85.362 546.71  -82.487 117.3074 < 2.2e-16 ***
## runtime         1     5.629 466.98 -184.951   7.7357 0.0055750 ** 
## genre          10   188.227 649.57   11.574  25.8668 < 2.2e-16 ***
## decade          3    13.863 475.21 -177.590   6.3504 0.0003028 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Update the model by removing the least significant term
best.fit2 <- update(best.fit1,.~.-runtime)

Finally we will remove runtime which, even though it has somewhat small p-value, we want to follow the principle of parsimony, which establishes that the simpler model is better. Also, the addition of runtime could bring nothing new to the model. The final model is shown below as bestfit2 We will see that some film genres do have large p-values, as we see it in the Kernel Estimates plot at the beginning od EDA. It is important to recall that we cannot remove a particular film genre from the model, although some genres does not affect the rating, we must treat the variable as a whole. Something similar happens with decade variable.

summary(best.fit2)

## 
## Call:
## lm(formula = imdb_rating ~ imdb_num_votes + genre + decade, data = best)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0368 -0.4235  0.0674  0.5717  2.4467 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.936e+00  1.477e-01  40.187  < 2e-16 ***
## imdb_num_votes                  3.924e-06  3.118e-07  12.587  < 2e-16 ***
## genreAnimation                  6.037e-02  3.061e-01   0.197 0.843696    
## genreArt House & International  9.342e-01  2.538e-01   3.681 0.000252 ***
## genreComedy                    -4.029e-02  1.417e-01  -0.284 0.776208    
## genreDocumentary                2.041e+00  1.641e-01  12.441  < 2e-16 ***
## genreDrama                      7.973e-01  1.177e-01   6.774 2.85e-11 ***
## genreHorror                    -3.008e-02  2.088e-01  -0.144 0.885479    
## genreMusical & Performing Arts  1.540e+00  2.700e-01   5.704 1.79e-08 ***
## genreMystery & Suspense         5.473e-01  1.548e-01   3.536 0.000435 ***
## genreOther                      4.225e-01  2.411e-01   1.752 0.080239 .  
## genreScience Fiction & Fantasy -2.917e-01  3.061e-01  -0.953 0.340937    
## decade80's                     -3.766e-02  1.389e-01  -0.271 0.786294    
## decade90's                     -3.963e-01  1.289e-01  -3.073 0.002207 ** 
## decade2000's                   -4.007e-01  1.217e-01  -3.293 0.001046 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8576 on 635 degrees of freedom
## Multiple R-squared:  0.3886, Adjusted R-squared:  0.3752 
## F-statistic: 28.83 on 14 and 635 DF,  p-value: < 2.2e-16

Check Conditions

In the residual plot if the variables are linearly associated the plot will show no pattern at all. This is not the case, in the Fitted vs Residuals plot we could see a kind of inverted fan shape pattern, which indicates a poor linear relationship between the response variable and the explanatory varibales.
On the histogram of residuals we look for a nearly normal residuals with mean 0. Indeed the residuals are left skewed, suggesting that the model is not suitable for forecasting. This behavior is also seen i the Normal QQ plot.
The model has not constant variability, then the model will be less accurate when predicting lower rating values.

par(mfrow=c(1,3))
plot(x=best.fit2$fitted.values,y=best.fit2$residuals,xlab="Fitted Values",ylab="Residuals",
     main="Fitted vs Residuals")
abline(h=0,lty=2,col="gray")
hist(x=best.fit2$residuals,main="Histogram of Residuals",xlab="Residuals")
plot(best.fit2,which = 2,add.smooth = FALSE)

Part 5: Prediction

As we could see, the model does not fit the data well, but we will use it anyway and pick a recent movie and do a prediction for its IMDB rating using the model. Also we will quantify the uncertainty around this prediction using an appropriate interval. We will try to predict IMDB Rating of It Chapter:2.

itChapter2 <- data.frame(imdb_num_votes=2184+418,genre="Horror", decade="2000's")
predict(best.fit2,newdata=itChapter2,interval="prediction",level=0.9)

##        fit      lwr      upr
## 1 5.515917 4.071357 6.960477

The real IMDB Rating is 6,6 which is include in the prediction interval, however not in all the film cases the interval will include the real rating since the model does not fit the data aqqurately. Moreover I used a very bad and predictable movie. If we use the model again but this time with another movie like The Shape of Water we see that the real IMDB rating is 7.3 which is not include in the prediction interval.

shapeOfWater <- data.frame(imdb_num_votes=1589+996,
                           genre="Science Fiction & Fantasy", decade="2000's")
predict(best.fit2,newdata=shapeOfWater,interval="prediction",level=0.9)

##        fit      lwr      upr
## 1 5.254198 3.761581 6.746816

Part 6: Conclusion

Here the overall goal of fitting any statistical model is to faithfully represent the data and the relationship between the variables. In general we are interested in figure out the balancing between goodness-of-fit (good reponse-predictor relationship) and complexity (how complicated the model is).

With this assignment we could confirm that selection of the appropriate variables can culminate in a suitable model which sadly was not the case this time. Even though satisfactory conclusions are not reached we have concluded that the variables used in the model, as a whole, are not significant to predict something as subjective as the rating of a film.