Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

load(file="movies.RData")

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

#load("movies.Rdata")

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016. It contains data about how much audiences and critics like movies as well as numerous other variables about the movies. The data is compiled from the IMDB and Rotten Tomatoes databases.

This is an observational study, which has random sampling of the movies. We will be able to generalize the results, because we have random selection. However, we cannot infer causation. While from a general sense, we can generalize to “movie goers”, it would be beneficial to understand more about this population (age, geographic location, etc) to enable more precise specification of the audience to which we are generalizing.

Part 2: Research question

My research question is to determine if there is an association between the IMDB rating and the Critics score, Audience Score, How many actual IMDB votes, whether a move was nominated for an Oscar, or whether the movie actually won the Oscar?

Investigating these associations will be interesting, because it could allow us to determine if all or a set of these are assocated to IMBD rating enabling us to hone in on a specific, defined set of variables. Narrowing in on a set of variables that do have a significant association with IMBD could enable the movie industry to have better focus on things that matter. It would also be helpful to the consumer to learn if some of these variables have stronger association with IMDB than others. For example: Does the Critic Score or the Audience score have a stronger association?

Regression Formula IMDB^=bo+b1(critics_score)+b2(audience_score) +b3(imdb_votes) + b4(best_pic_nom) +b5(best_pic_win)

Part 3: Exploratory data analysis

summary(movies$imdb_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.900   5.900   6.600   6.493   7.300   9.000

ggplot(movies, aes(x=imdb_rating)) + geom_histogram(binwidth=1)

The summary statistics for IMDB rating show that this distribution is left skewed. The summary stats show the median>mean and the histogram shows a long tail of lower scores. The max score is 9, with a low score of 1.9. The lower scores are pulling down the mean, which is why it is lower than the mean.

summary(movies$critics_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   33.00   61.00   57.69   83.00  100.00

ggplot(movies, aes(x=critics_score)) + geom_histogram(binwidth=10)

Similar to IMDB ratings, Critics scores also appear left skewed, with a long tail of lower scores. These lower scores pull down the mean, which is why the mean < median.

ggplot(data = movies, aes(x = critics_score, y = imdb_rating)) +
  geom_point()

A scatterplot of Critics Score and IMDB ratings does appear to show a positive linear assocation, with the IMDB rating increasing as critics_scores increase. Let’s take a look to see if the association appears as strong for audience scores and imdb rating.

summary(movies$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.36   80.00   97.00

ggplot(movies, aes(x=audience_score)) + geom_histogram(binwidth=10)

Also consistent audience scores appear left skewed, with a long tail of lower scores. These lower scores pull down the mean, which is why the mean < median.

ggplot(data = movies, aes(x = audience_score, y = imdb_rating)) +
  geom_point()

Again, we see a positive association between audience_score and imdb_rating. Higher audience scores are associated with higher imdb_ratings. It appears this association is even stronger than Critics_score and Imdb, because the slope appears higher and the dots are tigher, have less scatter. We can not say definitively until the model is our multiple regression model is run.

ggplot(movies, aes(x = factor(best_pic_nom), y = imdb_rating)) +
  geom_boxplot()

ggplot(movies, aes(x = factor(best_pic_win), y = imdb_rating)) +
  geom_boxplot()

Examining both Oscar nominations and Oscar wins shows that the median imdb rating is higher for movies which were nominated or which won an Oscar. In fact, the increased median score looks remarably similar for both nominations and wins, suggesting both measures might be similar or correlated. I will come back to this idea. ```

Part 4: Modeling

The variables that I am considering for the full model to predict (IMDBrating) include:

Critics_score (Rotten Tomatoes) Audience_score (Rotten Tomatoes) IMDB_votes Best_pic_nom Best_pic_win

I’ve honed in on these five, because it made intuitive sense I could relate them to IMDB, and my exlploratory data analysis showed this. I’ve included several variables such as Title and Year, since these are categorical variables with many levels and unclear to me how they might be associated. I was equally uncertain how variables such as studio name, day and month might have an association.

On some of these variables, I hypothesized they could all be related and potentially measuring the same thing. For example, there are 5 variables pertaining to the Oscars from the director or actors winning to the overall picture being nominated or winning. I decided to just include 2 of these variables in my model.

In fact two of my variables Critics Score and Audience Score appear related in my EDA analysis, both having a positive association with IMDB.

movies %>% 
  summarise(cor(critics_score, audience_score))

## # A tibble: 1 × 1
##   `cor(critics_score, audience_score)`
##                                  <dbl>
## 1                            0.7042762

Indeed, these two metrics have a correlation of 70, suggesting they are both measuring the same thing and we only need one. However, I am interested in seeing if one of these happens to be more significant than the other in terms of the association with IMDB. For now, I will keep both in and make adjustments. I am using a bit of creative license here or the “art” of fitting a model.

I’ve decided to use the Backwards Elimination P Value method. I chose this approach, because I want a model where each predictor itself is significant. This model also enables fewer iterations to arrive at the final model.

I will first run my full model with all 5 predictors:

IMDB_model <- lm(imdb_rating ~ critics_score + audience_score +imdb_num_votes + best_pic_nom + best_pic_win, data = movies)
summary(IMDB_model)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + imdb_num_votes + 
##     best_pic_nom + best_pic_win, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.48960 -0.18770  0.02672  0.29454  1.17199 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.681e+00  6.241e-02  58.977  < 2e-16 ***
## critics_score    1.180e-02  9.428e-04  12.515  < 2e-16 ***
## audience_score   3.341e-02  1.352e-03  24.712  < 2e-16 ***
## imdb_num_votes   8.474e-07  1.880e-07   4.508 7.77e-06 ***
## best_pic_nomyes -2.773e-02  1.231e-01  -0.225    0.822    
## best_pic_winyes -3.274e-03  2.130e-01  -0.015    0.988    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4833 on 645 degrees of freedom
## Multiple R-squared:  0.8031, Adjusted R-squared:  0.8015 
## F-statistic:   526 on 5 and 645 DF,  p-value: < 2.2e-16

Interestingly my first 3 variables are significant while both Oscar Wins and Oscar nominations are not. I will remove best_pic_nom from my model as it has the highest P value at P=.744406.

IMDB_model <- lm(imdb_rating ~ critics_score + audience_score +imdb_num_votes + best_pic_win, data = movies)
summary(IMDB_model)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + imdb_num_votes + 
##     best_pic_win, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.49017 -0.18635  0.02487  0.29438  1.17255 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.682e+00  6.202e-02  59.375  < 2e-16 ***
## critics_score    1.179e-02  9.411e-04  12.527  < 2e-16 ***
## audience_score   3.339e-02  1.348e-03  24.766  < 2e-16 ***
## imdb_num_votes   8.401e-07  1.850e-07   4.541 6.69e-06 ***
## best_pic_winyes -2.304e-02  1.940e-01  -0.119    0.905    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4829 on 646 degrees of freedom
## Multiple R-squared:  0.803,  Adjusted R-squared:  0.8018 
## F-statistic: 658.5 on 4 and 646 DF,  p-value: < 2.2e-16

The reduced, 4 variable model shows the first 3 significant, while Oscar Win is not at (P=.905). I will remove this from my next run of the model.

IMDB_model <- lm(imdb_rating ~ critics_score + audience_score +imdb_num_votes, data = movies)
summary(IMDB_model)

## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + imdb_num_votes, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.49004 -0.18552  0.02332  0.29450  1.17298 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.683e+00  6.192e-02  59.471  < 2e-16 ***
## critics_score  1.178e-02  9.387e-04  12.552  < 2e-16 ***
## audience_score 3.340e-02  1.347e-03  24.794  < 2e-16 ***
## imdb_num_votes 8.335e-07  1.764e-07   4.726 2.82e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4825 on 647 degrees of freedom
## Multiple R-squared:  0.803,  Adjusted R-squared:  0.8021 
## F-statistic: 879.3 on 3 and 647 DF,  p-value: < 2.2e-16

The reduced 3 variable model shows all three to have a significant association with IMDB rating: Critics Score, Audience Score, and IMDB_votes all.

MODEL DIAGNOSTICS

I will next conduct model diagnostics checking for 4 things:

Linear Relationship between X and Y (Numerical Variables) Nearly Normal Residuals Constant Variability of Residuals Independence of Residuals

I. Linear Relationship between X and Y

IMDB_final = lm(imdb_rating ~ critics_score + audience_score +imdb_num_votes, data = movies)
plot(IMDB_final$residuals ~ movies$critics_score)

IMDB_final = lm(imdb_rating ~ critics_score + audience_score +imdb_num_votes, data = movies)
plot(IMDB_final$residuals ~ movies$audience_score)

IMDB_final = lm(imdb_rating ~ critics_score + audience_score +imdb_num_votes, data = movies)
plot(IMDB_final$residuals ~ movies$imdb_num_votes)

hist(IMDB_final$residuals)

qqnorm(IMDB_final$residuals)
qqline(IMDB_final$residuals)

Part 5: Prediction

NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.