Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

load("movies.Rdata")

Part 1: Data

Our working data set aggregates information about how much audiences and critics like movies and some other information collected on Rotten Tomatoes (https://www.rottentomatoes.com/) and IMDB (https://www.imdb.com/).

According to Coursera site (or our Paramount Pictures’ boss on this exercise), the data set is comprised of 651 randomly sampled movies produced and released before 2016. As we are dealing here with a random sample, we can expect the results of our analysis to have some degree of generabizability.

All we can do with this data set is to observe the relations among variables. There’s no information about interventions, control groups, or random allocation of subjects (in this case, movies) in order to search for causality among variables. So, although we will be searching for variable associations, we won’t be talking about causation among them.

Let’s first do some exploratory data analysis, taking a look at the dimensions of our data frame and, then, getting some counts and levels for some variables that we could use on this analysis.

Part 2: Exploratory data analysis

dim(movies)

## [1] 651  32

This dataset consists of 32 columns (features/variables) and 651 rows (people/observations). Let’s now search for some possible variables to analyze.

movies %>% 
  count(genre)  %>% 
  mutate(freq = paste0(round(100 * prop.table(n),2),"%"))

## # A tibble: 11 x 3
##    genre                         n freq  
##    <fct>                     <int> <chr> 
##  1 Action & Adventure           65 9.98% 
##  2 Animation                     9 1.38% 
##  3 Art House & International    14 2.15% 
##  4 Comedy                       87 13.36%
##  5 Documentary                  52 7.99% 
##  6 Drama                       305 46.85%
##  7 Horror                       23 3.53% 
##  8 Musical & Performing Arts    12 1.84% 
##  9 Mystery & Suspense           59 9.06% 
## 10 Other                        16 2.46% 
## 11 Science Fiction & Fantasy     9 1.38%

As we can see, great part of the movies (46.85%) of our sample were classified as Drama movies. That does not make us feel comfortable to use this variable (genre) do build a model, as we want to build something that could bring us information about all kinds of movies, not only drama ones.

The month the movie is released in theater (var. thtr_rel_month) is another interesting candidate. We can see the months the movies on our samples were released on the table below:

movies %>% 
  count(thtr_rel_month)  %>% 
  mutate(freq = paste0(round(100 * prop.table(n),2),"%"))

## # A tibble: 12 x 3
##    thtr_rel_month     n freq  
##             <dbl> <int> <chr> 
##  1              1    69 10.6% 
##  2              2    34 5.22% 
##  3              3    51 7.83% 
##  4              4    45 6.91% 
##  5              5    44 6.76% 
##  6              6    72 11.06%
##  7              7    48 7.37% 
##  8              8    44 6.76% 
##  9              9    53 8.14% 
## 10             10    70 10.75%
## 11             11    51 7.83% 
## 12             12    70 10.75%

The table shows that movies are roughly equally distributed along the months. That’s a good balance with no month being over represented on the sample.

Another variable we are interested are the scores (critics_score) given to movies on the Rotten Tomatoes website.

movies %>% 
  count(critics_score)  %>% 
  summarise(mean = mean(critics_score), median = median(critics_score), 
            sd = sd(critics_score), min = min(critics_score), max = max(critics_score),
            iqr = IQR(critics_score), "25%" = quantile(critics_score, 1/4), 
            "75%" = quantile(critics_score, 3/4))

## # A tibble: 1 x 8
##    mean median    sd   min   max   iqr `25%` `75%`
##   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  50.0     50  28.7     1   100    49  25.5  74.5

The code above gives us the basic summary on our critics scores values. The table displays the values for mean, median, IQR and an adittional information about 25% and 75% quantifies. It would be interesting to know if we have a most frequent value (or maybe more than one) for this score. R does not have a built-in function to do this, but we can create one like this:

# Mode function (only one mode)
Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

movies %>% 
  summarise(Mode = Mode(critics_score))

## # A tibble: 1 x 1
##    Mode
##   <dbl>
## 1    67

As we can see, 67 is the most frequent value for a critic score on this data set. But could we have another score as frequent as 67? To investigate it, we can use basically the same code, with a few modifications:

# Creating Modes function (show more than one mode, if there is more than one)
Modes <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux))
  ux[tab == max(tab)]
}

movies %>% 
  summarise(Mode = Modes(critics_score))

## # A tibble: 2 x 1
##    Mode
##   <dbl>
## 1    67
## 2   100

And yes! We have 2 most frequent values for critic scores, 67 and 100. Let’s finally plot a bar graph for this variable to see how it looks like.

ggplot(data = movies, aes(x = critics_score)) +
  geom_histogram(binwidth = 1) +
  xlab("Rotten Tomatoes's critics score")

And it looks pretty much as we would expect it to be: no apparent shape, but with a slightly greater density on the right.

So, after this preliminary exploratory analysis, we can proceed to the formulation of our questions.

Part 3: Research questions

Question 1:

Is the month the movie is released in theater associated with audience approval? If we can find an association between these two variables, it could help producers to chose some months and avoid others when deciding when to release a movie in theaters.

Question 2:

Is there an association between the critics scores variable and the audience scores variable? If we can find an association between these two variables we could use the critics scores as a predictor for audience preferences.

Part 4: Modeling

RESEARCH QUESTION 1 - month of release and audiance scores

To adress the question 1, we start plotting a scatter plot for the 2 variables:

# audience_score ~ dvd_rel_month scatterplot
ggplot(data = movies, aes(x = thtr_rel_month, y = audience_score)) +
  geom_jitter() +
  geom_smooth(method = "lm", se = FALSE) #exclude shaded area

## `geom_smooth()` using formula 'y ~ x'

This visualization shows a very scattered dispositions of audience score month by month. That gives us a first intuition that we are dealing with variables that are not correlated or that may have a very low correlation coefficient. We can check this using the function cor. And just in case we have NA entries on our variables, we will use “complete.obs” at the end of our code to ignore them.

#correlation audience_score ~ dvd_rel_month
movies %>%
  summarise(cor(thtr_rel_month, audience_score, use = "complete.obs"))

## # A tibble: 1 x 1
##   `cor(thtr_rel_month, audience_score, use = "complete.obs")`
##                                                         <dbl>
## 1                                                      0.0327

As expected, audience scores and the month the movie is released in theaters seem to no correlation or, more precisely, the correlation between the two variables has a very low coefficient of 0.0327.

The linear model for this two variables would be like this:

#linear model
aud <- lm(audience_score ~ thtr_rel_month, data = movies)
summary(aud)

## 
## Call:
## lm(formula = audience_score ~ thtr_rel_month, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.039 -16.155   2.705  17.589  34.333 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     61.1088     1.7007  35.932   <2e-16 ***
## thtr_rel_month   0.1860     0.2232   0.833    0.405    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.23 on 649 degrees of freedom
## Multiple R-squared:  0.001069,   Adjusted R-squared:  -0.0004705 
## F-statistic: 0.6943 on 1 and 649 DF,  p-value: 0.405

And with a p-value = 0.405 and a R-squared of 0.001069 we found no evidence that these two variables could have any association. Therefore for the movies released before 2016, the month it was released could not be used to predict audience reaction to it. We will skip model diagnostics for this one, as it is a model we wouldn’t be willing to use anyway.

RESEARCH QUESTION 2 - critics score and audiance scores

As usual, let’s start ploting a scatter plot between our two variables:

# audience_score ~ critics_score scatterplot
ggplot(data = movies, aes(x = critics_score, y = audience_score)) +
  geom_jitter()+ #handling overplotting
  geom_smooth(method = "lm", se = FALSE) #exclude shaded area

## `geom_smooth()` using formula 'y ~ x'

Now it seems we found something. The scatter pattern shows a clear association between critics score and audience scores. We can verify the magnitude of this correlation:

#correlation audience_score ~ critics_score
movies %>%
  summarise(cor(critics_score, audience_score))

## # A tibble: 1 x 1
##   `cor(critics_score, audience_score)`
##                                  <dbl>
## 1                                0.704

And fit the model for audience_score and critics_score

#linear model
simple_lm <- lm(audience_score ~ critics_score, data = movies)
summary(simple_lm)

## 
## Call:
## lm(formula = audience_score ~ critics_score, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -37.043  -9.571   0.504  10.422  43.544 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   33.43551    1.27561   26.21   <2e-16 ***
## critics_score  0.50144    0.01984   25.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.37 on 649 degrees of freedom
## Multiple R-squared:  0.496,  Adjusted R-squared:  0.4952 
## F-statistic: 638.7 on 1 and 649 DF,  p-value: < 2.2e-16

This time we have two variables with a correlation coefficient of 0.704. Besides, we can see that the critic scores is a statistically significant predictor of audience_score, predicting 49.6% (R-squared) of the audience score variation.

Model Diagnostics

To use linear regression, we must attend to some conditions.

The first condition is the linear relationship between the variables. As we can see on the analysis above, this condition is attended for audience score and critics score.
Second, the residuals of our regression must have a nearly normal distribution and must be constantly distributed (homoscedasticity). Let’s verify that, plotting the residuals for this regression:

#plot of the residuals vs. fitted (predicted) values.
ggplot(data = simple_lm, aes(x = .fitted, y = .resid)) +
  geom_jitter()  +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

The scatter pattern seems to have a fan shape, with more variability at lower values and less variability at higher values. That shape does not allow us to conclude for the homoscedasticity of the residuals. We can also check for nearly normal residuals:

#method 1 - Histogram
ggplot(data = simple_lm, aes(x = .resid)) +
  geom_histogram(binwidth = 4) +
  xlab("Residuals")

#method 2 - qqplot
ggplot(data = simple_lm, aes(sample = .resid)) +
  stat_qq() #qq = quantile quantile

The histogram and the qqplot show us what seems to be a nearly normal distribution for the residuals around zero.

Diagnostics conclusion.

Although we have a strong correlation coefficient, and a significative p-value for critics score as a predictor for audience score , we found a fan shape on the distribution of residuals. So, we cannot feel much comfortable to use the regression model here.

Multilinear Regression

We will add now three variables to our model: best_actor_win, best_actress_win and best_pic_nom. The first two variables inform (no, yes) whether or not one of the main actors (and actresses) in the movie ever won an Oscar (not necessarily for their role in the given movie). The last variable informs whether or not the movie was nominated for a best picture Oscar (no, yes).

In order to see if critics score is still a significant predictor of audience score after we include these variables, let’s plug them all into the model. We we treat this as our full model. Let’s also recall the R-squared we got ealier for the simple linear regression:

Full_Model <-lm(audience_score ~ critics_score + best_actor_win + best_actress_win + best_pic_nom, data = movies)
summary(Full_Model)

## 
## Call:
## lm(formula = audience_score ~ critics_score + best_actor_win + 
##     best_actress_win + best_pic_nom, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.951  -9.379   0.661   9.754  43.020 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         34.10115    1.29749  26.282  < 2e-16 ***
## critics_score        0.49134    0.02014  24.397  < 2e-16 ***
## best_actor_winyes   -1.38197    1.63310  -0.846  0.39774    
## best_actress_winyes -1.94742    1.83589  -1.061  0.28920    
## best_pic_nomyes      9.76076    3.25700   2.997  0.00283 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.3 on 646 degrees of freedom
## Multiple R-squared:  0.5033, Adjusted R-squared:  0.5003 
## F-statistic: 163.7 on 4 and 646 DF,  p-value: < 2.2e-16

summary(simple_lm)$adj.r.squared

## [1] 0.4952284

Although we had a tiny increment on R-squared value (0.5003 for the full model vs. 0.4952284 for the simple regression), we can see that the p-values of our actors and actresses variables indicates they are not significant predictors. Let’s do a backwards elimination, excluding the highest p-value, and observe what happens:

aud_actress_nom <-lm(audience_score ~ critics_score + best_actress_win + best_pic_nom, data = movies)
summary(aud_actress_nom)

## 
## Call:
## lm(formula = audience_score ~ critics_score + best_actress_win + 
##     best_pic_nom, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.782  -9.313   0.621   9.798  43.178 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         33.94587    1.28416  26.434  < 2e-16 ***
## critics_score        0.49114    0.02013  24.394  < 2e-16 ***
## best_actress_winyes -2.11090    1.82530  -1.156  0.24792    
## best_pic_nomyes      9.37911    3.22491   2.908  0.00376 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.29 on 647 degrees of freedom
## Multiple R-squared:  0.5028, Adjusted R-squared:  0.5005 
## F-statistic: 218.1 on 3 and 647 DF,  p-value: < 2.2e-16

summary(simple_lm)$adj.r.squared

## [1] 0.4952284

summary(Full_Model)$adj.r.squared

## [1] 0.5002518

After removing best_actor_win, we have a decrease on best_actress_win p-value, even though it remains not significant as a predictor of audience scores. The change on R-squared is also very tiny. Let’s do one more step and remove best_actress_win:

aud_nom <-lm(audience_score ~ critics_score  + best_pic_nom, data = movies)
summary(aud_nom)

## 
## Call:
## lm(formula = audience_score ~ critics_score + best_pic_nom, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.564  -9.413   0.510  10.020  43.365 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     33.76610    1.27505  26.482  < 2e-16 ***
## critics_score    0.49064    0.02013  24.369  < 2e-16 ***
## best_pic_nomyes  8.64307    3.16231   2.733  0.00644 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.3 on 648 degrees of freedom
## Multiple R-squared:  0.5017, Adjusted R-squared:  0.5002 
## F-statistic: 326.3 on 2 and 648 DF,  p-value: < 2.2e-16

summary(simple_lm)$adj.r.squared

## [1] 0.4952284

summary(Full_Model)$adj.r.squared

## [1] 0.5002518

summary(aud_actress_nom)$adj.r.squared

## [1] 0.500471

This final model shows us a slightly increase on R-squared (comparing to the simple model audience_score ~ critics_score) and both variables remain significant predictors of audience scores. We can also see that, keeping all the other variables constant, being nominated for an Oscar was associated with an increment of 8.64307, on average, on the audience scores.

Therefore, we will stick with it and run the model diagnostics.

Model Diagnostics

final_model <- aud_nom

#plot of the residuals vs. fitted (predicted) values.
ggplot(data = final_model, aes(x = .fitted, y = .resid)) +
  geom_jitter()  +
  geom_hline(yintercept = 0, linetype = "dashed") +
  xlab("Fitted values") +
  ylab("Residuals")

#Nearly normal residuals?

#method 1 - Histogram
ggplot(data = final_model, aes(x = .resid)) +
  geom_histogram(binwidth = 4) +
  xlab("Residuals")

#method 2 - qqplot
ggplot(data = final_model, aes(sample = .resid)) +
  stat_qq() #qq = quantile quantile

Again we have the same patter we found before, what’s really not a big surprise. Although the residuals seem to have a nearly normal distribution around zero, we can’t sustain homoscedasticity for the residuals distributions, as we the residuals variation on the left tail is greater than the variation on the right tail.

So, one more time, although we found good coefficients and p-values, we cannot feel much comfortable again using the regression model here.

Part 5: Prediction

Now let’s pick a movie from 2016 (a movie that’s not in the sample) and do a prediction for this movie using our the model

The selected movie will be Moana, which have received a score of 95 from rotten tomatoes critics (https://www.rottentomatoes.com/m/moana_2016). The film was nominated for a Best Animated Feature Film of the Year.

First, we need to create a new data frame for this movie

moana <- data.frame(best_pic_nom = "yes", critics_score = 95)

Then, I can do the prediction using the predict function:

predict(final_model, moana)

##        1 
## 89.02031

Finally, we can also construct a prediction interval around this prediction, which will provide a measure of uncertainty around the prediction.

predict(final_model, moana, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 89.02031 60.31469 117.7259

Thus, the model predicts, with 95% confidence, that a movie with a critics score of 95 is expected to have an audience score between 60.31 and 117.72. This is an interval with a great uncertainty. Although the estimate (89.02) was exactly the number we can find on rotten tomatoes webesite.

Part 6: Conclusion

This study showed us that critics scores from rotten tomatoes are not precise, and sometimes reliable, predictors of audience scores. Specially when the critics scores are low.

It seems that when critics on rotten tomatoes give a good valuation, the audience tended to do the same. We can see that on the plot by the greater concentration of dots on the right tail of the curve. On the other hand, when critics scores are lower we can find a variability on audience scores that is greater than the one we observe on the right tail. We could also say that this regression model does more precise predictions when the critics scores are higher. The higher the score, the more we could expect the audience to approve the movie. However, we could not be so certain when trying to predict audience reactions when the movie receives a lower critic score.

Also, the search for a period of the year associated with a higher audience score ended with no evidence found to support it.

Future studies could try to investigate this data using other approaches, like studying the movie genres separately, or trying to predict audience scores using other variables like Top 200 Box Office list or Number of votes on IMDB.