Our working data set aggregates information about how much audiences and critics like movies and some other information collected on Rotten Tomatoes (https://www.rottentomatoes.com/) and IMDB (https://www.imdb.com/).
According to Coursera site (or our Paramount Pictures’ boss on this exercise), the data set is comprised of 651 randomly sampled movies produced and released before 2016. As we are dealing here with a random sample, we can expect the results of our analysis to have some degree of generabizability.
All we can do with this data set is to observe the relations among variables. There’s no information about interventions, control groups, or random allocation of subjects (in this case, movies) in order to search for causality among variables. So, although we will be searching for variable associations, we won’t be talking about causation among them.
Let’s first do some exploratory data analysis, taking a look at the dimensions of our data frame and, then, getting some counts and levels for some variables that we could use on this analysis.
## [1] 651 32
This dataset consists of 32 columns (features/variables) and 651 rows (people/observations). Let’s now search for some possible variables to analyze.
## # A tibble: 11 x 3
## genre n freq
## <fct> <int> <chr>
## 1 Action & Adventure 65 9.98%
## 2 Animation 9 1.38%
## 3 Art House & International 14 2.15%
## 4 Comedy 87 13.36%
## 5 Documentary 52 7.99%
## 6 Drama 305 46.85%
## 7 Horror 23 3.53%
## 8 Musical & Performing Arts 12 1.84%
## 9 Mystery & Suspense 59 9.06%
## 10 Other 16 2.46%
## 11 Science Fiction & Fantasy 9 1.38%
As we can see, great part of the movies (46.85%) of our sample were classified as Drama movies. That does not make us feel comfortable to use this variable (genre) do build a model, as we want to build something that could bring us information about all kinds of movies, not only drama ones.
The month the movie is released in theater (var. thtr_rel_month) is another interesting candidate. We can see the months the movies on our samples were released on the table below:
## # A tibble: 12 x 3
## thtr_rel_month n freq
## <dbl> <int> <chr>
## 1 1 69 10.6%
## 2 2 34 5.22%
## 3 3 51 7.83%
## 4 4 45 6.91%
## 5 5 44 6.76%
## 6 6 72 11.06%
## 7 7 48 7.37%
## 8 8 44 6.76%
## 9 9 53 8.14%
## 10 10 70 10.75%
## 11 11 51 7.83%
## 12 12 70 10.75%
The table shows that movies are roughly equally distributed along the months. That’s a good balance with no month being over represented on the sample.
Another variable we are interested are the scores (critics_score) given to movies on the Rotten Tomatoes website.
movies %>%
count(critics_score) %>%
summarise(mean = mean(critics_score), median = median(critics_score),
sd = sd(critics_score), min = min(critics_score), max = max(critics_score),
iqr = IQR(critics_score), "25%" = quantile(critics_score, 1/4),
"75%" = quantile(critics_score, 3/4)) ## # A tibble: 1 x 8
## mean median sd min max iqr `25%` `75%`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 50.0 50 28.7 1 100 49 25.5 74.5
The code above gives us the basic summary on our critics scores values. The table displays the values for mean, median, IQR and an adittional information about 25% and 75% quantifies. It would be interesting to know if we have a most frequent value (or maybe more than one) for this score. R does not have a built-in function to do this, but we can create one like this:
# Mode function (only one mode)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
movies %>%
summarise(Mode = Mode(critics_score))## # A tibble: 1 x 1
## Mode
## <dbl>
## 1 67
As we can see, 67 is the most frequent value for a critic score on this data set. But could we have another score as frequent as 67? To investigate it, we can use basically the same code, with a few modifications:
# Creating Modes function (show more than one mode, if there is more than one)
Modes <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux))
ux[tab == max(tab)]
}
movies %>%
summarise(Mode = Modes(critics_score))## # A tibble: 2 x 1
## Mode
## <dbl>
## 1 67
## 2 100
And yes! We have 2 most frequent values for critic scores, 67 and 100. Let’s finally plot a bar graph for this variable to see how it looks like.
ggplot(data = movies, aes(x = critics_score)) +
geom_histogram(binwidth = 1) +
xlab("Rotten Tomatoes's critics score")And it looks pretty much as we would expect it to be: no apparent shape, but with a slightly greater density on the right.
So, after this preliminary exploratory analysis, we can proceed to the formulation of our questions.
Is the month the movie is released in theater associated with audience approval? If we can find an association between these two variables, it could help producers to chose some months and avoid others when deciding when to release a movie in theaters.
Is there an association between the critics scores variable and the audience scores variable? If we can find an association between these two variables we could use the critics scores as a predictor for audience preferences.
To adress the question 1, we start plotting a scatter plot for the 2 variables:
# audience_score ~ dvd_rel_month scatterplot
ggplot(data = movies, aes(x = thtr_rel_month, y = audience_score)) +
geom_jitter() +
geom_smooth(method = "lm", se = FALSE) #exclude shaded area## `geom_smooth()` using formula 'y ~ x'
This visualization shows a very scattered dispositions of audience score month by month. That gives us a first intuition that we are dealing with variables that are not correlated or that may have a very low correlation coefficient. We can check this using the function cor. And just in case we have NA entries on our variables, we will use “complete.obs” at the end of our code to ignore them.
#correlation audience_score ~ dvd_rel_month
movies %>%
summarise(cor(thtr_rel_month, audience_score, use = "complete.obs"))## # A tibble: 1 x 1
## `cor(thtr_rel_month, audience_score, use = "complete.obs")`
## <dbl>
## 1 0.0327
As expected, audience scores and the month the movie is released in theaters seem to no correlation or, more precisely, the correlation between the two variables has a very low coefficient of 0.0327.
The linear model for this two variables would be like this:
##
## Call:
## lm(formula = audience_score ~ thtr_rel_month, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.039 -16.155 2.705 17.589 34.333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 61.1088 1.7007 35.932 <2e-16 ***
## thtr_rel_month 0.1860 0.2232 0.833 0.405
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.23 on 649 degrees of freedom
## Multiple R-squared: 0.001069, Adjusted R-squared: -0.0004705
## F-statistic: 0.6943 on 1 and 649 DF, p-value: 0.405
And with a p-value = 0.405 and a R-squared of 0.001069 we found no evidence that these two variables could have any association. Therefore for the movies released before 2016, the month it was released could not be used to predict audience reaction to it. We will skip model diagnostics for this one, as it is a model we wouldn’t be willing to use anyway.
As usual, let’s start ploting a scatter plot between our two variables:
# audience_score ~ critics_score scatterplot
ggplot(data = movies, aes(x = critics_score, y = audience_score)) +
geom_jitter()+ #handling overplotting
geom_smooth(method = "lm", se = FALSE) #exclude shaded area## `geom_smooth()` using formula 'y ~ x'
Now it seems we found something. The scatter pattern shows a clear association between critics score and audience scores. We can verify the magnitude of this correlation:
#correlation audience_score ~ critics_score
movies %>%
summarise(cor(critics_score, audience_score))## # A tibble: 1 x 1
## `cor(critics_score, audience_score)`
## <dbl>
## 1 0.704
And fit the model for audience_score and critics_score
##
## Call:
## lm(formula = audience_score ~ critics_score, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.043 -9.571 0.504 10.422 43.544
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.43551 1.27561 26.21 <2e-16 ***
## critics_score 0.50144 0.01984 25.27 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.37 on 649 degrees of freedom
## Multiple R-squared: 0.496, Adjusted R-squared: 0.4952
## F-statistic: 638.7 on 1 and 649 DF, p-value: < 2.2e-16
This time we have two variables with a correlation coefficient of 0.704. Besides, we can see that the critic scores is a statistically significant predictor of audience_score, predicting 49.6% (R-squared) of the audience score variation.
To use linear regression, we must attend to some conditions.
The first condition is the linear relationship between the variables. As we can see on the analysis above, this condition is attended for audience score and critics score.
Second, the residuals of our regression must have a nearly normal distribution and must be constantly distributed (homoscedasticity). Let’s verify that, plotting the residuals for this regression:
#plot of the residuals vs. fitted (predicted) values.
ggplot(data = simple_lm, aes(x = .fitted, y = .resid)) +
geom_jitter() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")The scatter pattern seems to have a fan shape, with more variability at lower values and less variability at higher values. That shape does not allow us to conclude for the homoscedasticity of the residuals. We can also check for nearly normal residuals:
#method 1 - Histogram
ggplot(data = simple_lm, aes(x = .resid)) +
geom_histogram(binwidth = 4) +
xlab("Residuals")#method 2 - qqplot
ggplot(data = simple_lm, aes(sample = .resid)) +
stat_qq() #qq = quantile quantileThe histogram and the qqplot show us what seems to be a nearly normal distribution for the residuals around zero.
Although we have a strong correlation coefficient, and a significative p-value for critics score as a predictor for audience score , we found a fan shape on the distribution of residuals. So, we cannot feel much comfortable to use the regression model here.
We will add now three variables to our model: best_actor_win, best_actress_win and best_pic_nom. The first two variables inform (no, yes) whether or not one of the main actors (and actresses) in the movie ever won an Oscar (not necessarily for their role in the given movie). The last variable informs whether or not the movie was nominated for a best picture Oscar (no, yes).
In order to see if critics score is still a significant predictor of audience score after we include these variables, let’s plug them all into the model. We we treat this as our full model. Let’s also recall the R-squared we got ealier for the simple linear regression:
Full_Model <-lm(audience_score ~ critics_score + best_actor_win + best_actress_win + best_pic_nom, data = movies)
summary(Full_Model)##
## Call:
## lm(formula = audience_score ~ critics_score + best_actor_win +
## best_actress_win + best_pic_nom, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.951 -9.379 0.661 9.754 43.020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.10115 1.29749 26.282 < 2e-16 ***
## critics_score 0.49134 0.02014 24.397 < 2e-16 ***
## best_actor_winyes -1.38197 1.63310 -0.846 0.39774
## best_actress_winyes -1.94742 1.83589 -1.061 0.28920
## best_pic_nomyes 9.76076 3.25700 2.997 0.00283 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.3 on 646 degrees of freedom
## Multiple R-squared: 0.5033, Adjusted R-squared: 0.5003
## F-statistic: 163.7 on 4 and 646 DF, p-value: < 2.2e-16
## [1] 0.4952284
Although we had a tiny increment on R-squared value (0.5003 for the full model vs. 0.4952284 for the simple regression), we can see that the p-values of our actors and actresses variables indicates they are not significant predictors. Let’s do a backwards elimination, excluding the highest p-value, and observe what happens:
aud_actress_nom <-lm(audience_score ~ critics_score + best_actress_win + best_pic_nom, data = movies)
summary(aud_actress_nom)##
## Call:
## lm(formula = audience_score ~ critics_score + best_actress_win +
## best_pic_nom, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.782 -9.313 0.621 9.798 43.178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.94587 1.28416 26.434 < 2e-16 ***
## critics_score 0.49114 0.02013 24.394 < 2e-16 ***
## best_actress_winyes -2.11090 1.82530 -1.156 0.24792
## best_pic_nomyes 9.37911 3.22491 2.908 0.00376 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.29 on 647 degrees of freedom
## Multiple R-squared: 0.5028, Adjusted R-squared: 0.5005
## F-statistic: 218.1 on 3 and 647 DF, p-value: < 2.2e-16
## [1] 0.4952284
## [1] 0.5002518
After removing best_actor_win, we have a decrease on best_actress_win p-value, even though it remains not significant as a predictor of audience scores. The change on R-squared is also very tiny. Let’s do one more step and remove best_actress_win:
##
## Call:
## lm(formula = audience_score ~ critics_score + best_pic_nom, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.564 -9.413 0.510 10.020 43.365
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.76610 1.27505 26.482 < 2e-16 ***
## critics_score 0.49064 0.02013 24.369 < 2e-16 ***
## best_pic_nomyes 8.64307 3.16231 2.733 0.00644 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.3 on 648 degrees of freedom
## Multiple R-squared: 0.5017, Adjusted R-squared: 0.5002
## F-statistic: 326.3 on 2 and 648 DF, p-value: < 2.2e-16
## [1] 0.4952284
## [1] 0.5002518
## [1] 0.500471
This final model shows us a slightly increase on R-squared (comparing to the simple model audience_score ~ critics_score) and both variables remain significant predictors of audience scores. We can also see that, keeping all the other variables constant, being nominated for an Oscar was associated with an increment of 8.64307, on average, on the audience scores.
Therefore, we will stick with it and run the model diagnostics.
final_model <- aud_nom
#plot of the residuals vs. fitted (predicted) values.
ggplot(data = final_model, aes(x = .fitted, y = .resid)) +
geom_jitter() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")#Nearly normal residuals?
#method 1 - Histogram
ggplot(data = final_model, aes(x = .resid)) +
geom_histogram(binwidth = 4) +
xlab("Residuals")#method 2 - qqplot
ggplot(data = final_model, aes(sample = .resid)) +
stat_qq() #qq = quantile quantileAgain we have the same patter we found before, what’s really not a big surprise. Although the residuals seem to have a nearly normal distribution around zero, we can’t sustain homoscedasticity for the residuals distributions, as we the residuals variation on the left tail is greater than the variation on the right tail.
So, one more time, although we found good coefficients and p-values, we cannot feel much comfortable again using the regression model here.
Now let’s pick a movie from 2016 (a movie that’s not in the sample) and do a prediction for this movie using our the model
The selected movie will be Moana, which have received a score of 95 from rotten tomatoes critics (https://www.rottentomatoes.com/m/moana_2016). The film was nominated for a Best Animated Feature Film of the Year.
First, we need to create a new data frame for this movie
Then, I can do the prediction using the predict function:
## 1
## 89.02031
Finally, we can also construct a prediction interval around this prediction, which will provide a measure of uncertainty around the prediction.
## fit lwr upr
## 1 89.02031 60.31469 117.7259
Thus, the model predicts, with 95% confidence, that a movie with a critics score of 95 is expected to have an audience score between 60.31 and 117.72. This is an interval with a great uncertainty. Although the estimate (89.02) was exactly the number we can find on rotten tomatoes webesite.
This study showed us that critics scores from rotten tomatoes are not precise, and sometimes reliable, predictors of audience scores. Specially when the critics scores are low.
It seems that when critics on rotten tomatoes give a good valuation, the audience tended to do the same. We can see that on the plot by the greater concentration of dots on the right tail of the curve. On the other hand, when critics scores are lower we can find a variability on audience scores that is greater than the one we observe on the right tail. We could also say that this regression model does more precise predictions when the critics scores are higher. The higher the score, the more we could expect the audience to approve the movie. However, we could not be so certain when trying to predict audience reactions when the movie receives a lower critic score.
Also, the search for a period of the year associated with a higher audience score ended with no evidence found to support it.
Future studies could try to investigate this data using other approaches, like studying the movie genres separately, or trying to predict audience scores using other variables like Top 200 Box Office list or Number of votes on IMDB.