Movies data set includes information from Rotten Tomatoes and IMDB for a random sample of movies.
Sample collection methodology: Rotten Tomatoes and IMDB are review-aggregation websites that include information related to films and television programs. The rating for a particular film or television program is the result of critics and audiences reviews.
Data collection method implications: Since the sample was optained from a randomly sample selected movies in Rotten Tomatoes and IMDB, we can reneralized the results to the entire film population from 1970 to 2016 included either in Rotten Tomatoes and IMDB. In addition, it is important to point out that audience reviews comes from volunteers subjects; as volunteers are more likely to participate in the reviews, it can be a source of bias.
Scope of inference: Each film is equally likely to be selected, therefore we are dealing with a large-scale obervational study, the sample is representative of the population. As the groups are not escencially the same (due there is no random assignment), causal conclutions cannot be made.
In short, we have an observational study: not-causal-generalizable.
Research question: We are interested in learning what attributes make a movie popular. For this time we only consider IMDB ratings. We will take the following variables from movies dataset:
imdb_rating: Rating on IMDB.imdb_num_votes: Number of votes on IMDB.genre: Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other).runtime: Runtime of movie (in minutes).thtr_rel_year: Year the movie is released in theaters.best_pic_win: Whether or not the movie won a best picture Oscar (no, yes).Most of the variables are not numerical, therefore only a scatterplot would not be ideal for vizualizing any patterns in the data. Firstly, we would be interested in find out whether a particular film genre is more likely to have higher or lower scores than others. If so genre would be representative in order to estimate IMDB rating.
# Select the variables and remove NA values
best <- movies %>% select(imdb_rating,imdb_num_votes,genre,runtime,thtr_rel_year,best_pic_win) %>% na.omit()
# Create facet mapped plot to film genre
gg1 <- ggplot(data=best,aes(x=imdb_rating,fill=genre)) + geom_density(alpha=0.4) +
ggtitle("IMDB Rating Probability Densities") + labs(x="IMDB Rating",y="Kernel Estimate")
gg1 + facet_wrap(~genre)Comments
Secondly, runtime of movie (in minutes) could have to do with a higher or lower scores, and also could be more likely to win an Oscar. As both IMDB and runtime are numerical variables, let use a simple scatterplot and add a smoth linear regression. We could use facet_wrap to create a plot of each decade separetely.
Note: In order to facilitate the data manipulation We will consider in 2000's category all the movies from this year to the present.
# Create decade factor and add it to the data set
best <- best %>% mutate(decade=cut(x=best$thtr_rel_year,breaks = c(1970,1980,1990,2000,2020),
include.lowest = TRUE, right = TRUE,labels=c("70's","80's","90's","2000's")))
# Create faced mapped scatterplot to decade
gg2 <- ggplot(data=best,aes(x=runtime,y=imdb_rating)) +
geom_point(aes(col=best_pic_win)) + labs(x="Run Time (minutes)",y="IMDB Rating",col="Oscar Winner") +
geom_smooth(method = "lm")
gg2 + facet_wrap(~decade,nrow=1)## `geom_smooth()` using formula 'y ~ x'
Comments
Now let’s check the relationship between the number of votes and the IMDB Rating. We will use a simple scatterplot.
# Create the plot
ggplot(data=best,aes(x=imdb_num_votes,y=imdb_rating)) + geom_point(aes(col=best_pic_win)) +
labs(x="IMDB Number of Votes",y="IMDB Rating",col="Oscar Winner") + geom_smooth(method="lm")## `geom_smooth()` using formula 'y ~ x'
Comments
Now we can start modeling and confirm what we already figured out in the EDA. For this we will use the backward selection approach, by starting with a full model and then systematically drop terms according to significance of the changes in fit quality.
# Create the linear model
best.fit <- lm(imdb_rating~imdb_num_votes+runtime+genre+best_pic_win+decade,data=best)
summary(best.fit)##
## Call:
## lm(formula = imdb_rating ~ imdb_num_votes + runtime + genre +
## best_pic_win + decade, data = best)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0173 -0.4088 0.0597 0.5371 2.3980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.372e+00 2.499e-01 21.496 < 2e-16 ***
## imdb_num_votes 3.624e-06 3.472e-07 10.438 < 2e-16 ***
## runtime 5.540e-03 1.983e-03 2.794 0.005368 **
## genreAnimation 1.385e-01 3.060e-01 0.453 0.650855
## genreArt House & International 9.195e-01 2.528e-01 3.637 0.000299 ***
## genreComedy -1.292e-02 1.416e-01 -0.091 0.927339
## genreDocumentary 2.050e+00 1.637e-01 12.524 < 2e-16 ***
## genreDrama 7.552e-01 1.184e-01 6.378 3.47e-10 ***
## genreHorror 1.973e-02 2.086e-01 0.095 0.924672
## genreMusical & Performing Arts 1.469e+00 2.701e-01 5.440 7.62e-08 ***
## genreMystery & Suspense 5.133e-01 1.547e-01 3.319 0.000957 ***
## genreOther 4.008e-01 2.403e-01 1.668 0.095860 .
## genreScience Fiction & Fantasy -2.710e-01 3.048e-01 -0.889 0.374371
## best_pic_winyes -9.971e-02 3.477e-01 -0.287 0.774373
## decade80's -4.872e-02 1.386e-01 -0.352 0.725307
## decade90's -3.901e-01 1.289e-01 -3.026 0.002576 **
## decade2000's -3.718e-01 1.225e-01 -3.034 0.002510 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8537 on 633 degrees of freedom
## Multiple R-squared: 0.3961, Adjusted R-squared: 0.3808
## F-statistic: 25.95 on 16 and 633 DF, p-value: < 2.2e-16
With this approach we want to remove all the terms that do not result in a statistically significant detriment of the goodness of fit. In this case we will use drop1 function and then we will choose the term to remove from the model that has the leas significant effect of reducinf the goodness of fit. In other words, the term with the largest p-value for its F-test.
## Single term deletions
##
## Model:
## imdb_rating ~ imdb_num_votes + runtime + genre + best_pic_win +
## decade
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 461.29 -188.919
## imdb_num_votes 1 79.396 540.68 -87.690 108.9515 < 2.2e-16 ***
## runtime 1 5.688 466.97 -182.953 7.8049 0.0053680 **
## genre 10 188.205 649.49 13.492 25.8264 < 2.2e-16 ***
## best_pic_win 1 0.060 461.35 -190.834 0.0822 0.7743726
## decade 3 13.912 475.20 -175.605 6.3638 0.0002973 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Udate the model by removing the least significant term
best.fit1 <- update(best.fit,.~.-best_pic_win)We can see that removing the variable best_pic_win, regarding whether or not is an Oscar Winning Film, has the single least significant effect on reducing the goodness of fit, so we will star with update function by removing this term from best.fit. The new model will be named best.fit1. Then we will apply drop1 function one more time and repeat the process.
## Single term deletions
##
## Model:
## imdb_rating ~ imdb_num_votes + runtime + genre + decade
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 461.35 -190.834
## imdb_num_votes 1 85.362 546.71 -82.487 117.3074 < 2.2e-16 ***
## runtime 1 5.629 466.98 -184.951 7.7357 0.0055750 **
## genre 10 188.227 649.57 11.574 25.8668 < 2.2e-16 ***
## decade 3 13.863 475.21 -177.590 6.3504 0.0003028 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Update the model by removing the least significant term
best.fit2 <- update(best.fit1,.~.-runtime)Finally we will remove runtime which, even though it has somewhat small p-value, we want to follow the principle of parsimony, which establishes that the simpler model is better. Also, the addition of runtime could bring nothing new to the model. The final model is shown below as bestfit2 We will see that some film genres do have large p-values, as we see it in the Kernel Estimates plot at the beginning od EDA. It is important to recall that we cannot remove a particular film genre from the model, although some genres does not affect the rating, we must treat the variable as a whole. Something similar happens with decade variable.
##
## Call:
## lm(formula = imdb_rating ~ imdb_num_votes + genre + decade, data = best)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0368 -0.4235 0.0674 0.5717 2.4467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.936e+00 1.477e-01 40.187 < 2e-16 ***
## imdb_num_votes 3.924e-06 3.118e-07 12.587 < 2e-16 ***
## genreAnimation 6.037e-02 3.061e-01 0.197 0.843696
## genreArt House & International 9.342e-01 2.538e-01 3.681 0.000252 ***
## genreComedy -4.029e-02 1.417e-01 -0.284 0.776208
## genreDocumentary 2.041e+00 1.641e-01 12.441 < 2e-16 ***
## genreDrama 7.973e-01 1.177e-01 6.774 2.85e-11 ***
## genreHorror -3.008e-02 2.088e-01 -0.144 0.885479
## genreMusical & Performing Arts 1.540e+00 2.700e-01 5.704 1.79e-08 ***
## genreMystery & Suspense 5.473e-01 1.548e-01 3.536 0.000435 ***
## genreOther 4.225e-01 2.411e-01 1.752 0.080239 .
## genreScience Fiction & Fantasy -2.917e-01 3.061e-01 -0.953 0.340937
## decade80's -3.766e-02 1.389e-01 -0.271 0.786294
## decade90's -3.963e-01 1.289e-01 -3.073 0.002207 **
## decade2000's -4.007e-01 1.217e-01 -3.293 0.001046 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8576 on 635 degrees of freedom
## Multiple R-squared: 0.3886, Adjusted R-squared: 0.3752
## F-statistic: 28.83 on 14 and 635 DF, p-value: < 2.2e-16
Check Conditions
Fitted vs Residuals plot we could see a kind of inverted fan shape pattern, which indicates a poor linear relationship between the response variable and the explanatory varibales.Normal QQ plot.par(mfrow=c(1,3))
plot(x=best.fit2$fitted.values,y=best.fit2$residuals,xlab="Fitted Values",ylab="Residuals",
main="Fitted vs Residuals")
abline(h=0,lty=2,col="gray")
hist(x=best.fit2$residuals,main="Histogram of Residuals",xlab="Residuals")
plot(best.fit2,which = 2,add.smooth = FALSE)As we could see, the model does not fit the data well, but we will use it anyway and pick a recent movie and do a prediction for its IMDB rating using the model. Also we will quantify the uncertainty around this prediction using an appropriate interval. We will try to predict IMDB Rating of It Chapter:2.
itChapter2 <- data.frame(imdb_num_votes=2184+418,genre="Horror", decade="2000's")
predict(best.fit2,newdata=itChapter2,interval="prediction",level=0.9)## fit lwr upr
## 1 5.515917 4.071357 6.960477
The real IMDB Rating is 6,6 which is include in the prediction interval, however not in all the film cases the interval will include the real rating since the model does not fit the data aqqurately. Moreover I used a very bad and predictable movie. If we use the model again but this time with another movie like The Shape of Water we see that the real IMDB rating is 7.3 which is not include in the prediction interval.
shapeOfWater <- data.frame(imdb_num_votes=1589+996,
genre="Science Fiction & Fantasy", decade="2000's")
predict(best.fit2,newdata=shapeOfWater,interval="prediction",level=0.9)## fit lwr upr
## 1 5.254198 3.761581 6.746816
Here the overall goal of fitting any statistical model is to faithfully represent the data and the relationship between the variables. In general we are interested in figure out the balancing between goodness-of-fit (good reponse-predictor relationship) and complexity (how complicated the model is).
With this assignment we could confirm that selection of the appropriate variables can culminate in a suitable model which sadly was not the case this time. Even though satisfactory conclusions are not reached we have concluded that the variables used in the model, as a whole, are not significant to predict something as subjective as the rating of a film.