library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(magrittr)
library(statsr)
## Loading required package: BayesFactor
## Loading required package: coda
## Loading required package: Matrix
## ************
## Welcome to BayesFactor 0.9.12-4.2. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
##
## Type BFManual() to open the manual.
## ************
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(knitr)
Load the required dataset*
load("movies.Rdata.gz")
In this project I will use a dataset of 651 randomly selected movies and many factors pertaining to them, such as their mpaa rating, release date, critics’ score, audience score, and others. I will perform some exploratory data analysis, followed by building a linear model to predict the critics’ score of a film.
The dataset is comprised of 651 randomly sampled movies produced and released before 2016. The mdata draws from imdb.com, rottentomatoes.com, and flixster.com. As this is random sampling, only correlations can be drawn. Because of the randomness of the selection and the size of the dataset, our results can be generalizable.
Can the number of imdb votes, the audience rating, audience score, and knowledge of whether a movie has won an award or not be useful in predicting the critics’ score of a movie?
Let’s take a quick look at the variables in the movies dataset.
names(movies)
## [1] "title" "title_type" "genre" "runtime"
## [5] "mpaa_rating" "studio" "thtr_rel_year" "thtr_rel_month"
## [9] "thtr_rel_day" "dvd_rel_year" "dvd_rel_month" "dvd_rel_day"
## [13] "imdb_rating" "imdb_num_votes" "critics_rating" "critics_score"
## [17] "audience_rating" "audience_score" "best_pic_nom" "best_pic_win"
## [21] "best_actor_win" "best_actress_win" "best_dir_win" "top200_box"
## [25] "director" "actor1" "actor2" "actor3"
## [29] "actor4" "actor5" "imdb_url" "rt_url"
Let’s create a variable called won_award with the levels “yes” and “no”. If a movie won an award for best director, best actor, best actress, or best picture, the value for won_award would be “yes”.
movies <- movies %>%
mutate(won_award = ifelse(best_pic_win=="yes" | best_actor_win=="yes" | best_actress_win=="yes" | best_dir_win=="yes" , "yes", "no"))
Now let’s take a look at critics_ratings and audience_rating and compare the two.
require(gridExtra)
cr_plot <- ggplot(movies, aes(x=critics_rating)) + geom_bar(fill=c("green4","green2","brown4"))
ar_plot <- ggplot(movies, aes(x=audience_rating)) + geom_bar(fill=c("brown4","green4"))
suppressWarnings(suppressMessages(print(grid.arrange(cr_plot, ar_plot, ncol=2))))
## TableGrob (1 x 2) "arrange": 2 grobs
## z cells name grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
Let’s take a look at the proportion of critics_ratings that are Certified Fresh or Fresh. To define the levels:
Fresh = 60%-74%
Certified fresh = 75%+
( length(which(movies$critics_rating=="Certified Fresh")) + length(which(movies$critics_rating=="Fresh")) ) / ( length(which(movies$critics_rating=="Certified Fresh")) + length(which(movies$critics_rating=="Fresh")) + length(which(movies$critics_rating=="Rotten")) )
## [1] 0.5284178
Let’s take a look at the proportion of audience_rating that are upright. To define: Upright = 3.5 stars or higher (out of 5), rated by Flixster and Rotten Tomatoes Users.
length(which(movies$audience_rating=="Upright")) / ( length(which(movies$audience_rating=="Spilled")) + length(which(movies$audience_rating=="Upright")) )
## [1] 0.577573
These dotcharts below are simply to visualize a sample of critics_score and audience_score to show how critics and the audience can vary on how they rate a movie. Take a look at Mad Dog Time, for instance.
par(mfrow=c(1,2))
dotchart(movies$critics_score[1:10], movies$title[1:25],xlim=c(0,100),panel.first=grid())
dotchart(movies$audience_score[1:10], movies$title[1:25],xlim=c(0,100),panel.first=grid())
Now let’s take a look at when movies get released.
ggplot(movies, aes(x=thtr_rel_month)) + geom_histogram(binwidth = .2) + scale_x_continuous(breaks=seq(1,12,1),lim=c(0,13))
## Warning: Removed 2 rows containing missing values (geom_bar).
Four months stick out as having far more releases: January, June, October, and December. This makes sense intuitively; June would have the summer blockbusters, October would have the horror movies and Thanksgiving family films, and December and January would see the release of holiday films.
Let’s take a look at the correlation between the variables imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, and audience_score.
suppressWarnings(suppressMessages(print(ggpairs(movies, columns=13:18))))
We see some clear correlations here. For instance, see the scatter plot between critics_score and audience_score. This plot gives us a good idea of how these variables interact.
cs_ar2_final <- lm(critics_score ~ imdb_rating + audience_score + audience_rating + imdb_num_votes, data=movies)
cs_pv_final <- lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score, data = movies)
I chose a subset of variables with which to work based on my exploratory data analysis. The correlation plot was very informative regarding which variables would inform the model. My response variable will be critics_score. My explanatory variable set will be:
imdb_rating
imdb_num_votes
audience_score
audience_rating
won_award
#step 1
summary(lm(movies$critics_score ~ movies$imdb_rating))$adj.r.squared # 0.5846403
## [1] 0.5846403
summary(lm(movies$critics_score ~ movies$imdb_num_votes))$adj.r.squared # 0.04231254
## [1] 0.04231254
#summary(lm(movies$critics_score ~ movies$critics_rating))$adj.r.squared # 0.7745503
summary(lm(movies$critics_score ~ movies$audience_rating))$adj.r.squared # 0.3431939
## [1] 0.3431939
summary(lm(movies$critics_score ~ movies$audience_score))$adj.r.squared # 0.4952284
## [1] 0.4952284
#step 2
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$imdb_num_votes))$adj.r.squared # 0.5861897
## [1] 0.5861897
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_rating))$adj.r.squared # 0.5898401
## [1] 0.5898401
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score))$adj.r.squared # 0.5912307
## [1] 0.5912307
#step 3
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$imdb_num_votes))$adj.r.squared # 0.5928506
## [1] 0.5928506
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$audience_rating))$adj.r.squared # 0.5910414
## [1] 0.5910414
#step 4
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$imdb_num_votes + movies$audience_rating))$adj.r.squared
## [1] 0.5926277
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$imdb_num_votes + movies$audience_rating + movies$won_award))$adj.r.squared #reduces
## [1] 0.5922498
We see in step 5 that the addition of the variable won_award reduces the adjusted R2 score, so we will leave it out of the model.
# step 1
summary(lm(movies$critics_score ~ movies$imdb_rating))[[4]] # 3.743006e-126
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -72.37919 4.3572315 -16.61128 6.716290e-52
## movies$imdb_rating 20.03167 0.6618978 30.26398 3.743006e-126
summary(lm(movies$critics_score ~ movies$imdb_num_votes))[[4]] # 7.105787e-08
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.463854e+01 1.224634e+00 44.616225 6.927730e-200
## movies$imdb_num_votes 5.300671e-05 9.723417e-06 5.451449 7.105787e-08
summary(lm(movies$critics_score ~ movies$audience_rating))[[4]] # 1.850804e-61
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.21818 1.388085 27.53303 3.754628e-111
## movies$audience_ratingUpright 33.71001 1.826470 18.45637 1.850804e-61
summary(lm(movies$critics_score ~ movies$audience_score))[[4]] # 1.214966e-98
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.9987221 2.56578617 -1.558478 1.196074e-01
## movies$audience_score 0.9891662 0.03913966 25.272732 1.214966e-98
summary(lm(movies$critics_score ~ movies$won_award))[[4]] # 5.681539e-02
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.422917 1.293787 43.61068 4.581765e-195
## movies$won_awardyes 4.816849 2.524382 1.90813 5.681539e-02
# step 2
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$imdb_num_votes))[[4]] # 6.447851e-02
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.444564e+01 4.489947e+00 -16.580514 9.898082e-52
## movies$imdb_rating 2.046108e+01 7.001676e-01 29.223115 2.129242e-120
## movies$imdb_num_votes -1.254511e-05 6.773766e-06 -1.852014 6.447851e-02
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_rating))[[4]] # 2.479742e-03
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -63.35563 5.2508843 -12.06571 2.207924e-30
## movies$imdb_rating 18.09940 0.9150093 19.78056 1.726793e-68
## movies$audience_ratingUpright 6.09938 2.0078938 3.03770 2.479742e-03
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score))[[4]]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -62.3239246 5.2444415 -11.883806 1.330077e-29
## movies$imdb_rating 16.2014831 1.3080084 12.386376 8.941152e-32
## movies$audience_score 0.2375537 0.0701619 3.385794 7.526017e-04
# step 3
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$audience_rating))[[4]]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -61.3467693 5.3740917 -11.4152816 1.263006e-27
## movies$imdb_rating 16.4346011 1.3376499 12.2861750 2.474313e-31
## movies$audience_score 0.1748909 0.1026356 1.7039977 8.886167e-02
## movies$audience_ratingUpright 2.4533719 2.9322439 0.8366875 4.030772e-01
# step 4
#final model
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score))[[4]]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -62.3239246 5.2444415 -11.883806 1.330077e-29
## movies$imdb_rating 16.2014831 1.3080084 12.386376 8.941152e-32
## movies$audience_score 0.2375537 0.0701619 3.385794 7.526017e-04
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$audience_rating + movies$won_award + movies$imdb_num_votes))[[4]]
## Estimate Std. Error t value
## (Intercept) -6.331079e+01 5.482227e+00 -11.5483723
## movies$imdb_rating 1.674909e+01 1.361012e+00 12.3063497
## movies$audience_score 1.819399e-01 1.026526e-01 1.7723856
## movies$audience_ratingUpright 2.396771e+00 2.929228e+00 0.8182262
## movies$won_awardyes 1.043272e+00 1.647156e+00 0.6333776
## movies$imdb_num_votes -1.319090e-05 6.787363e-06 -1.9434496
## Pr(>|t|)
## (Intercept) 3.575083e-28
## movies$imdb_rating 2.063833e-31
## movies$audience_score 7.680255e-02
## movies$audience_ratingUpright 4.135302e-01
## movies$won_awardyes 5.267116e-01
## movies$imdb_num_votes 5.239649e-02
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$audience_rating + movies$imdb_num_votes))[[4]]
## Estimate Std. Error t value
## (Intercept) -6.342409e+01 5.476767e+00 -11.5805712
## movies$imdb_rating 1.684144e+01 1.352552e+00 12.4516018
## movies$audience_score 1.784088e-01 1.024536e-01 1.7413630
## movies$audience_ratingUpright 2.352686e+00 2.927044e+00 0.8037755
## movies$imdb_num_votes -1.261082e-05 6.722171e-06 -1.8760035
## Pr(>|t|)
## (Intercept) 2.598154e-28
## movies$imdb_rating 4.730279e-32
## movies$audience_score 8.209599e-02
## movies$audience_ratingUpright 4.218223e-01
## movies$imdb_num_votes 6.110549e-02
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$imdb_num_votes))[[4]]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.437715e+01 5.345409e+00 -12.043448 2.779897e-30
## movies$imdb_rating 1.662116e+01 1.324134e+00 12.552477 1.683568e-32
## movies$audience_score 2.385074e-01 7.002456e-02 3.406054 6.998851e-04
## movies$imdb_num_votes -1.270989e-05 6.719202e-06 -1.891577 5.899372e-02
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score))[[4]]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -62.3239246 5.2444415 -11.883806 1.330077e-29
## movies$imdb_rating 16.2014831 1.3080084 12.386376 8.941152e-32
## movies$audience_score 0.2375537 0.0701619 3.385794 7.526017e-04
cs_ar2_final <- lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$imdb_num_votes + movies$audience_rating)
cs_pv_final <- lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score)
We see a difference in these two models, namely that the adjusted R2 model includes the variables audience_rating and imdb_num_votes.
If we plot the residuals of the adjusted R2 model, this is what we get:
ggplot(movies, aes(x=cs_ar2_final$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The result is a nearly normal distribution with a mean of approximately 0. This satisfies the necessary criteria for being a legitimate linear model.
Let’s take a look at a histogram plot of the p-value model’s residuals:
ggplot(movies, aes(x=cs_pv_final$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The result is a nearly normal distribution, which is what we want. I don’t want to lose some of the accuracy the variables in the adjusted R2 model could give me, though. For that reason, I am selecting the adjusted R2 model, as it contains all the variables in the p-value model as well as additional variables that will provide us better accuracy.
plot(cs_ar2_final$residuals ~ movies$imdb_rating, col="red")
abline(h=0, lty=2)
plot(cs_ar2_final$residuals ~ movies$audience_score, col="red")
abline(h=0, lty=2)
plot(cs_ar2_final$residuals ~ movies$imdb_num_votes, col="red")
abline(h=0, lty=2)
We see random scatter for all plots, which meets the criteria for a linear model.
Let’s rename cs_ar2_final, giving it the name cs_final to denote its finality. We’ll then plot the residuals in a scatterplot and histogram and plot the normal probability plot of residuals.
cs_final <- lm(critics_score ~ imdb_rating + audience_score + audience_rating + imdb_num_votes, data=movies)
plot(cs_final$residuals, col="red")
abline(h=0, lty=2)
ggplot(movies, aes(x=cs_final$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
qqnorm(cs_final$residuals, col="red")
qqline(cs_final$residuals)
We definitely see a random scatter around 0 in this residuals plot.
Now let’s plot cs_final$residuals ~ cs_final$fitted and abs(cs_final$residuals) ~ cs_final$fitted to check for constant variability.
plot(cs_final$residuals ~ cs_final$fitted, col="red")
abline(h=0, lty=2)
plot(abs(cs_final$residuals) ~ cs_final$fitted, col="red")
We do not see a fan shape in these plots, so the constant variability conditions appears to be met.
The movie whose critics_score I will predict is Star Trek Beyond.
It was released on July 22, 2016. All information is from imdb.com and rottentomatoes.com.
imdb_num_votes = 122567
imdb_rating = 7.2
audience_score = 81
audience_rating = Upright
For reference, critics_score = 84
First, I’ll create the data.frame newmovie to include the data about Star Trek Beyond
newmovie <- data.frame(imdb_rating = 7.2, imdb_num_votes = 122567, audience_score = 81, audience_rating = "Upright")
Now, I will run the predict function.
predict(cs_final, newmovie, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 73.09239 37.43775 108.747
We predict that this movie would receive a critics’ score of 73.09.
We can also see that based on this model, with 95% confidence, a movie (in this case Star Trek Beyond) with an imdb_rating of 7.2, an audience_score of 81, an audience_rating of “Upright”, and and imdb_num_votes of 122,567 is expected to receive a critics’ score between 43.18 and 100 (score can not exceed 100).
Using our model, we predicted a critics’ score that reached 87% accuracy. The true score did in fact fall within the 95% confidence interval. The accuracy rating is decent, and could likely be improved.
To improve these results using this model, we would need more observations. To improve the model in general, we would need a greater number of useful variables.
I think many useful questions can be addressed using a similar model. One would be to forecast how much money a movie would make.