library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(magrittr)
library(statsr)
## Loading required package: BayesFactor
## Loading required package: coda
## Loading required package: Matrix
## ************
## Welcome to BayesFactor 0.9.12-4.2. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
## 
## Type BFManual() to open the manual.
## ************
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(knitr)

Load the required dataset*

load("movies.Rdata.gz")

Part 0: Abstract

In this project I will use a dataset of 651 randomly selected movies and many factors pertaining to them, such as their mpaa rating, release date, critics’ score, audience score, and others. I will perform some exploratory data analysis, followed by building a linear model to predict the critics’ score of a film.

Part 1: Data

The dataset is comprised of 651 randomly sampled movies produced and released before 2016. The mdata draws from imdb.com, rottentomatoes.com, and flixster.com. As this is random sampling, only correlations can be drawn. Because of the randomness of the selection and the size of the dataset, our results can be generalizable.


Part 2: Research question

Can the number of imdb votes, the audience rating, audience score, and knowledge of whether a movie has won an award or not be useful in predicting the critics’ score of a movie?


Part 3: Exploratory data analysis

Let’s take a quick look at the variables in the movies dataset.

names(movies)
##  [1] "title"            "title_type"       "genre"            "runtime"         
##  [5] "mpaa_rating"      "studio"           "thtr_rel_year"    "thtr_rel_month"  
##  [9] "thtr_rel_day"     "dvd_rel_year"     "dvd_rel_month"    "dvd_rel_day"     
## [13] "imdb_rating"      "imdb_num_votes"   "critics_rating"   "critics_score"   
## [17] "audience_rating"  "audience_score"   "best_pic_nom"     "best_pic_win"    
## [21] "best_actor_win"   "best_actress_win" "best_dir_win"     "top200_box"      
## [25] "director"         "actor1"           "actor2"           "actor3"          
## [29] "actor4"           "actor5"           "imdb_url"         "rt_url"

Let’s create a variable called won_award with the levels “yes” and “no”. If a movie won an award for best director, best actor, best actress, or best picture, the value for won_award would be “yes”.

movies <- movies %>%
  mutate(won_award = ifelse(best_pic_win=="yes" | best_actor_win=="yes" | best_actress_win=="yes" | best_dir_win=="yes" , "yes", "no"))

Now let’s take a look at critics_ratings and audience_rating and compare the two.

require(gridExtra)
cr_plot <- ggplot(movies, aes(x=critics_rating)) + geom_bar(fill=c("green4","green2","brown4"))
ar_plot <- ggplot(movies, aes(x=audience_rating)) + geom_bar(fill=c("brown4","green4"))
suppressWarnings(suppressMessages(print(grid.arrange(cr_plot, ar_plot, ncol=2))))

## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]

Let’s take a look at the proportion of critics_ratings that are Certified Fresh or Fresh. To define the levels:

( length(which(movies$critics_rating=="Certified Fresh")) + length(which(movies$critics_rating=="Fresh")) ) / ( length(which(movies$critics_rating=="Certified Fresh")) + length(which(movies$critics_rating=="Fresh")) + length(which(movies$critics_rating=="Rotten")) ) 
## [1] 0.5284178

Let’s take a look at the proportion of audience_rating that are upright. To define: Upright = 3.5 stars or higher (out of 5), rated by Flixster and Rotten Tomatoes Users.

length(which(movies$audience_rating=="Upright")) / ( length(which(movies$audience_rating=="Spilled")) + length(which(movies$audience_rating=="Upright")) )
## [1] 0.577573

These dotcharts below are simply to visualize a sample of critics_score and audience_score to show how critics and the audience can vary on how they rate a movie. Take a look at Mad Dog Time, for instance.

par(mfrow=c(1,2))
dotchart(movies$critics_score[1:10], movies$title[1:25],xlim=c(0,100),panel.first=grid())
dotchart(movies$audience_score[1:10], movies$title[1:25],xlim=c(0,100),panel.first=grid())

Now let’s take a look at when movies get released.

ggplot(movies, aes(x=thtr_rel_month)) + geom_histogram(binwidth = .2) + scale_x_continuous(breaks=seq(1,12,1),lim=c(0,13))
## Warning: Removed 2 rows containing missing values (geom_bar).

Four months stick out as having far more releases: January, June, October, and December. This makes sense intuitively; June would have the summer blockbusters, October would have the horror movies and Thanksgiving family films, and December and January would see the release of holiday films.

Let’s take a look at the correlation between the variables imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, and audience_score.

suppressWarnings(suppressMessages(print(ggpairs(movies, columns=13:18))))

We see some clear correlations here. For instance, see the scatter plot between critics_score and audience_score. This plot gives us a good idea of how these variables interact.


Part 4: Modeling

The final models I’ve selected are:

  • Using Adjusted R2
cs_ar2_final <- lm(critics_score ~ imdb_rating + audience_score + audience_rating + imdb_num_votes, data=movies)
  • Using p-values
cs_pv_final <- lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score, data = movies)

See below for my methods. I used forward selection to build the adjusted R2 model, and both forward selection and backwards elimination to build the p-value model (testing to make sure I’d get the same result).

I chose a subset of variables with which to work based on my exploratory data analysis. The correlation plot was very informative regarding which variables would inform the model. My response variable will be critics_score. My explanatory variable set will be:

  • imdb_rating

  • imdb_num_votes

  • audience_score

  • audience_rating

  • won_award

Methods

Forward selection using adjusted R2

Step 1

#step 1
summary(lm(movies$critics_score ~ movies$imdb_rating))$adj.r.squared # 0.5846403
## [1] 0.5846403
summary(lm(movies$critics_score ~ movies$imdb_num_votes))$adj.r.squared # 0.04231254
## [1] 0.04231254
#summary(lm(movies$critics_score ~ movies$critics_rating))$adj.r.squared # 0.7745503
summary(lm(movies$critics_score ~ movies$audience_rating))$adj.r.squared # 0.3431939
## [1] 0.3431939
summary(lm(movies$critics_score ~ movies$audience_score))$adj.r.squared # 0.4952284
## [1] 0.4952284

Step 2

#step 2
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$imdb_num_votes))$adj.r.squared # 0.5861897
## [1] 0.5861897
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_rating))$adj.r.squared # 0.5898401
## [1] 0.5898401
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score))$adj.r.squared # 0.5912307
## [1] 0.5912307

Step 3

#step 3
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$imdb_num_votes))$adj.r.squared # 0.5928506
## [1] 0.5928506
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$audience_rating))$adj.r.squared # 0.5910414
## [1] 0.5910414

Step 4

#step 4
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$imdb_num_votes + movies$audience_rating))$adj.r.squared
## [1] 0.5926277

Step 5

summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$imdb_num_votes + movies$audience_rating + movies$won_award))$adj.r.squared #reduces
## [1] 0.5922498

We see in step 5 that the addition of the variable won_award reduces the adjusted R2 score, so we will leave it out of the model.

Forward selection using p-values

Step 1

# step 1
summary(lm(movies$critics_score ~ movies$imdb_rating))[[4]] # 3.743006e-126
##                     Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)        -72.37919  4.3572315 -16.61128  6.716290e-52
## movies$imdb_rating  20.03167  0.6618978  30.26398 3.743006e-126
summary(lm(movies$critics_score ~ movies$imdb_num_votes))[[4]] # 7.105787e-08
##                           Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept)           5.463854e+01 1.224634e+00 44.616225 6.927730e-200
## movies$imdb_num_votes 5.300671e-05 9.723417e-06  5.451449  7.105787e-08
summary(lm(movies$critics_score ~ movies$audience_rating))[[4]] # 1.850804e-61
##                               Estimate Std. Error  t value      Pr(>|t|)
## (Intercept)                   38.21818   1.388085 27.53303 3.754628e-111
## movies$audience_ratingUpright 33.71001   1.826470 18.45637  1.850804e-61
summary(lm(movies$critics_score ~ movies$audience_score))[[4]] # 1.214966e-98
##                         Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)           -3.9987221 2.56578617 -1.558478 1.196074e-01
## movies$audience_score  0.9891662 0.03913966 25.272732 1.214966e-98
summary(lm(movies$critics_score ~ movies$won_award))[[4]] # 5.681539e-02
##                      Estimate Std. Error  t value      Pr(>|t|)
## (Intercept)         56.422917   1.293787 43.61068 4.581765e-195
## movies$won_awardyes  4.816849   2.524382  1.90813  5.681539e-02

Step 2

# step 2
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$imdb_num_votes))[[4]] # 6.447851e-02
##                            Estimate   Std. Error    t value      Pr(>|t|)
## (Intercept)           -7.444564e+01 4.489947e+00 -16.580514  9.898082e-52
## movies$imdb_rating     2.046108e+01 7.001676e-01  29.223115 2.129242e-120
## movies$imdb_num_votes -1.254511e-05 6.773766e-06  -1.852014  6.447851e-02
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_rating))[[4]] # 2.479742e-03
##                                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                   -63.35563  5.2508843 -12.06571 2.207924e-30
## movies$imdb_rating             18.09940  0.9150093  19.78056 1.726793e-68
## movies$audience_ratingUpright   6.09938  2.0078938   3.03770 2.479742e-03
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score))[[4]] 
##                          Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)           -62.3239246  5.2444415 -11.883806 1.330077e-29
## movies$imdb_rating     16.2014831  1.3080084  12.386376 8.941152e-32
## movies$audience_score   0.2375537  0.0701619   3.385794 7.526017e-04

Step 3

# step 3
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$audience_rating))[[4]]
##                                  Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)                   -61.3467693  5.3740917 -11.4152816 1.263006e-27
## movies$imdb_rating             16.4346011  1.3376499  12.2861750 2.474313e-31
## movies$audience_score           0.1748909  0.1026356   1.7039977 8.886167e-02
## movies$audience_ratingUpright   2.4533719  2.9322439   0.8366875 4.030772e-01

Step 4

# step 4
#final model
summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score))[[4]] 
##                          Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)           -62.3239246  5.2444415 -11.883806 1.330077e-29
## movies$imdb_rating     16.2014831  1.3080084  12.386376 8.941152e-32
## movies$audience_score   0.2375537  0.0701619   3.385794 7.526017e-04

Backward selection using p-values

Step 1

summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$audience_rating + movies$won_award + movies$imdb_num_votes))[[4]]
##                                    Estimate   Std. Error     t value
## (Intercept)                   -6.331079e+01 5.482227e+00 -11.5483723
## movies$imdb_rating             1.674909e+01 1.361012e+00  12.3063497
## movies$audience_score          1.819399e-01 1.026526e-01   1.7723856
## movies$audience_ratingUpright  2.396771e+00 2.929228e+00   0.8182262
## movies$won_awardyes            1.043272e+00 1.647156e+00   0.6333776
## movies$imdb_num_votes         -1.319090e-05 6.787363e-06  -1.9434496
##                                   Pr(>|t|)
## (Intercept)                   3.575083e-28
## movies$imdb_rating            2.063833e-31
## movies$audience_score         7.680255e-02
## movies$audience_ratingUpright 4.135302e-01
## movies$won_awardyes           5.267116e-01
## movies$imdb_num_votes         5.239649e-02

Step 2

summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$audience_rating + movies$imdb_num_votes))[[4]]
##                                    Estimate   Std. Error     t value
## (Intercept)                   -6.342409e+01 5.476767e+00 -11.5805712
## movies$imdb_rating             1.684144e+01 1.352552e+00  12.4516018
## movies$audience_score          1.784088e-01 1.024536e-01   1.7413630
## movies$audience_ratingUpright  2.352686e+00 2.927044e+00   0.8037755
## movies$imdb_num_votes         -1.261082e-05 6.722171e-06  -1.8760035
##                                   Pr(>|t|)
## (Intercept)                   2.598154e-28
## movies$imdb_rating            4.730279e-32
## movies$audience_score         8.209599e-02
## movies$audience_ratingUpright 4.218223e-01
## movies$imdb_num_votes         6.110549e-02

Step 3

summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$imdb_num_votes))[[4]]
##                            Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)           -6.437715e+01 5.345409e+00 -12.043448 2.779897e-30
## movies$imdb_rating     1.662116e+01 1.324134e+00  12.552477 1.683568e-32
## movies$audience_score  2.385074e-01 7.002456e-02   3.406054 6.998851e-04
## movies$imdb_num_votes -1.270989e-05 6.719202e-06  -1.891577 5.899372e-02

Step 4

summary(lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score))[[4]]
##                          Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)           -62.3239246  5.2444415 -11.883806 1.330077e-29
## movies$imdb_rating     16.2014831  1.3080084  12.386376 8.941152e-32
## movies$audience_score   0.2375537  0.0701619   3.385794 7.526017e-04

Selecting final models

cs_ar2_final <- lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score + movies$imdb_num_votes + movies$audience_rating) 
cs_pv_final <- lm(movies$critics_score ~ movies$imdb_rating + movies$audience_score)

We see a difference in these two models, namely that the adjusted R2 model includes the variables audience_rating and imdb_num_votes.

If we plot the residuals of the adjusted R2 model, this is what we get:

ggplot(movies, aes(x=cs_ar2_final$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The result is a nearly normal distribution with a mean of approximately 0. This satisfies the necessary criteria for being a legitimate linear model.

Let’s take a look at a histogram plot of the p-value model’s residuals:

ggplot(movies, aes(x=cs_pv_final$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The result is a nearly normal distribution, which is what we want. I don’t want to lose some of the accuracy the variables in the adjusted R2 model could give me, though. For that reason, I am selecting the adjusted R2 model, as it contains all the variables in the p-value model as well as additional variables that will provide us better accuracy.

Plotting residuals against individual variables

plot(cs_ar2_final$residuals ~ movies$imdb_rating, col="red")
abline(h=0, lty=2)

plot(cs_ar2_final$residuals ~ movies$audience_score, col="red")
abline(h=0, lty=2)

plot(cs_ar2_final$residuals ~ movies$imdb_num_votes, col="red")
abline(h=0, lty=2)

We see random scatter for all plots, which meets the criteria for a linear model.

Let’s rename cs_ar2_final, giving it the name cs_final to denote its finality. We’ll then plot the residuals in a scatterplot and histogram and plot the normal probability plot of residuals.

Plotting the final model

cs_final <- lm(critics_score ~ imdb_rating + audience_score + audience_rating + imdb_num_votes, data=movies)
plot(cs_final$residuals, col="red")
abline(h=0, lty=2)

ggplot(movies, aes(x=cs_final$residuals)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qqnorm(cs_final$residuals, col="red")
qqline(cs_final$residuals)

We definitely see a random scatter around 0 in this residuals plot.

Checking for constant variability of the final model

Now let’s plot cs_final$residuals ~ cs_final$fitted and abs(cs_final$residuals) ~ cs_final$fitted to check for constant variability.

plot(cs_final$residuals ~ cs_final$fitted, col="red")
abline(h=0, lty=2)

plot(abs(cs_final$residuals) ~ cs_final$fitted, col="red")

We do not see a fan shape in these plots, so the constant variability conditions appears to be met.


Part 5: Prediction

The movie whose critics_score I will predict is Star Trek Beyond.

It was released on July 22, 2016. All information is from imdb.com and rottentomatoes.com.

For reference, critics_score = 84

First, I’ll create the data.frame newmovie to include the data about Star Trek Beyond

newmovie <- data.frame(imdb_rating = 7.2, imdb_num_votes = 122567, audience_score = 81, audience_rating = "Upright")

Now, I will run the predict function.

predict(cs_final, newmovie, interval = "prediction", level = 0.95)
##        fit      lwr     upr
## 1 73.09239 37.43775 108.747

We predict that this movie would receive a critics’ score of 73.09.

We can also see that based on this model, with 95% confidence, a movie (in this case Star Trek Beyond) with an imdb_rating of 7.2, an audience_score of 81, an audience_rating of “Upright”, and and imdb_num_votes of 122,567 is expected to receive a critics’ score between 43.18 and 100 (score can not exceed 100).


Part 6: Conclusion

Using our model, we predicted a critics’ score that reached 87% accuracy. The true score did in fact fall within the 95% confidence interval. The accuracy rating is decent, and could likely be improved.

To improve these results using this model, we would need more observations. To improve the model in general, we would need a greater number of useful variables.

I think many useful questions can be addressed using a similar model. One would be to forecast how much money a movie would make.