Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

load("movies.Rdata")

Part 1: Data

Generabizality:
The dataset contains variables which are independent between observations. Thus, requirement for random sampling is satisfied. Also, there are 651 observations. The number of observations is less than 10% of the total movies registered in IMDB and Rotten Tomatoes database, and large enough (more than 10, or 30 for high-skewed data) to draw generabizality. As random sampling and number of observations are satisfied, we can draw generabizality from this dataset.

Causality : The dataset is observational study. Causality can only be concluded if there is follow up experimental study. Therefore, we can only draw correlation.


Part 2: Research question

With the given dataset, we would like to construct parsimonious linear regression model to predict IMDB rating of a movie. A parsimonious model is desirable because it accomplishes desired level of prediction with as few predictor as possible.

In order to achieve parsimony, we select variables which hypothetically have correlations to IMDB rating. Then we will validate these questions:
1. Are there correlation between critics_score, audience_score and imdb_rating?
2. Is there correlation between imdb_num_votes and imdb_rating? Movies which have more number of votes are hypothetically more popular or awaited, hence could have higher rating.
3. Are there correlation between critics_rating, audience_rating and imdb_rating?
4. Do movies with specific genre have higher rating? We will evaluate using genre and imdb_rating.
5. Movies which win awards could have higher rating than those which don’t. We will evaluate using best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box and imdb_rating variables.


Part 3: Exploratory data analysis

Q1. Are there correlation between critics_score, audience_score and imdb_rating?
Let’s contruct scatter plot with fitted regression line

par(mfcol = c(1,2))
plot(movies$audience_score, movies$imdb_rating)
abline(lm(movies$imdb_rating ~ movies$audience_score))
plot(movies$critics_score, movies$imdb_rating)
abline(lm(movies$imdb_rating ~ movies$critics_score))

From the plots, it looks imdb_score is linearly correlated to audience_score and critics_score. Let’s see how much they correlate.

with(movies, cor(imdb_rating, critics_score))
## [1] 0.7650355
with(movies, cor(imdb_rating, audience_score))
## [1] 0.8648652

With correlations 0.76 and 0.86, imdb_score is highly linearly correlated to audience_score and critics_score.

Q2. Is there correlation between imdb_num_votes and imdb_rating? Movies which have more number of votes are hypothetically more popular or awaited, hence could have higher rating.

Similar to previous question, let’s contruct scatter plot and see how much those two variables are correlated.

plot(movies$imdb_num_votes, movies$imdb_rating)
abline(lm(movies$imdb_rating ~ movies$imdb_num_votes))

with(movies, cor(imdb_num_votes, imdb_rating))
## [1] 0.3311525

We see moderate correlations and we learn that movies having extremely large votes tend to have high IMDB rating.

Q3. Are there correlation between critics_rating, audience_rating and imdb_rating?

Let’s first evalutate critics_rating to imdb_rating. We construct summary table of mean of imdb_rating, grouped by three categorical levels of critics_rating and plot the numbers into barplot

summary1 <- movies %>%
        group_by(critics_rating) %>%
        summarise(mean_rating = mean(imdb_rating))

summary1
## # A tibble: 3 x 2
##   critics_rating  mean_rating
##   <fct>                 <dbl>
## 1 Certified Fresh        7.40
## 2 Fresh                  6.95
## 3 Rotten                 5.78
ggplot(data = summary1, aes(x=critics_rating, y = mean_rating)) + geom_bar(stat = "identity")

We see that movies with critical_rating of “Certified Fresh” have the highest imdb_rating mean, followed by “Fresh” and “Rotten”. How about audience_rating? Let’s construct similar summary table and barplot.

summary2 <- movies %>%
        group_by(audience_rating) %>%
        summarise(mean_rating = mean(imdb_rating))

summary2
## # A tibble: 2 x 2
##   audience_rating mean_rating
##   <fct>                 <dbl>
## 1 Spilled                5.61
## 2 Upright                7.14
ggplot(data = summary2, aes(x=audience_rating, y = mean_rating)) + geom_bar(stat = "identity")

Movies with audience_rating “upright”, have higher rating mean than “spilled”.

Q4. Do movies with specific genre have higher rating? We will evaluate using genre and imdb_rating.

Let’s construct summary table of imdb_rating means, grouped by different levels of genre and its barplot.

summary3 <- movies %>%
        group_by(genre) %>%
        summarise(mean_rating = mean(imdb_rating))

summary3
## # A tibble: 11 x 2
##    genre                     mean_rating
##    <fct>                           <dbl>
##  1 Action & Adventure               5.97
##  2 Animation                        5.9 
##  3 Art House & International        6.61
##  4 Comedy                           5.74
##  5 Documentary                      7.65
##  6 Drama                            6.67
##  7 Horror                           5.76
##  8 Musical & Performing Arts        7.3 
##  9 Mystery & Suspense               6.48
## 10 Other                            6.63
## 11 Science Fiction & Fantasy        5.76
ggplot(data = summary3, aes(x=genre, y = mean_rating)) + geom_bar(stat = "identity")

I don’t want to spend so much time to tidy the xlabel of the barplot, since this is just explanatory analysis. Anyway, from summary table and barplot above, we can see certain genres have higher means of rating. For example, documentary movies are rated with mean as high as 7.65, while comedy movies as low as 5.74.

Q5. Movies which win awards could have higher rating than those which don’t. We will evaluate using best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box and imdb_rating variables.

Let’s construct summary table of each explanatory variable to response variable imdb_rating. As this is repetitive task, we create function summarise_func and do for loop to output the mean of imdb_rating for each levels in the explanatory variables.

summarise_func <- function(col_name){
        require("dplyr")
        a<- movies %>%
        group_by_(col_name) %>%
        summarise(mean_rating = mean(imdb_rating), count = n())
        print(a)
        
}

variables <- names(movies)[19:24]

for (var in variables){
        summarise_func(var)
}
## Warning: group_by_() is deprecated. 
## Please use group_by() instead
## 
## The 'programming' vignette or the tidyeval book can help you
## to program with group_by() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.
## # A tibble: 2 x 3
##   best_pic_nom mean_rating count
##   <fct>              <dbl> <int>
## 1 no                  6.45   629
## 2 yes                 7.75    22
## # A tibble: 2 x 3
##   best_pic_win mean_rating count
##   <fct>              <dbl> <int>
## 1 no                  6.48   644
## 2 yes                 7.9      7
## # A tibble: 2 x 3
##   best_actor_win mean_rating count
##   <fct>                <dbl> <int>
## 1 no                    6.46   558
## 2 yes                   6.66    93
## # A tibble: 2 x 3
##   best_actress_win mean_rating count
##   <fct>                  <dbl> <int>
## 1 no                      6.47   579
## 2 yes                     6.71    72
## # A tibble: 2 x 3
##   best_dir_win mean_rating count
##   <fct>              <dbl> <int>
## 1 no                  6.45   608
## 2 yes                 7.04    43
## # A tibble: 2 x 3
##   top200_box mean_rating count
##   <fct>            <dbl> <int>
## 1 no                6.48   636
## 2 yes               7.14    15

All of the respective variables generally yield higher imdb_rating when their values are “yes”. However, count number of “yes” is small, and we need to be wary of this, especially best_pic_win where “yes” response is less than 10.


Part 4: Modeling

We will construct our linear regression model using forward stepwise method. We will add explanatory variable one-by-one and evaluate if there is increase in adjusted R-squared and decrease in sum of squared of residuals. We start from constructing linear regression model between response variable imdb_rating and one explanatory variable audience_score. We use forward stepwise method because there are some variables which we are more confident to yield better predictive model, and we will add these variables in the order of confidence.

m <- lm(imdb_rating ~ audience_score, data = movies)
summary(m)
## 
## Call:
## lm(formula = imdb_rating ~ audience_score, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2082 -0.1866  0.0712  0.3093  1.1516 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.599992   0.069291   51.95   <2e-16 ***
## audience_score 0.046392   0.001057   43.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.545 on 649 degrees of freedom
## Multiple R-squared:  0.748,  Adjusted R-squared:  0.7476 
## F-statistic:  1926 on 1 and 649 DF,  p-value: < 2.2e-16
sum(m$residuals^2)
## [1] 192.7457

Add critics_score

m1 <- lm(imdb_rating ~ critics_score + audience_score, data = movies)
summary(m1)
## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.51964 -0.19767  0.03466  0.30671  1.22691 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.647241   0.062471   58.38   <2e-16 ***
## critics_score  0.011816   0.000954   12.39   <2e-16 ***
## audience_score 0.034703   0.001340   25.90   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4904 on 648 degrees of freedom
## Multiple R-squared:  0.7962, Adjusted R-squared:  0.7956 
## F-statistic:  1266 on 2 and 648 DF,  p-value: < 2.2e-16
sum(m1$residuals^2)
## [1] 155.847

There is increase in adjusted R-squared and decrease in sum of squared residuals. Both variables also appear to be statistically significant (small p-values). Although critics_score and audience_score subject to collinearity (correlation as high as 0.70), we will still use both variables.

with(movies, cor(critics_score, audience_score))
## [1] 0.7042762

Continue, add imdb_num_votes.

m2 <- lm(imdb_rating ~ critics_score + audience_score + imdb_num_votes, data = movies)
summary(m2)
## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + imdb_num_votes, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.49004 -0.18552  0.02332  0.29450  1.17298 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.683e+00  6.192e-02  59.471  < 2e-16 ***
## critics_score  1.178e-02  9.387e-04  12.552  < 2e-16 ***
## audience_score 3.340e-02  1.347e-03  24.794  < 2e-16 ***
## imdb_num_votes 8.335e-07  1.764e-07   4.726 2.82e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4825 on 647 degrees of freedom
## Multiple R-squared:  0.803,  Adjusted R-squared:  0.8021 
## F-statistic: 879.3 on 3 and 647 DF,  p-value: < 2.2e-16
sum(m2$residuals^2)
## [1] 150.6472

We get better metrics. Continue, add critics_rating.

m3 <- lm(imdb_rating ~ critics_score + audience_score + imdb_num_votes + critics_rating, data = movies)
summary(m3)
## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + imdb_num_votes + 
##     critics_rating, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.36752 -0.18711  0.03324  0.27457  1.19303 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.083e+00  1.354e-01  22.769  < 2e-16 ***
## critics_score        1.766e-02  1.553e-03  11.374  < 2e-16 ***
## audience_score       3.311e-02  1.328e-03  24.937  < 2e-16 ***
## imdb_num_votes       1.031e-06  1.858e-07   5.548 4.22e-08 ***
## critics_ratingFresh  1.531e-01  5.815e-02   2.633  0.00866 ** 
## critics_ratingRotten 4.626e-01  9.333e-02   4.957 9.18e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4743 on 645 degrees of freedom
## Multiple R-squared:  0.8103, Adjusted R-squared:  0.8088 
## F-statistic:   551 on 5 and 645 DF,  p-value: < 2.2e-16
sum(m3$residuals^2)
## [1] 145.1046

Add audience_rating.

m4 <- lm(imdb_rating ~ critics_score + audience_score + imdb_num_votes + critics_rating + audience_rating, data = movies)
summary(m4)
## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + imdb_num_votes + 
##     critics_rating + audience_rating, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.29886 -0.19762  0.03352  0.26021  1.19221 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.880e+00  1.402e-01  20.547  < 2e-16 ***
## critics_score           1.692e-02  1.537e-03  11.013  < 2e-16 ***
## audience_score          4.079e-02  2.101e-03  19.413  < 2e-16 ***
## imdb_num_votes          9.597e-07  1.835e-07   5.230 2.29e-07 ***
## critics_ratingFresh     1.290e-01  5.747e-02   2.244   0.0251 *  
## critics_ratingRotten    4.181e-01  9.235e-02   4.527 7.12e-06 ***
## audience_ratingUpright -3.472e-01  7.441e-02  -4.666 3.74e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4669 on 644 degrees of freedom
## Multiple R-squared:  0.8165, Adjusted R-squared:  0.8148 
## F-statistic: 477.5 on 6 and 644 DF,  p-value: < 2.2e-16
sum(m4$residuals^2)
## [1] 140.36

Add genre.

m5 <- lm(imdb_rating ~ critics_score + audience_score + imdb_num_votes + critics_rating + audience_rating + genre, data = movies)
summary(m5)
## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + imdb_num_votes + 
##     critics_rating + audience_rating + genre, data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.36827 -0.18255  0.04097  0.25011  1.05042 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2.944e+00  1.461e-01  20.144  < 2e-16 ***
## critics_score                   1.532e-02  1.502e-03  10.197  < 2e-16 ***
## audience_score                  3.985e-02  2.054e-03  19.399  < 2e-16 ***
## imdb_num_votes                  1.109e-06  1.859e-07   5.968 4.00e-09 ***
## critics_ratingFresh             9.707e-02  5.567e-02   1.744 0.081698 .  
## critics_ratingRotten            3.814e-01  8.921e-02   4.275 2.21e-05 ***
## audience_ratingUpright         -3.486e-01  7.185e-02  -4.852 1.54e-06 ***
## genreAnimation                 -3.546e-01  1.603e-01  -2.212 0.027314 *  
## genreArt House & International  3.501e-01  1.339e-01   2.615 0.009137 ** 
## genreComedy                    -1.349e-01  7.377e-02  -1.828 0.067982 .  
## genreDocumentary                3.386e-01  9.304e-02   3.639 0.000296 ***
## genreDrama                      1.433e-01  6.350e-02   2.256 0.024389 *  
## genreHorror                     6.206e-02  1.096e-01   0.566 0.571536    
## genreMusical & Performing Arts  1.434e-01  1.450e-01   0.989 0.323160    
## genreMystery & Suspense         2.917e-01  8.154e-02   3.578 0.000373 ***
## genreOther                     -2.075e-02  1.264e-01  -0.164 0.869691    
## genreScience Fiction & Fantasy -1.626e-01  1.602e-01  -1.015 0.310396    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4483 on 634 degrees of freedom
## Multiple R-squared:  0.8334, Adjusted R-squared:  0.8292 
## F-statistic: 198.2 on 16 and 634 DF,  p-value: < 2.2e-16
sum(m5$residuals^2)
## [1] 127.428

Up to this point, we still get desirable adjusted R-squared, sum of squared residuals, and p-values. Now we add best_pic_nom.

m6 <- lm(imdb_rating ~ critics_score + audience_score + imdb_num_votes + critics_rating + audience_rating + genre + best_pic_nom, data = movies)
summary(m6)
## 
## Call:
## lm(formula = imdb_rating ~ critics_score + audience_score + imdb_num_votes + 
##     critics_rating + audience_rating + genre + best_pic_nom, 
##     data = movies)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.36911 -0.18049  0.03844  0.25089  1.05228 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2.941e+00  1.465e-01  20.072  < 2e-16 ***
## critics_score                   1.534e-02  1.504e-03  10.196  < 2e-16 ***
## audience_score                  3.990e-02  2.060e-03  19.364  < 2e-16 ***
## imdb_num_votes                  1.123e-06  1.903e-07   5.903 5.83e-09 ***
## critics_ratingFresh             9.518e-02  5.597e-02   1.701 0.089504 .  
## critics_ratingRotten            3.803e-01  8.933e-02   4.257 2.38e-05 ***
## audience_ratingUpright         -3.497e-01  7.197e-02  -4.859 1.49e-06 ***
## genreAnimation                 -3.548e-01  1.604e-01  -2.212 0.027332 *  
## genreArt House & International  3.508e-01  1.340e-01   2.618 0.009059 ** 
## genreComedy                    -1.339e-01  7.388e-02  -1.812 0.070392 .  
## genreDocumentary                3.381e-01  9.311e-02   3.632 0.000304 ***
## genreDrama                      1.449e-01  6.372e-02   2.274 0.023276 *  
## genreHorror                     6.299e-02  1.097e-01   0.574 0.566138    
## genreMusical & Performing Arts  1.428e-01  1.451e-01   0.984 0.325644    
## genreMystery & Suspense         2.928e-01  8.166e-02   3.586 0.000362 ***
## genreOther                     -1.740e-02  1.269e-01  -0.137 0.890964    
## genreScience Fiction & Fantasy -1.628e-01  1.603e-01  -1.015 0.310380    
## best_pic_nomyes                -3.665e-02  1.056e-01  -0.347 0.728576    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4486 on 633 degrees of freedom
## Multiple R-squared:  0.8334, Adjusted R-squared:  0.829 
## F-statistic: 186.3 on 17 and 633 DF,  p-value: < 2.2e-16
sum(m6$residuals^2)
## [1] 127.4037

best_pic_nom has no significant impact, hence we exclude this variable. I have done similar thing with,best_pic_win, best_actor_win, best_actress_win, best_dir_win, and top200_box. It turns out, all of these variables don’t have significance in improving predictability of our model. Hence, we exclude all of them.

Our finalized model is the following:

m5$call
## lm(formula = imdb_rating ~ critics_score + audience_score + imdb_num_votes + 
##     critics_rating + audience_rating + genre, data = movies)
m5$coefficients
##                    (Intercept)                  critics_score 
##                   2.943809e+00                   1.531663e-02 
##                 audience_score                 imdb_num_votes 
##                   3.984636e-02                   1.109416e-06 
##            critics_ratingFresh           critics_ratingRotten 
##                   9.706667e-02                   3.813685e-01 
##         audience_ratingUpright                 genreAnimation 
##                  -3.486205e-01                  -3.545662e-01 
## genreArt House & International                    genreComedy 
##                   3.501387e-01                  -1.348769e-01 
##               genreDocumentary                     genreDrama 
##                   3.385743e-01                   1.432736e-01 
##                    genreHorror genreMusical & Performing Arts 
##                   6.205769e-02                   1.433776e-01 
##        genreMystery & Suspense                     genreOther 
##                   2.917188e-01                  -2.074673e-02 
## genreScience Fiction & Fantasy 
##                  -1.626365e-01

The coefficents refer to mean of increase/decrease of imdb_rating every addition of 1 in numerical explanatory variables or if it is categorized under respective categorical variables.

Model Diagnostics

    I. Linear relationship between each numerical explanatory variable and y  

Yes, we have evaluated this in previous section.

    II. Nearly normal residuals with mean 0
hist(m5$residuals)

qqnorm(m5$residuals)
qqline(m5$residuals)

Yes. We can see from histogram and Q-Q plot. The residuals are a little left skewed. Inspite of that, the residuals are concentrated in normal line region.

    III. Constant variability of residuals  
plot(m5$residuals ~ m5$fitted)

plot(abs(m5$residuals) ~ m5$fitted)

Yes. It seems the lower fitted value has larger variability. However, variability is generally constant. We can still count by hands, out of 651 observations, the number of residuals in the lower fitted values which are highly variable.

    IV. Independent residuals
plot(m5$residuals)

Yes. residuals are random.


Part 5: Prediction

Let’s predict IMDB rating of movie titled “Peter Rabbit”. Actual rating in IMDB is 6.3.

peter_rabbit <- data.frame(critics_score = 58, audience_score = 60, imdb_num_votes = 2109, critics_rating = "Rotten", audience_rating = "Upright", genre = "Animation")
predict(m5, peter_rabbit, interval="predict") 
##        fit      lwr      upr
## 1 5.903476 4.969089 6.837863

Our prediction is 5.90. With 95% confidence interval, lower bound of predicted rating is 4.969 and upper bound is 6.837. Using our predicted model, we are 95% confident that the actual rating lies within this range.

Let’s try to predict lower rated movie, “The Emoji Movie”. Its actual IMDB rating is 3.0.

emoji <- data.frame(critics_score = 9, audience_score = 39, imdb_num_votes = 35502, critics_rating = "Rotten", audience_rating = "Spilled", genre = "Animation")
predict(m5, emoji, interval="predict") 
##        fit      lwr      upr
## 1 4.701855 3.771232 5.632478

Our 95% confidence interval model fails to capture the actual rating. The lower rating movie has higher residual variability as described in the previous section.

How about high rated movie? Let’s try to predict “Call Me by Your Name”. Its actual IMDB rating is 8.1.

call_me <- data.frame(critics_score = 96, audience_score = 86, imdb_num_votes = 64883, critics_rating = "Certified Fresh", audience_rating = "Upright", genre = "Drama")
predict(m5, call_me, interval="predict") 
##        fit      lwr      upr
## 1 7.707627 6.822195 8.593058

Our 95% confidence interval model succeeds to capture the actual rating.


Part 6: Conclusion

We have successfully built our parsimonious linear regression model to predict IMDB rating. Although all of the potential explanatory variables seemed to have correlation to IMDB rating, it turns out some of them are not statistically significant in our model.

Our model predicts less accurate for lower-rated movies. It is shown from the higher variability in the residuals for lower rating predictions. This could be because our sample contains more data concentrated in mid-range ratings and human behavior when a movie is disliked, i.e. disliked movie may be rated from 0-5, while liked movie from 8-10.