Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(plotly)
library(GGally)

Introduction

Congratulations on getting a job as a data scientist at Paramount Pictures!

Your boss has just acquired data about how much audiences and critics like movies as well as numerous other variables about the movies. This dataset is provided below, and it includes information from Rotten Tomatoes and IMDB for a random sample of movies.

She is interested in learning what attributes make a movie popular. She is also interested in learning something new about movies. She wants you team to figure it all out.

As part of this project you will complete exploratory data analysis (EDA), modeling, and prediction.


Load data

load("movies.RData")

Part 1: Data

The data set is comprised of 651 randomly sampled movies produced and released before 2016. Some of the features provided are only for informational purposes and do not make any sense to include in a statistical analysis.

My exploratory analysis (see below) confirms that these data appear to be random samples and the results achieved through linear regression can be generalized to a larger popultion.


Part 2: Research question

My research question is: What movie features can influence its imdb_rating?


Part 3: Exploratory data analysis

First, let’s see what the variable types are in the movies dataset… It looks like a good mix of numerical and categorical variables.

str(movies)
## tibble [651 × 32] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num [1:651] 2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num [1:651] 2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

Now, let’s develop a correlation matrix for the numerical variables to see if there is any colinearity…

num_cols <- c("runtime","imdb_rating","imdb_num_votes","critics_score","audience_score")
num_df <- movies %>% select(all_of(num_cols))
ggpairs(num_df)

If we isolate the variable pairs with correlations > 0.7, we have the following:

If we keep imdb_rating as our dependent (response) variable, we might consider dropping audience_score and critics_score.

Now, let’s check the distribution of the numerical variables we intend to use, namely runtime,imdb_rating,imdb_num_votes:

movies_no_NA <- movies[!(is.na(movies$runtime)),]

runtime_density <- density(movies_no_NA$runtime)
fig <- plot_ly(data=movies, x=~runtime,type="histogram",nbinsx=40,
               name="Histogram") %>%
  
  add_trace(x = runtime_density$x, y = runtime_density$y, type = "scatter", mode = "lines",
            fill = "tozeroy", yaxis = "y2", name = "Density") %>%
  
  layout(autosize = F, width = 800, height = 400,
         title="Histogram and Density Plot of 'runtime'",
         xaxis=list(title="runtime"),yaxis=list(title="Number of Instances"),
         yaxis2 = list(overlaying = "y", side = "right", title="Density"),
         legend = list(x = 5, y = 1))
fig

runtime looks like it is normally-distributed.

movies_no_NA <- movies[!(is.na(movies$imdb_rating)),]

imdb_rating_density <- density(movies_no_NA$imdb_rating)
fig <- plot_ly(data=movies, x=~imdb_rating,type="histogram",nbinsx=40,
               name="Histogram") %>%
  
  add_trace(x = imdb_rating_density$x, y = imdb_rating_density$y, type = "scatter", mode = "lines",
            fill = "tozeroy", yaxis = "y2", name = "Density") %>%
  
  layout(autosize = F, width = 800, height = 400,
         title="Histogram and Density Plot of 'imdb_rating'",
         xaxis=list(title="imdb_rating"),yaxis=list(title="Number of Instances"),
         yaxis2 = list(overlaying = "y", side = "right", title="Density"),
         legend = list(x = 5, y = 1))
fig

imdb_rating looks like it is normally-distributed.

movies_no_NA <- movies[!(is.na(movies$imdb_num_votes)),]

imdb_num_votes_density <- density(movies_no_NA$imdb_num_votes)
fig <- plot_ly(data=movies, x=~imdb_num_votes,type="histogram",nbinsx=40,
               name="Histogram") %>%
  
  add_trace(x = imdb_num_votes_density$x, y = imdb_num_votes_density$y, type = "scatter", mode = "lines",
            fill = "tozeroy", yaxis = "y2", name = "Density") %>%
  
  layout(autosize = F, width = 800, height = 400,
         title="Histogram and Density Plot of 'imdb_num_votes'",
         xaxis=list(title="imdb_num_votes"),yaxis=list(title="Number of Instances"),
         yaxis2 = list(overlaying = "y", side = "right", title="Density"),
         legend = list(x = 5, y = 1))
fig

The distribution plot of imdb_num_votes is heavily skewed to the right. Let’s transform this variable by taking the log.

movies <- movies %>% mutate(log_imdb_num_votes=log(imdb_num_votes))
movies_no_NA <- movies[!(is.na(movies$log_imdb_num_votes)),]

imdb_num_votes_density <- density(movies_no_NA$log_imdb_num_votes)
fig <- plot_ly(data=movies, x=~log_imdb_num_votes,type="histogram",nbinsx=40,
               name="Histogram") %>%
  
  add_trace(x = imdb_num_votes_density$x, y = imdb_num_votes_density$y, type = "scatter", mode = "lines",
            fill = "tozeroy", yaxis = "y2", name = "Density") %>%
  
  layout(autosize = F, width = 800, height = 400,
         title="Histogram and Density Plot of 'Log(imdb_num_votes)'",
         xaxis=list(title="Log(imdb_num_votes)"),yaxis=list(title="Number of Instances"),
         yaxis2 = list(overlaying = "y", side = "right", title="Density"),
         legend = list(x = 5, y = 1))
fig


We can see that log_imdb_num_votes is normally distributed.

Any categorial variables we decide to include need to have a relatively small number of values (or factors) so that our analysis does not become unwieldy. Let’s look at categorical values with factors <= 6. Therefore, we will consider title_type, mpaa_rating, critics_rating, audience_rating, and top200_box.

Let’s refactor mpaa_rating from 6 to 2 factors, namely, ‘family-friendly’ (G and PG ratings) and ‘not family-friendly’ (PG-13, R, NC-17, and Unrated ratings).

levels(movies$mpaa_rating)[levels(movies$mpaa_rating) =="G"|
                               levels(movies$mpaa_rating)=="PG"] <- "Family-Friendly"

levels(movies$mpaa_rating)[levels(movies$mpaa_rating) =="PG-13"|
                               levels(movies$mpaa_rating)=="R" |
                               levels(movies$mpaa_rating)=="NC-17" |
                               levels(movies$mpaa_rating)=="Unrated"] <- "Not Family-Friendly"

Let’s also investigate the effect of what time of year a particular movie is released. The big release times are Summer (May, June, July) and Holiday (November, December)

movies$thtr_rel_season <- ifelse(movies$thtr_rel_month>=1 & movies$thtr_rel_month<=4 |
                                  movies$thtr_rel_month>=8 & movies$thtr_rel_month<=10, "Other",
                                ifelse(movies$thtr_rel_month>=5 &
                                         movies$thtr_rel_month<=7,"Summer","Holiday" ))

Part 4: Modeling

The dependent variable for our linear model will be imdb_rating. Our independent variables will be

mod1 <- lm(imdb_rating ~ runtime+log_imdb_num_votes+mpaa_rating+thtr_rel_season+
            title_type+critics_rating+audience_rating+top200_box, data=movies)

summary(mod1)
## 
## Call:
## lm(formula = imdb_rating ~ runtime + log_imdb_num_votes + mpaa_rating + 
##     thtr_rel_season + title_type + critics_rating + audience_rating + 
##     top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6751 -0.2777  0.0677  0.4034  1.6452 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.160538   0.237600  21.719  < 2e-16 ***
## runtime                         0.008106   0.001435   5.647 2.46e-08 ***
## log_imdb_num_votes              0.108605   0.019384   5.603 3.14e-08 ***
## mpaa_ratingNot Family-Friendly  0.011014   0.064246   0.171   0.8639    
## thtr_rel_seasonOther            0.015398   0.071486   0.215   0.8295    
## thtr_rel_seasonSummer          -0.181071   0.079646  -2.273   0.0233 *  
## title_typeFeature Film         -0.825702   0.110567  -7.468 2.69e-13 ***
## title_typeTV Movie             -1.145300   0.306128  -3.741   0.0002 ***
## critics_ratingFresh            -0.028633   0.079235  -0.361   0.7179    
## critics_ratingRotten           -0.707828   0.082008  -8.631  < 2e-16 ***
## audience_ratingUpright          0.958280   0.062759  15.269  < 2e-16 ***
## top200_boxyes                  -0.119786   0.178232  -0.672   0.5018    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6489 on 638 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.6483, Adjusted R-squared:  0.6422 
## F-statistic: 106.9 on 11 and 638 DF,  p-value: < 2.2e-16

We see that the p-value for two variables, mpaa_rating and top200_box, are quite high and we will remove them.

mod2 <- lm(imdb_rating ~ runtime+log_imdb_num_votes+thtr_rel_season+
            title_type+critics_rating+audience_rating, data=movies)

summary(mod2)
## 
## Call:
## lm(formula = imdb_rating ~ runtime + log_imdb_num_votes + thtr_rel_season + 
##     title_type + critics_rating + audience_rating, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6716 -0.2736  0.0689  0.4077  1.6434 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.175213   0.235848  21.943  < 2e-16 ***
## runtime                 0.008093   0.001417   5.711 1.72e-08 ***
## log_imdb_num_votes      0.107027   0.019162   5.585 3.45e-08 ***
## thtr_rel_seasonOther    0.022410   0.070554   0.318 0.750869    
## thtr_rel_seasonSummer  -0.177880   0.079393  -2.240 0.025401 *  
## title_typeFeature Film -0.825115   0.110336  -7.478 2.49e-13 ***
## title_typeTV Movie     -1.143423   0.305556  -3.742 0.000199 ***
## critics_ratingFresh    -0.025339   0.078842  -0.321 0.748023    
## critics_ratingRotten   -0.703857   0.081589  -8.627  < 2e-16 ***
## audience_ratingUpright  0.957645   0.062641  15.288  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6482 on 640 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.648,  Adjusted R-squared:  0.643 
## F-statistic: 130.9 on 9 and 640 DF,  p-value: < 2.2e-16

Model mod2 seems to perform relatively well. Our p-value for the model is very close to zero (< 2.2e-16) so mod2 is significant. We note that the F-statistic (130.9) is quite a bit higher than that of mod1 (106.9), so we can conclude that the \(\beta\) of at least one of our predictors is non-zero. Let’s plot the residuals:

fig <- plot_ly(y=mod2$residuals,type="scatter",mode="markers") %>%
  layout(title="Residual Plot - mod2", xaxis=list(title="Observation"),
         yaxis=list(title="ypred-yobs"))

fig

We can see that the residuals seem to be evenly distributed around \(y_{pred}-y_{obs}=0\)


Part 5: Prediction

The movie I would like to consider for my prediction is “John Wick” which has an imdb_rating of 7.8 and is not included in the original movies dataset.

john_wick <- read.csv("John Wick ratings.csv")
john_wick <- john_wick %>% mutate(log_imdb_num_votes=log(imdb_num_votes))

Let’s get a prediction of imdb_rating based on the John Wick data:

pred <- predict(mod2,newdata=john_wick,interval="prediction",type="response")
pred
##        fit      lwr      upr
## 1 7.562931 6.280879 8.844984

The 95% prediction interval associated with an imdb_rating of 7.56 is (6.28, 8.84). This means that, according to the mod2 model, 95% the movies that are predicted to have an imdb_rating of 7.56 should have an actual rating that falls between 6.28 and 8.84.

pred <- predict(mod2,newdata=john_wick,interval="confidence",type="response")
pred
##        fit      lwr      upr
## 1 7.562931 7.409074 7.716788

The confidence interval reflects the uncertainty around the mean predictions. The 95% confidence interval associated with an imdb_rating of 7.56 is (7.40, 7.72). This means that, according to the mod2 model, a movie with an imdb_rating of 7.56 on average will have a rating within the interval (7.40, 7.72).

We calculate the percent error knowing \(y_{obs}=7.8\) from the data:

\(\frac{y_{obs}-y_{pred}}{y_{obs}} = \frac{7.8-7.56}{7.8} = 0.031\) or \(3.1\%\)

Not bad!!


Part 6: Conclusion

In this analysis, we fitted selected features about movies to a linear model to predict imdb_rating. To improve the robustness of our model, we ensured that all numerical variables being considered were normally distributed. We were able to build a model using a mixture of numerical and categorical variables that achieved an \(R^2\) value of 0.648. It is possible that, with a larger dataset and some additional feature engineering, an even higher value of \(R^2\) might be achieved. Future work might include a linear model of non-linear terms, however, we must exercise caution so as to not overfit the data.