Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.2

library(statsr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called movies. Delete this note when before you submit your work.

load("movies.Rdata")

Part 1: Data

This data set consists of 651 randomly sampled movies produced and released before 2016. In total, there are 651 observations and 32 variables.

Since random sampling is used, this sample could be generalized to all movies produced and released before 2016. We should be wary of potential bias arising from easier access to English movies when compared wuth other languages. Furthermore, since this is an observational study without random assignment, there is no causality involved

Part 2: Research question

From a business perspective, we are interested in popular movies and what attributes make a movie popular. >Based on this data set, we would explore what variables are associated with the popularity of a movie

Note:For this study, we would define the popularity of a movie based on its ratings/scores.

Part 3: Exploratory data analysis

To start with, we would; 1. Take a brief look at the variables in data 2. Select our response variable 3. Look at the distribution of the response variable 4. Explore the relationship between select variables and the response variable

Let’s take a brief look at the data

names(movies)

##  [1] "title"            "title_type"       "genre"            "runtime"         
##  [5] "mpaa_rating"      "studio"           "thtr_rel_year"    "thtr_rel_month"  
##  [9] "thtr_rel_day"     "dvd_rel_year"     "dvd_rel_month"    "dvd_rel_day"     
## [13] "imdb_rating"      "imdb_num_votes"   "critics_rating"   "critics_score"   
## [17] "audience_rating"  "audience_score"   "best_pic_nom"     "best_pic_win"    
## [21] "best_actor_win"   "best_actress_win" "best_dir_win"     "top200_box"      
## [25] "director"         "actor1"           "actor2"           "actor3"          
## [29] "actor4"           "actor5"           "imdb_url"         "rt_url"

As noted in the research question above. We would use. audience_score as our response_variable (i.e. the movie popularity). 2. Next, we would select some variables of interest that may be associated with the audience_score.

#create a new data-frame variables of interest
movie_response <-movies %>%
  select(audience_score, genre, runtime, thtr_rel_year, critics_score, imdb_rating, best_pic_win, best_actor_win, best_actress_win, best_dir_win, top200_box) %>%
  na.omit()

Next, we look at the distribution of audience_score our response variable.

ggplot(data = movie_response, aes(x=audience_score)) +geom_histogram(binwidth =1)

The distribution of audience_score appears to be left-skewed.

quantile(movie_response$audience_score)

##   0%  25%  50%  75% 100% 
##   11   46   65   80   97

boxplot(movie_response$audience_score)

We now have a better picture of the range of values for audience_score (11 - 97) as well as the median ~ 65.

Lastly, we would explore the relationship between select variables and the response variable audience_score.

Directors vs. Audience Score

Let’s check to see how the directors relate with the audience_score

movie_response %>%
  group_by(best_dir_win) %>%
  summarise(median=median(audience_score), sd=sd(audience_score))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 3
##   best_dir_win median    sd
##   <fct>         <dbl> <dbl>
## 1 no               65  20.2
## 2 yes              73  19.0

From the summaries above, we see significant differences between median scores when grouped by best_dir_win, which may imply an association between best_dir_win and audience_score. * * *

Critics Score vs. Audience Score

ggplot(data = movie_response, aes(x=critics_score, y=audience_score)) +geom_jitter()

From the plot above, there appears to be a strong positive linear association between critics score and audience score. We then proceed to calculate the correlation coefficient.

movie_response %>%
  summarise(cor(audience_score, critics_score))

## # A tibble: 1 x 1
##   `cor(audience_score, critics_score)`
##                                  <dbl>
## 1                                0.704

As expected, there is a strong correlation coefficient between audience_score and critics_score.

Imdb vs. Audience Score

ggplot(data = movie_response, aes(x=imdb_rating, y=audience_score)) +geom_jitter()

As expected, we see from the plot above that there appears to be a strong positive linear association between imdb_rating and audience_score. Similarly, we calculate the correlation coefficient

movie_response %>%
  summarise(cor(audience_score, imdb_rating))

## # A tibble: 1 x 1
##   `cor(audience_score, imdb_rating)`
##                                <dbl>
## 1                              0.865

Release year vs Audience Score

Does the year a movie gets released in have an association with the audience_score ? Below we would examine that relationship.

medianscore_year <-movie_response %>%
  group_by(thtr_rel_year) %>%
  summarise(scoremedian =median(audience_score))

## `summarise()` ungrouping output (override with `.groups` argument)

ggplot(data = medianscore_year, aes(x = thtr_rel_year, y = scoremedian)) +geom_jitter() +stat_smooth(method = lm) +theme(axis.text.x =element_text(angle =90, hjust =1))

## `geom_smooth()` using formula 'y ~ x'

From the plot above, we observe a decrease in median score from 1970 to 2014. Hinting that there may be a relationship between these two variables.

In summary, we have seen some strong relationships between some variables in the data ( imdb_rating, critics_score, etc.) and the response variable audience_score. In order to accurately predict the audience_score, we would need to utilize a multiple linear regression model.

Part 4: Modeling

Based on the results from our exploratory data analysis, we would include the following variables to our model. - imdb_rating - critics_score - best_dir_win - tht_rel_year

and then we also include some other variables from the data - runtime - best_pic_win - top200_box

For this project we would not be including the following very specific variables title, director, actor1. ..actor5,imdb_url, rt_url.

For this model, we would use backwards elimination, using the adjusted R² approach for a more reliable prediction.

Step 1

We would start the mutiple linear regression with the full model(using all selected variables above)

#lm using all variables
lm_full <-
  lm(audience_score ~ imdb_rating + runtime + thtr_rel_year + critics_score + best_pic_win + top200_box + best_dir_win, data = movie_response)

summary(lm_full)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + runtime + thtr_rel_year + 
##     critics_score + best_pic_win + top200_box + best_dir_win, 
##     data = movie_response)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.547  -6.393   0.656   5.476  52.745 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     70.12586   72.93220   0.962  0.33665    
## imdb_rating     15.03104    0.57922  25.950  < 2e-16 ***
## runtime         -0.05533    0.02196  -2.519  0.01200 *  
## thtr_rel_year   -0.05173    0.03645  -1.419  0.15631    
## critics_score    0.06763    0.02173   3.113  0.00194 ** 
## best_pic_winyes  1.85741    4.09747   0.453  0.65048    
## top200_boxyes    1.97028    2.66271   0.740  0.45960    
## best_dir_winyes -1.70476    1.72363  -0.989  0.32301    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.04 on 642 degrees of freedom
## Multiple R-squared:  0.7563, Adjusted R-squared:  0.7536 
## F-statistic: 284.6 on 7 and 642 DF,  p-value: < 2.2e-16

#lm all adjusted r^2
summary(lm(audience_score ~ imdb_rating + runtime + thtr_rel_year + critics_score + best_pic_win + top200_box + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.7536016

Next, we drop one variable and calculate the adjusted R²

#lm -imdb_rating
summary(lm(audience_score ~ runtime + thtr_rel_year + critics_score + best_pic_win + top200_box + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.4959296

#lm -runtime
summary(lm(audience_score ~ imdb_rating + thtr_rel_year + critics_score + best_pic_win + top200_box + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.7515527

#lm -tht_rel_year
summary(lm(audience_score ~ imdb_rating + runtime + critics_score + best_pic_win + top200_box + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.753213

#lm -critics_score
summary(lm(audience_score ~ imdb_rating + runtime + thtr_rel_year + best_pic_win + top200_box + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.7502722

#lm -best_pic_win
summary(lm(audience_score ~ imdb_rating + runtime + thtr_rel_year + critics_score + top200_box + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.7539061

#lm -top_200_box
summary(lm(audience_score ~ imdb_rating + runtime + thtr_rel_year + critics_score + best_pic_win + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.753775

#lm -best_dir_win
summary(lm(audience_score ~ imdb_rating + runtime + thtr_rel_year + critics_score + best_pic_win + top200_box, data = movie_response))$adj.r.squared

## [1] 0.75361

After comparing the adjusted R² of the models above, we remove best_pic_win since it has the highest adjusted R², and repeat step 1 all over again.

Step 2

#lm -imdb_rating
summary(lm(audience_score ~ runtime + thtr_rel_year + critics_score + top200_box + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.4962107

#lm -runtime
summary(lm(audience_score ~ imdb_rating + thtr_rel_year + critics_score + top200_box + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.7519245

#lm -tht_rel_year
summary(lm(audience_score ~ imdb_rating + runtime  + critics_score + top200_box + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.7535176

#lm -critics_score
summary(lm(audience_score ~ imdb_rating + runtime  + thtr_rel_year + top200_box + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.7505571

#lm -top200box
summary(lm(audience_score ~ imdb_rating + runtime  + thtr_rel_year + critics_score + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.7540642

#lm -best_win_dir
summary(lm(audience_score ~ imdb_rating + runtime  + thtr_rel_year +  critics_score + top200_box, data = movie_response))$adj.r.squared

## [1] 0.7539811

Following the same methodology, we eliminate top200_box and repeat the same series of steps

Step 3

#lm -imdb_rating
summary(lm(audience_score ~ runtime  + thtr_rel_year + critics_score + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.49667

#lm -runtime
summary(lm(audience_score ~ imdb_rating + thtr_rel_year + critics_score + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.7522146

#lm -tht_rel_year
summary(lm(audience_score ~ imdb_rating + runtime  + critics_score + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.7536306

#lm -critics_score
summary(lm(audience_score ~ imdb_rating + runtime  + thtr_rel_year + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.7506349

#lm -best_win_dir
summary(lm(audience_score ~ imdb_rating + runtime  + thtr_rel_year +  critics_score, data = movie_response))$adj.r.squared

## [1] 0.7541373

Following the same methodology, we eliminate best_win_dir and repeat the same series of steps

Step 4

#lm -imdb_rating
summary(lm(audience_score ~ runtime  + thtr_rel_year + critics_score, data = movie_response))$adj.r.squared

## [1] 0.4972347

#lm -runtime
summary(lm(audience_score ~ imdb_rating + thtr_rel_year + critics_score, data = movie_response))$adj.r.squared

## [1] 0.7517863

#lm -tht_rel_year
summary(lm(audience_score ~ imdb_rating + runtime  + critics_score + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.7536306

#lm -critics_score
summary(lm(audience_score ~ imdb_rating + runtime  + thtr_rel_year + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.7506349

Again, we eliminate thtr_rel_year

Step 5

#lm -imdb_rating
summary(lm(audience_score ~ runtime  + thtr_rel_year + critics_score, data = movie_response))$adj.r.squared

## [1] 0.4972347

#lm -runtime
summary(lm(audience_score ~ imdb_rating + thtr_rel_year + critics_score, data = movie_response))$adj.r.squared

## [1] 0.7517863

#lm -critics_score
summary(lm(audience_score ~ imdb_rating + runtime  + thtr_rel_year + best_dir_win, data = movie_response))$adj.r.squared

## [1] 0.7506349

Since none of our adjusted R² yielded a higher value than our full model, we have now a final model. This is the parsimonuos model for audience_score.

lm_final <-
  lm(audience_score ~ imdb_rating + runtime + thtr_rel_year + critics_score, data = movie_response)

Performing model diagnostics on our model.

ggplot(data = lm_final, aes(x = .fitted, y = .resid)) +geom_point() +geom_hline(yintercept =0, linetype ="dashed") +xlab("Fitted values") +ylab("Residuals")

ggplot(data = lm_final, aes(x = .resid)) +geom_histogram(binwidth =1) +xlab("Residuals")

ggplot(data = lm_final, aes(sample = .resid)) +stat_qq()

From the plots above, we see a nearly normal histogram for the residuals distribution and a fairly normal qq-plot.

summary(lm_final)

## 
## Call:
## lm(formula = audience_score ~ imdb_rating + runtime + thtr_rel_year + 
##     critics_score, data = movie_response)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.522  -6.449   0.720   5.422  52.775 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   67.91543   72.58015   0.936  0.34976    
## imdb_rating   15.03767    0.57837  26.000  < 2e-16 ***
## runtime       -0.05672    0.02117  -2.679  0.00757 ** 
## thtr_rel_year -0.05059    0.03625  -1.396  0.16329    
## critics_score  0.06746    0.02164   3.117  0.00191 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.03 on 645 degrees of freedom
## Multiple R-squared:  0.7557, Adjusted R-squared:  0.7541 
## F-statistic: 498.7 on 4 and 645 DF,  p-value: < 2.2e-16

We interpret the model as shown below; All else held constant, each 1 point increase in imdb_rating, the model predicts audience_score to be higher on average by 15.04 points.

All else held constant, each 1 minute increase in runtime, the model predicts audience_score to be lower on average by 0.057 points.

All else held constant, each 1 point increase in critics_score, the model predicts audience_score to be higher on average by 0.067 points.

We also note that the intercept here does not have any real meaning.

Part 5: Prediction

In this section, we would use our newly built model to predict audience_score for a movie.

Let’s create a data frame for this new movie

newmovie <-
  data.frame(title="The Hangover", imdb_rating = 7.8, runtime = 100, thtr_rel_year = 2009, critics_score = 79)

Then, we predict using the predict function with our model.

predict(lm_final, newmovie)

##        1 
## 83.23385

For the measure of prediction uncertainty, we construct a prediction interval.

predict(lm_final, newmovie, interval ="prediction", level =0.90)

##        fit      lwr      upr
## 1 83.23385 66.65742 99.81029

The model predicts, with 90% confidence, that a 100 minute long movie, with an imdb rating of 7.8 and a critics score of 79, released in 2009 would have an audience_score between 66.65742 and 99.81029

Looking at the actual data, the movie The Hangover has an audience_score of 84. So our model is pretty close. * * *

Part 6: Conclusion

At the beginning of this research we set out to understand what attributes make up a popular movie (high audience_score). From this research we have investigated and found that several factors affect the popularity of a movie. These are ; - the movie release year - the imdb rating - the critics score ` and the runtime(the length) of the movie

We however note that the data collected was for movies between 1970 and 2014 and that some other features which may affect a movie’s popularity may not be included in the data.