Modeling and prediction for movies

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.1.1

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.1

## Warning: package 'tibble' was built under R version 4.1.2

## Warning: package 'tidyr' was built under R version 4.1.1

## Warning: package 'readr' was built under R version 4.1.2

## Warning: package 'purrr' was built under R version 4.1.1

## Warning: package 'dplyr' was built under R version 4.1.1

## Warning: package 'forcats' was built under R version 4.1.1

library(statsr)

## Warning: package 'BayesFactor' was built under R version 4.1.2

## Warning: package 'coda' was built under R version 4.1.2

## Warning: package 'Matrix' was built under R version 4.1.1

library(GGally)

## Warning: package 'GGally' was built under R version 4.1.2

library(Metrics)

## Warning: package 'Metrics' was built under R version 4.1.2

Load data

load("movies.Rdata")

Part 1: Data

The data was collected by the simple random sample. The data can be used to generalize and not causal.

Part 2: Research question

Create a linear regression to predict audience score?
Plot a residual plot and Histgram to see if the model is good?
Calculate the RMSE?

Part 3: Exploratory data analysis

First, I will select the variables that will be used from the data set.

imdb_data <- movies %>% select(title, title_type, genre, runtime, mpaa_rating, studio, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, best_pic_nom, best_pic_win, top200_box, best_actor_win, best_actress_win, best_dir_win) %>% drop_na()

A quick analysis to see how the data looks like.

str(imdb_data)

## tibble [642 x 18] (S3: tbl_df/tbl/data.frame)
##  $ title           : chr [1:642] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num [1:642] 80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ imdb_rating     : num [1:642] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int [1:642] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num [1:642] 45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num [1:642] 73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...

Statistical summary of the data as a whole.

summary(imdb_data)

##     title                  title_type                 genre        runtime   
##  Length:642         Documentary : 53   Drama             :303   Min.   : 39  
##  Class :character   Feature Film:585   Comedy            : 86   1st Qu.: 93  
##  Mode  :character   TV Movie    :  4   Action & Adventure: 64   Median :103  
##                                        Mystery & Suspense: 59   Mean   :106  
##                                        Documentary       : 51   3rd Qu.:116  
##                                        Horror            : 22   Max.   :267  
##                                        (Other)           : 57                
##   mpaa_rating                               studio     imdb_rating 
##  G      : 18   Paramount Pictures              : 37   Min.   :1.9  
##  NC-17  :  1   Warner Bros. Pictures           : 30   1st Qu.:5.9  
##  PG     :117   Sony Pictures Home Entertainment: 27   Median :6.6  
##  PG-13  :133   Universal Pictures              : 23   Mean   :6.5  
##  R      :324   Warner Home Video               : 19   3rd Qu.:7.3  
##  Unrated: 49   20th Century Fox                : 18   Max.   :9.0  
##                (Other)                         :488                
##  imdb_num_votes           critics_rating critics_score    audience_rating
##  Min.   :   180   Certified Fresh:135    Min.   :  1.00   Spilled:269    
##  1st Qu.:  4861   Fresh          :205    1st Qu.: 33.00   Upright:373    
##  Median : 15470   Rotten         :302    Median : 61.00                  
##  Mean   : 58255                          Mean   : 57.78                  
##  3rd Qu.: 59034                          3rd Qu.: 83.00                  
##  Max.   :893008                          Max.   :100.00                  
##                                                                          
##  audience_score best_pic_nom best_pic_win top200_box best_actor_win
##  Min.   :11.0   no :620      no :635      no :627    no :550       
##  1st Qu.:46.0   yes: 22      yes:  7      yes: 15    yes: 92       
##  Median :65.0                                                      
##  Mean   :62.5                                                      
##  3rd Qu.:80.0                                                      
##  Max.   :97.0                                                      
##                                                                    
##  best_actress_win best_dir_win
##  no :570          no :599     
##  yes: 72          yes: 43     
##                               
##                               
##                               
##                               
##

A histogram to see the distribution of the audience_score.

ggplot(imdb_data, aes(audience_score)) + geom_histogram(binwidth = 10, col = "black", fill = "purple", alpha = 0.8) + labs(title = "Histogram of audience score", y = "Count", x = "Audience Score")

From the histogram, we see that the data is left skewed. Next, we will generate a ggpairs plot with the numeric variables we have from our data.

linear <- imdb_data %>% select(audience_score, runtime, imdb_rating, imdb_num_votes, critics_score)
ggpairs(data = linear, title = "Pairs plot for Numerical data")

From the ggpairs plot, we see variables that are highly correlated to the audience_score causing collinearity. We will ignore this because we will be building a model using the adjusted r^2 for predicting.

Part 4: Modeling

To create a model to predict the audience_score, we will use the backwards elimination method to find the model with the highest adjusted r^2.

Below is the full model.

audi_model <- lm(audience_score ~ title_type + runtime + mpaa_rating + imdb_rating + imdb_num_votes + critics_rating + audience_rating + best_pic_nom + best_pic_win + top200_box + best_actor_win + best_actress_win + best_dir_win, data = imdb_data)
summary(audi_model)

## 
## Call:
## lm(formula = audience_score ~ title_type + runtime + mpaa_rating + 
##     imdb_rating + imdb_num_votes + critics_rating + audience_rating + 
##     best_pic_nom + best_pic_win + top200_box + best_actor_win + 
##     best_actress_win + best_dir_win, data = imdb_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.1000  -4.6202   0.4403   4.2037  24.4228 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.951e+00  3.571e+00  -0.826   0.4090    
## title_typeFeature Film -2.817e-01  1.312e+00  -0.215   0.8300    
## title_typeTV Movie     -1.127e+00  3.617e+00  -0.311   0.7555    
## runtime                -3.293e-02  1.663e-02  -1.980   0.0481 *  
## mpaa_ratingNC-17       -8.555e+00  7.118e+00  -1.202   0.2298    
## mpaa_ratingPG          -1.407e+00  1.771e+00  -0.794   0.4273    
## mpaa_ratingPG-13       -2.408e+00  1.788e+00  -1.346   0.1787    
## mpaa_ratingR           -3.091e+00  1.710e+00  -1.808   0.0712 .  
## mpaa_ratingUnrated     -1.886e+00  2.027e+00  -0.931   0.3524    
## imdb_rating             9.268e+00  4.226e-01  21.930   <2e-16 ***
## imdb_num_votes          3.171e-06  3.125e-06   1.015   0.3107    
## critics_ratingFresh    -2.966e-01  8.499e-01  -0.349   0.7272    
## critics_ratingRotten   -1.133e+00  9.445e-01  -1.199   0.2309    
## audience_ratingUpright  2.057e+01  7.874e-01  26.131   <2e-16 ***
## best_pic_nomyes         3.731e+00  1.809e+00   2.063   0.0396 *  
## best_pic_winyes        -2.499e+00  3.189e+00  -0.784   0.4335    
## top200_boxyes          -9.303e-01  1.924e+00  -0.484   0.6289    
## best_actor_winyes      -2.103e-01  8.203e-01  -0.256   0.7978    
## best_actress_winyes    -1.377e+00  9.022e-01  -1.526   0.1275    
## best_dir_winyes         4.689e-01  1.194e+00   0.393   0.6947    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.894 on 622 degrees of freedom
## Multiple R-squared:  0.8873, Adjusted R-squared:  0.8839 
## F-statistic: 257.7 on 19 and 622 DF,  p-value: < 2.2e-16

Now we will iterate until we find the best adjusted R^2 model.

audi_model1 <- lm(audience_score ~ runtime + mpaa_rating + imdb_rating + imdb_num_votes + critics_rating + audience_rating + best_pic_nom + best_pic_win + top200_box + best_actor_win + best_actress_win + best_dir_win, data = imdb_data)
summary(audi_model1)$adj.r.squared

## [1] 0.8842088

audi_model2 <- lm(audience_score ~ runtime + mpaa_rating + imdb_rating  + critics_rating + audience_rating + best_pic_nom + best_pic_win + top200_box + best_actor_win + best_actress_win + best_dir_win, data = imdb_data)
summary(audi_model2)$adj.r.squared

## [1] 0.8842105

audi_model3 <- lm(audience_score ~ runtime + mpaa_rating + imdb_rating  + critics_rating + audience_rating + best_pic_nom + top200_box + best_actor_win + best_actress_win + best_dir_win, data = imdb_data)
summary(audi_model3)$adj.r.squared

## [1] 0.8843208

audi_model4 <- lm(audience_score ~ runtime + mpaa_rating + imdb_rating  + critics_rating + audience_rating + best_pic_nom + best_actor_win + best_actress_win + best_dir_win, data = imdb_data)
summary(audi_model4)$adj.r.squared

## [1] 0.8844904

audi_model5 <- lm(audience_score ~ runtime + mpaa_rating + imdb_rating  + critics_rating + audience_rating + best_pic_nom + best_actor_win + best_actress_win , data = imdb_data)
summary(audi_model5)$adj.r.squared

## [1] 0.8846628

audi_model6 <- lm(audience_score ~ runtime + mpaa_rating + imdb_rating  + critics_rating + audience_rating + best_pic_nom + best_actress_win , data = imdb_data)
summary(audi_model6)$adj.r.squared

## [1] 0.8848365

Best Model.

audi_model6 <- lm(audience_score ~ runtime + mpaa_rating + imdb_rating  + critics_rating + audience_rating + best_pic_nom + best_actress_win , data = imdb_data)
summary(audi_model6)

## 
## Call:
## lm(formula = audience_score ~ runtime + mpaa_rating + imdb_rating + 
##     critics_rating + audience_rating + best_pic_nom + best_actress_win, 
##     data = imdb_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.0190  -4.5540   0.4005   4.2443  24.6114 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -3.81844    3.15526  -1.210   0.2267    
## runtime                -0.03141    0.01544  -2.034   0.0424 *  
## mpaa_ratingNC-17       -8.71445    7.07340  -1.232   0.2184    
## mpaa_ratingPG          -1.39961    1.75667  -0.797   0.4259    
## mpaa_ratingPG-13       -2.24224    1.76711  -1.269   0.2050    
## mpaa_ratingR           -2.99429    1.68385  -1.778   0.0758 .  
## mpaa_ratingUnrated     -1.90798    1.91039  -0.999   0.3183    
## imdb_rating             9.37019    0.40208  23.304   <2e-16 ***
## critics_ratingFresh    -0.48581    0.79793  -0.609   0.5428    
## critics_ratingRotten   -1.28464    0.91337  -1.406   0.1601    
## audience_ratingUpright 20.54857    0.78041  26.330   <2e-16 ***
## best_pic_nomyes         3.40536    1.61919   2.103   0.0359 *  
## best_actress_winyes    -1.47000    0.89235  -1.647   0.1000 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.865 on 629 degrees of freedom
## Multiple R-squared:  0.887,  Adjusted R-squared:  0.8848 
## F-statistic: 411.4 on 12 and 629 DF,  p-value: < 2.2e-16

Diagnostic for the MLR.

Using the model, we will plot the residual plot.

ggplot(data = audi_model6, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")

It seems that the points are randomly scattered into two different clusters.

A histogram for the residuals.

ggplot(data = audi_model6, aes(.resid)) + geom_histogram(binwidth = 3, col = "black", alpha = .8, fill = "purple") + labs(title = "Histogram of the Residuals", x = "Frequency", y = "Residuals")

From the histogram above, we see that the residuals are nearly normal centered around 0.

Part 5: Prediction

We will make predictions with the data provided by movies.Rdata and calculate the RMSE (Root Mean Square Error) to see how far our predicted values are from the observed values.

pred <- imdb_data %>% select(runtime, mpaa_rating, imdb_rating, critics_rating, audience_rating, best_pic_nom, best_actress_win)
rmse(predict(audi_model6, pred), imdb_data$audience_score)

## [1] 6.794786

Our RMSE is low so we can say the model is good.

Part 6: Conclusion

In conclusion, the model is okay to be used to predict audience scores with an adjusted r^2 of 88% and an RMSE of 6.8.