library(ggplot2)## Warning: package 'ggplot2' was built under R version 4.1.1
library(tidyverse)## Warning: package 'tidyverse' was built under R version 4.1.1
## Warning: package 'tibble' was built under R version 4.1.2
## Warning: package 'tidyr' was built under R version 4.1.1
## Warning: package 'readr' was built under R version 4.1.2
## Warning: package 'purrr' was built under R version 4.1.1
## Warning: package 'dplyr' was built under R version 4.1.1
## Warning: package 'forcats' was built under R version 4.1.1
library(statsr)## Warning: package 'BayesFactor' was built under R version 4.1.2
## Warning: package 'coda' was built under R version 4.1.2
## Warning: package 'Matrix' was built under R version 4.1.1
library(GGally)## Warning: package 'GGally' was built under R version 4.1.2
library(Metrics)## Warning: package 'Metrics' was built under R version 4.1.2
load("movies.Rdata")The data was collected by the simple random sample. The data can be used to generalize and not causal.
First, I will select the variables that will be used from the data set.
imdb_data <- movies %>% select(title, title_type, genre, runtime, mpaa_rating, studio, imdb_rating, imdb_num_votes, critics_rating, critics_score, audience_rating, audience_score, best_pic_nom, best_pic_win, top200_box, best_actor_win, best_actress_win, best_dir_win) %>% drop_na()A quick analysis to see how the data looks like.
str(imdb_data)## tibble [642 x 18] (S3: tbl_df/tbl/data.frame)
## $ title : chr [1:642] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num [1:642] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ imdb_rating : num [1:642] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:642] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num [1:642] 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num [1:642] 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
Statistical summary of the data as a whole.
summary(imdb_data)## title title_type genre runtime
## Length:642 Documentary : 53 Drama :303 Min. : 39
## Class :character Feature Film:585 Comedy : 86 1st Qu.: 93
## Mode :character TV Movie : 4 Action & Adventure: 64 Median :103
## Mystery & Suspense: 59 Mean :106
## Documentary : 51 3rd Qu.:116
## Horror : 22 Max. :267
## (Other) : 57
## mpaa_rating studio imdb_rating
## G : 18 Paramount Pictures : 37 Min. :1.9
## NC-17 : 1 Warner Bros. Pictures : 30 1st Qu.:5.9
## PG :117 Sony Pictures Home Entertainment: 27 Median :6.6
## PG-13 :133 Universal Pictures : 23 Mean :6.5
## R :324 Warner Home Video : 19 3rd Qu.:7.3
## Unrated: 49 20th Century Fox : 18 Max. :9.0
## (Other) :488
## imdb_num_votes critics_rating critics_score audience_rating
## Min. : 180 Certified Fresh:135 Min. : 1.00 Spilled:269
## 1st Qu.: 4861 Fresh :205 1st Qu.: 33.00 Upright:373
## Median : 15470 Rotten :302 Median : 61.00
## Mean : 58255 Mean : 57.78
## 3rd Qu.: 59034 3rd Qu.: 83.00
## Max. :893008 Max. :100.00
##
## audience_score best_pic_nom best_pic_win top200_box best_actor_win
## Min. :11.0 no :620 no :635 no :627 no :550
## 1st Qu.:46.0 yes: 22 yes: 7 yes: 15 yes: 92
## Median :65.0
## Mean :62.5
## 3rd Qu.:80.0
## Max. :97.0
##
## best_actress_win best_dir_win
## no :570 no :599
## yes: 72 yes: 43
##
##
##
##
##
A histogram to see the distribution of the audience_score.
ggplot(imdb_data, aes(audience_score)) + geom_histogram(binwidth = 10, col = "black", fill = "purple", alpha = 0.8) + labs(title = "Histogram of audience score", y = "Count", x = "Audience Score")From the histogram, we see that the data is left skewed. Next, we will generate a ggpairs plot with the numeric variables we have from our data.
linear <- imdb_data %>% select(audience_score, runtime, imdb_rating, imdb_num_votes, critics_score)
ggpairs(data = linear, title = "Pairs plot for Numerical data")From the ggpairs plot, we see variables that are highly correlated to the audience_score causing collinearity. We will ignore this because we will be building a model using the adjusted r^2 for predicting.
To create a model to predict the audience_score, we will use the backwards elimination method to find the model with the highest adjusted r^2.
Below is the full model.
audi_model <- lm(audience_score ~ title_type + runtime + mpaa_rating + imdb_rating + imdb_num_votes + critics_rating + audience_rating + best_pic_nom + best_pic_win + top200_box + best_actor_win + best_actress_win + best_dir_win, data = imdb_data)
summary(audi_model)##
## Call:
## lm(formula = audience_score ~ title_type + runtime + mpaa_rating +
## imdb_rating + imdb_num_votes + critics_rating + audience_rating +
## best_pic_nom + best_pic_win + top200_box + best_actor_win +
## best_actress_win + best_dir_win, data = imdb_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.1000 -4.6202 0.4403 4.2037 24.4228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.951e+00 3.571e+00 -0.826 0.4090
## title_typeFeature Film -2.817e-01 1.312e+00 -0.215 0.8300
## title_typeTV Movie -1.127e+00 3.617e+00 -0.311 0.7555
## runtime -3.293e-02 1.663e-02 -1.980 0.0481 *
## mpaa_ratingNC-17 -8.555e+00 7.118e+00 -1.202 0.2298
## mpaa_ratingPG -1.407e+00 1.771e+00 -0.794 0.4273
## mpaa_ratingPG-13 -2.408e+00 1.788e+00 -1.346 0.1787
## mpaa_ratingR -3.091e+00 1.710e+00 -1.808 0.0712 .
## mpaa_ratingUnrated -1.886e+00 2.027e+00 -0.931 0.3524
## imdb_rating 9.268e+00 4.226e-01 21.930 <2e-16 ***
## imdb_num_votes 3.171e-06 3.125e-06 1.015 0.3107
## critics_ratingFresh -2.966e-01 8.499e-01 -0.349 0.7272
## critics_ratingRotten -1.133e+00 9.445e-01 -1.199 0.2309
## audience_ratingUpright 2.057e+01 7.874e-01 26.131 <2e-16 ***
## best_pic_nomyes 3.731e+00 1.809e+00 2.063 0.0396 *
## best_pic_winyes -2.499e+00 3.189e+00 -0.784 0.4335
## top200_boxyes -9.303e-01 1.924e+00 -0.484 0.6289
## best_actor_winyes -2.103e-01 8.203e-01 -0.256 0.7978
## best_actress_winyes -1.377e+00 9.022e-01 -1.526 0.1275
## best_dir_winyes 4.689e-01 1.194e+00 0.393 0.6947
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.894 on 622 degrees of freedom
## Multiple R-squared: 0.8873, Adjusted R-squared: 0.8839
## F-statistic: 257.7 on 19 and 622 DF, p-value: < 2.2e-16
Now we will iterate until we find the best adjusted R^2 model.
audi_model1 <- lm(audience_score ~ runtime + mpaa_rating + imdb_rating + imdb_num_votes + critics_rating + audience_rating + best_pic_nom + best_pic_win + top200_box + best_actor_win + best_actress_win + best_dir_win, data = imdb_data)
summary(audi_model1)$adj.r.squared## [1] 0.8842088
audi_model2 <- lm(audience_score ~ runtime + mpaa_rating + imdb_rating + critics_rating + audience_rating + best_pic_nom + best_pic_win + top200_box + best_actor_win + best_actress_win + best_dir_win, data = imdb_data)
summary(audi_model2)$adj.r.squared## [1] 0.8842105
audi_model3 <- lm(audience_score ~ runtime + mpaa_rating + imdb_rating + critics_rating + audience_rating + best_pic_nom + top200_box + best_actor_win + best_actress_win + best_dir_win, data = imdb_data)
summary(audi_model3)$adj.r.squared## [1] 0.8843208
audi_model4 <- lm(audience_score ~ runtime + mpaa_rating + imdb_rating + critics_rating + audience_rating + best_pic_nom + best_actor_win + best_actress_win + best_dir_win, data = imdb_data)
summary(audi_model4)$adj.r.squared## [1] 0.8844904
audi_model5 <- lm(audience_score ~ runtime + mpaa_rating + imdb_rating + critics_rating + audience_rating + best_pic_nom + best_actor_win + best_actress_win , data = imdb_data)
summary(audi_model5)$adj.r.squared## [1] 0.8846628
audi_model6 <- lm(audience_score ~ runtime + mpaa_rating + imdb_rating + critics_rating + audience_rating + best_pic_nom + best_actress_win , data = imdb_data)
summary(audi_model6)$adj.r.squared## [1] 0.8848365
Best Model.
audi_model6 <- lm(audience_score ~ runtime + mpaa_rating + imdb_rating + critics_rating + audience_rating + best_pic_nom + best_actress_win , data = imdb_data)
summary(audi_model6)##
## Call:
## lm(formula = audience_score ~ runtime + mpaa_rating + imdb_rating +
## critics_rating + audience_rating + best_pic_nom + best_actress_win,
## data = imdb_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.0190 -4.5540 0.4005 4.2443 24.6114
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.81844 3.15526 -1.210 0.2267
## runtime -0.03141 0.01544 -2.034 0.0424 *
## mpaa_ratingNC-17 -8.71445 7.07340 -1.232 0.2184
## mpaa_ratingPG -1.39961 1.75667 -0.797 0.4259
## mpaa_ratingPG-13 -2.24224 1.76711 -1.269 0.2050
## mpaa_ratingR -2.99429 1.68385 -1.778 0.0758 .
## mpaa_ratingUnrated -1.90798 1.91039 -0.999 0.3183
## imdb_rating 9.37019 0.40208 23.304 <2e-16 ***
## critics_ratingFresh -0.48581 0.79793 -0.609 0.5428
## critics_ratingRotten -1.28464 0.91337 -1.406 0.1601
## audience_ratingUpright 20.54857 0.78041 26.330 <2e-16 ***
## best_pic_nomyes 3.40536 1.61919 2.103 0.0359 *
## best_actress_winyes -1.47000 0.89235 -1.647 0.1000 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.865 on 629 degrees of freedom
## Multiple R-squared: 0.887, Adjusted R-squared: 0.8848
## F-statistic: 411.4 on 12 and 629 DF, p-value: < 2.2e-16
Using the model, we will plot the residual plot.
ggplot(data = audi_model6, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals") It seems that the points are randomly scattered into two different clusters.
A histogram for the residuals.
ggplot(data = audi_model6, aes(.resid)) + geom_histogram(binwidth = 3, col = "black", alpha = .8, fill = "purple") + labs(title = "Histogram of the Residuals", x = "Frequency", y = "Residuals")From the histogram above, we see that the residuals are nearly normal centered around 0.
We will make predictions with the data provided by movies.Rdata and calculate the RMSE (Root Mean Square Error) to see how far our predicted values are from the observed values.
pred <- imdb_data %>% select(runtime, mpaa_rating, imdb_rating, critics_rating, audience_rating, best_pic_nom, best_actress_win)
rmse(predict(audi_model6, pred), imdb_data$audience_score)## [1] 6.794786
Our RMSE is low so we can say the model is good.
In conclusion, the model is okay to be used to predict audience scores with an adjusted r^2 of 88% and an RMSE of 6.8.