This analysis performs exploratory data anlaysis and linear regression, modeling, and prediction with a data set of 651 randomly sampled movies. The movie data used in the analysis was sourced from IMDB and Rotten Tomatoes APIs.
We will first prepare the workspace environment by setting global options.
#Install Knitr pckage if necessary and load Knitr library
list.of.packages <- c("knitr")
new.packages <-
list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages))
install.packages(new.packages, repos = "http://cran.us.r-project.org")
suppressWarnings (suppressMessages (library (knitr)))
knitr::opts_chunk$set(
fig.width = 8,
fig.height = 4,
fig.path = 'figures/DataAnalysisProject_',
echo = TRUE,
warning = FALSE,
message = FALSE
)
#Clear variables
rm (list = ls (all = TRUE))
#Get and set working directory
setwd (getwd ())Install and load required libraries if neccessary.
#Check installed status of requried packages, and install if necessary
list.of.packages <-
c("statsr", "dplyr", "ggplot2", "kableExtra", "olsrr", "lmtest", "car")
new.packages <-
list.of.packages[!(list.of.packages %in% installed.packages()[, "Package"])]
if (length(new.packages))
install.packages(new.packages, repos = "http://cran.us.r-project.org")
suppressWarnings (suppressMessages (library (statsr)))
suppressWarnings (suppressMessages (library (dplyr)))
suppressWarnings (suppressMessages (library (ggplot2)))
suppressWarnings (suppressMessages (library (kableExtra)))
suppressWarnings (suppressMessages (library (olsrr)))
suppressWarnings (suppressMessages (library (lmtest)))
suppressWarnings (suppressMessages (library (car)))Load the data set.
load ( url ("https://d18ky98rnyall9.cloudfront.net/_e1fe0c85abec6f73c72d73926884eaca_movies.Rdata?Expires=1516924800&Signature=WUuGqHLm1BK2nWJPFSb~ONIiNylTRh4amRdLaX0bWK8T1daAC~XVI~Kw4rHHXsWMMp2pa2HLTbEUhk4sW3TA1MNvtocSihOxm33Dc~oEJuQ7alE0Wmm4PLyTGi~XUvNejiyoVpugdfuDphurmxsEBtF2xo59jrRRjjY7XjfJIQs_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A"))According to the project’s code book, the observations in the “Movies” data set were randomly sampled from IMDB and Rotten Tomatoes APIs. For the sake of the analysis, we will assume the sampling method used was simple random sampling.
Since simple random sampling was used, we know that each population observation has an equal chance of being selected. Thus, we can infer generalizability about the “Movies” data set. Note that we cannot infer causality becase random assignment of the observations was not used.
We will research the association and predictive value between the response variable “Audience Score on Rotten Tomatoes,” and multiple explanatory variables to build a multiple linear regression model. Variable selection criteria will be covered in Part 4: Modeling. Note the explanatory variables used in the model were selected based on their contextual relevance, and we won’t be excluding variables upfront other than those that have obviously no value.
The “Audience score on Rotten Tomatoes” variable as a response is of interest because Rotten Tomatoes is more of a “popularity” score generator based on likes and dislikes, and as percentage score, this metric is easily digested. So, as a very simplified and condensed metric, uncovering any meaningful relationships will be interesting.
For our Exploratory Data Analysis, let’s first get familiar with our response variable, “Audience Score on Rotten Tomatoes,” by determining its distribution and summary statistics.
Using a histogram, we will survey the normality.
ggplot(data = movies, aes(movies$audience_score)) +
geom_histogram(
breaks = seq(0, 100, by = 10),
col = "white",
fill = "blue",
alpha = 1
) +
scale_x_continuous(
expand = c(0, 0),
limits = c(0, 100),
breaks = seq(0, 100, by = 10)
) +
scale_y_continuous(
expand = c(0, 0),
limits = c(0, 150),
breaks = seq(0, 150, by = 25)
) +
labs(title = "Histogram for Audience Score on Rotten Tomatoes") +
labs(x = "Audience Score", y = "Count")There data is slightly skewed to the left, suggesting Rotten Tomatoes reviews tend to be more positive. Let’s calculate summary statistics.
#Compute summary stats
AudienceScoreSummary <- movies %>%
select (audience_score) %>%
filter(audience_score != "NA") %>%
summarise (
Total = n (),
MinAudienceScore = min(audience_score, na.rm = TRUE),
MaxAudienceScore = max(audience_score, na.rm = TRUE),
AverageAudienceScore = mean(audience_score, na.rm = TRUE),
MedianAudienceScore = median(audience_score, na.rm = TRUE),
AudienceScoreIQR = IQR(audience_score, na.rm = TRUE)
)
#Create summary table
suppressWarnings (suppressMessages (library (kableExtra)))
AudienceScoreSummary %>%
kable("html") %>%
kable_styling()| Total | MinAudienceScore | MaxAudienceScore | AverageAudienceScore | MedianAudienceScore | AudienceScoreIQR |
|---|---|---|---|---|---|
| 651 | 11 | 97 | 62.36252 | 65 | 34 |
We see the median score is higher than the mean score, confirming the left skewness. Now, let’s compare the distribution of the rating on IMDB to see any potential bias. We will again use a histogram to survey the normality.
ggplot(data = movies, aes(movies$imdb_rating)) +
geom_histogram(
breaks = seq(0, 10, by = 1),
col = "white",
fill = "blue",
alpha = 1
) +
scale_x_continuous(
expand = c(0, 0),
limits = c(0, 10),
breaks = seq(0, 10, by = 1)
) +
scale_y_continuous(
expand = c(0, 0),
limits = c(0, 250),
breaks = seq(0, 250, by = 25)
) +
labs(title = "Histogram for IMDB Rating") +
labs(x = "IMDB Rating", y = "Count")There data is also slightly skewed to the left, suggesting IMDB ratings tend to be more positive. Let’s calculate summary statistics.
#Compute summary stats
IMDBRatingSummary <- movies %>%
select (imdb_rating) %>%
filter(imdb_rating != "NA") %>%
summarise (
Total = n (),
MinIMDBRating = min(imdb_rating, na.rm = TRUE),
MaxIMDBRating = max(imdb_rating, na.rm = TRUE),
AverageIMDBRating = mean(imdb_rating, na.rm = TRUE),
MedianIMDBRating = median(imdb_rating, na.rm = TRUE),
IMDBRatingIQR = IQR(imdb_rating, na.rm = TRUE)
)
#Create summary table
suppressWarnings (suppressMessages (library (kableExtra)))
IMDBRatingSummary %>%
kable("html") %>%
kable_styling()| Total | MinIMDBRating | MaxIMDBRating | AverageIMDBRating | MedianIMDBRating | IMDBRatingIQR |
|---|---|---|---|---|---|
| 651 | 1.9 | 9 | 6.493088 | 6.6 | 1.4 |
Again, we see the median score is higher than the mean score, confirming the left skewness of the IMDB Ratings. At this point, we want to call out potential bias in the sample and its potential effects on generalizability.
However, this bias may disappear in a multi-variate view. Let’s breakdown our Rotten Tomatoes Audience Score by MPAA rating.
ggplot(data = subset(movies, !is.na(mpaa_rating) &
!is.na(audience_score)),
aes(x = mpaa_rating, y = audience_score)) +
geom_boxplot(fill = "#56B4E9") +
labs(title = "Rotten Tomatoes Audience Score by MPAA Rating", x = "MPAA Rating", y = "Audience Score")We can see that “Unrated” movies have overwhelmingly positive reviews with little variation, and PG-13 movies have the lowest median score with much variability. To get a more detailed view of skewness, let’s look at a quantile-quantile plot.
qplot(
sample = audience_score,
data = subset(movies,!is.na(mpaa_rating) &
!is.na(audience_score)),
color = mpaa_rating,
shape = mpaa_rating
)Left-skewness is definitive except for the PG-13 rating. Let’s see what the summary statistics say.
#Compute summary stats
AudienceScoreSummaryMPAA <- movies %>%
select (mpaa_rating, audience_score) %>%
filter(audience_score != "NA") %>%
group_by (mpaa_rating) %>%
summarise (
Total = n (),
MinAudienceScore = min(audience_score, na.rm = TRUE),
MaxAudienceScore = max(audience_score, na.rm = TRUE),
AverageAudienceScore = mean(audience_score, na.rm = TRUE),
MedianAudienceScore = median(audience_score, na.rm = TRUE),
AudienceScoreIQR = IQR(audience_score, na.rm = TRUE)
) %>%
arrange (desc(AverageAudienceScore))
#Create summary table
suppressWarnings (suppressMessages (library (kableExtra)))
AudienceScoreSummaryMPAA %>%
kable("html") %>%
kable_styling()| mpaa_rating | Total | MinAudienceScore | MaxAudienceScore | AverageAudienceScore | MedianAudienceScore | AudienceScoreIQR |
|---|---|---|---|---|---|---|
| Unrated | 50 | 19 | 96 | 78.48000 | 81.5 | 14.00 |
| G | 19 | 18 | 92 | 68.47368 | 74.0 | 25.50 |
| NC-17 | 2 | 56 | 71 | 63.50000 | 63.5 | 7.50 |
| R | 329 | 14 | 97 | 62.04255 | 64.0 | 35.00 |
| PG | 118 | 13 | 93 | 61.83051 | 65.0 | 32.75 |
| PG-13 | 133 | 11 | 94 | 56.67669 | 56.0 | 26.00 |
We can conclude that left-skenewness is pervasive for almost all of the MPAA ratings except the normally distributed PG-13 rating.
As an additional angle on multi-variate analysis of skewness, we will look at the movie genre and repeat our set of visuals and summary stats.
ggplot(data = subset(movies,!is.na(genre) &
!is.na(audience_score)),
aes(x = genre, y = audience_score)) +
geom_boxplot(fill = "#56B4E9") +
labs(title = "Rotten Tomatoes Audience Score by Genre", x = "Genre", y = "Audience Score") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))Genre variability is incredibly diverse. We can see that a movie’s rating is definitely affected by its genre, where documentaries fare much better than action movies. This revelation begs the question of significance between genres, which we may wan to address in additional analysis. Let’s look at the quantile-quantile plot.
p <- qplot(
sample = audience_score,
data = subset(movies, !is.na(genre) &
!is.na(audience_score)),
color = genre,
shape = genre
)
p + scale_shape_manual(values = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11))The “Documentary” and “Musical & Performing Arts” genres are notable examples of left-skew. Now for the summary stats.
#Compute summary stats
AudienceScoreSummaryGenre <- movies %>%
select (genre, audience_score) %>%
filter(audience_score != "NA") %>%
group_by (genre) %>%
summarise (
Total = n (),
MinAudienceScore = min(audience_score, na.rm = TRUE),
MaxAudienceScore = max(audience_score, na.rm = TRUE),
AverageAudienceScore = mean(audience_score, na.rm = TRUE),
MedianAudienceScore = median(audience_score, na.rm = TRUE),
AudienceScoreIQR = IQR(audience_score, na.rm = TRUE)
) %>%
arrange (desc(AverageAudienceScore))
#Create summary table
suppressWarnings (suppressMessages (library (kableExtra)))
AudienceScoreSummaryGenre %>%
kable("html") %>%
kable_styling()| genre | Total | MinAudienceScore | MaxAudienceScore | AverageAudienceScore | MedianAudienceScore | AudienceScoreIQR |
|---|---|---|---|---|---|---|
| Documentary | 52 | 57 | 96 | 82.75000 | 86.0 | 12.75 |
| Musical & Performing Arts | 12 | 55 | 95 | 80.16667 | 80.5 | 13.75 |
| Other | 16 | 21 | 91 | 66.68750 | 73.5 | 29.50 |
| Drama | 305 | 13 | 95 | 65.34754 | 70.0 | 28.00 |
| Art House & International | 14 | 29 | 86 | 64.00000 | 65.5 | 29.00 |
| Animation | 9 | 18 | 88 | 62.44444 | 65.0 | 11.00 |
| Mystery & Suspense | 59 | 15 | 97 | 55.94915 | 54.0 | 30.00 |
| Action & Adventure | 65 | 11 | 94 | 53.78462 | 52.0 | 28.00 |
| Comedy | 87 | 19 | 93 | 52.50575 | 50.0 | 30.50 |
| Science Fiction & Fantasy | 9 | 17 | 85 | 50.88889 | 47.0 | 53.00 |
| Horror | 23 | 24 | 84 | 45.82609 | 43.0 | 17.50 |
The “Drama” genre had the highest number of observations, and also had the 4th largest mean Audience score. This is helping to shift our global average.
As mentioned in Part 2, the “Audience Score on Rotten Tomatoes” (audience_score) is our dependent variable. The below explanatory variables will be excluded from modeling because in terms of dimensional richness, they offer no meaningful information gain or they add model noise because they are very high-cardinality (unique). For example, the “Link to Rotten Tomatoes page for the movie” variable would have virtually no model value due to its tangential nature and its high-cardinality - in other words, a URL has no domain context from a decision making perspective.
The below explanatory variables will be included in our modeling as they offer the most information gain potential and dimensional richness.
To build our mulitple linear regression model, we will use stepwise forward selection via the “ols_step_forward” function from the “olsrr” R package. The function will build the model from our set of explanatory variables by entering predictors based on p-values in a stepwise manner. The first variable to be added to the model is most the significant, and more variables are included until none of remaining variables are “significant” when added to the model.
Compared to stepwise backward selection, stepwise forward selection was used for its low resource overhead and thus ease of reproducibility. It is worth noting the forward selection key disadvantage, such that each addition of a new variable may undermine the significance of variables already in the model.
Let’s run the model.
#create the list of predictor coefficients
MultiRegression1 <- lm(
audience_score ~ title_type +
genre +
runtime +
mpaa_rating +
thtr_rel_month +
dvd_rel_month +
imdb_rating +
critics_rating +
critics_score +
best_pic_nom +
best_pic_win +
best_actor_win +
best_actress_win +
best_dir_win +
top200_box,
data = movies
)
#generate the multi-regression model
ols_step_forward (MultiRegression1, details = TRUE)## Variable Selection Procedure
## Dependent Variable: audience_score
##
## Forward Selection: Step 1
##
## Variable imdb_rating Entered
##
## Model Summary
## ---------------------------------------------------------------
## R 0.865 RMSE 10.160
## R-Squared 0.748 Coef. Var 16.291
## Adj. R-Squared 0.748 MSE 103.219
## Pred R-Squared 0.745 MAE 7.742
## ---------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -------------------------------------------------------------------------
## Regression 198831.499 1 198831.499 1926.312 0.0000
## Residual 66988.946 649 103.219
## Total 265820.445 650
## -------------------------------------------------------------------------
##
## Parameter Estimates
## --------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## --------------------------------------------------------------------------------------------
## (Intercept) -42.328 2.418 -17.503 0.000 -47.077 -37.580
## imdb_rating 16.123 0.367 0.865 43.890 0.000 15.402 16.845
## --------------------------------------------------------------------------------------------
##
##
## Forward Selection: Step 2
##
## Variable critics_score Entered
##
## Model Summary
## ---------------------------------------------------------------
## R 0.867 RMSE 10.079
## R-Squared 0.752 Coef. Var 16.162
## Adj. R-Squared 0.752 MSE 101.581
## Pred R-Squared 0.749 MAE 7.702
## ---------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ------------------------------------------------------------------------
## Regression 199995.983 2 99997.991 984.417 0.0000
## Residual 65824.463 648 101.581
## Total 265820.445 650
## ------------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------------
## (Intercept) -37.032 2.864 -12.930 0.000 -42.656 -31.408
## imdb_rating 14.658 0.566 0.786 25.901 0.000 13.546 15.769
## critics_score 0.073 0.022 0.103 3.386 0.001 0.031 0.116
## ----------------------------------------------------------------------------------------------
##
##
## Forward Selection: Step 3
##
## Variable genre Entered
##
## Model Summary
## --------------------------------------------------------------
## R 0.877 RMSE 9.817
## R-Squared 0.769 Coef. Var 15.742
## Adj. R-Squared 0.764 MSE 96.378
## Pred R-Squared 0.759 MAE 7.415
## --------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ------------------------------------------------------------------------
## Regression 204331.585 12 17027.632 176.676 0.0000
## Residual 61488.860 638 96.378
## Total 265820.445 650
## ------------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------------------------
## (Intercept) -37.153 3.109 -11.950 0.000 -43.258 -31.048
## imdb_rating 14.766 0.573 0.792 25.777 0.000 13.642 15.891
## critics_score 0.067 0.021 0.094 3.122 0.002 0.025 0.109
## genreAnimation 9.117 3.498 0.053 2.606 0.009 2.247 15.987
## genreArt House & International 0.030 2.905 0.000 0.010 0.992 -5.674 5.734
## genreComedy 2.092 1.614 0.035 1.296 0.195 -1.078 5.261
## genreDocumentary 1.194 1.968 0.016 0.607 0.544 -2.671 5.059
## genreDrama -0.203 1.380 -0.005 -0.147 0.883 -2.913 2.507
## genreHorror -5.028 2.387 -0.046 -2.106 0.036 -9.716 -0.340
## genreMusical & Performing Arts 4.398 3.138 0.029 1.401 0.162 -1.765 10.561
## genreMystery & Suspense -6.253 1.779 -0.089 -3.515 0.000 -9.746 -2.759
## genreOther 1.582 2.763 0.012 0.573 0.567 -3.843 7.007
## genreScience Fiction & Fantasy -0.291 3.503 -0.002 -0.083 0.934 -7.170 6.588
## ---------------------------------------------------------------------------------------------------------------
## Forward Selection Method
##
## Candidate Terms:
##
## 1 . title_type
## 2 . genre
## 3 . runtime
## 4 . mpaa_rating
## 5 . thtr_rel_month
## 6 . dvd_rel_month
## 7 . imdb_rating
## 8 . critics_rating
## 9 . critics_score
## 10 . best_pic_nom
## 11 . best_pic_win
## 12 . best_actor_win
## 13 . best_actress_win
## 14 . best_dir_win
## 15 . top200_box
##
## --------------------------------------------------------------------------------
## Selection Summary
## --------------------------------------------------------------------------------
## Variable Adj.
## Step Entered R-Square R-Square C(p) AIC RMSE
## --------------------------------------------------------------------------------
## 1 imdb_rating 0.7480 0.7476 49.1726 4870.0444 10.1597
## 2 critics_score 0.7524 0.7516 39.0709 4860.6284 10.0787
## 3 genre 0.7687 0.7643 -3.9862 4836.2722 9.8172
## --------------------------------------------------------------------------------
Our model has selected the variables of IMDB Rating, Genre, and the Critics score on Rotten Tomatoes. Let’s peform diagnostics to evaluate how well the model fits the data. We will use the standard diagnostic plot set of:
First, we need to generate a new model based on the stepwise forward selection.
#generate the new model
MultiRegression2 <- lm(
audience_score ~
imdb_rating +
critics_score +
genre,
data = movies
)Next, we create our plots.
#plot diagnostics with the generated model
plot (MultiRegression2)Now we can step through each plot to understand the diagnostic assessment.
To verify our residuals vs. fitted values’ heteroscedasticity, we will use a Breusch-Pagan Test and an NCV Test.
bptest (MultiRegression2)##
## studentized Breusch-Pagan test
##
## data: MultiRegression2
## BP = 108.51, df = 12, p-value < 2.2e-16
ncvTest (MultiRegression2)## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 164.1416 Df = 1 p = 1.4087e-37
Since both tests have a p-value less than a significance level of 0.05, we can conclude heteroscedasticity exists.
To conclude our modeling section, we will provide the interpretation of the model coefficients. Here’s a summary of the model.
summary (MultiRegression2)##
## Call:
## lm(formula = audience_score ~ imdb_rating + critics_score + genre,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.708 -6.446 0.614 5.479 50.111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -37.15275 3.10900 -11.950 < 2e-16 ***
## imdb_rating 14.76644 0.57286 25.777 < 2e-16 ***
## critics_score 0.06687 0.02142 3.122 0.001879 **
## genreAnimation 9.11698 3.49847 2.606 0.009375 **
## genreArt House & International 0.03008 2.90452 0.010 0.991740
## genreComedy 2.09167 1.61412 1.296 0.195493
## genreDocumentary 1.19414 1.96833 0.607 0.544281
## genreDrama -0.20316 1.38018 -0.147 0.883022
## genreHorror -5.02795 2.38740 -2.106 0.035591 *
## genreMusical & Performing Arts 4.39791 3.13828 1.401 0.161588
## genreMystery & Suspense -6.25279 1.77914 -3.515 0.000472 ***
## genreOther 1.58228 2.76266 0.573 0.567024
## genreScience Fiction & Fantasy -0.29079 3.50319 -0.083 0.933872
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.817 on 638 degrees of freedom
## Multiple R-squared: 0.7687, Adjusted R-squared: 0.7643
## F-statistic: 176.7 on 12 and 638 DF, p-value: < 2.2e-16
Let’s review our coefficients:
At this point, we are ready to test the predictive ability of our multi-regession model. We will use the 2016 comedy film “Bad Santa 2” to predict the Rotten Tomatoes Audience Score.
Using the URLs https://www.rottentomatoes.com/m/bad_santa_2 and http://www.imdb.com/title/tt1798603/?ref_=ttls_li_tt, we see that our Rotten Tomatoes Critics Score is 23%, and the IMDB rating is 5.6.
Using R’s “predict” function, let’s run our model.
#Create a data frame for the variables we will pass to the model
NewMovie <- data.frame(imdb_rating = 5.6, critics_score = 23, genre = "Comedy" )
#Predict the Audience score with the provided variables
predict (MultiRegression2, NewMovie, interval="predict", level = 0.95)## fit lwr upr
## 1 49.16892 29.76966 68.56818
With the provided variables, the predicted Audience Score is 49.16892 which is generous compared to the actual Audience Score of 34.
The 95% prediction interval of the Audience Score for an IMDB Rating of 5.6, a Critic’s Score of 23, and a genre of “Comedy” is between 29.76966 % and 68.56818 %. If we were to collect additional sample movie data sets, there is a 95% probability that our provided variables will generate a predicted value within the prediction interval.
Our prediction interval width is allowing for a high degree of accuracy, but it is not very precise because it spans almost 40 percentage points. If we are concerned with precision, we may want to select a narrower prediction interval or increase our sample size.
When looking at any ratings methodology, our desire is to see ratings that do not have inherent bias in so far that we don’t have too many critical or complimentary reviews. Our exploratory data analysis indicates that our given sample may actually have a complimentary bias. Therefore, we need to be careful with any inference or modeling applications.
Despite a resulting simple model, additional mulitple regression techniques, such as backward selection, should also be entertained to check model convergence and help refine prediction. We would also want to investigate the sampling method and experiment with larger sample sizes to increase our prediction precision. It is assumed that the IMDB and Rotten Tomatoe APIs would not prohibit us from obtaining an adequately sized data set.