options(warn = -1)
library(tidyverse)
library(statsr)
library(GGally)
library(gridExtra)
library(SignifReg)
library(tidyr)load("movies.Rdata")The data set is comprised of 651 randomly sampled movies produced and released before 2016. It contains 651 observations with 32 variables including title, genre and ratings from IMDB and Rotten Tomatoes. Each row within the dataframe represents a single movie.
As all the movies are being randomly sampled (Not much details provided on how or which type of random sampling is being used), we could assume the result of the study could be made to all movies produced and released before 2016.
The study is an observational study, no random assignment and controls are being used. The data collected in this survey could not be used in the establishment of any causal relationships.
From a articles written by Mark Bridge on The Times (link in Reference). Research shows that the voters on online movies database like Rotten Tomato, IMDb and Metaritic are biased towards male voters, and the distribution of rankings are skwered by mostly male voters (means that generally male voters tend to gives high score to the movie). As no random sampling techniques are used for the voting mechanisms, voters vote completely on voluntary base. We would expect bias exsist in our data.
In this assignment, our research questions is to find out which variables presented within the dataframe are good predictors on movies’ popularity (i.e. audience score). The model established would hopefully help us to make recommendations to audience on popular movies.
Our first step is to have a glance on the data set provided in this assignment.
Notice that some of the variables provided little to no useful information for us in constructing the linear regression model, for example, title type, studio, actor 1-5, imdb url and rotten tomatoes url. Not all the variables will be of interest in this study, we are going to filter out a few of them before we apply modeling techniques to optimize the linear regression model.
Some of the variables convey a similar meaning, for example, best_actor_win and best_actor_nom. As we expect a winning in Oscar awards would be of a higher weight than a nomination in predicting the popularity of the movie, in view of that, we only select the Oscar awards winning variables (best_pic_win, best_actor_win, best_actress_win, best_dir_win), and leave out all the Oscar awards nomination variables.
# Let's look at the structure of the dataframe.
str(movies)## tibble [651 x 32] (S3: tbl_df/tbl/data.frame)
## $ title : chr [1:651] "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
## $ title_type : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
## $ genre : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
## $ runtime : num [1:651] 80 101 84 139 90 78 142 93 88 119 ...
## $ mpaa_rating : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
## $ studio : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## $ thtr_rel_year : num [1:651] 2013 2001 1996 1993 2004 ...
## $ thtr_rel_month : num [1:651] 4 3 8 10 9 1 1 11 9 3 ...
## $ thtr_rel_day : num [1:651] 19 14 21 1 10 15 1 8 7 2 ...
## $ dvd_rel_year : num [1:651] 2013 2001 2001 2001 2005 ...
## $ dvd_rel_month : num [1:651] 7 8 8 11 4 4 2 3 1 8 ...
## $ dvd_rel_day : num [1:651] 30 28 21 6 19 20 18 2 21 14 ...
## $ imdb_rating : num [1:651] 5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
## $ imdb_num_votes : int [1:651] 899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
## $ critics_rating : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
## $ critics_score : num [1:651] 45 96 91 80 33 91 57 17 90 83 ...
## $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
## $ audience_score : num [1:651] 73 81 91 76 27 86 76 47 89 66 ...
## $ best_pic_nom : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_pic_win : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_actor_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
## $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ best_dir_win : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ top200_box : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ director : chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
## $ actor1 : chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
## $ actor2 : chr [1:651] "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
## $ actor3 : chr [1:651] "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
## $ actor4 : chr [1:651] "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
## $ actor5 : chr [1:651] "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
## $ imdb_url : chr [1:651] "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
## $ rt_url : chr [1:651] "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...
# We will select only the variables that we are interested in.
df <- movies %>% select(title, genre, runtime,
thtr_rel_year, imdb_rating, imdb_num_votes, critics_score,
audience_score, best_pic_win, best_actor_win,
best_actress_win, best_dir_win, top200_box)Next, we are going to perform EDA to look at how some of the variables in the data set are distributed. The main idea here is to have general understanding of the characteristics of the movies in the data set. For example, are most of the movies from the same genre? What are the average runtime of the movies in the data set?
One thing to note here, the EDA here may or may not be helpful in engineering the linear regression model, and it is not the main goal in this study.
We will visualize the data using ggplot2 function and other visualization tools come with standard R package.
# Let's first look at the genre distribution of the movies.
ggplot(df, aes(x=genre)) + geom_bar() +
labs(title = "Frequency of movies of different genre") +
theme(axis.text.x = element_text(angle = 90), plot.title = element_text(hjust = 0.5))# And which genre receives the highest audience score?
ggplot(df, aes(x=genre, y= audience_score)) + geom_boxplot() +
labs(title="Genre v.s. Audience Score") +
theme(axis.text.x = element_text(angle = 90),
plot.title = element_text(hjust = 0.5))It is interesting to see that nearly half of the movies inside the database are Drama. And seems that Musical & Performing Arts genre generally receive a higher audience_score and with less variation.
# Next, let's look at the runtime distribution
ggplot(df, aes(x=runtime)) + geom_histogram(bins=50) +
labs(title = "Distribution of runtime") +
theme(plot.title = element_text(hjust = 0.5))mean(df$runtime, na.rm=TRUE)## [1] 105.8215
df <- drop_na(df)The distribution is a bit right-skewed, and notice that there is one NA value, we will exclude the movie out of our dataframe. The average runtime of the movie is 105.8 mins.
# Next, we are going to look at the distribution of imdb_rating and audience_score.
p1 <- ggplot(df, aes(y=imdb_rating)) + geom_boxplot()
p2 <- ggplot(df, aes(y=critics_score)) + geom_boxplot()
p3 <- ggplot(df, aes(y=audience_score)) + geom_boxplot()
grid.arrange(p1,p2,p3, ncol=3)More then half of the movies received a IMDb rating more than 6.5 and more than half of the movies received an audience score more than 65. For critics rating, more than half of the movies received a critics score more above 60.
# Let's also look at the proportion of movies that has won Oscar best picture award, and the proportion of movies that has a Oscar best actor, best actress, or best director participate in it.
prop.table(table(df$best_pic_win))##
## no yes
## 0.98923077 0.01076923
prop.table(table(df$best_actor_win))##
## no yes
## 0.8569231 0.1430769
prop.table(table(df$best_actress_win))##
## no yes
## 0.8892308 0.1107692
prop.table(table(df$best_dir_win))##
## no yes
## 0.93384615 0.06615385
There are only about 10 percent of the movies has won a Oscar best picture award, about 14 percent of the movies has a Oscar best actor starring in it, about 11 percent of the movies has a Oscar best actress starring in it, and about 6 percent of the movies were directed by a Oscar best director.
Before we start to engineer our prediction model, we have to think which check the collinearity between numeric variables, as a large association between explanatory numeric variables might weaken the predictive power of our model. To check for association between explanatory variables in the data set, we are going to use a pair chart to analyze for correlation.
# Let's look at the relationship between the variables.
ggpairs(df, columns = c(3:8))We can obviously notice that the association between imdb rating and critics score are pretty strong, with a positive correlation coefficient of 0.704. It is not a surprise that these two movie online database give a similar score to the same movie, as both sites are English website hosted in western country, the visitor of the both website might come from the same population. Due to collinearity problem, we are going to exclude one of them out of our model.
So the equation of our full model to start with will pretty much looks like this:
\(AudienceScore = S_1 * Genre + S_2 * Runtime + S_3 * ReleaseYear + S_4 * imdbRating + S_5 *imdbVotes + S_6 * BestPicture + S_7 * BestActor + S_8 * BestActress + S_9 * BestDirector + S_{10} * Top200Box\)
Where S denote for the slope associated with each variable.
Before we start to construct our model, we have to divide our data set into training set and test set.
Training set is a segment of the original data set, it serves the purposes of model training and is used to fit the parameters of the model. Test set is another subset of the original data set, which should not come from segment of the training set, but it should follow the same distribution of the training set. The purpose of test set is to provide unbiased evaluation of the final model. Training of model should not be done using test set, it could result in a very high accuracy which is not exactly the case.
# Set the seeds before using random number generating functions.
set.seed(123)
# We are going to use a 80:20 split ratio to create our training set and test set.
# Define training set size, i.e. 0.8 * nrow(df)
training_smpsize <- floor(0.8 * nrow(df))
# Randomly select row indices from the original data set.
train_ind <- sample(seq_len(nrow(df)), size=training_smpsize)
# Using the sampled index to split the data set into training and test set.
training_df <- df[train_ind, ]
test_df <- df[-train_ind, ]We will start with the full model established above, and perform backward selection process to optimize our model base on using Adjusted R squared as the criterion, our goal is to create model with highest predictive power so that we could recommend movies to audience base on the popularity of the movies.
# The first step is to create our full model, with all the variables in the dataframe.
lm_full <- lm(audience_score ~ genre + runtime + thtr_rel_year +
imdb_rating + imdb_num_votes + critics_score + best_pic_win +
best_actor_win + best_actress_win + best_dir_win + top200_box,
data=training_df)
# Let's look at our full model.
summary(lm_full)##
## Call:
## lm(formula = audience_score ~ genre + runtime + thtr_rel_year +
## imdb_rating + imdb_num_votes + critics_score + best_pic_win +
## best_actor_win + best_actress_win + best_dir_win + top200_box,
## data = training_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.642 -6.101 0.376 5.618 42.454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.211e+02 8.390e+01 2.635 0.00867 **
## genreAnimation 1.057e+01 3.830e+00 2.760 0.00599 **
## genreArt House & International -3.507e+00 3.810e+00 -0.920 0.35777
## genreComedy 8.968e-01 1.726e+00 0.520 0.60350
## genreDocumentary 2.969e+00 2.314e+00 1.283 0.20005
## genreDrama 4.486e-01 1.515e+00 0.296 0.76732
## genreHorror -4.384e+00 2.487e+00 -1.763 0.07848 .
## genreMusical & Performing Arts 7.874e+00 3.665e+00 2.149 0.03215 *
## genreMystery & Suspense -5.483e+00 1.933e+00 -2.837 0.00474 **
## genreOther 2.279e+00 2.943e+00 0.774 0.43902
## genreScience Fiction & Fantasy -2.101e+00 3.800e+00 -0.553 0.58056
## runtime -2.286e-02 2.845e-02 -0.803 0.42222
## thtr_rel_year -1.285e-01 4.166e-02 -3.083 0.00216 **
## imdb_rating 1.489e+01 6.667e-01 22.338 < 2e-16 ***
## imdb_num_votes 9.396e-06 5.113e-06 1.838 0.06671 .
## critics_score 5.251e-02 2.305e-02 2.278 0.02312 *
## best_pic_winyes -1.328e+00 4.407e+00 -0.301 0.76336
## best_actor_winyes -1.381e+00 1.232e+00 -1.121 0.26289
## best_actress_winyes -2.145e+00 1.419e+00 -1.512 0.13120
## best_dir_winyes -6.542e-01 1.831e+00 -0.357 0.72105
## top200_boxyes -1.232e-01 2.771e+00 -0.044 0.96455
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.406 on 499 degrees of freedom
## Multiple R-squared: 0.7873, Adjusted R-squared: 0.7788
## F-statistic: 92.36 on 20 and 499 DF, p-value: < 2.2e-16
The full model has a R-squared value of 0.7873, which is considerably high, to interpret this, we say that this model account for 78.73% variation of the output variables, while the value is also adjusted for number of predictors in the model.
To further modified the model using backward elimination with adjusted R-squared as the criterion. The first step is to drop each variable one at a time, and record the updated adjusted-r-squared. Here we are going to drop genre first, and then runtime, and then thtr_rel_year and so on. And we to pick the model with the highest adjusted r squared, and perform next iteration until none of the model yields an increase in adjusted r square anymore.
The whole iterative process could be very tedious and prone to error, and here we are going to use SignifReg function to do the work for us.
lm_final <- SignifReg(lm_full, direction="backward", criterion="r-adj")
# And let's look at our final model.
summary(lm_final)##
## Call:
## lm(formula = audience_score ~ genre + thtr_rel_year + imdb_rating +
## critics_score, data = training_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.355 -5.877 0.372 5.678 43.186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 146.02371 77.68224 1.880 0.060716 .
## genreAnimation 10.41367 3.79291 2.746 0.006256 **
## genreArt House & International -4.12208 3.78744 -1.088 0.276955
## genreComedy 0.42846 1.70704 0.251 0.801920
## genreDocumentary 1.85580 2.15544 0.861 0.389655
## genreDrama -0.63096 1.44622 -0.436 0.662819
## genreHorror -4.42438 2.47049 -1.791 0.073908 .
## genreMusical & Performing Arts 6.90677 3.61743 1.909 0.056788 .
## genreMystery & Suspense -6.69820 1.87330 -3.576 0.000383 ***
## genreOther 2.03339 2.92730 0.695 0.487605
## genreScience Fiction & Fantasy -1.75166 3.79558 -0.461 0.644638
## thtr_rel_year -0.09237 0.03883 -2.379 0.017747 *
## imdb_rating 15.10031 0.61735 24.460 < 2e-16 ***
## critics_score 0.05334 0.02292 2.327 0.020365 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.417 on 506 degrees of freedom
## Multiple R-squared: 0.7838, Adjusted R-squared: 0.7782
## F-statistic: 141.1 on 13 and 506 DF, p-value: < 2.2e-16
We can see that in our final model, best_pic_win, best_actor_win, best_actress_win, best_dir_win and top200_box have all been dropped. This yields a slightly increase of adjusted r squared. And using backward elimination with adjusted r square as the criterion help us to achieve a model with highest predictive power.
Before we move on to prediction using our parsimonious model, we are going to perform Model Diagnostics to check whether critical conditions for multiple linear regression has been met or not.
1) Normality of the residuals
hist(lm_final$residuals, main="Distribution of Residuals", xlab="Model Residuals", ylab="Frequency")qqnorm(lm_final$residuals)
qqline(lm_final$residuals)Notice that the distribution is a little bit right skewed, other than that, the distribution look fairly normally distributed, and the qqplot also suggest that within the lower limit to the upper 2nd quantile, the residuals are normally distributed. We could say that the conditions is fairly satisfied.
2) Constant variability of Residuals
plot(lm_final$residuals ~ lm_final$fitted.values, main="Residuals vs fitted Values") +
abline(0,0)## integer(0)
From the residual plot above, we could see that the residuals are randomly distributed around zero, a little bit fan shape is observed, the constant variability condition seems to be fairly satisfied.
# Let's look at the our model ummary once again.
summary(lm_final)##
## Call:
## lm(formula = audience_score ~ genre + thtr_rel_year + imdb_rating +
## critics_score, data = training_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.355 -5.877 0.372 5.678 43.186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 146.02371 77.68224 1.880 0.060716 .
## genreAnimation 10.41367 3.79291 2.746 0.006256 **
## genreArt House & International -4.12208 3.78744 -1.088 0.276955
## genreComedy 0.42846 1.70704 0.251 0.801920
## genreDocumentary 1.85580 2.15544 0.861 0.389655
## genreDrama -0.63096 1.44622 -0.436 0.662819
## genreHorror -4.42438 2.47049 -1.791 0.073908 .
## genreMusical & Performing Arts 6.90677 3.61743 1.909 0.056788 .
## genreMystery & Suspense -6.69820 1.87330 -3.576 0.000383 ***
## genreOther 2.03339 2.92730 0.695 0.487605
## genreScience Fiction & Fantasy -1.75166 3.79558 -0.461 0.644638
## thtr_rel_year -0.09237 0.03883 -2.379 0.017747 *
## imdb_rating 15.10031 0.61735 24.460 < 2e-16 ***
## critics_score 0.05334 0.02292 2.327 0.020365 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.417 on 506 degrees of freedom
## Multiple R-squared: 0.7838, Adjusted R-squared: 0.7782
## F-statistic: 141.1 on 13 and 506 DF, p-value: < 2.2e-16
Intercept: When all numeric variables being set to 0, and all categorical variables assigned to reference level (i.e. x=0), the expected audience score would be 146. Of course that does not make any sense, the purpose of intercept here only serves to adjust the height of the line.
Genre slope coefficient: Most popular movies type seems to be Animation, which depicts that, with all other variables being held unchanged, a movie type of Animation would expected to have the highest audience score.
Release year slope coefficient: Release year seems to have a very weak positive influence on audience score according to our model. With all other variables being held constant, an increase in one year would expect to decrease the audience score of the movie by 0.09.
IMDB rating slope coefficient: imdb rating seems to have a very strong positive influence on audience score according to our model. An increase of 1 imdb rating is expected to increase 15 audience score if every other variables in our model held constant.
Critics score slope coefficient: critics score seems to have a very weak positive rating on audience score according to our model. An increase of 1 critics score is expect to lower the audience score by 0.05 if every other variables in the model held constant.
In this part, we are going to use our model to predict the audience score of all 130 movies in our test set.
# Let's elect the useful variables only.
new_test_df <- test_df %>% select(genre, thtr_rel_year, imdb_rating,
critics_score)
# Predict the audience score with 95% of confidence level.
predictions <- predict(lm_final, new_test_df,
interval ="prediction",
level = 0.95)
# Display the predict audience score interval along with the actual audience score of each movie.
output <- cbind(head(select(test_df,title,audience_score)),
head(predictions))
output## title audience_score fit lwr upr
## 1 Filly Brown 73 44.90627 26.3047316 63.50780
## 2 Waiting for Guffman 91 81.70006 62.9829197 100.41720
## 3 Beauty Is Embarrassing 89 80.08613 61.3332531 98.83901
## 4 Rhinestone 36 18.35371 -0.4246826 37.13210
## 5 The Wood 92 68.19324 49.6499822 86.73649
## 6 The Gleaners and I 86 82.71889 63.9772249 101.46056
From our prediction, we are 95% confident that:
The movie with title Filly Brown is expected to have an audience score between 26.3 to 63.5. The actual audience score is 73, our prediction does not contain the true value.
The movie with title Waiting for Guffman is expected to have an audience score between 70.0 to 100.0. The actual audience score is 91, our prediction contain the true value.
The movie with title Beauty Is Embarrassing is expected to have an audience score between 61.3 to 98.9. The actual audience score is 89, our prediction contain the true value.
The movie with title Rhinestone is expected to have an audience score between -0.4 to 37.1. The actual audience score is 36, our prediction contain the true value.
The movie with title The Wood is expected to have an audience score between 49.6 to 86.7. The actual audience score is 36, our prediction does not contain the true value.
The movie with title The Gleaners and I is expected to have an audience score between 49.6 to 86.7. The actual audience score is 86, our prediction contain the true value.
# Let's also work out the success rate of our model (i.e. successfully predict the audience score within a 95% confidence interval range)
output <- output %>% mutate(Success = (audience_score<=upr & audience_score>=lwr))
prop.table(table(output$Success))["TRUE"]## TRUE
## 0.6666667
For all the 130 movie in the test set, our model successfully predict 67% of the movies’ audience score within a 95% confidence level prediction interval.
The R-squared and adjusted R-squared are both less than 0.8, the predictive power of the model still have some room for improvements. In the beginning stage of the study, I intentionally left out some variables which I think would not be of much use to my model, however, it is completely based on my personal speculation, maybe including those variables would yield a higher adjusted-r-squared.
As I have mentioned in the early part of my study, sex could play an important role on the movies’ rating on online movie database, I suggest to improve the model by also creating a variable to account for proportion of male and female voters for that particular movie.
To conclude, the model we obtained state that we could predict a movie’s popularity (e.g. audience score) by using 4 predictors (e.g. genre, thtr_rel_year, imdb_rating, critics_score). And our model successfully predict around 67% of the movies’ popularity in the test set.
One last thing to note is, the model only apply to movies that is released before 2016, as the movies from the training set are all released before 2016.