library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)load("movies.Rdata")This dataset was gathered from a random sampling of movies from IMDB and Rotten Tomatoes that have release years between 1970 and 2016.
Since this was a random sample, as opposed to an experiment, only associations can be suggested, no causal relation can be confirmed from this data.
Which variables are most significant in determining a movie’s popularity?
This question is of interest because it can highlight the top areas studios can focus on to produce a popular movie, or it could highlight a trend in historical record (i.e. if some years are highly correlated with success, it could show that time can play a role in popularity).
Step 1 - prepare the dataset
First, let’s get rid of the variables we know will have no bearing on the model. These include things like URLs and titles - variables that are unique to each observation, and thus will not generalize well.
df <- movies[,2:30]Next, let’s get rid of the actor variables. These don’t provide much benefit in their current state. The ordering is fairly arbitrary, and would not generalize to a model very well. With other variables on the actors we could possibly create new aggregate variables. But given the scope of this dataset, it is best to remove them.
df <- df[,1:24]Now, let’s get rid of the ‘Day’ variables, as the day of month likely has little bearing on a movie’s popularity.
df <- df[,-c(8,11)]To return an intercept of value, let’s calibrate the year variables to begin at 1970 (the minimum year in the observed data).
df$thtr_rel_year <- df$thtr_rel_year-1970
df$dvd_rel_year <- df$dvd_rel_year-1970Step 2 - Build response variable
First,calibrate the IMDB Rating to be on the same scale as the Audience Score (Rotten Tomatoes). Then, multiply the two values together. This combination will be the response variable and will replace the two different ratings.
df$imdb_rating <- df$imdb_rating*10
df$responseVar <- df$imdb_rating*df$audience_score
df <- df[,-c(10,15)]Since the categorical variable audience_rating is a byproduct of the audience_score, we can’t (in good conscience) keep it in the model.
df <- df[,-13]Since the values in imdb_num_votes were captured as of a particular point in time, we cannot use this variable in prediction, given the prediction process defined in the assignment. The values currently would be much higher. There is no means on IMDB to capture the number of votes at a given point in time, so this is not a useful variable.
df <- df[,-10]Now, we’re finally ready to look at some data.
Step 3 - Plotting to find collinearity
Check for collinearity between the numeric variables prior to building the model.
ggpairs(df[,c(20,3,6,7,8,9,10,12)])The only pair of variables that seem highly correlated are the theatrical release year (thtr_rel_year) and the DVD release year (dvd_rel_year). Since the theatrical release year is more complete (DVD’s didn’t come out until the 1990’s), we will keep theatrical release and remove DVD release year.
df <- df[,-8]It is also worth noting that the highest correlating variable to the response variable is the critic’s score. We will see if that ends up being the most significant in the model.
I will approach the model build using backward elimination until I get the highest Adjusted R-Squared. I will not overload you with the output of fit summary, since there are so many directors and studios (many of which are statistically significant).
The output of the following code block will be the Adjusted R-Squared using all variables.
fit <- lm(data=df,responseVar~.)
fitSum <- summary(fit)
fitSum$adj.r.squared## [1] 0.6614396
The pruning was done based on the highest P-Value, and I continued pruning until the Adjusted R-Squared no longer improved.
Pruning occurred by removing the following variables in order based on their P-Values…
thtr_rel_year
thtr_rel_month
best_actress_win
best_actor_win
best_pic_win
critics_rating
runtime
This left us with the following model and resulting R-Squared…
fit <- lm(data=df,responseVar~top200_box + dvd_rel_month +
critics_score + title_type +
genre + mpaa_rating +
best_dir_win + best_pic_nom +
studio + director)
fitSum <- summary(fit)
fitSum$adj.r.squared## [1] 0.7301131
The next variable with the highest P-Value (dvd_rel_month) caused the Adjusted R-Squared to decrease, so based on the Backward Elimination model pruning method, this is the best model.
Per assignment instructions, pick a movie not included in the original dataset, enter its details into a data frame with the model variables, and see how the model performs.
movies$title[grep("Interstellar",movies$title)]## character(0)
This movie was not in the original dataset, so we will go with that.
The output from the following code chunk will display the fitted value (prediction), as well as the confidence interval for the prediction.
test <- data.frame(top200_box = "no",
dvd_rel_month = 3,
title_type = "Feature Film",
genre = "Science Fiction & Fantasy",
mpaa_rating = "PG-13",
studio = "Paramount Pictures",
critics_score = 71,
best_pic_nom = "no",
best_dir_win = "no",
director = "Christopher Nolan")
pred <- predict(fit,test,interval="predict")
pred## fit lwr upr
## 1 8944.614 3678.816 14210.41
The ratings for Interstellar are as follows…
IMDB - 8.6
Rotten Tomatoes - 85
Following our method of creating a response variable, the value would be: (8.6x10)x85 = 7310.
The predicted value is 8945. The actual value (7310) is within the 95% confidence interval of 3679 to 14210.
This means we are 95% confident the true rating is within this interval, and as it turns out, it was.
References for populating test criteria (2/13/2018)…
https://www.rottentomatoes.com/m/interstellar_2014/
http://www.imdb.com/title/tt0816692/?ref_=nv_sr_1
http://www.boxofficemojo.com/alltime/domestic.htm?page=1&p=.htm
Conclusion not repetitive of earlier statements:
More than half of the original variables have been kept out of the model. We are left with the most parsimonious model given the process (backward elimination) and the tools allowed in this assignment.
Cohesive synthesis of findings
Within the scope of the data and this assignment, the findings are that the following variables are most valuable to the model…
top200_box
dvd_rel_month
critics_score
title_type
genre
mpaa_rating
best_dir_win
best_pic_nom
studio
director
Discussion of shortcomings
This project has the major shortcoming of testing with data outside of the original dataset. We are asked to grab information for a movie that is ‘as of today’, but run it through a model that is determined with data as of two years ago. Box Office numbers change, number of votes and ratings change. It would have been much more appropriate to set aside a portion of the original data for testing, then calculate accuracy based on the test data. As it stands, we are asking the model to extrapolate, which typically doesn’t perform well.