library(ggplot2)
library(dplyr)
library(statsr)load("movies.Rdata")John Eugene Driscoll | Module 3 - Data Analysis Project | August 2017 Submission
The purpose of this document is to complete the data analysis project required during week 4 of the Linear Regression and Modeling course by Duke University (Coursera.)
The background context regarding the assignment can be found at: https://www.coursera.org/learn/linear-regression-model/peer/gIqqL/data-analysis-project
Investigate what parameters are major influences on the audience score (audience_score.) Parameters are pulled from the movies data set.
If a meaninguful predection could be pulled from this exercise, I would be interested to repurpose this concept onto video game ratings, a form of media I am passionate about.
Scope of Inference
For the purposes of inference, this shoud be considered an observational study that uses a random sampling approach to obtain a representative sample from U.S. movies released between 1974 and 2016. Since a random sampling method is applied in data collection, the results can be generalizable to the movies released between 1974 and 2016.
Causation can only be inferred from a randomized experiment. This study does not meet the requirements of a randomized experiment, therefore causation can not be determined.
Sources of Bias
As Rotten Tomatoes audience score is created by voulnteers, the study may suffer from voluntary response bias since people with strong responses are more likely to participate. The voluntary participants may not be representative of the U.S. population.
The following features will be included in the first itteration of the multi linear regression model.
audience_score,genre,thtr_rel_month,imdb_rating,critics_score,best_pic_nom,best_actor_win,best_actress_win,top200_box
Data Pair Plots
The EDA section will begin with a plot of paired variables to examine the relationship, if any, that exists between the selected. This method will be used as a quick “eye ball” test ahead of the more in depth model analysis below. Only numeirc categories can be used for this test, so title, genere, actor 1 and actor 2 have been held back from this test.
Genere, actor 1 and actor 2 will be included in the first itteration of the multi linear regression model.
The response variable will be audience_score.
The linear relationship between critics_score and audience_score and imdb_rating and audience_score lead the analyst to believe those data points would have the highest impact on audience score in future lienar models created in this analysis.
workingset = select(movies, audience_score,genre,thtr_rel_month,imdb_rating,critics_score,best_pic_nom,best_actor_win,best_actress_win,top200_box)
workingset = na.exclude(workingset)
pairs(~audience_score+thtr_rel_month+imdb_rating+critics_score+best_pic_nom+best_actor_win+best_actress_win+top200_box,
data=workingset,
main="Pair Assesment")In this project we will use linear regression and start by fitting a model with 9 variables(described in the previous section). Backward elimination will help us to define if better results can be obtained by using a smaller set of attributes. The advantage of backward elimination is that it allows to start with all the variables, deleting one variable at a time until there are no improvements in the model.
First, let’s fit an initial model with the 9 variables. The adjusted R-squared is 76.68%.
model<-lm(audience_score~genre+thtr_rel_month+imdb_rating+critics_score+best_pic_nom+best_actor_win+best_actress_win+top200_box,data=workingset)
summary(model)##
## Call:
## lm(formula = audience_score ~ genre + thtr_rel_month + imdb_rating +
## critics_score + best_pic_nom + best_actor_win + best_actress_win +
## top200_box, data = workingset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.100 -6.198 0.343 5.584 50.082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -35.48683 3.17716 -11.169 < 2e-16 ***
## genreAnimation 9.72637 3.49086 2.786 0.00549 **
## genreArt House & International 0.22529 2.90339 0.078 0.93818
## genreComedy 2.44484 1.62145 1.508 0.13210
## genreDocumentary 1.43704 1.99090 0.722 0.47068
## genreDrama 0.27363 1.40151 0.195 0.84527
## genreHorror -5.06609 2.38562 -2.124 0.03409 *
## genreMusical & Performing Arts 4.89165 3.13877 1.558 0.11962
## genreMystery & Suspense -5.67166 1.80100 -3.149 0.00171 **
## genreOther 1.39281 2.76225 0.504 0.61428
## genreScience Fiction & Fantasy -0.54853 3.48758 -0.157 0.87507
## thtr_rel_month -0.21644 0.11053 -1.958 0.05063 .
## imdb_rating 14.75962 0.57625 25.613 < 2e-16 ***
## critics_score 0.06241 0.02139 2.918 0.00365 **
## best_pic_nomyes 4.71253 2.28180 2.065 0.03930 *
## best_actor_winyes -1.52565 1.13692 -1.342 0.18010
## best_actress_winyes -2.29217 1.27634 -1.796 0.07299 .
## top200_boxyes 2.23707 2.63012 0.851 0.39534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.765 on 633 degrees of freedom
## Multiple R-squared: 0.7729, Adjusted R-squared: 0.7668
## F-statistic: 126.8 on 17 and 633 DF, p-value: < 2.2e-16
We will make use of the backward function to remove model features with low predective value as a means to see if we can come up with a more simple (in terms of number of features) and more effective (same Rsquared or better.)
SimpleModel<-step(model, direction = "backward", trace=FALSE )
summary(SimpleModel)##
## Call:
## lm(formula = audience_score ~ genre + thtr_rel_month + imdb_rating +
## critics_score + best_pic_nom + best_actress_win, data = workingset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.866 -6.371 0.254 5.717 50.253
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -35.53970 3.17794 -11.183 < 2e-16 ***
## genreAnimation 9.54767 3.48475 2.740 0.006319 **
## genreArt House & International 0.20499 2.89196 0.071 0.943514
## genreComedy 2.33460 1.61205 1.448 0.148049
## genreDocumentary 1.30592 1.96719 0.664 0.507026
## genreDrama 0.03110 1.38463 0.022 0.982086
## genreHorror -5.07665 2.37557 -2.137 0.032978 *
## genreMusical & Performing Arts 4.70946 3.12658 1.506 0.132496
## genreMystery & Suspense -6.05681 1.78267 -3.398 0.000722 ***
## genreOther 1.24672 2.76104 0.452 0.651755
## genreScience Fiction & Fantasy -0.31778 3.48571 -0.091 0.927389
## thtr_rel_month -0.22019 0.11006 -2.001 0.045867 *
## imdb_rating 14.76556 0.57585 25.641 < 2e-16 ***
## critics_score 0.06363 0.02136 2.979 0.003006 **
## best_pic_nomyes 4.42753 2.26734 1.953 0.051289 .
## best_actress_winyes -2.35008 1.26980 -1.851 0.064671 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.768 on 635 degrees of freedom
## Multiple R-squared: 0.7721, Adjusted R-squared: 0.7667
## F-statistic: 143.4 on 15 and 635 DF, p-value: < 2.2e-16
The SimpleModel has only 7 variables and a nearly consistent Adjusted R-squared of 0.7667 , using 2 less variables than the full model. The imdb_rating, genre and the critics score variables are the most significant variables. Mystery and suspens Genre has a strong negative relationship with the reference variable, while Animation genere has a strong positive relationship with the reference variable.
While not sigificantly different than our first model, we will use the SimpleModel in ahderence with Occam’s Razor.
The Min residuals indicuate this model may not be effective when dealing with films with a low audience_score
Check Conditions
Check for Multi-collinearity
Per the pairwise plot below, none of the include features appear to share the same or similar relationships with the explanatory variable. Multi-collinearity should not be an issue.
Linear relationship between explanatory and response variables
The strongest lienar relationshipsexist between critics_score and audience_score and imdb_rating and audience_score.
m<-lm(formula = audience_score ~ genre + thtr_rel_month + imdb_rating +
critics_score + best_pic_nom + best_actress_win, data = workingset)
pairs(~best_actress_win+best_pic_nom+thtr_rel_month+genre+critics_score+imdb_rating+audience_score,
data=workingset)Nearly Normal Residuals with mean 0
par(mfrow=c(1,2))
hist(SimpleModel$residuals, main='Histogram of Residuals')
qqnorm(SimpleModel$residuals,main='Normal Probability Plot of Residuals')
qqline(SimpleModel$residuals)An observation of the histogram of residuals above indicate a somewhat normal distribution with a strong right skew within the residuals. We see that most of the deviation occurs at the tail of the distriubtion when looking at the Q Q plot. It is clear that this is a linear (not binomial) distribution, so we will consider this test passed.
Constant Variability of Residuals
There is a Constant Variability of Residuals in the chart below.
par(mfrow=c(1,2))
plot(SimpleModel$residuals~SimpleModel$fitted,main='Residuals vs. Predicted (fitted) ')Independent Residuals
The residuals on the chart below seem to be generally homoscedastic. However, there is some degree of heteroscedasticity in the left end of the above visualization; then the model will be less accurate when predicting lower values.
plot(SimpleModel$residuals~SimpleModel$fitted,main="Residuals vs. fitted")
abline(0,0)We wanted to predict the audience score for a new movie that has not been used to fit the model. For the movie “Kung Fu Panda 3.” The data below, obtained from IMDB and Rotten Tomato represent each respective data point required to populate the SimpleModel.
Impressivley, the model was able to accuratrly created a range of possible values that caputred the actual audience score! The actual audience score for Kung Fu Panda 3 per Rotten Tomatoes was 79, and the model predeicted it to be 83 (rounded down to nearest whole number!) When considering the projected lower and upper boudns of the SimpleModel, we can see the predection falls within the bounds and is very close to the actual score!.
KFP3<-data.frame(genre="Animation", thtr_rel_month=1,imdb_rating=7.2,critics_score=87,best_pic_nom="no",best_actress_win="yes")
predict(SimpleModel,KFP3)## 1
## 83.28528
predict(SimpleModel, KFP3, interval="predict") ## fit lwr upr
## 1 83.28528 62.86262 103.7079
The intial research objetvice was Investigate what parameters are major influences on the audience score (audience_score.) Using the SimpleModel, generated with the approach presented above, we were able to identify a 7 feature model that had stastically significant compoenents that accurately were able to predict the audience score of a film from the year 2016 that was not included in the movies data set intially presented.
Potenially using the whole data set, or introducing external factors like net sales adjusted for inflation or social media data, we could have come up with an even more effecitve model, but that can be left the the sequel to this blockbuster!