library(ggplot2);library(dplyr);library(statsr)## Warning: package 'dplyr' was built under R version 3.3.2
getwd()## [1] "C:/Albert/Coursera Statitics with R/Linear Regression and Modeling/Wk4-"
load("eaca_movies.Rdata")
msmp = select(movies, critics_rating, imdb_rating,runtime,title, critics_score,genre,title_type,imdb_num_votes,audience_rating,audience_score,mpaa_rating)
msmp = na.exclude(msmp)The Internet Movie Database (abbreviated IMDb) is an online database of information related to films, television programs and video games, including cast, production crew, fictional characters, biographies, plot summaries, trivia and reviews, operated by IMDb.com, Inc., a subsidiary of Amazon.com.
Actors and crew can post their own résumé and upload photos of themselves for a yearly fee. U.S. users can view over 6,000 movies and television shows from CBS, Sony, and various independent filmmakers.
As of September 2016, IMDb has approximately 3.9 million titles (including episodes) and 7.4 million personalities in its database,[2] as well as 67 million registered users.[1]
Rotten Tomatoes is a website launched in August 1998 devoted to film reviews, news and details; it is widely known as a film review aggregator. Coverage now includes TV content as well. The name derives from the practice of audiences throwing rotten tomatoes when disapproving of a poor stage performance. The company was created by Senh Duong and since January 2010 has been owned by Flixster, which itself was acquired in 2011 by Warner Bros.[3]
Generalization is allowable since the data represent 456 randomly sampled movies released between 1972 to 2014 in the Unites States. The data taken from both the IMDb and Rotten Tomatoes database. However, since the study does not make use of random assignment, causality cannot be inferred from this study. The sample may be biased since it is limited to entries contributed by members.
Investigate what parameters are major influences on the audience score (audience_score). The audience score is, of course, the acid test for all movies. It will ultimately determine the box office purse for the selected movie.
The EDA section will begin with a plot of paired variables to examine the relationship, if any, that exists between the selected. This method will be used to select items to used in the modeling part of the project. The response variable will be audience_score, since the audience evaluation would be more likely to infer the most about movie popularity.
pairs(~audience_score+critics_score+mpaa_rating+runtime+genre+critics_rating+imdb_rating+audience_rating+imdb_num_votes+title_type,
data=msmp,
main="Simple Scatterplot Matrix")The linear model data summary and anova summary are provided below
m<-lm(audience_score~critics_score+mpaa_rating+runtime+genre+critics_rating+imdb_rating+audience_rating+imdb_num_votes+title_type,data=msmp)
summary(m)##
## Call:
## lm(formula = audience_score ~ critics_score + mpaa_rating + runtime +
## genre + critics_rating + imdb_rating + audience_rating +
## imdb_num_votes + title_type, data = msmp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.1421 -4.4953 0.4327 4.2316 24.5563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.525e+00 4.426e+00 -2.152 0.0318 *
## critics_score 9.521e-03 2.522e-02 0.377 0.7059
## mpaa_ratingNC-17 -5.586e-01 5.216e+00 -0.107 0.9148
## mpaa_ratingPG -8.903e-02 1.899e+00 -0.047 0.9626
## mpaa_ratingPG-13 -1.002e+00 1.959e+00 -0.511 0.6092
## mpaa_ratingR -1.176e+00 1.886e+00 -0.623 0.5332
## mpaa_ratingUnrated -2.789e-01 2.166e+00 -0.129 0.8976
## runtime -2.613e-02 1.616e-02 -1.617 0.1063
## genreAnimation 2.431e+00 2.697e+00 0.902 0.3677
## genreArt House & International -2.422e+00 2.104e+00 -1.151 0.2501
## genreComedy 1.554e+00 1.148e+00 1.354 0.1761
## genreDocumentary 2.525e+00 2.759e+00 0.915 0.3605
## genreDrama -4.738e-01 1.009e+00 -0.469 0.6389
## genreHorror -1.614e+00 1.723e+00 -0.937 0.3493
## genreMusical & Performing Arts 3.757e+00 2.368e+00 1.587 0.1131
## genreMystery & Suspense -2.903e+00 1.288e+00 -2.253 0.0246 *
## genreOther 8.901e-02 1.957e+00 0.045 0.9637
## genreScience Fiction & Fantasy -2.839e-01 2.460e+00 -0.115 0.9081
## critics_ratingFresh 9.706e-02 8.602e-01 0.113 0.9102
## critics_ratingRotten -7.752e-01 1.399e+00 -0.554 0.5798
## imdb_rating 9.439e+00 4.878e-01 19.349 <2e-16 ***
## audience_ratingUpright 2.008e+01 7.892e-01 25.441 <2e-16 ***
## imdb_num_votes 3.589e-06 3.069e-06 1.169 0.2427
## title_typeFeature Film 2.512e+00 2.575e+00 0.976 0.3296
## title_typeTV Movie 7.692e-01 4.039e+00 0.190 0.8490
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.871 on 625 degrees of freedom
## Multiple R-squared: 0.889, Adjusted R-squared: 0.8847
## F-statistic: 208.5 on 24 and 625 DF, p-value: < 2.2e-16
anova(m)## Analysis of Variance Table
##
## Response: audience_score
## Df Sum Sq Mean Sq F value Pr(>F)
## critics_score 1 131758 131758 2791.0826 < 2.2e-16 ***
## mpaa_rating 5 877 175 3.7150 0.002541 **
## runtime 1 1291 1291 27.3549 2.313e-07 ***
## genre 10 8096 810 17.1511 < 2.2e-16 ***
## critics_rating 2 1714 857 18.1517 2.174e-08 ***
## imdb_rating 1 61850 61850 1310.1895 < 2.2e-16 ***
## audience_rating 1 30510 30510 646.3062 < 2.2e-16 ***
## imdb_num_votes 1 69 69 1.4713 0.225600
## title_type 2 58 29 0.6165 0.540164
## Residuals 625 29504 47
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The linear model and anova output summary reveals an adjusted \(R^2\) value of 88.47%. Ths means that only 11.53 of the variability is accounted for in the residuals and the remainder is accounted for in the predictor varibles. This extremely high value may be suspect to error, that would invalidate the reliability of the study.
The multi-collinearity must be evaluated to remove any interdependecies amoung the predictors. This will be left for the model building section
A Backwards Elimination method will be used to build the step-wise model. The selection criteria will be the highest p-value. Begining with the full model identified in the EDA section, one predictor at a time, will be eliminated, until a parsimonios model is arrived at.
m<-lm(audience_score~critics_score+mpaa_rating+runtime+genre+critics_rating+imdb_rating+audience_rating+imdb_num_votes+title_type,data=msmp)
x<-summary(m)
y<-data.frame(x$coefficients[,4])
colnames(y)<-c('Pr(>|t|')The following Iterations are made and and Adjusted \(R^2\) is recorded
* Base - ARS = 88.47
* Removal of critics_rating - ARS = 88.49
* Removal of mpaa_rating - ARS = 88.53
* Removal of tite_type - ARS = 88.55
* Removal of imdb_num_votes - ARS = 88.55
At this point Adjusted \(R^2\) starts to diminish.
m<-lm(audience_score~runtime+genre+critics_rating+imdb_rating+audience_rating,data=msmp)
x<-summary(m)
y<-data.frame(x$coefficients[,4])
colnames(y)<-c('Pr(>|t|')
x##
## Call:
## lm(formula = audience_score ~ runtime + genre + critics_rating +
## imdb_rating + audience_rating, data = msmp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.4194 -4.7080 0.6925 4.3510 24.8869
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.20919 2.97135 -2.763 0.00590 **
## runtime -0.02181 0.01530 -1.426 0.15437
## genreAnimation 3.06600 2.46274 1.245 0.21361
## genreArt House & International -2.93234 2.02995 -1.445 0.14908
## genreComedy 1.30392 1.12904 1.155 0.24857
## genreDocumentary 0.04005 1.39784 0.029 0.97715
## genreDrama -0.90540 0.96471 -0.939 0.34833
## genreHorror -2.00209 1.67640 -1.194 0.23282
## genreMusical & Performing Arts 2.50272 2.18911 1.143 0.25336
## genreMystery & Suspense -3.32832 1.25185 -2.659 0.00804 **
## genreOther 0.10508 1.93104 0.054 0.95662
## genreScience Fiction & Fantasy -0.20922 2.44861 -0.085 0.93193
## critics_ratingFresh -0.23492 0.78294 -0.300 0.76424
## critics_ratingRotten -1.36926 0.89623 -1.528 0.12706
## imdb_rating 9.64472 0.41387 23.304 < 2e-16 ***
## audience_ratingUpright 20.05979 0.78471 25.563 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.848 on 634 degrees of freedom
## Multiple R-squared: 0.8881, Adjusted R-squared: 0.8855
## F-statistic: 335.5 on 15 and 634 DF, p-value: < 2.2e-16
anova(m)## Analysis of Variance Table
##
## Response: audience_score
## Df Sum Sq Mean Sq F value Pr(>F)
## runtime 1 8702 8702 185.59 < 2.2e-16 ***
## genre 10 49167 4917 104.86 < 2.2e-16 ***
## critics_rating 2 62427 31214 665.69 < 2.2e-16 ***
## imdb_rating 1 85062 85062 1814.12 < 2.2e-16 ***
## audience_rating 1 30641 30641 653.48 < 2.2e-16 ***
## Residuals 634 29728 47
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
m<-lm(audience_score~runtime+genre+critics_rating+imdb_rating+audience_rating,data=msmp)
pairs(~runtime+genre+critics_rating+imdb_rating+audience_rating,
data=msmp)plot(m$residuals~msmp$genre, main='Residuals vs. Genre')par(mfrow=c(1,2))
hist(m$residuals, main='Histogram of Residuals')
qqnorm(m$residuals,main='Normal Probability Plot of Residuals')
qqline(m$residuals)par(mfrow=c(1,1))par(mfrow=c(1,2))
plot(m$residuals~m$fitted,main='Residuals vs. Predicted (fitted) ')
plot(abs(m$residuals)~m$fitted,main='Absolute Residuals vs. Predicted')#plot(m$residuals~msmp$imdb_rating)
par(mfrow=c(1,1))plot(m$residuals, main='Residuals vs Collection Index')Using the MLR model developed above, build a prediction algorithym. The response variable, audience_score, will be typified by a formula based on the five explanatory variables and the intercept.
In the sample table, predicted score is given along with the actual audience score and the difference between the two (DIFF).
## [1] "TITLE: Burn After Reading AUDIENCE SCORE: 64 PREDICTION: 76 DIFF: -12"
## [1] "TITLE: Max AUDIENCE SCORE: 64 PREDICTION: 72 DIFF: -8"
## [1] "TITLE: Basic AUDIENCE SCORE: 64 PREDICTION: 70 DIFF: -6"
## [1] "TITLE: The Man Without a Face AUDIENCE SCORE: 64 PREDICTION: 73 DIFF: -9"
## [1] "TITLE: The Tortured AUDIENCE SCORE: 35 PREDICTION: 38 DIFF: -3"
## [1] "TITLE: The Thin Blue Line AUDIENCE SCORE: 90 PREDICTION: 88 DIFF: 2"
The data sources for the new movies are provided in[4],[5].
## title runtime genre critics_rating imdb_rating
## 1 Revenant 156 Action & Adventure Certified Fresh 8.0
## 2 Exposed 102 Drama Rotten 4.2
## 3 Finding Dory 100 Animation Certified Fresh 7.6
## audience_rating audience_score
## 1 Upright 84
## 2 Spilled 14
## 3 Upright 86
for(i in 1:nrow(newMovie)){
x<-newMovie[i,]
cr<-as.numeric(x$critics_rating)
ar<-as.numeric(x$audience_rating)
g<-as.numeric(x$genre)
rt<-x$runtime
imdb<-x$imdb_rating
#PREDICTION
# Intercept runtime Critic's Rating Genre imdb rating Audience Rating
p <- -8.20919 -0.02181*rt + CR[cr] + G[g] + 9.64472*imdb + AR[ar]
#DISPLAY
print(paste('TITLE: ',x$title,' AUDIENCE SCORE: ',x$audience_score,' PREDICTION: ',round(p),'DIFF: ',x$audience_score - round(p)))
}## [1] "TITLE: Revenant AUDIENCE SCORE: 84 PREDICTION: 86 DIFF: -2"
## [1] "TITLE: Exposed AUDIENCE SCORE: 14 PREDICTION: 33 DIFF: -19"
## [1] "TITLE: Finding Dory AUDIENCE SCORE: 86 PREDICTION: 80 DIFF: 6"
Each of the three new movies have predicted ratings that have a 95% chance of lying between:
in this, and the above, context fit refers to predicted
predict(m,newMovie,interval='predict')## fit lwr upr
## 1 85.60548 71.93000 99.28095
## 2 27.79901 14.25758 41.34043
## 3 86.03514 71.81460 100.25569
The model demostrates that its predicted scores fall with in the 95% prediction intervals of the predict() method analysis. Differences can arise between scores from using a rudimentry prediction model. The truncation of parameters could cause such differences. The most likely reason is the level of complexity of the predict() method as opposed to the linear model developed.
AR<-rbind(0,20.05979)
#ANALYZE ALL MOVIE PREDICTIONS
pred<-as.numeric()
for(i in 1:nrow(smp1)){
x<-smp1[i,];x
cr<-as.numeric(x$critics_rating)
ar<-as.numeric(x$audience_rating)
g<-as.numeric(x$genre)
rt<-x$runtime
imdb<-x$imdb_rating
p<- -8.20919 -0.02181*rt + CR[cr] + G[g] + 9.64472*imdb + AR[ar]
pred[i]<-p
}
smp1$pred <- c(round(pred))
smp1$diff<-c(smp1$audience_score - smp1$pred)
hist(smp1$diff, main='Histogram of Model Score Differences',xlab='Differences between Model and Audience Score', col='red')summary(smp1$diff)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -21.000000 -5.000000 1.000000 0.004615 4.000000 25.000000
inference(y = diff,data = smp1, statistic = "mean", type = "ci", method = "theoretical",alternative ='twosided' )## Single numerical variable
## n = 650, y-bar = 0.0046, s = 6.7736
## 95% CI: (-0.5171 , 0.5263)
The prediction model developed had a difference mean of near 0 for the audience score. The standard deviation was 6.8 and the IQR is 9. This means that 95% of the predictions had difference where \(|diff| \le 13.3\) and the middle 50% of the predicted value differences were \(|diff| \le 9\)