Modeling and prediction for movies

Setup

Load packages

library(ggplot2);library(dplyr);library(statsr)

## Warning: package 'dplyr' was built under R version 3.3.2

Load data

getwd()

## [1] "C:/Albert/Coursera Statitics with R/Linear Regression and Modeling/Wk4-"

load("eaca_movies.Rdata")

msmp = select(movies, critics_rating, imdb_rating,runtime,title, critics_score,genre,title_type,imdb_num_votes,audience_rating,audience_score,mpaa_rating) 

msmp = na.exclude(msmp)

Part 1: Data

IMDb

The Internet Movie Database (abbreviated IMDb) is an online database of information related to films, television programs and video games, including cast, production crew, fictional characters, biographies, plot summaries, trivia and reviews, operated by IMDb.com, Inc., a subsidiary of Amazon.com.

Actors and crew can post their own résumé and upload photos of themselves for a yearly fee. U.S. users can view over 6,000 movies and television shows from CBS, Sony, and various independent filmmakers.

As of September 2016, IMDb has approximately 3.9 million titles (including episodes) and 7.4 million personalities in its database,[2] as well as 67 million registered users.[1]

Rotten Tomatoes

Rotten Tomatoes is a website launched in August 1998 devoted to film reviews, news and details; it is widely known as a film review aggregator. Coverage now includes TV content as well. The name derives from the practice of audiences throwing rotten tomatoes when disapproving of a poor stage performance. The company was created by Senh Duong and since January 2010 has been owned by Flixster, which itself was acquired in 2011 by Warner Bros.[3]

Resolution

Generalization is allowable since the data represent 456 randomly sampled movies released between 1972 to 2014 in the Unites States. The data taken from both the IMDb and Rotten Tomatoes database. However, since the study does not make use of random assignment, causality cannot be inferred from this study. The sample may be biased since it is limited to entries contributed by members.

Part 2: Research question

Investigate what parameters are major influences on the audience score (audience_score). The audience score is, of course, the acid test for all movies. It will ultimately determine the box office purse for the selected movie.

Part 3: Exploratory data analysis

Analysys of Data Pair Plots

The EDA section will begin with a plot of paired variables to examine the relationship, if any, that exists between the selected. This method will be used to select items to used in the modeling part of the project. The response variable will be audience_score, since the audience evaluation would be more likely to infer the most about movie popularity.

pairs(~audience_score+critics_score+mpaa_rating+runtime+genre+critics_rating+imdb_rating+audience_rating+imdb_num_votes+title_type,
      data=msmp, 
   main="Simple Scatterplot Matrix")

Preliminary Data Analysis

The linear model data summary and anova summary are provided below

m<-lm(audience_score~critics_score+mpaa_rating+runtime+genre+critics_rating+imdb_rating+audience_rating+imdb_num_votes+title_type,data=msmp)
summary(m)

## 
## Call:
## lm(formula = audience_score ~ critics_score + mpaa_rating + runtime + 
##     genre + critics_rating + imdb_rating + audience_rating + 
##     imdb_num_votes + title_type, data = msmp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.1421  -4.4953   0.4327   4.2316  24.5563 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -9.525e+00  4.426e+00  -2.152   0.0318 *  
## critics_score                   9.521e-03  2.522e-02   0.377   0.7059    
## mpaa_ratingNC-17               -5.586e-01  5.216e+00  -0.107   0.9148    
## mpaa_ratingPG                  -8.903e-02  1.899e+00  -0.047   0.9626    
## mpaa_ratingPG-13               -1.002e+00  1.959e+00  -0.511   0.6092    
## mpaa_ratingR                   -1.176e+00  1.886e+00  -0.623   0.5332    
## mpaa_ratingUnrated             -2.789e-01  2.166e+00  -0.129   0.8976    
## runtime                        -2.613e-02  1.616e-02  -1.617   0.1063    
## genreAnimation                  2.431e+00  2.697e+00   0.902   0.3677    
## genreArt House & International -2.422e+00  2.104e+00  -1.151   0.2501    
## genreComedy                     1.554e+00  1.148e+00   1.354   0.1761    
## genreDocumentary                2.525e+00  2.759e+00   0.915   0.3605    
## genreDrama                     -4.738e-01  1.009e+00  -0.469   0.6389    
## genreHorror                    -1.614e+00  1.723e+00  -0.937   0.3493    
## genreMusical & Performing Arts  3.757e+00  2.368e+00   1.587   0.1131    
## genreMystery & Suspense        -2.903e+00  1.288e+00  -2.253   0.0246 *  
## genreOther                      8.901e-02  1.957e+00   0.045   0.9637    
## genreScience Fiction & Fantasy -2.839e-01  2.460e+00  -0.115   0.9081    
## critics_ratingFresh             9.706e-02  8.602e-01   0.113   0.9102    
## critics_ratingRotten           -7.752e-01  1.399e+00  -0.554   0.5798    
## imdb_rating                     9.439e+00  4.878e-01  19.349   <2e-16 ***
## audience_ratingUpright          2.008e+01  7.892e-01  25.441   <2e-16 ***
## imdb_num_votes                  3.589e-06  3.069e-06   1.169   0.2427    
## title_typeFeature Film          2.512e+00  2.575e+00   0.976   0.3296    
## title_typeTV Movie              7.692e-01  4.039e+00   0.190   0.8490    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.871 on 625 degrees of freedom
## Multiple R-squared:  0.889,  Adjusted R-squared:  0.8847 
## F-statistic: 208.5 on 24 and 625 DF,  p-value: < 2.2e-16

anova(m)

## Analysis of Variance Table
## 
## Response: audience_score
##                  Df Sum Sq Mean Sq   F value    Pr(>F)    
## critics_score     1 131758  131758 2791.0826 < 2.2e-16 ***
## mpaa_rating       5    877     175    3.7150  0.002541 ** 
## runtime           1   1291    1291   27.3549 2.313e-07 ***
## genre            10   8096     810   17.1511 < 2.2e-16 ***
## critics_rating    2   1714     857   18.1517 2.174e-08 ***
## imdb_rating       1  61850   61850 1310.1895 < 2.2e-16 ***
## audience_rating   1  30510   30510  646.3062 < 2.2e-16 ***
## imdb_num_votes    1     69      69    1.4713  0.225600    
## title_type        2     58      29    0.6165  0.540164    
## Residuals       625  29504      47                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The linear model and anova output summary reveals an adjusted \(R^2\) value of 88.47%. Ths means that only 11.53 of the variability is accounted for in the residuals and the remainder is accounted for in the predictor varibles. This extremely high value may be suspect to error, that would invalidate the reliability of the study.

The multi-collinearity must be evaluated to remove any interdependecies amoung the predictors. This will be left for the model building section

Part 4: Modeling

A Backwards Elimination method will be used to build the step-wise model. The selection criteria will be the highest p-value. Begining with the full model identified in the EDA section, one predictor at a time, will be eliminated, until a parsimonios model is arrived at.

m<-lm(audience_score~critics_score+mpaa_rating+runtime+genre+critics_rating+imdb_rating+audience_rating+imdb_num_votes+title_type,data=msmp)
x<-summary(m)
y<-data.frame(x$coefficients[,4])
colnames(y)<-c('Pr(>|t|')

Iterative Removal of Predictors

The following Iterations are made and and Adjusted \(R^2\) is recorded

  * Base - ARS = 88.47
  * Removal of critics_rating - ARS = 88.49
  * Removal of mpaa_rating - ARS = 88.53
  * Removal of tite_type - ARS = 88.55
  * Removal of imdb_num_votes - ARS = 88.55

At this point Adjusted \(R^2\) starts to diminish.

Final Model Parameters

m<-lm(audience_score~runtime+genre+critics_rating+imdb_rating+audience_rating,data=msmp)
x<-summary(m)
y<-data.frame(x$coefficients[,4])
colnames(y)<-c('Pr(>|t|')
x

## 
## Call:
## lm(formula = audience_score ~ runtime + genre + critics_rating + 
##     imdb_rating + audience_rating, data = msmp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.4194  -4.7080   0.6925   4.3510  24.8869 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -8.20919    2.97135  -2.763  0.00590 ** 
## runtime                        -0.02181    0.01530  -1.426  0.15437    
## genreAnimation                  3.06600    2.46274   1.245  0.21361    
## genreArt House & International -2.93234    2.02995  -1.445  0.14908    
## genreComedy                     1.30392    1.12904   1.155  0.24857    
## genreDocumentary                0.04005    1.39784   0.029  0.97715    
## genreDrama                     -0.90540    0.96471  -0.939  0.34833    
## genreHorror                    -2.00209    1.67640  -1.194  0.23282    
## genreMusical & Performing Arts  2.50272    2.18911   1.143  0.25336    
## genreMystery & Suspense        -3.32832    1.25185  -2.659  0.00804 ** 
## genreOther                      0.10508    1.93104   0.054  0.95662    
## genreScience Fiction & Fantasy -0.20922    2.44861  -0.085  0.93193    
## critics_ratingFresh            -0.23492    0.78294  -0.300  0.76424    
## critics_ratingRotten           -1.36926    0.89623  -1.528  0.12706    
## imdb_rating                     9.64472    0.41387  23.304  < 2e-16 ***
## audience_ratingUpright         20.05979    0.78471  25.563  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.848 on 634 degrees of freedom
## Multiple R-squared:  0.8881, Adjusted R-squared:  0.8855 
## F-statistic: 335.5 on 15 and 634 DF,  p-value: < 2.2e-16

anova(m)

## Analysis of Variance Table
## 
## Response: audience_score
##                  Df Sum Sq Mean Sq F value    Pr(>F)    
## runtime           1   8702    8702  185.59 < 2.2e-16 ***
## genre            10  49167    4917  104.86 < 2.2e-16 ***
## critics_rating    2  62427   31214  665.69 < 2.2e-16 ***
## imdb_rating       1  85062   85062 1814.12 < 2.2e-16 ***
## audience_rating   1  30641   30641  653.48 < 2.2e-16 ***
## Residuals       634  29728      47                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Check for Multi-collinearity

m<-lm(audience_score~runtime+genre+critics_rating+imdb_rating+audience_rating,data=msmp)

pairs(~runtime+genre+critics_rating+imdb_rating+audience_rating,
      data=msmp)

MLR Diagnostics

Linear relationship between explanatory and response variables

Check using residuals plot

Random scatter around 0
Boxplots demostrate for separate Genre categories

plot(m$residuals~msmp$genre, main='Residuals vs. Genre')

Nearly Normal Residuals with mean 0

par(mfrow=c(1,2))
hist(m$residuals, main='Histogram of Residuals')
qqnorm(m$residuals,main='Normal Probability Plot of Residuals')
qqline(m$residuals)

par(mfrow=c(1,1))

Constant Variability of Residuals

par(mfrow=c(1,2))

plot(m$residuals~m$fitted,main='Residuals vs. Predicted (fitted) ')
plot(abs(m$residuals)~m$fitted,main='Absolute Residuals vs. Predicted')

#plot(m$residuals~msmp$imdb_rating)
par(mfrow=c(1,1))

Independent Residuals

plot(m$residuals, main='Residuals vs Collection Index')

Part 5: Prediction

Overview

Using the MLR model developed above, build a prediction algorithym. The response variable, audience_score, will be typified by a formula based on the five explanatory variables and the intercept.

PREDICTION ALGORITHYM
- Intercept, runtime, Critic’s Rating, Genre, imdb rating, Audience Rating
- p <- -8.20919 -0.02181rt + CR[cr] + G[g] + 9.64472imdb + AR[ar]

In the sample table, predicted score is given along with the actual audience score and the difference between the two (DIFF).

## [1] "TITLE:   Burn After Reading      AUDIENCE SCORE:  64      PREDICTION:  76 DIFF:  -12"
## [1] "TITLE:   Max      AUDIENCE SCORE:  64      PREDICTION:  72 DIFF:  -8"
## [1] "TITLE:   Basic      AUDIENCE SCORE:  64      PREDICTION:  70 DIFF:  -6"
## [1] "TITLE:   The Man Without a Face      AUDIENCE SCORE:  64      PREDICTION:  73 DIFF:  -9"
## [1] "TITLE:   The Tortured      AUDIENCE SCORE:  35      PREDICTION:  38 DIFF:  -3"
## [1] "TITLE:   The Thin Blue Line      AUDIENCE SCORE:  90      PREDICTION:  88 DIFF:  2"

New Movies

The data sources for the new movies are provided in[4],[5].

The Revenant
Exposed
Finding Dory

##          title runtime              genre  critics_rating imdb_rating
## 1     Revenant     156 Action & Adventure Certified Fresh         8.0
## 2      Exposed     102              Drama          Rotten         4.2
## 3 Finding Dory     100          Animation Certified Fresh         7.6
##   audience_rating audience_score
## 1         Upright             84
## 2         Spilled             14
## 3         Upright             86

Model Predictions

for(i in 1:nrow(newMovie)){
    x<-newMovie[i,]
    cr<-as.numeric(x$critics_rating)
    ar<-as.numeric(x$audience_rating)
    g<-as.numeric(x$genre)
    rt<-x$runtime
    imdb<-x$imdb_rating
   
    #PREDICTION
    # Intercept   runtime  Critic's Rating Genre  imdb rating  Audience Rating
    p <- -8.20919 -0.02181*rt + CR[cr] + G[g] + 9.64472*imdb + AR[ar]
   
    #DISPLAY 
    print(paste('TITLE:  ',x$title,'     AUDIENCE SCORE: ',x$audience_score,'     PREDICTION: ',round(p),'DIFF: ',x$audience_score - round(p)))
}

## [1] "TITLE:   Revenant      AUDIENCE SCORE:  84      PREDICTION:  86 DIFF:  -2"
## [1] "TITLE:   Exposed      AUDIENCE SCORE:  14      PREDICTION:  33 DIFF:  -19"
## [1] "TITLE:   Finding Dory      AUDIENCE SCORE:  86      PREDICTION:  80 DIFF:  6"

The R Predict Method Results

Each of the three new movies have predicted ratings that have a 95% chance of lying between:

The Revenant (71.9, 99.3) with predicted score of 85.6
Exposed (14.3, 41.3) with predicted score of 27.8
Finding Dory (71.8, 100) with predicted score of 100

in this, and the above, context fit refers to predicted

predict(m,newMovie,interval='predict')

##        fit      lwr       upr
## 1 85.60548 71.93000  99.28095
## 2 27.79901 14.25758  41.34043
## 3 86.03514 71.81460 100.25569

Part 6: Conclusion

New Movie Data

The model demostrates that its predicted scores fall with in the 95% prediction intervals of the predict() method analysis. Differences can arise between scores from using a rudimentry prediction model. The truncation of parameters could cause such differences. The most likely reason is the level of complexity of the predict() method as opposed to the linear model developed.

Analysys of the Difference between the linear model and true audience scores

AR<-rbind(0,20.05979)
#ANALYZE ALL MOVIE PREDICTIONS
pred<-as.numeric()

for(i in 1:nrow(smp1)){
    x<-smp1[i,];x
    cr<-as.numeric(x$critics_rating)
    ar<-as.numeric(x$audience_rating)
    g<-as.numeric(x$genre)
    rt<-x$runtime
    imdb<-x$imdb_rating
    p<- -8.20919 -0.02181*rt + CR[cr] + G[g] + 9.64472*imdb + AR[ar]
    pred[i]<-p
}
smp1$pred <- c(round(pred))
smp1$diff<-c(smp1$audience_score - smp1$pred)

hist(smp1$diff, main='Histogram of Model Score Differences',xlab='Differences between Model and Audience Score', col='red')

summary(smp1$diff)

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -21.000000  -5.000000   1.000000   0.004615   4.000000  25.000000

inference(y = diff,data = smp1, statistic = "mean", type = "ci", method = "theoretical",alternative ='twosided' )

## Single numerical variable
## n = 650, y-bar = 0.0046, s = 6.7736
## 95% CI: (-0.5171 , 0.5263)

Summary of Findings

The prediction model developed had a difference mean of near 0 for the audience score. The standard deviation was 6.8 and the IQR is 9. This means that 95% of the predictions had difference where \(|diff| \le 13.3\) and the middle 50% of the predicted value differences were \(|diff| \le 9\)