Modeling and prediction for movies

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)

Load data

setwd("C:/Users/rahul/Desktop/Duke University/Linear regression and Modeling/Week4")
load("movies.Rdata")
dim(movies)

## [1] 651  32

Part 1: Data

The dataset contains information about 651 movies released before 2016 and information about these are stored across 32 variables.The data is randomly selected not randomly assigned. Consequently any conclusion can only be generlised to population; causality requires random assignment. it is only possible to do an observational study & no causal analysis can be done.

str(movies)

## Classes 'tbl_df', 'tbl' and 'data.frame':    651 obs. of  32 variables:
##  $ title           : chr  "Filly Brown" "The Dish" "Waiting for Guffman" "The Age of Innocence" ...
##  $ title_type      : Factor w/ 3 levels "Documentary",..: 2 2 2 2 2 1 2 2 1 2 ...
##  $ genre           : Factor w/ 11 levels "Action & Adventure",..: 6 6 4 6 7 5 6 6 5 6 ...
##  $ runtime         : num  80 101 84 139 90 78 142 93 88 119 ...
##  $ mpaa_rating     : Factor w/ 6 levels "G","NC-17","PG",..: 5 4 5 3 5 6 4 5 6 6 ...
##  $ studio          : Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
##  $ thtr_rel_year   : num  2013 2001 1996 1993 2004 ...
##  $ thtr_rel_month  : num  4 3 8 10 9 1 1 11 9 3 ...
##  $ thtr_rel_day    : num  19 14 21 1 10 15 1 8 7 2 ...
##  $ dvd_rel_year    : num  2013 2001 2001 2001 2005 ...
##  $ dvd_rel_month   : num  7 8 8 11 4 4 2 3 1 8 ...
##  $ dvd_rel_day     : num  30 28 21 6 19 20 18 2 21 14 ...
##  $ imdb_rating     : num  5.5 7.3 7.6 7.2 5.1 7.8 7.2 5.5 7.5 6.6 ...
##  $ imdb_num_votes  : int  899 12285 22381 35096 2386 333 5016 2272 880 12496 ...
##  $ critics_rating  : Factor w/ 3 levels "Certified Fresh",..: 3 1 1 1 3 2 3 3 2 1 ...
##  $ critics_score   : num  45 96 91 80 33 91 57 17 90 83 ...
##  $ audience_rating : Factor w/ 2 levels "Spilled","Upright": 2 2 2 2 1 2 2 1 2 2 ...
##  $ audience_score  : num  73 81 91 76 27 86 76 47 89 66 ...
##  $ best_pic_nom    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_pic_win    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_actor_win  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 2 1 1 ...
##  $ best_actress_win: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ best_dir_win    : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ top200_box      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ director        : chr  "Michael D. Olmos" "Rob Sitch" "Christopher Guest" "Martin Scorsese" ...
##  $ actor1          : chr  "Gina Rodriguez" "Sam Neill" "Christopher Guest" "Daniel Day-Lewis" ...
##  $ actor2          : chr  "Jenni Rivera" "Kevin Harrington" "Catherine O'Hara" "Michelle Pfeiffer" ...
##  $ actor3          : chr  "Lou Diamond Phillips" "Patrick Warburton" "Parker Posey" "Winona Ryder" ...
##  $ actor4          : chr  "Emilio Rivera" "Tom Long" "Eugene Levy" "Richard E. Grant" ...
##  $ actor5          : chr  "Joseph Julian Soria" "Genevieve Mooy" "Bob Balaban" "Alec McCowen" ...
##  $ imdb_url        : chr  "http://www.imdb.com/title/tt1869425/" "http://www.imdb.com/title/tt0205873/" "http://www.imdb.com/title/tt0118111/" "http://www.imdb.com/title/tt0106226/" ...
##  $ rt_url          : chr  "//www.rottentomatoes.com/m/filly_brown_2012/" "//www.rottentomatoes.com/m/dish/" "//www.rottentomatoes.com/m/waiting_for_guffman/" "//www.rottentomatoes.com/m/age_of_innocence/" ...

As we can see dataset cantains 32 variables

9 are character variables (@)12 are factor variable (@)10 are numerical variable

also veriable like audience score,actor1,2,3,4,5, URL based, and studio won’t not taken in consderation. for further analysis.

movies_new <- na.omit(movies) %>% 
  select (title_type, genre, runtime, 
         mpaa_rating, thtr_rel_year,imdb_rating, imdb_num_votes, critics_score,
         critics_rating, audience_rating, audience_score)
dim(movies_new)

## [1] 619  11

head(movies_new)

## # A tibble: 6 x 11
##   title_type   genre  runtime mpaa_rating thtr_rel_year imdb_rating
##   <fct>        <fct>    <dbl> <fct>               <dbl>       <dbl>
## 1 Feature Film Drama     80.0 R                    2013        5.50
## 2 Feature Film Drama    101   PG-13                2001        7.30
## 3 Feature Film Comedy    84.0 R                    1996        7.60
## 4 Feature Film Drama    139   PG                   1993        7.20
## 5 Feature Film Horror    90.0 R                    2004        5.10
## 6 Feature Film Drama    142   PG-13                1986        7.20
## # ... with 5 more variables: imdb_num_votes <int>, critics_score <dbl>,
## #   critics_rating <fct>, audience_rating <fct>, audience_score <dbl>

Part 2: Research question

Investigate what parameters are major influences on the audience score (audience_score).

Part 3: Exploratory data analysis

ggplot(data= movies_new,aes(x=genre))+geom_bar( fill='skyblue',color='black')+
        theme(axis.text.x=element_text(angle = 60, hjust = 1))+
         ggtitle("movie count by genre")

ggplot(data=movies_new,aes(x=critics_score,y=audience_score))+geom_point()+
        geom_smooth(method=lm,se=F)+ggtitle("relationship between critic score Vs audience score")

ggplot(data=movies_new, aes(x=imdb_rating, y=audience_score)) + geom_point() + geom_smooth(method=lm,se=F)+ggtitle("relationship between imdb Vs audience score")

ggplot(data=movies_new,aes(x=audience_rating, y=audience_score))+geom_point()+theme(axis.text.x=element_text(angle = 60, hjust = 1))+ geom_smooth(method=lm)+ggtitle("relationship between genre Vs audience score")

There seems to be a strong positive linear relationship between critic score Vs audience score

The average runtime of a movie is around 105 and its distribution is slightly right skewed with some outliers movies with runtime around 250 minutes. The critics scores are nearly uniformly distributed, but slightly left skewed.The score given by the audience has a similar behavior has the critics scores, with a nearly uniform left skewed distribution. * * *

Part 4: Modeling

In this project we will use linear regression and start by fitting a model with 8 variables(described in the previous section). Backward elimination will help us to define if better results can be obtained by using a smaller set of attributes. The advantage of backward elimination is that it allows to start with all the variables, deleting one variable at a time until there are no improvements in the model.

m_full<-lm(audience_score~critics_score+genre+critics_rating+imdb_rating+audience_rating+mpaa_rating+runtime
              +imdb_num_votes+title_type+thtr_rel_year,data=movies_new)
m_full<-summary(m_full)
m_full

## 
## Call:
## lm(formula = audience_score ~ critics_score + genre + critics_rating + 
##     imdb_rating + audience_rating + mpaa_rating + runtime + imdb_num_votes + 
##     title_type + thtr_rel_year, data = movies_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.7520  -4.4252   0.5607   4.1959  24.3954 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     8.334e+01  6.145e+01   1.356   0.1755    
## critics_score                   1.046e-02  2.571e-02   0.407   0.6842    
## genreAnimation                  4.025e+00  2.855e+00   1.410   0.1591    
## genreArt House & International -1.723e+00  2.275e+00  -0.758   0.4490    
## genreComedy                     1.641e+00  1.168e+00   1.405   0.1607    
## genreDocumentary                4.479e+00  2.929e+00   1.529   0.1267    
## genreDrama                     -1.170e-01  1.032e+00  -0.113   0.9097    
## genreHorror                    -1.476e+00  1.765e+00  -0.836   0.4032    
## genreMusical & Performing Arts  4.636e+00  2.404e+00   1.928   0.0543 .  
## genreMystery & Suspense        -2.761e+00  1.319e+00  -2.093   0.0368 *  
## genreOther                      3.546e-01  2.021e+00   0.175   0.8608    
## genreScience Fiction & Fantasy -1.491e-01  2.601e+00  -0.057   0.9543    
## critics_ratingFresh            -2.292e-01  9.075e-01  -0.253   0.8007    
## critics_ratingRotten           -8.477e-01  1.439e+00  -0.589   0.5562    
## imdb_rating                     9.321e+00  5.000e-01  18.642   <2e-16 ***
## audience_ratingUpright          2.016e+01  8.081e-01  24.943   <2e-16 ***
## mpaa_ratingNC-17               -8.167e+00  7.193e+00  -1.135   0.2567    
## mpaa_ratingPG                  -1.213e+00  2.037e+00  -0.595   0.5518    
## mpaa_ratingPG-13               -1.590e+00  2.121e+00  -0.750   0.4536    
## mpaa_ratingR                   -2.010e+00  2.039e+00  -0.986   0.3247    
## mpaa_ratingUnrated             -6.683e-01  2.442e+00  -0.274   0.7845    
## runtime                        -4.014e-02  1.711e-02  -2.346   0.0193 *  
## imdb_num_votes                  5.209e-06  3.175e-06   1.641   0.1014    
## title_typeFeature Film          3.378e+00  2.783e+00   1.214   0.2253    
## title_typeTV Movie              2.290e+00  4.433e+00   0.517   0.6057    
## thtr_rel_year                  -4.558e-02  3.044e-02  -1.497   0.1349    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.876 on 593 degrees of freedom
## Multiple R-squared:  0.8886, Adjusted R-squared:  0.8839 
## F-statistic: 189.2 on 25 and 593 DF,  p-value: < 2.2e-16

The linear model and anova output summary reveals an adjusted \(R^2\) value of 88.39%. Ths means that only 11.7 of the variability is accounted for in the residuals and the remainder is accounted for in the predictor varibles. This extremely high value may be suspect to error, that would invalidate the reliability of the study.

The multi-collinearity must be evaluated to remove any interdependecies amoung the predictors. This will be left for the model building section

#remove veriable thtr_rel_year
m1<-lm(audience_score~critics_score+mpaa_rating+runtime+genre+critics_rating+
               imdb_rating+audience_rating+imdb_num_votes+title_type,data=movies_new)
a<-summary(m1)$adj.r.squared

#menove veriable title_type
m2<-lm(audience_score~critics_score+mpaa_rating+runtime+genre+critics_rating+
               imdb_rating+audience_rating+imdb_num_votes,data=movies_new)
b<-summary(m2)$adj.r.squared
#remove veriable imdb_num_votes
m3<-lm(audience_score~critics_score+mpaa_rating+runtime+genre+critics_rating+
               imdb_rating+audience_rating,data=movies_new)
c<-summary(m3)$adj.r.squared
#remove veraible runtime
m4<-lm(audience_score~critics_score+mpaa_rating+genre+
               imdb_rating+audience_rating+critics_rating,data=movies_new)
d<-summary(m4)$adj.r.squared

#Removing mpaa_rating
m5<-lm(audience_score~critics_score+genre+
               imdb_rating+audience_rating+critics_rating,data=movies_new)

e<-summary(m5)$adj.r.squared

Analysising Adjusted \(R^2\)

cbind(c("m_full","-thtr_rel_year","-title_type","-imdb_num_votes","-runtime","mpaa_rating"),
      round(c(m_full$adj.r.squared,a,b,c,d,e),4))

##      [,1]              [,2]    
## [1,] "m_full"          "0.8839"
## [2,] "-thtr_rel_year"  "0.8836"
## [3,] "-title_type"     "0.8837"
## [4,] "-imdb_num_votes" "0.8835"
## [5,] "-runtime"        "0.8831"
## [6,] "mpaa_rating"     "0.8834"

from the output of above table it seems veriable thtr_rel_year,“title_type”,“imdb_num_votes”,“mpaa_rating,imdb_rating and”runtime" are not significant predictors as adjusted \(R^2\)= 0.8889 does not increase infact it remains same indicating that there is weak relationship between them.

going forward

Now we have left with 4 veriables critics_score,genre,imdb_rating and audience_rating lets check them step by step

#removing audiance_rating
m6<-lm(audience_score~critics_score+genre+critics_rating+
               imdb_rating,data=movies_new)

f<-summary(m6)$adj.r.squared
f

## [1] 0.7610629

It seems audience rating is having strong posive correlation with audience score so keep it in model and continue further

#removing imdb_rating
m7<-lm(audience_score~critics_score+genre+audience_rating+critics_rating,data=movies_new)

g<-summary(m7)$adj.r.squared

#removing critic_score
m8<-lm(audience_score~critics_rating+audience_rating+imdb_rating+genre,data=movies_new)

h<-summary(m8)$adj.r.squared

#removing genre
m9<-lm(audience_score~critics_rating+audience_rating+imdb_rating,data=movies_new)

i<-summary(m9)$adj.r.squared

#removing critics rating
m10<-lm(audience_score~audience_rating+imdb_rating+genre,data=movies_new)

j<-summary(m10)$adj.r.squared

Analysising Adjusted \(R^2\)

cbind(c("m_full","audiance_rating","imdb_rating","critic_score","genre","critics_rating"),
      round(c(m_full$adj.r.squared,f,g,h,i,j),4))

##      [,1]              [,2]    
## [1,] "m_full"          "0.8839"
## [2,] "audiance_rating" "0.7611"
## [3,] "imdb_rating"     "0.8078"
## [4,] "critic_score"    "0.8836"
## [5,] "genre"           "0.8799"
## [6,] "critics_rating"  "0.8834"

The final model depicts a Parsimonius Model: the simplest model with the highest predictive power. Only four predictors are used : imdb_rating, audience_rating,critics_rating and genre.from the output of above table it seems veriable critic_score is not significant predictors as adjusted \(R^2\)= 0.885 does not increase infact it remains same indicating that there is weak relationship between them on the other hand audiance_rating,imdb_rating and “critics_rating is having strong relationship between them.

so our final model will be

m8<-lm(audience_score~audience_rating+imdb_rating+genre,data=movies_new)

summary(m8)

## 
## Call:
## lm(formula = audience_score ~ audience_rating + imdb_rating + 
##     genre, data = movies_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.620  -4.502   0.627   4.364  24.975 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -12.2663     2.2718  -5.399 9.61e-08 ***
## audience_ratingUpright          20.5042     0.7937  25.834  < 2e-16 ***
## imdb_rating                      9.7107     0.3802  25.543  < 2e-16 ***
## genreAnimation                   5.5041     2.6000   2.117   0.0347 *  
## genreArt House & International  -2.2434     2.1928  -1.023   0.3067    
## genreComedy                      1.6229     1.1509   1.410   0.1590    
## genreDocumentary                 1.3614     1.4789   0.921   0.3577    
## genreDrama                      -0.6963     0.9828  -0.708   0.4790    
## genreHorror                     -1.3892     1.7145  -0.810   0.4181    
## genreMusical & Performing Arts   2.7493     2.2083   1.245   0.2136    
## genreMystery & Suspense         -3.2959     1.2818  -2.571   0.0104 *  
## genreOther                       0.7362     1.9985   0.368   0.7127    
## genreScience Fiction & Fantasy   0.5883     2.5888   0.227   0.8203    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.89 on 606 degrees of freedom
## Multiple R-squared:  0.8856, Adjusted R-squared:  0.8834 
## F-statistic: 391.1 on 12 and 606 DF,  p-value: < 2.2e-16

As we started with full model with adjusted \(R^2\)= 88.3 reached to adjusted \(R^2\) = 88.5 with backword elimination with increase in adjusted \(R^2\).

Variables that were excluded from the table are listed below-

runtime - weak linear relationship and not significant thtr_rel_year no linear relationship and not significant rel_year - no linear relationship and not significant rel_month - no linear relationship and not significant mpaa_rating - not significant imdb_num_votes - not significant critics_score - collinearity and not significant

checking mutiple colinearity

pairs(data=movies_new,~audience_score+critics_rating+audience_rating+imdb_rating+genre)

* * * ## Part 5 Model Diagnostics

Checking for the linear relationship between numerical predictor (s) and residual (s) using the code written below.

plot(m8$residuals~movies_new$imdb_rating,pch=20,main="Linearity Condition")
abline(h=0)

by looking at plot I can say the plot depicts a complete random scatter around zero; no descernable pattern.

par(mfrow=c(1,2),mar=c(3,2,3,2))
hist(m8$residuals,main="histogram of residuals",col="skyblue")
qqnorm(m8$residuals,pch=20,main="Normal Probability Plot of Residuals")
qqline(m8$residuals)

Condition met - the majority of the points lie on the line, but because of skeness, a few points do not. Also note that there are no apparent outliners the histogram confirms the skewness (right skewness) but the distribution still appear to be nearly normal.

Check for variability

par(mfrow=c(1,2))

plot(m8$residuals~m8$fitted,main='Residuals vs. Predicted (fitted) ')
plot(abs(m8$residuals)~m8$fitted,main='Absolute Residuals vs. Predicted')

The plot of predicted values shows that residuals are equally variable for low and heigh values and there is no visible fan pattern.

Independancy condition

plot(m8$residuals,pch=20 ,main="Independany Conditions")

Above plot depicts residuals being randomly scatterd around zero.

Part 5: Prediction

Build test data cases for the movie “The jungle book (2016)” using the data gathered from IMDB and rotten tomatoes website and storing the data in the variable named “new_data” using the following code.

imdb_rating<-7.5
genre <- "Drama"
critics_rating <-"Fresh"
audience_score= 86
audience_rating<- "Upright"

new_data<- data.frame(critics_rating,audience_rating,imdb_rating,genre)

predict(m8, new_data)

##       1 
## 80.3719

predict(m8,new_data,interval="confidence")

##       fit      lwr      upr
## 1 80.3719 79.44515 81.29865

Note: The actual audience_score for movie The jungle book (2016) is 86 and model predicted audience sccore is 80.37

With this information we can conclude that we are 95% confident that the actual audience_score for The jungle book (2016) is between 79.44 and 81.29 The model returns an interval that does not includes our predicted value of 86.

Part 6: Conclusion

Predicting model is not easy task with few veriable to deal with I think model will be more accurate if we add more veriables model as mention in imdb website http://www.imdb.com/title/tt3040964 genre of movie is Adventure, Drama, Family while in my final model I dont have such genre. so I left with no option to mention “Drama” as gente. beacuse of this there is variation in predicted model.

The data is biased toward drama movies, consequently the model was trained primarily by drame movie dataset thus it would have been better to predict the audience score about drama movies.