Final Project - Regression Models

Data analysis project for the Linear Regression and Modeling course by Duke University (Coursera)

Load packages

library(tidyverse)
library(ggplot2)
library(dplyr)
library(magrittr)
library(scales)
library(RColorBrewer)
library(GGally)
library(car)

Load data

load("movies.RData")

Part 1: DATA

This project is interested in learning what attributes make a movie popular.

The data set is comprised of 651 randomly sampled movies produced and released before 2016. The dataset includes information from Rotten Tomatoes and IMDB for a random sample of movies.

As the sample was randomly selected, we could say that this sample is generalizable for movies released before 2016.
However, we are not able to evaluate causality. There is not a controlled experiment (with a random assignment) that allows us to test causal inference. We might check correlation, association, but not causality in this type of study.

Part 2: RESEARCH QUESTION

What are the factors that determine a movie’s audience score?

The opinion of the public and critics about a movie can be controversial. In fact, some movies awarded by critics may be criticized or poorly evaluated by the general public. One example of that is when we have Oscar winners that are a surprise for the general public. Therefore, for the general audience, which are the characteristics of a good movie? What do they take into consideration to rate a movie?

Part 3: EXPLORATORY DATA ANALYSIS (EDA)

The first part of the EDA is to clean the database to include only the relevant and of interest variables. In the GitHub repository for the course it is possible to access the codebook for the entire database: https://github.com/ldbatista/Statistics-with-R.

Many variables are just for information purpose, such as URL link for the movie on the IMDB and Rotten Tomatoes website. This kind of variable are not relevant for modeling, therefore, they were removed for the following analyses.

Then, the following variables were selected to the further analyses:

DEPENDENT VARIABLE:

audience_score : Audience score on Rotten Tomatoes

INDEPENDENT VARIABLES:

title_type : Type of movie (Documentary, Feature Film, TV Movie)
genre : Genre of movie (Action & Adventure, Comedy, Documentary, Drama, Horror, Mystery & Suspense, Other)
runtime : Runtime of movie (in minutes)
thtr_rel_year : Year the movie is released in theaters
thtr_rel_month : Month the movie is released in theaters
imdb_rating : Rating on IMDB
imdb_num_votes : Number of votes on IMDB
critics_score : Critics score on Rotten Tomatoes
audience_rating : Categorical variable for audience rating on Rotten Tomatoes (Spilled, Upright)
best_pic_nom : Whether or not the movie was nominated for a best picture Oscar (no, yes)
best_pic_win : Whether or not the movie won a best picture Oscar (no, yes)
best_actor_win : Whether or not one of the main actors in the movie ever won an Oscar (no, yes)
best_actress win : Whether or not one of the main actresses in the movie ever won an Oscar (no, yes)
best_dir_win : Whether or not the director of the movie ever won an Oscar (no, yes)

#Selecting the variables of interest 

model <- select(movies, audience_score, title_type, genre, runtime, thtr_rel_year, thtr_rel_month,imdb_rating, imdb_num_votes, critics_score, audience_rating, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win) 

#Removing NAs
modeldata <- data.frame(!is.na(model))

Linear Regression - Assumptions:

We can assume that the observations are independent, meeting the independence of observations assumption.

Linearity - I will check linearity between the dependent variable and the other quantitative independent variables by analyzing the scatterplots in a paired matrix. We can also assess collinearity by analyzing the correlation coefficients between the variables.
Normality - I will check the normality for the dependent variable as well as the residuals of the model (nearly normal residuals with mean 0, which will be tested later)
Homoscedasticity - The variance of residual is the same for any value of X. This assumption will me checked later, after modeling by analyzing residuals (residuals diagnostic plots).

NORMALITY

summary(model$audience_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.00   46.00   65.00   62.36   80.00   97.00

#Histograma 
histograma<-hist(model$audience_score, breaks = 10, col="lightsteelblue", border="midnightblue", xlim=c(0,100),main="Histogram of Audience Score", xlab="Audience Score (Points)", ylab="Frequency");
xfit<-seq(min(model$audience_score),max(model$audience_score))
yfit<-dnorm(xfit,mean=mean(model$audience_score),sd=sd(model$audience_score))
yfit <- yfit*diff(histograma$mids[1:2])*length(model$audience_score)
lines(xfit, yfit, col="aquamarine4", lwd=1)
abline(v = c(median(model$audience_score), mean(model$audience_score)),
       col = c("brown4", "lightsalmon1"),
       lwd = c(1,1), lty=c(1,2));
legend(x="topleft", #Position of the legend
       c("Median","Mean"), #Names on the legend
       cex=1, col=c("brown4","lightsalmon1"),lty=c(1,2),lwd=c(1,2))

#Boxplot
boxplot(model$audience_score,
        ylab="Audience Score (Points)",
        col="lightsteelblue3",
        border="midnightblue")

My variable of interested to be modeled is the “audience_score”. Based on the summary statistics, we can see that the movie with the lowest score is 11 points and the highest score is 97 points. The mean and the median are quite close (mean = 62.36 points; median = 65 points).

The boxplot and the histogram of the dependent variable (audience_score) were used to check the assumption of normality, required to a linear regression model with a Gaussian link function.

Both plots showed pretty good adherence to the Normal distribution. In the histogram plot, we can see a nearly normal curve. The boxplot is quite symmetric, with no outliers.

LINEARITY

quantmodel <- select(model, audience_score, runtime, thtr_rel_year, thtr_rel_month, imdb_rating, imdb_num_votes, critics_score)

ggpairs(quantmodel)

Another assumption for linear regression is the LINEAR relationship between the dependent variable and the other quantitative variables. We can check this assumption by analyzing the scatterplot of these variables in relation to the audience score.

In the matrix above besides the linear relationship between the variables, we can also check the presence of multicollinearity, which means that two variables are highly correlated. The highest correlation coefficient was between the audience score (our dependent variable) and the IMDB rating (r=0.865). I decided to use r>0.90 as a threshold for collinearity. Therefore, I decided to include this variable in the initial model, and check its performance later on the modeling approach.

The scatterplots of the dependent variable were mostly linear. Only “runtime” and “IMDB_num_votes” presented a biased trend. For that reason, as the linearity assumption could not be met, I decided to remove these two variables from the further analyses (modeling).

Part 4: MODELING

#Selecting the variables to be included in the modeling

modeldta <- select(model, audience_score, title_type, genre, thtr_rel_year, thtr_rel_month,imdb_rating, critics_score, audience_rating, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win)

As the variables “runtime” and “IMDB_num_votes” were excluded for the modeling analysis, firstly, I selected only the variables that will be used in the further analyses.

The selection method that I used was the “backward” elimination process. Then, I will start with the full model (all variables included), and I will remove from the highest p-value until I reached a parsimonious model with significant predictors.

#Full Model
model1 <- lm(audience_score~., data=modeldta)
summary(model1)

## 
## Call:
## lm(formula = audience_score ~ ., data = modeldta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.4277  -4.3491   0.5061   4.1344  24.2203 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    55.19999   50.75895   1.087   0.2772    
## title_typeFeature Film          2.37581    2.53638   0.937   0.3493    
## title_typeTV Movie              0.81661    4.00597   0.204   0.8385    
## genreAnimation                  3.97088    2.45491   1.618   0.1063    
## genreArt House & International -2.53867    2.02932  -1.251   0.2114    
## genreComedy                     1.70505    1.13050   1.508   0.1320    
## genreDocumentary                2.66034    2.69466   0.987   0.3239    
## genreDrama                     -0.82680    0.97237  -0.850   0.3955    
## genreHorror                    -1.92742    1.66889  -1.155   0.2486    
## genreMusical & Performing Arts  3.42674    2.32538   1.474   0.1411    
## genreMystery & Suspense        -3.22274    1.25716  -2.564   0.0106 *  
## genreOther                     -0.46069    1.94969  -0.236   0.8133    
## genreScience Fiction & Fantasy -0.25822    2.44190  -0.106   0.9158    
## thtr_rel_year                  -0.03378    0.02532  -1.334   0.1826    
## thtr_rel_month                 -0.16985    0.07735  -2.196   0.0285 *  
## imdb_rating                     9.41132    0.45614  20.632   <2e-16 ***
## critics_score                   0.02197    0.01524   1.441   0.1500    
## audience_ratingUpright         20.03581    0.78060  25.667   <2e-16 ***
## best_pic_nomyes                 4.18083    1.78590   2.341   0.0195 *  
## best_pic_winyes                -2.42457    3.11298  -0.779   0.4364    
## best_actor_winyes              -0.13828    0.80138  -0.173   0.8631    
## best_actress_winyes            -1.42723    0.89370  -1.597   0.1108    
## best_dir_winyes                 0.07818    1.17103   0.067   0.9468    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.83 on 628 degrees of freedom
## Multiple R-squared:  0.8898, Adjusted R-squared:  0.8859 
## F-statistic: 230.5 on 22 and 628 DF,  p-value: < 2.2e-16

#"best_dir_winyes" had the highest p-value: Removed for the next step
model2 <- lm(audience_score~title_type+genre+thtr_rel_year+thtr_rel_month+imdb_rating+critics_score+audience_rating+best_pic_nom+best_pic_win+best_actor_win+best_actress_win, data=modeldta)
summary(model2)

## 
## Call:
## lm(formula = audience_score ~ title_type + genre + thtr_rel_year + 
##     thtr_rel_month + imdb_rating + critics_score + audience_rating + 
##     best_pic_nom + best_pic_win + best_actor_win + best_actress_win, 
##     data = modeldta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.4305  -4.3552   0.5206   4.1319  24.2223 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    55.41022   50.62106   1.095   0.2741    
## title_typeFeature Film          2.38252    2.53238   0.941   0.3472    
## title_typeTV Movie              0.82028    4.00242   0.205   0.8377    
## genreAnimation                  3.96680    2.45221   1.618   0.1062    
## genreArt House & International -2.54395    2.02617  -1.256   0.2097    
## genreComedy                     1.70384    1.12946   1.509   0.1319    
## genreDocumentary                2.65846    2.69238   0.987   0.3238    
## genreDrama                     -0.82813    0.97139  -0.853   0.3942    
## genreHorror                    -1.92883    1.66744  -1.157   0.2478    
## genreMusical & Performing Arts  3.42733    2.32353   1.475   0.1407    
## genreMystery & Suspense        -3.22258    1.25616  -2.565   0.0105 *  
## genreOther                     -0.46246    1.94797  -0.237   0.8124    
## genreScience Fiction & Fantasy -0.25468    2.43940  -0.104   0.9169    
## thtr_rel_year                  -0.03389    0.02524  -1.343   0.1798    
## thtr_rel_month                 -0.16957    0.07717  -2.197   0.0284 *  
## imdb_rating                     9.41323    0.45489  20.694   <2e-16 ***
## critics_score                   0.02203    0.01520   1.449   0.1478    
## audience_ratingUpright         20.03333    0.77909  25.714   <2e-16 ***
## best_pic_nomyes                 4.17438    1.78187   2.343   0.0195 *  
## best_pic_winyes                -2.36316    2.97161  -0.795   0.4268    
## best_actor_winyes              -0.13467    0.79892  -0.169   0.8662    
## best_actress_winyes            -1.42637    0.89290  -1.597   0.1107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.824 on 629 degrees of freedom
## Multiple R-squared:  0.8898, Adjusted R-squared:  0.8861 
## F-statistic: 241.8 on 21 and 629 DF,  p-value: < 2.2e-16

#"best_actor_win" next variable to be removed in the next step
model3 <- lm(audience_score~title_type+genre+thtr_rel_year+thtr_rel_month+imdb_rating+critics_score+audience_rating+best_pic_nom+best_pic_win+best_actress_win, data=modeldta)
summary(model3)

## 
## Call:
## lm(formula = audience_score ~ title_type + genre + thtr_rel_year + 
##     thtr_rel_month + imdb_rating + critics_score + audience_rating + 
##     best_pic_nom + best_pic_win + best_actress_win, data = modeldta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.465  -4.378   0.530   4.140  24.235 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    55.01620   50.52805   1.089  0.27665    
## title_typeFeature Film          2.37573    2.53011   0.939  0.34810    
## title_typeTV Movie              0.83257    3.99867   0.208  0.83513    
## genreAnimation                  3.96398    2.45026   1.618  0.10621    
## genreArt House & International -2.52968    2.02284  -1.251  0.21156    
## genreComedy                     1.70642    1.12849   1.512  0.13100    
## genreDocumentary                2.66105    2.69025   0.989  0.32297    
## genreDrama                     -0.83435    0.96994  -0.860  0.39000    
## genreHorror                    -1.91298    1.66350  -1.150  0.25059    
## genreMusical & Performing Arts  3.42903    2.32171   1.477  0.14019    
## genreMystery & Suspense        -3.24064    1.25062  -2.591  0.00979 ** 
## genreOther                     -0.46776    1.94621  -0.240  0.81014    
## genreScience Fiction & Fantasy -0.23964    2.43588  -0.098  0.92166    
## thtr_rel_year                  -0.03369    0.02519  -1.337  0.18164    
## thtr_rel_month                 -0.17040    0.07695  -2.215  0.02715 *  
## imdb_rating                     9.40940    0.45397  20.727  < 2e-16 ***
## critics_score                   0.02202    0.01519   1.450  0.14763    
## audience_ratingUpright         20.04290    0.77643  25.814  < 2e-16 ***
## best_pic_nomyes                 4.13717    1.76678   2.342  0.01951 *  
## best_pic_winyes                -2.33725    2.96534  -0.788  0.43088    
## best_actress_winyes            -1.43827    0.88941  -1.617  0.10636    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.819 on 630 degrees of freedom
## Multiple R-squared:  0.8898, Adjusted R-squared:  0.8863 
## F-statistic: 254.3 on 20 and 630 DF,  p-value: < 2.2e-16

#"genre" next variable to be removed in the next step
model4 <- lm(audience_score~title_type+thtr_rel_year+thtr_rel_month+imdb_rating+critics_score+audience_rating+best_pic_nom+best_pic_win+best_actress_win, data=modeldta)
summary(model4)

## 
## Call:
## lm(formula = audience_score ~ title_type + thtr_rel_year + thtr_rel_month + 
##     imdb_rating + critics_score + audience_rating + best_pic_nom + 
##     best_pic_win + best_actress_win, data = modeldta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.1248  -4.6419   0.5008   4.4517  24.6679 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            55.52388   50.62880   1.097   0.2732    
## title_typeFeature Film -1.09823    1.07369  -1.023   0.3068    
## title_typeTV Movie     -2.95786    3.26927  -0.905   0.3659    
## thtr_rel_year          -0.03147    0.02523  -1.247   0.2127    
## thtr_rel_month         -0.14331    0.07766  -1.845   0.0654 .  
## imdb_rating             9.04339    0.44760  20.204   <2e-16 ***
## critics_score           0.02036    0.01521   1.339   0.1810    
## audience_ratingUpright 20.56335    0.77024  26.697   <2e-16 ***
## best_pic_nomyes         3.99103    1.77953   2.243   0.0253 *  
## best_pic_winyes        -2.07882    2.99565  -0.694   0.4880    
## best_actress_winyes    -1.63699    0.88986  -1.840   0.0663 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.912 on 640 degrees of freedom
## Multiple R-squared:  0.885,  Adjusted R-squared:  0.8832 
## F-statistic: 492.4 on 10 and 640 DF,  p-value: < 2.2e-16

#"best_pic_win" next variable to be removed in the next step
model5 <- lm(audience_score~title_type+thtr_rel_year+thtr_rel_month+imdb_rating+critics_score+audience_rating+best_pic_nom+best_actress_win, data=modeldta)
summary(model5)

## 
## Call:
## lm(formula = audience_score ~ title_type + thtr_rel_year + thtr_rel_month + 
##     imdb_rating + critics_score + audience_rating + best_pic_nom + 
##     best_actress_win, data = modeldta)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.133  -4.618   0.426   4.430  24.631 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            54.67257   50.59347   1.081   0.2803    
## title_typeFeature Film -1.10643    1.07320  -1.031   0.3029    
## title_typeTV Movie     -2.95952    3.26795  -0.906   0.3655    
## thtr_rel_year          -0.03102    0.02521  -1.230   0.2190    
## thtr_rel_month         -0.14203    0.07761  -1.830   0.0677 .  
## imdb_rating             9.03594    0.44729  20.202   <2e-16 ***
## critics_score           0.02026    0.01520   1.333   0.1831    
## audience_ratingUpright 20.57081    0.76986  26.720   <2e-16 ***
## best_pic_nomyes         3.44840    1.59788   2.158   0.0313 *  
## best_actress_winyes    -1.67505    0.88781  -1.887   0.0597 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.909 on 641 degrees of freedom
## Multiple R-squared:  0.8849, Adjusted R-squared:  0.8833 
## F-statistic: 547.6 on 9 and 641 DF,  p-value: < 2.2e-16

#"title_type" next variable to be removed in the next step
model6 <- lm(audience_score~thtr_rel_year+thtr_rel_month+imdb_rating+critics_score+audience_rating+best_pic_nom+best_actress_win, data=modeldta)
summary(model6)

## 
## Call:
## lm(formula = audience_score ~ thtr_rel_year + thtr_rel_month + 
##     imdb_rating + critics_score + audience_rating + best_pic_nom + 
##     best_actress_win, data = modeldta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.9387  -4.5129   0.4377   4.3389  24.8079 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            44.90423   49.57334   0.906   0.3654    
## thtr_rel_year          -0.02692    0.02483  -1.084   0.2787    
## thtr_rel_month         -0.14190    0.07748  -1.831   0.0675 .  
## imdb_rating             9.10429    0.44347  20.530   <2e-16 ***
## critics_score           0.02215    0.01500   1.477   0.1401    
## audience_ratingUpright 20.57956    0.76862  26.775   <2e-16 ***
## best_pic_nomyes         3.26258    1.58578   2.057   0.0401 *  
## best_actress_winyes    -1.77618    0.88330  -2.011   0.0448 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.906 on 643 degrees of freedom
## Multiple R-squared:  0.8846, Adjusted R-squared:  0.8834 
## F-statistic: 704.4 on 7 and 643 DF,  p-value: < 2.2e-16

#"thtr_rel_year" next variable to be removed in the next step
model7 <- lm(audience_score~thtr_rel_month+imdb_rating+critics_score+audience_rating+best_pic_nom+best_actress_win, data=modeldta)
summary(model7)

## 
## Call:
## lm(formula = audience_score ~ thtr_rel_month + imdb_rating + 
##     critics_score + audience_rating + best_pic_nom + best_actress_win, 
##     data = modeldta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.3836  -4.5036   0.4579   4.4158  24.6069 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -8.78659    2.25412  -3.898 0.000107 ***
## thtr_rel_month         -0.14230    0.07749  -1.836 0.066774 .  
## imdb_rating             9.07664    0.44280  20.498  < 2e-16 ***
## critics_score           0.02353    0.01495   1.574 0.115915    
## audience_ratingUpright 20.59801    0.76854  26.802  < 2e-16 ***
## best_pic_nomyes         3.31645    1.58521   2.092 0.036819 *  
## best_actress_winyes    -1.74852    0.88305  -1.980 0.048119 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.907 on 644 degrees of freedom
## Multiple R-squared:  0.8844, Adjusted R-squared:  0.8833 
## F-statistic: 821.4 on 6 and 644 DF,  p-value: < 2.2e-16

#"critics_score" next variable to be removed in the next step
model8 <- lm(audience_score~thtr_rel_month+imdb_rating+audience_rating+best_pic_nom+best_actress_win, data=modeldta)
summary(model8)

## 
## Call:
## lm(formula = audience_score ~ thtr_rel_month + imdb_rating + 
##     audience_rating + best_pic_nom + best_actress_win, data = modeldta)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.8460  -4.5718   0.5043   4.4143  24.5659 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -10.22759    2.06227  -4.959 9.05e-07 ***
## thtr_rel_month          -0.14736    0.07751  -1.901   0.0577 .  
## imdb_rating              9.49944    0.35245  26.953  < 2e-16 ***
## audience_ratingUpright  20.73992    0.76411  27.143  < 2e-16 ***
## best_pic_nomyes          3.44325    1.58498   2.172   0.0302 *  
## best_actress_winyes     -1.74052    0.88405  -1.969   0.0494 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.915 on 645 degrees of freedom
## Multiple R-squared:  0.884,  Adjusted R-squared:  0.8831 
## F-statistic: 982.9 on 5 and 645 DF,  p-value: < 2.2e-16

After eight models we found a parsimonious one with mostly variable as a statistically significant predictor of audience_score. The variable “thtr_rel_month” even though was not significant, I decided to keep it in the final model because the removal of this variable affected the other significant results. Besides, the variable presented a borderline p-value (very close to the significant cutoff), and its inclusion did not affect the final adjusted R2.

Another important parameter is the adjusted R2. During the modeling process, its value did not change substantially, which indicates that the removal of the variables during the backward elimination process did not impact this parameter. Therefore, we have a final model that is parsimonious, significant, and with a high adjusted R2.

Model Diagnostics

par(mfrow = c(1, 2))

#Nearly Normal Residuals
hist(model8$residuals, main='Histogram of Residuals')
qqnorm(model8$residuals,main='Normal Probability Plot of Residuals')
qqline(model8$residuals)

par(mfrow = c(1, 1))

#Homoscedasticity (Constant variability of residuals)
plot(model8$residuals~model8$fitted,main='Residuals vs.Predicted (fitted)')
abline(0,0)

To check the last assumptions of linear regression models, I performed a model diagnostic test. First, I analyzed if the residuals of the models are nearly normally distributed. As we can see in the histogram and Q-Q plot, the residuals presented a nearly normal distribution. There are some points above the normal line in the Q-Q plot, as well as a slightly right-skewed distribution, but overall the residuals do not show a biased distribution.

Homoscedasticity was tested by analyzing the plot of the residuals by the predicted (fitted) values. The model seems to be homoscedastic because the points are equally distributed around zero, which means that there is a constant variability of the residuals. Even though it seems to form some point clouds, the variability is still constant around zero.

Interpreting the model coefficients

IMDB rating: All else held constant, for each 1 point increase in IMDB rating the model predicts the audience score to be greater on average by 9.50 points.
Audience Rating: All else held constant, the model predicts that upright movies are, on average, 20.74 points greater in audience score than spilled movies.
Best Picture “Oscar” Nomination: All else held constant, the model predicts that movies that were nominated for a best picture Oscar are, on average, 3.32 points greater in audience score than movies that were not nominated.
Best Actress “Oscar” Winner: All else held constant, the model predicts that movies casting Oscar-winning actresses are, on average, 1.74 points lower in audience score than movies that do not cast Oscar-winning actresses.

Part 5: PREDICTION

For the prediction task, I decided to test two movies that were not in the original modeling database - “Mad Max: Fury Road” (2015) and “Nurse Betty” (2000). I chose two movies that had different characteristics, as well as different audience score on the Rotten Tomatoes website. Mad Max: Fury Road has a higher score, with the best picture nomination, as well as a higher IMDB rating. On the other hand, Nurse Betty is a movie a lower score, categorizes as “Spilled” by the audience.

The information about each movie used to predict the model can be found on IMDB and Rotten Tomatoes websites:

Mad Max: Fury Road

IMDB: https://www.imdb.com/title/tt1392190/

Rotten Tomatoes: https://www.rottentomatoes.com/m/mad_max_fury_road

Nurse Betty

IMDB: https://www.imdb.com/title/tt0171580/?ref_=nv_sr_srsg_0

Rotten Tomatoes: https://www.rottentomatoes.com/m/nurse_betty

#
madmax <- data.frame(thtr_rel_month = 5, imdb_rating = 8.1, audience_rating = "Upright", best_pic_nom = "yes", best_actress_win = "yes")
predict(model8, madmax, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 88.42373 74.48485 102.3626

nurse <- data.frame(thtr_rel_month = 12, imdb_rating = 6.1, audience_rating = "Spilled", best_pic_nom = "no", best_actress_win = "yes")
predict(model8, nurse, interval = "prediction", level = 0.95)

##        fit      lwr      upr
## 1 44.21013 30.49373 57.92653

The actual audience scores in Rotten Tomatoes website were 85% for “Mad Max: Fury Road” and 45% for “Nurse Betty”, and the model predicted it to be 88.4% and 44.2%, respectively, which implies that the model was able to accurately predict the movie’s audience score.

Part 6: CONCLUSION

The initial aim of this project was to investigate the parameters that influence a movie’s audience score (audience_score). The final model was able to identify 4 factors that were statistically significantly associated with the dependent variable, adjusted by the month of release of the movie.

IMDB rating
Audience Rating
Best Picture “Oscar” Nomination
Best Actress “Oscar” Winner

These factors are responsible for 88.31% (Adjusted R2 = 0.8831) of the explained variance in the dependent variable (audience_score), which means that 11.69% of the variance in a movie’s audience score could not be explained by this model.

August 25, 2020