Data analysis project for the Linear Regression and Modeling course by Duke University (Coursera)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(magrittr)
library(scales)
library(RColorBrewer)
library(GGally)
library(car)
load("movies.RData")
This project is interested in learning what attributes make a movie popular.
The data set is comprised of 651 randomly sampled movies produced and released before 2016. The dataset includes information from Rotten Tomatoes and IMDB for a random sample of movies.
As the sample was randomly selected, we could say that this sample is generalizable for movies released before 2016.
However, we are not able to evaluate causality. There is not a controlled experiment (with a random assignment) that allows us to test causal inference. We might check correlation, association, but not causality in this type of study.
The opinion of the public and critics about a movie can be controversial. In fact, some movies awarded by critics may be criticized or poorly evaluated by the general public. One example of that is when we have Oscar winners that are a surprise for the general public. Therefore, for the general audience, which are the characteristics of a good movie? What do they take into consideration to rate a movie?
The first part of the EDA is to clean the database to include only the relevant and of interest variables. In the GitHub repository for the course it is possible to access the codebook for the entire database: https://github.com/ldbatista/Statistics-with-R.
Many variables are just for information purpose, such as URL link for the movie on the IMDB and Rotten Tomatoes website. This kind of variable are not relevant for modeling, therefore, they were removed for the following analyses.
Then, the following variables were selected to the further analyses:
DEPENDENT VARIABLE:
INDEPENDENT VARIABLES:
#Selecting the variables of interest
model <- select(movies, audience_score, title_type, genre, runtime, thtr_rel_year, thtr_rel_month,imdb_rating, imdb_num_votes, critics_score, audience_rating, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win)
#Removing NAs
modeldata <- data.frame(!is.na(model))
We can assume that the observations are independent, meeting the independence of observations assumption.
Linearity - I will check linearity between the dependent variable and the other quantitative independent variables by analyzing the scatterplots in a paired matrix. We can also assess collinearity by analyzing the correlation coefficients between the variables.
Normality - I will check the normality for the dependent variable as well as the residuals of the model (nearly normal residuals with mean 0, which will be tested later)
Homoscedasticity - The variance of residual is the same for any value of X. This assumption will me checked later, after modeling by analyzing residuals (residuals diagnostic plots).
NORMALITY
summary(model$audience_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 46.00 65.00 62.36 80.00 97.00
#Histograma
histograma<-hist(model$audience_score, breaks = 10, col="lightsteelblue", border="midnightblue", xlim=c(0,100),main="Histogram of Audience Score", xlab="Audience Score (Points)", ylab="Frequency");
xfit<-seq(min(model$audience_score),max(model$audience_score))
yfit<-dnorm(xfit,mean=mean(model$audience_score),sd=sd(model$audience_score))
yfit <- yfit*diff(histograma$mids[1:2])*length(model$audience_score)
lines(xfit, yfit, col="aquamarine4", lwd=1)
abline(v = c(median(model$audience_score), mean(model$audience_score)),
col = c("brown4", "lightsalmon1"),
lwd = c(1,1), lty=c(1,2));
legend(x="topleft", #Position of the legend
c("Median","Mean"), #Names on the legend
cex=1, col=c("brown4","lightsalmon1"),lty=c(1,2),lwd=c(1,2))
#Boxplot
boxplot(model$audience_score,
ylab="Audience Score (Points)",
col="lightsteelblue3",
border="midnightblue")
My variable of interested to be modeled is the “audience_score”. Based on the summary statistics, we can see that the movie with the lowest score is 11 points and the highest score is 97 points. The mean and the median are quite close (mean = 62.36 points; median = 65 points).
The boxplot and the histogram of the dependent variable (audience_score) were used to check the assumption of normality, required to a linear regression model with a Gaussian link function.
Both plots showed pretty good adherence to the Normal distribution. In the histogram plot, we can see a nearly normal curve. The boxplot is quite symmetric, with no outliers.
LINEARITY
quantmodel <- select(model, audience_score, runtime, thtr_rel_year, thtr_rel_month, imdb_rating, imdb_num_votes, critics_score)
ggpairs(quantmodel)
Another assumption for linear regression is the LINEAR relationship between the dependent variable and the other quantitative variables. We can check this assumption by analyzing the scatterplot of these variables in relation to the audience score.
In the matrix above besides the linear relationship between the variables, we can also check the presence of multicollinearity, which means that two variables are highly correlated. The highest correlation coefficient was between the audience score (our dependent variable) and the IMDB rating (r=0.865). I decided to use r>0.90 as a threshold for collinearity. Therefore, I decided to include this variable in the initial model, and check its performance later on the modeling approach.
#Selecting the variables to be included in the modeling
modeldta <- select(model, audience_score, title_type, genre, thtr_rel_year, thtr_rel_month,imdb_rating, critics_score, audience_rating, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, best_dir_win)
As the variables “runtime” and “IMDB_num_votes” were excluded for the modeling analysis, firstly, I selected only the variables that will be used in the further analyses.
The selection method that I used was the “backward” elimination process. Then, I will start with the full model (all variables included), and I will remove from the highest p-value until I reached a parsimonious model with significant predictors.
#Full Model
model1 <- lm(audience_score~., data=modeldta)
summary(model1)
##
## Call:
## lm(formula = audience_score ~ ., data = modeldta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.4277 -4.3491 0.5061 4.1344 24.2203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.19999 50.75895 1.087 0.2772
## title_typeFeature Film 2.37581 2.53638 0.937 0.3493
## title_typeTV Movie 0.81661 4.00597 0.204 0.8385
## genreAnimation 3.97088 2.45491 1.618 0.1063
## genreArt House & International -2.53867 2.02932 -1.251 0.2114
## genreComedy 1.70505 1.13050 1.508 0.1320
## genreDocumentary 2.66034 2.69466 0.987 0.3239
## genreDrama -0.82680 0.97237 -0.850 0.3955
## genreHorror -1.92742 1.66889 -1.155 0.2486
## genreMusical & Performing Arts 3.42674 2.32538 1.474 0.1411
## genreMystery & Suspense -3.22274 1.25716 -2.564 0.0106 *
## genreOther -0.46069 1.94969 -0.236 0.8133
## genreScience Fiction & Fantasy -0.25822 2.44190 -0.106 0.9158
## thtr_rel_year -0.03378 0.02532 -1.334 0.1826
## thtr_rel_month -0.16985 0.07735 -2.196 0.0285 *
## imdb_rating 9.41132 0.45614 20.632 <2e-16 ***
## critics_score 0.02197 0.01524 1.441 0.1500
## audience_ratingUpright 20.03581 0.78060 25.667 <2e-16 ***
## best_pic_nomyes 4.18083 1.78590 2.341 0.0195 *
## best_pic_winyes -2.42457 3.11298 -0.779 0.4364
## best_actor_winyes -0.13828 0.80138 -0.173 0.8631
## best_actress_winyes -1.42723 0.89370 -1.597 0.1108
## best_dir_winyes 0.07818 1.17103 0.067 0.9468
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.83 on 628 degrees of freedom
## Multiple R-squared: 0.8898, Adjusted R-squared: 0.8859
## F-statistic: 230.5 on 22 and 628 DF, p-value: < 2.2e-16
#"best_dir_winyes" had the highest p-value: Removed for the next step
model2 <- lm(audience_score~title_type+genre+thtr_rel_year+thtr_rel_month+imdb_rating+critics_score+audience_rating+best_pic_nom+best_pic_win+best_actor_win+best_actress_win, data=modeldta)
summary(model2)
##
## Call:
## lm(formula = audience_score ~ title_type + genre + thtr_rel_year +
## thtr_rel_month + imdb_rating + critics_score + audience_rating +
## best_pic_nom + best_pic_win + best_actor_win + best_actress_win,
## data = modeldta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.4305 -4.3552 0.5206 4.1319 24.2223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.41022 50.62106 1.095 0.2741
## title_typeFeature Film 2.38252 2.53238 0.941 0.3472
## title_typeTV Movie 0.82028 4.00242 0.205 0.8377
## genreAnimation 3.96680 2.45221 1.618 0.1062
## genreArt House & International -2.54395 2.02617 -1.256 0.2097
## genreComedy 1.70384 1.12946 1.509 0.1319
## genreDocumentary 2.65846 2.69238 0.987 0.3238
## genreDrama -0.82813 0.97139 -0.853 0.3942
## genreHorror -1.92883 1.66744 -1.157 0.2478
## genreMusical & Performing Arts 3.42733 2.32353 1.475 0.1407
## genreMystery & Suspense -3.22258 1.25616 -2.565 0.0105 *
## genreOther -0.46246 1.94797 -0.237 0.8124
## genreScience Fiction & Fantasy -0.25468 2.43940 -0.104 0.9169
## thtr_rel_year -0.03389 0.02524 -1.343 0.1798
## thtr_rel_month -0.16957 0.07717 -2.197 0.0284 *
## imdb_rating 9.41323 0.45489 20.694 <2e-16 ***
## critics_score 0.02203 0.01520 1.449 0.1478
## audience_ratingUpright 20.03333 0.77909 25.714 <2e-16 ***
## best_pic_nomyes 4.17438 1.78187 2.343 0.0195 *
## best_pic_winyes -2.36316 2.97161 -0.795 0.4268
## best_actor_winyes -0.13467 0.79892 -0.169 0.8662
## best_actress_winyes -1.42637 0.89290 -1.597 0.1107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.824 on 629 degrees of freedom
## Multiple R-squared: 0.8898, Adjusted R-squared: 0.8861
## F-statistic: 241.8 on 21 and 629 DF, p-value: < 2.2e-16
#"best_actor_win" next variable to be removed in the next step
model3 <- lm(audience_score~title_type+genre+thtr_rel_year+thtr_rel_month+imdb_rating+critics_score+audience_rating+best_pic_nom+best_pic_win+best_actress_win, data=modeldta)
summary(model3)
##
## Call:
## lm(formula = audience_score ~ title_type + genre + thtr_rel_year +
## thtr_rel_month + imdb_rating + critics_score + audience_rating +
## best_pic_nom + best_pic_win + best_actress_win, data = modeldta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.465 -4.378 0.530 4.140 24.235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.01620 50.52805 1.089 0.27665
## title_typeFeature Film 2.37573 2.53011 0.939 0.34810
## title_typeTV Movie 0.83257 3.99867 0.208 0.83513
## genreAnimation 3.96398 2.45026 1.618 0.10621
## genreArt House & International -2.52968 2.02284 -1.251 0.21156
## genreComedy 1.70642 1.12849 1.512 0.13100
## genreDocumentary 2.66105 2.69025 0.989 0.32297
## genreDrama -0.83435 0.96994 -0.860 0.39000
## genreHorror -1.91298 1.66350 -1.150 0.25059
## genreMusical & Performing Arts 3.42903 2.32171 1.477 0.14019
## genreMystery & Suspense -3.24064 1.25062 -2.591 0.00979 **
## genreOther -0.46776 1.94621 -0.240 0.81014
## genreScience Fiction & Fantasy -0.23964 2.43588 -0.098 0.92166
## thtr_rel_year -0.03369 0.02519 -1.337 0.18164
## thtr_rel_month -0.17040 0.07695 -2.215 0.02715 *
## imdb_rating 9.40940 0.45397 20.727 < 2e-16 ***
## critics_score 0.02202 0.01519 1.450 0.14763
## audience_ratingUpright 20.04290 0.77643 25.814 < 2e-16 ***
## best_pic_nomyes 4.13717 1.76678 2.342 0.01951 *
## best_pic_winyes -2.33725 2.96534 -0.788 0.43088
## best_actress_winyes -1.43827 0.88941 -1.617 0.10636
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.819 on 630 degrees of freedom
## Multiple R-squared: 0.8898, Adjusted R-squared: 0.8863
## F-statistic: 254.3 on 20 and 630 DF, p-value: < 2.2e-16
#"genre" next variable to be removed in the next step
model4 <- lm(audience_score~title_type+thtr_rel_year+thtr_rel_month+imdb_rating+critics_score+audience_rating+best_pic_nom+best_pic_win+best_actress_win, data=modeldta)
summary(model4)
##
## Call:
## lm(formula = audience_score ~ title_type + thtr_rel_year + thtr_rel_month +
## imdb_rating + critics_score + audience_rating + best_pic_nom +
## best_pic_win + best_actress_win, data = modeldta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.1248 -4.6419 0.5008 4.4517 24.6679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.52388 50.62880 1.097 0.2732
## title_typeFeature Film -1.09823 1.07369 -1.023 0.3068
## title_typeTV Movie -2.95786 3.26927 -0.905 0.3659
## thtr_rel_year -0.03147 0.02523 -1.247 0.2127
## thtr_rel_month -0.14331 0.07766 -1.845 0.0654 .
## imdb_rating 9.04339 0.44760 20.204 <2e-16 ***
## critics_score 0.02036 0.01521 1.339 0.1810
## audience_ratingUpright 20.56335 0.77024 26.697 <2e-16 ***
## best_pic_nomyes 3.99103 1.77953 2.243 0.0253 *
## best_pic_winyes -2.07882 2.99565 -0.694 0.4880
## best_actress_winyes -1.63699 0.88986 -1.840 0.0663 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.912 on 640 degrees of freedom
## Multiple R-squared: 0.885, Adjusted R-squared: 0.8832
## F-statistic: 492.4 on 10 and 640 DF, p-value: < 2.2e-16
#"best_pic_win" next variable to be removed in the next step
model5 <- lm(audience_score~title_type+thtr_rel_year+thtr_rel_month+imdb_rating+critics_score+audience_rating+best_pic_nom+best_actress_win, data=modeldta)
summary(model5)
##
## Call:
## lm(formula = audience_score ~ title_type + thtr_rel_year + thtr_rel_month +
## imdb_rating + critics_score + audience_rating + best_pic_nom +
## best_actress_win, data = modeldta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.133 -4.618 0.426 4.430 24.631
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.67257 50.59347 1.081 0.2803
## title_typeFeature Film -1.10643 1.07320 -1.031 0.3029
## title_typeTV Movie -2.95952 3.26795 -0.906 0.3655
## thtr_rel_year -0.03102 0.02521 -1.230 0.2190
## thtr_rel_month -0.14203 0.07761 -1.830 0.0677 .
## imdb_rating 9.03594 0.44729 20.202 <2e-16 ***
## critics_score 0.02026 0.01520 1.333 0.1831
## audience_ratingUpright 20.57081 0.76986 26.720 <2e-16 ***
## best_pic_nomyes 3.44840 1.59788 2.158 0.0313 *
## best_actress_winyes -1.67505 0.88781 -1.887 0.0597 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.909 on 641 degrees of freedom
## Multiple R-squared: 0.8849, Adjusted R-squared: 0.8833
## F-statistic: 547.6 on 9 and 641 DF, p-value: < 2.2e-16
#"title_type" next variable to be removed in the next step
model6 <- lm(audience_score~thtr_rel_year+thtr_rel_month+imdb_rating+critics_score+audience_rating+best_pic_nom+best_actress_win, data=modeldta)
summary(model6)
##
## Call:
## lm(formula = audience_score ~ thtr_rel_year + thtr_rel_month +
## imdb_rating + critics_score + audience_rating + best_pic_nom +
## best_actress_win, data = modeldta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.9387 -4.5129 0.4377 4.3389 24.8079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.90423 49.57334 0.906 0.3654
## thtr_rel_year -0.02692 0.02483 -1.084 0.2787
## thtr_rel_month -0.14190 0.07748 -1.831 0.0675 .
## imdb_rating 9.10429 0.44347 20.530 <2e-16 ***
## critics_score 0.02215 0.01500 1.477 0.1401
## audience_ratingUpright 20.57956 0.76862 26.775 <2e-16 ***
## best_pic_nomyes 3.26258 1.58578 2.057 0.0401 *
## best_actress_winyes -1.77618 0.88330 -2.011 0.0448 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.906 on 643 degrees of freedom
## Multiple R-squared: 0.8846, Adjusted R-squared: 0.8834
## F-statistic: 704.4 on 7 and 643 DF, p-value: < 2.2e-16
#"thtr_rel_year" next variable to be removed in the next step
model7 <- lm(audience_score~thtr_rel_month+imdb_rating+critics_score+audience_rating+best_pic_nom+best_actress_win, data=modeldta)
summary(model7)
##
## Call:
## lm(formula = audience_score ~ thtr_rel_month + imdb_rating +
## critics_score + audience_rating + best_pic_nom + best_actress_win,
## data = modeldta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.3836 -4.5036 0.4579 4.4158 24.6069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.78659 2.25412 -3.898 0.000107 ***
## thtr_rel_month -0.14230 0.07749 -1.836 0.066774 .
## imdb_rating 9.07664 0.44280 20.498 < 2e-16 ***
## critics_score 0.02353 0.01495 1.574 0.115915
## audience_ratingUpright 20.59801 0.76854 26.802 < 2e-16 ***
## best_pic_nomyes 3.31645 1.58521 2.092 0.036819 *
## best_actress_winyes -1.74852 0.88305 -1.980 0.048119 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.907 on 644 degrees of freedom
## Multiple R-squared: 0.8844, Adjusted R-squared: 0.8833
## F-statistic: 821.4 on 6 and 644 DF, p-value: < 2.2e-16
#"critics_score" next variable to be removed in the next step
model8 <- lm(audience_score~thtr_rel_month+imdb_rating+audience_rating+best_pic_nom+best_actress_win, data=modeldta)
summary(model8)
##
## Call:
## lm(formula = audience_score ~ thtr_rel_month + imdb_rating +
## audience_rating + best_pic_nom + best_actress_win, data = modeldta)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.8460 -4.5718 0.5043 4.4143 24.5659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.22759 2.06227 -4.959 9.05e-07 ***
## thtr_rel_month -0.14736 0.07751 -1.901 0.0577 .
## imdb_rating 9.49944 0.35245 26.953 < 2e-16 ***
## audience_ratingUpright 20.73992 0.76411 27.143 < 2e-16 ***
## best_pic_nomyes 3.44325 1.58498 2.172 0.0302 *
## best_actress_winyes -1.74052 0.88405 -1.969 0.0494 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.915 on 645 degrees of freedom
## Multiple R-squared: 0.884, Adjusted R-squared: 0.8831
## F-statistic: 982.9 on 5 and 645 DF, p-value: < 2.2e-16
After eight models we found a parsimonious one with mostly variable as a statistically significant predictor of audience_score. The variable “thtr_rel_month” even though was not significant, I decided to keep it in the final model because the removal of this variable affected the other significant results. Besides, the variable presented a borderline p-value (very close to the significant cutoff), and its inclusion did not affect the final adjusted R2.
Another important parameter is the adjusted R2. During the modeling process, its value did not change substantially, which indicates that the removal of the variables during the backward elimination process did not impact this parameter. Therefore, we have a final model that is parsimonious, significant, and with a high adjusted R2.
par(mfrow = c(1, 2))
#Nearly Normal Residuals
hist(model8$residuals, main='Histogram of Residuals')
qqnorm(model8$residuals,main='Normal Probability Plot of Residuals')
qqline(model8$residuals)
par(mfrow = c(1, 1))
#Homoscedasticity (Constant variability of residuals)
plot(model8$residuals~model8$fitted,main='Residuals vs.Predicted (fitted)')
abline(0,0)
To check the last assumptions of linear regression models, I performed a model diagnostic test. First, I analyzed if the residuals of the models are nearly normally distributed. As we can see in the histogram and Q-Q plot, the residuals presented a nearly normal distribution. There are some points above the normal line in the Q-Q plot, as well as a slightly right-skewed distribution, but overall the residuals do not show a biased distribution.
IMDB rating: All else held constant, for each 1 point increase in IMDB rating the model predicts the audience score to be greater on average by 9.50 points.
Audience Rating: All else held constant, the model predicts that upright movies are, on average, 20.74 points greater in audience score than spilled movies.
Best Picture “Oscar” Nomination: All else held constant, the model predicts that movies that were nominated for a best picture Oscar are, on average, 3.32 points greater in audience score than movies that were not nominated.
Best Actress “Oscar” Winner: All else held constant, the model predicts that movies casting Oscar-winning actresses are, on average, 1.74 points lower in audience score than movies that do not cast Oscar-winning actresses.
For the prediction task, I decided to test two movies that were not in the original modeling database - “Mad Max: Fury Road” (2015) and “Nurse Betty” (2000). I chose two movies that had different characteristics, as well as different audience score on the Rotten Tomatoes website. Mad Max: Fury Road has a higher score, with the best picture nomination, as well as a higher IMDB rating. On the other hand, Nurse Betty is a movie a lower score, categorizes as “Spilled” by the audience.
The information about each movie used to predict the model can be found on IMDB and Rotten Tomatoes websites:
IMDB: https://www.imdb.com/title/tt1392190/
Rotten Tomatoes: https://www.rottentomatoes.com/m/mad_max_fury_road
IMDB: https://www.imdb.com/title/tt0171580/?ref_=nv_sr_srsg_0
Rotten Tomatoes: https://www.rottentomatoes.com/m/nurse_betty
#
madmax <- data.frame(thtr_rel_month = 5, imdb_rating = 8.1, audience_rating = "Upright", best_pic_nom = "yes", best_actress_win = "yes")
predict(model8, madmax, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 88.42373 74.48485 102.3626
nurse <- data.frame(thtr_rel_month = 12, imdb_rating = 6.1, audience_rating = "Spilled", best_pic_nom = "no", best_actress_win = "yes")
predict(model8, nurse, interval = "prediction", level = 0.95)
## fit lwr upr
## 1 44.21013 30.49373 57.92653
The actual audience scores in Rotten Tomatoes website were 85% for “Mad Max: Fury Road” and 45% for “Nurse Betty”, and the model predicted it to be 88.4% and 44.2%, respectively, which implies that the model was able to accurately predict the movie’s audience score.
The initial aim of this project was to investigate the parameters that influence a movie’s audience score (audience_score). The final model was able to identify 4 factors that were statistically significantly associated with the dependent variable, adjusted by the month of release of the movie.
IMDB rating
Audience Rating
Best Picture “Oscar” Nomination
Best Actress “Oscar” Winner
These factors are responsible for 88.31% (Adjusted R2 = 0.8831) of the explained variance in the dependent variable (audience_score), which means that 11.69% of the variance in a movie’s audience score could not be explained by this model.
© Lais Duarte Batista
All Rights Reserved
August 25, 2020